In PySpark, the Pandas API offers powerful functionalities for working with time series data. One such function is date_range()
, which generates a fixed frequency DatetimeIndex. This article provides an in-depth exploration of date_range()
, covering its syntax, parameters, and practical applications through illustrative examples.
Understanding date_range()
The date_range()
function in the Pandas API on Spark is used to generate a DatetimeIndex with a fixed frequency. It enables the creation of sequences of dates or timestamps, facilitating time series analysis, visualization, and manipulation tasks.
Syntax
The syntax for date_range()
is as follows:
pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)
Here, start
, end
, periods
, freq
, tz
, normalize
, name
, and closed
are the parameters that control the generation of the DatetimeIndex. Each parameter provides flexibility in defining the range, frequency, and timezone of the generated dates.
Examples
Let’s explore various scenarios to understand the functionality of date_range()
:
Example 1: Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp
import pandas as pd
# Create a SparkSession
spark = SparkSession.builder \
.appName("date_range_example @ Learning @ Freshers.in ") \
.getOrCreate()
# Generate a date range from January 1, 2022 to January 5, 2022
date_range = pd.date_range(start='2022-01-01', end='2022-01-05')
# Convert the pandas DateTimeIndex to a Spark DataFrame
df_pandas = pd.DataFrame(date_range, columns=['date'])
df_spark = spark.createDataFrame(df_pandas)
# Show the DataFrame
df_spark.show()
+-------------------+
| date|
+-------------------+
|2022-01-01 00:00:00|
|2022-01-02 00:00:00|
|2022-01-03 00:00:00|
|2022-01-04 00:00:00|
|2022-01-05 00:00:00|
import pandas as pd
# Generate a date range from January 1, 2022 to January 5, 2022
date_index = pd.date_range(start='2022-01-01', end='2022-01-05')
print(date_index)
# Output:
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05'],
dtype='datetime64[ns]', freq='D')
Example 3: Generating Timestamps with Specific Frequency
import pandas as pd
# Generate timestamps every hour for 3 days
date_index = pd.date_range(start='2022-01-01', periods=72, freq='H')
print(date_index)
# Output:
DatetimeIndex(['2022-02-01 00:00:00-05:00', '2022-02-02 00:00:00-05:00',
'2022-02-03 00:00:00-05:00', '2022-02-04 00:00:00-05:00',
'2022-02-05 00:00:00-05:00', '2022-02-06 00:00:00-05:00',
'2022-02-07 00:00:00-05:00', '2022-02-08 00:00:00-05:00',
'2022-02-09 00:00:00-05:00', '2022-02-10 00:00:00-05:00',
'2022-02-11 00:00:00-05:00', '2022-02-12 00:00:00-05:00',
'2022-02-13 00:00:00-05:00', '2022-02-14 00:00:00-05:00',
'2022-02-15 00:00:00-05:00', '2022-02-16 00:00:00-05:00',
'2022-02-17 00:00:00-05:00', '2022-02-18 00:00:00-05:00',
'2022-02-19 00:00:00-05:00', '2022-02-20 00:00:00-05:00',
'2022-02-21 00:00:00-05:00', '2022-02-22 00:00:00-05:00',
'2022-02-23 00:00:00-05:00', '2022-02-24 00:00:00-05:00',
'2022-02-25 00:00:00-05:00', '2022-02-26 00:00:00-05:00',
'2022-02-27 00:00:00-05:00', '2022-02-28 00:00:00-05:00'],
dtype='datetime64[ns, America/New_York]', name='date', freq='D')
Spark important urls to refer