In this article, we will delve into the date_sub function in PySpark. This versatile function allows us to subtract a specified number of days from a given date, enabling us to perform date-based operations and gain valuable insights from our data.
from pyspark.sql.functions import date_sub
Understanding date_sub:
The date_sub function in PySpark facilitates date subtraction by subtracting a specified number of days from a given date. It helps us analyze historical data, calculate intervals, and perform various time-based computations within our Spark applications.
Syntax:
The syntax for using date_sub in PySpark is as follows:
date_sub(start_date, days)
Here, start_date represents the initial date from which we want to subtract days, and days indicates the number of days to subtract.
Example Usage:
To illustrate the usage of date_sub in PySpark, let’s consider a scenario where we have a dataset containing sales records. We want to analyze sales data from the past 7 days.
Step 1: Importing the necessary libraries and creating a SparkSession.
from pyspark.sql import SparkSession
from pyspark.sql.functions import date_sub
# Create a SparkSession
spark = SparkSession.builder \
.appName("date_sub Example at Freshers.in") \
.getOrCreate()
Step 2: Creating a sample DataFrame with hardcoded values.
# Sample DataFrame with hardcoded values
data = [("Product A", "2023-05-15", 100),
("Product B", "2023-05-16", 150),
("Product C", "2023-05-17", 200),
("Product D", "2023-05-18", 120),
("Product E", "2023-05-19", 90),
("Product F", "2023-05-20", 180),
("Product G", "2023-05-21", 210),
("Product H", "2023-05-22", 160)]
df = spark.createDataFrame(data, ["Product", "Date", "Sales"])
# Show the initial DataFrame
df.show()
ResultĀ
+---------+----------+-----+
| Product | Date |Sales|
+---------+----------+-----+
|Product A|2023-05-15| 100|
|Product B|2023-05-16| 150|
|Product C|2023-05-17| 200|
|Product D|2023-05-18| 120|
|Product E|2023-05-19| 90|
|Product F|2023-05-20| 180|
|Product G|2023-05-21| 210|
|Product H|2023-05-22| 160|
+---------+----------+-----+
Step 3: Subtracting days using date_sub.
# Subtract 7 days from the current date
df_subtracted = df.withColumn("SubtractedDate", date_sub(df.Date, 7))
# Show the resulting DataFrame
df_subtracted.show()
ResultĀ
+---------+----------+-----+--------------+
| Product| Date|Sales|SubtractedDate|
+---------+----------+-----+--------------+
|Product A|2023-05-15| 100| 2023-05-08|
|Product B|2023-05-16| 150| 2023-05-09|
|Product C|2023-05-17| 200| 2023-05-10|
|Product D|2023-05-18| 120| 2023-05-11|
|Product E|2023-05-19| 90| 2023-05-12|
|Product F|2023-05-20| 180| 2023-05-13|
|Product G|2023-05-21| 210| 2023-05-14|
|Product H|2023-05-22| 160| 2023-05-15|
+---------+----------+-----+--------------+
In the above code snippet, we used the `date_sub` function to subtract 7 days from the “Date” column in the DataFrame. The resulting column, “SubtractedDate,” contains the dates obtained after subtracting 7 days.
Step 4: Filtering data based on the subtracted date.
# Filter sales data from the past 7 days
recent_sales = df_subtracted.filter(df_subtracted.SubtractedDate >= '2023-05-15')
# Show the filtered DataFrame
recent_sales.show()
Result
+---------+----------+-----+--------------+
| Product | Date |Sales|SubtractedDate|
+---------+----------+-----+--------------+
|Product H|2023-05-22| 160| 2023-05-15|
+---------+----------+-----+--------------+
By filtering the DataFrame based on the “SubtractedDate” column, we obtained sales data from the past 7 days. In this case, we selected records where the subtracted date was greater than or equal to ‘2023-05-15’.
Here we explored the functionality of PySpark’s date_sub function, which allows us to subtract a specified number of days from a given date. By incorporating this powerful function into our PySpark workflows, we can perform date-based operations, analyze historical data, and gain valuable insights from our datasets. Whether it’s calculating intervals, filtering data based on specific timeframes, or performing time-based computations, the date_sub function proves to be an invaluable tool for date subtraction in PySpark applications.
Spark important urls to refer