PySpark : Understanding the PySpark next_day Function

PySpark @ Freshers.in

Time series data often involves handling and manipulating dates. Apache Spark, through its PySpark interface, provides an arsenal of date-time functions that simplify this task. One such function is next_day(), a powerful function used to find the next specified day of the week from a given date. This article will provide an in-depth look into the usage and application of the next_day() function in PySpark.

The next_day() function takes two arguments: a date and a day of the week. The function returns the next specified day after the given date. For instance, if the given date is a Monday and the specified day is ‘Thursday’, the function will return the date of the coming Thursday.

The next_day() function recognizes the day of the week case-insensitively, and both in full (like ‘Monday’) and abbreviated form (like ‘Mon’). 

To begin with, let’s initialize a SparkSession, the entry point to any Spark functionality.

from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()

Create a DataFrame with a single column date filled with some hardcoded date values.

data = [("2023-07-04",),
        ("2023-12-31",),
        ("2022-02-28",)]
df = spark.createDataFrame(data, ["date"])
df.show()

Output

+----------+
|      date|
+----------+
|2023-07-04|
|2023-12-31|
|2022-02-28|
+----------+

Given the dates are in string format, we need to convert them into date type using the to_date function.

from pyspark.sql.functions import col, to_date
df = df.withColumn("date", to_date(col("date"), "yyyy-MM-dd"))
df.show()

Use the next_day() function to find the next Sunday from the given date.

from pyspark.sql.functions import next_day
df = df.withColumn("next_sunday", next_day("date", 'Sunday'))
df.show()

Result DataFrame 

+----------+-----------+
|      date|next_sunday|
+----------+-----------+
|2023-07-04| 2023-07-09|
|2023-12-31| 2024-01-07|
|2022-02-28| 2022-03-05|
+----------+-----------+

The next_day() function in PySpark is a powerful tool for manipulating date-time data, particularly when you need to perform operations based on the days of the week.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply