PySpark : Exploring PySpark’s last_day function with detailed examples

PySpark @ Freshers.in

PySpark provides an easy-to-use interface for programming Spark with the Python programming language. Among the numerous functions available in PySpark, the last_day function is used to retrieve the last date of the month for a given date. In this article, we will discuss the PySpark last_day function, its syntax, and a detailed example illustrating its use with input data.

  1. The last_day function in PySpark

The last_day function is a part of the PySpark SQL library, which provides various functions to work with dates and times. It is useful when you need to perform time-based aggregations or calculations based on the end of the month.

Syntax:

pyspark.sql.functions.last_day(date)

Where date is a column or an expression that returns a date or a timestamp.

  1. A detailed example of using the last_day function

To illustrate the usage of the last_day function, let’s create a PySpark DataFrame containing date information and apply the function to it.

First, let’s import the necessary libraries and create a sample DataFrame:

from pyspark.sql import SparkSession
from pyspark.sql.functions import last_day, to_date
from pyspark.sql.types import StringType, DateType

# Create a Spark session
spark = SparkSession.builder.master("local").appName("last_day Function Example @ Freshers.in ").getOrCreate()

# Sample data
data = [("2023-01-15",), ("2023-02-25",), ("2023-03-05",), ("2023-04-10",)]

# Define the schema
schema = ["Date"]

# Create the DataFrame
df = spark.createDataFrame(data, schema)

# Convert the date string to date type
df = df.withColumn("Date", to_date(df["Date"], "yyyy-MM-dd"))

Now that we have our DataFrame, let’s apply the last_day function to it:

# Apply the last_day function
df = df.withColumn("Last Day of Month", last_day(df["Date"]))
# Show the results
df.show()
Output
+----------+-----------------+
|      Date|Last Day of Month|
+----------+-----------------+
|2023-01-15|       2023-01-31|
|2023-02-25|       2023-02-28|
|2023-03-05|       2023-03-31|
|2023-04-10|       2023-04-30|
+----------+-----------------+

In this example, we created a PySpark DataFrame with a date column and applied the last_day function to calculate the last day of the month for each date. The output DataFrame displays the original date along with the corresponding last day of the month.

The PySpark last_day function is a powerful and convenient tool for working with dates, particularly when you need to determine the last day of the month for a given date. With the help of the detailed example provided in this article, you should be able to effectively use the last_day function in your own PySpark projects.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply