Working with dates
Working with dates and time is a common task in data analysis. Apache Spark provides a variety of functions to manipulate date and time data types, including a function to extract the month from a date. In this article, we will explore how to use the month() function in PySpark to extract the month of a given date as an integer.
The month() function extracts the month part from a given date and returns it as an integer. For example, if you have a date “2023-07-04”, applying the month() function to this date will return the integer value 7.
Firstly, let’s start by setting up a SparkSession, which is the entry point to any Spark functionality.
Sample code for Extracting the Month from a Date in PySpark
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder.getOrCreate()
Create a DataFrame with a single column called date that contains some hard-coded date values.
data = [("2023-07-04",),
("2023-12-31",),
("2022-02-28",)]
df = spark.createDataFrame(data, ["date"])
df.show()
Output
+----------+
| date|
+----------+
|2023-07-04|
|2023-12-31|
|2022-02-28|
+----------+
As our dates are in string format, we need to convert them into date type using the to_date function.
from pyspark.sql.functions import col, to_date
df = df.withColumn("date", to_date(col("date"), "yyyy-MM-dd"))
df.show()
Let’s use the month() function to extract the month from the date column.
from pyspark.sql.functions import month
df = df.withColumn("month", month("date"))
df.show()
Result
+----------+
| date|
+----------+
|2023-07-04|
|2023-12-31|
|2022-02-28|
+----------+
As you can see, the month column contains the month part of the corresponding date in the date column. The month() function in PySpark provides a simple and effective way to retrieve the month part from a date, making it a valuable tool in a data scientist’s arsenal. This function, along with other date-time functions in PySpark, simplifies the process of handling date-time data.
Spark important urls to refer