In PySpark, dates and timestamps are stored as timestamp type. However, while working with timestamps in PySpark, sometimes it becomes necessary to format the date in a specific way. This is where the date_format function in PySpark comes in handy.
The date_format function in PySpark takes two arguments: the first argument is the date column, and the second argument is the format in which the date needs to be formatted. The function returns a new column with the formatted date.
Here’s a simple example to demonstrate the use of date_format function in PySpark:
from pyspark.sql import SparkSession from pyspark.sql.functions import * # Initializing Spark Session spark = SparkSession.builder.appName("DateFormatting").getOrCreate() # Creating DataFrame with sample data data = [("2023-02-01",),("2023-02-02",),("2023-02-03",)] df = spark.createDataFrame(data, ["date"]) # Formatting date column in desired format df = df.withColumn("formatted_date", date_format(col("date"), "dd-MM-yyyy")) # Showing the result df.show()
This will produce the following output:
+----------+--------------+ | date|formatted_date| +----------+--------------+ |2023-02-01| 01-02-2023| |2023-02-02| 02-02-2023| |2023-02-03| 03-02-2023| +----------+--------------+
In the above example, the date_format function is used to format the date column in the desired format. The first argument to the date_format function is the date column, and the second argument is the format in which the date needs to be formatted.
The date format string used in the second argument of the date_format function is made up of special characters that represent various parts of the date and time. Some of the most commonly used special characters in the date format string are:
- dd: Represents the day of the month (01 to 31).
- MM: Represents the month of the year (01 to 12).
- yyyy: Represents the year (with four digits).
There are many other special characters that can be used in the date format string to format the date and time. A complete list of special characters can be found in the PySpark documentation.
In conclusion, the date_format function in PySpark is a useful tool for formatting dates and timestamps. It provides an easy way to format dates in a specific format, which can be useful in various data processing tasks.