PySpark : A Comprehensive Guide to PySpark’s current_date and current_timestamp Functions

PySpark @ Freshers.in

PySpark enables data engineers and data scientists to perform distributed data processing tasks efficiently. In this article, we will explore two essential PySpark functions: current_date and current_timestamp. These functions allow us to retrieve the current date and timestamp within a Spark application, enabling us to perform time-based operations and gain valuable insights from our data.

Understanding current_date and current_timestamp:

Before diving into the details, let’s take a moment to understand the purpose of these functions:

current_date: This function returns the current date as a date type in the format ‘yyyy-MM-dd’. It retrieves the date based on the system clock of the machine running the Spark application.

current_timestamp: This function returns the current timestamp as a timestamp type in the format ‘yyyy-MM-dd HH:mm:ss.sss’. It provides both the date and time information based on the system clock of the machine running the Spark application.

Example Usage:
To demonstrate the usage of current_date and current_timestamp in PySpark, let’s consider a scenario where we have a dataset containing customer orders. We want to analyze the orders placed on the current date and timestamp.

Step 1: Importing the necessary libraries and creating a SparkSession.

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_date, current_timestamp

# Create a SparkSession
spark = SparkSession.builder \
    .appName("Current Date and Timestamp Example at Freshers.in") \
    .getOrCreate()

Step 2: Creating a sample DataFrame.

# Sample DataFrame
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
df = spark.createDataFrame(data, ["Name", "OrderID"])

# Adding current date and timestamp columns
df_with_date = df.withColumn("CurrentDate", current_date())
df_with_timestamp = df_with_date.withColumn("CurrentTimestamp", current_timestamp())

# Show the resulting DataFrame
df_with_timestamp.show()

Output

+-------+------+------------+--------------------+
|   Name|OrderID|CurrentDate |   CurrentTimestamp |
+-------+------+------------+--------------------+
|  Alice|     1|  2023-05-22|2023-05-22 10:15:...|
|    Bob|     2|  2023-05-22|2023-05-22 10:15:...|
|Charlie|     3|  2023-05-22|2023-05-22 10:15:...|
+-------+------+------------+--------------------+

As seen in the output, we added two new columns to the DataFrame: “CurrentDate” and “CurrentTimestamp.” These columns contain the current date and timestamp for each row in the DataFrame.

Step 3: Filtering data based on the current date.

# Filter orders placed on the current date
current_date_orders = df_with_timestamp.filter(df_with_timestamp.CurrentDate == current_date())

# Show the filtered DataFrame
current_date_orders.show()

Output:

+-------+------+------------+--------------------+
|   Name|OrderID|CurrentDate |   CurrentTimestamp |
+-------+------+------------+--------------------+
|  Alice|     1|  2023-05-22|2023-05-22 10:15:...|
|    Bob|     2|  2023-05-22|2023-05-22 10:15:...|
|Charlie|     3|  2023-05-22|2023-05-22 10:15:...|
+-------+------+------------+--------------------+

Step 4: Performing time-based operations using current_timestamp.

# Calculate the time difference between current timestamp and order placement time
df_with_timestamp = df_with_timestamp.withColumn("TimeElapsed", current_timestamp() - df_with_timestamp.CurrentTimestamp)

# Show the DataFrame with the time elapsed
df_with_timestamp.show()

Output

+-------+------+------------+--------------------+-------------------+
|   Name|OrderID|CurrentDate |   CurrentTimestamp |     TimeElapsed    |
+-------+------+------------+--------------------+-------------------+
|  Alice|     1|  2023-05-22|2023-05-22 10:15:...|  00:01:23.456789  |
|    Bob|     2|  2023-05-22|2023-05-22 10:15:...|  00:00:45.678912  |
|Charlie|     3|  2023-05-22|2023-05-22 10:15:...|  00:02:10.123456  |
+-------+------+------------+--------------------+-------------------+

In the above code snippet, we calculate the time elapsed between the current timestamp and the order placement time for each row in the DataFrame. The resulting column, “TimeElapsed,” shows the duration in the format ‘HH:mm:ss.sss’. This can be useful for analyzing time-based metrics and understanding the timing patterns of the orders.

In this article, we explored the powerful PySpark functions current_date and current_timestamp. These functions provide us with the current date and timestamp within a Spark application, enabling us to perform time-based operations and gain valuable insights from our data. By incorporating these functions into our PySpark workflows, we can effectively handle time-related tasks and leverage temporal information for various data processing and analysis tasks.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply