PySpark : Truncate date and timestamp in PySpark [date_trunc and trunc]

user February 1, 2023 Leave a Comment

pyspark.sql.functions.date_trunc(format, timestamp)

Truncation function offered by Spark Dateframe SQL functions is date_trunc(), which returns Date in the format “yyyy-MM-dd HH:mm:ss.SSSS” and truncates at Year, Month, Day, Hour, Minute, and Seconds units. This will returns timestamp truncated to the unit specified by the format.

from pyspark.sql.functions import date_trunc
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder.appName("date_trunc_example").getOrCreate()
# Create a dataframe with a single column of timestamps
df = spark.createDataFrame([
("2023-01-01 10:10:10",),
("2023-02-01 11:11:11",),
("2023-03-01 12:12:12",),
("2023-03-31 11:11:11",),
("2023-01-31 11:11:11",),
("2023-02-01 11:11:11",)], ["timestamp"])
# Round down each timestamp to the nearest month
df_truncated = df.withColumn("month", date_trunc("month", df["timestamp"]))
df_truncated.show(20,False)

Output

+-------------------+-------------------+
|          timestamp|              month|
+-------------------+-------------------+
|2023-01-01 10:10:10|2023-01-01 00:00:00|
|2023-02-01 11:11:11|2023-02-01 00:00:00|
|2023-03-01 12:12:12|2023-03-01 00:00:00|
|2023-03-31 11:11:11|2023-03-01 00:00:00|
|2023-01-31 11:11:11|2023-01-01 00:00:00|
|2023-02-01 11:11:11|2023-02-01 00:00:00|
+-------------------+-------------------+

In this example, we use the date_trunc function to round down the timestamps in the “timestamp” column to the nearest month. The output is a new dataframe with two columns: the original “timestamp” column and a new column “month” which contains the rounded-down timestamps.

pyspark.sql.functions.trunc

The trunc function in PySpark can also be used to truncate date or timestamp values to a specified unit of time (e.g. year, quarter, month, day, etc.).

Here’s an example of using trunc on date or timestamp values in PySpark:

from pyspark.sql.functions import trunc
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder.appName("trunc example").getOrCreate()
# Create a dataframe with a single column of timestamps
df = spark.createDataFrame([
("2023-01-01 10:10:10",),
("2023-02-01 11:11:11",),
("2023-03-01 12:12:12",),
("2023-03-31 11:11:11",),
("2023-01-31 11:11:11",),
("2023-02-01 11:11:11",)], ["timestamp"])
df.select(df.timestamp,trunc(df.timestamp, 'year').alias('year')).show()

Result

+-------------------+----------+
|          timestamp|      year|
+-------------------+----------+
|2023-01-01 10:10:10|2023-01-01|
|2023-02-01 11:11:11|2023-01-01|
|2023-03-01 12:12:12|2023-01-01|
|2023-03-31 11:11:11|2023-01-01|
|2023-01-31 11:11:11|2023-01-01|
|2023-02-01 11:11:11|2023-01-01|
+-------------------+----------+

df.select(df.timestamp,trunc(df.timestamp, 'month').alias('month')).show()

+-------------------+----------+
|          timestamp|     month|
+-------------------+----------+
|2023-01-01 10:10:10|2023-01-01|
|2023-02-01 11:11:11|2023-02-01|
|2023-03-01 12:12:12|2023-03-01|
|2023-03-31 11:11:11|2023-03-01|
|2023-01-31 11:11:11|2023-01-01|
|2023-02-01 11:11:11|2023-02-01|
+-------------------+----------+

df.select(df.timestamp,trunc(df.timestamp, 'day').alias('day')).show()

+-------------------+----+
|          timestamp| day|
+-------------------+----+
|2023-01-01 10:10:10|null|
|2023-02-01 11:11:11|null|
|2023-03-01 12:12:12|null|
|2023-03-31 11:11:11|null|
|2023-01-31 11:11:11|null|
|2023-02-01 11:11:11|null|
+-------------------+----+

Keep in mind that the trunc() function does not work with Day and that it always returns null.

Spark important urls to refer

Post Views: 596

Author: user

PySpark : Truncate date and timestamp in PySpark [date_trunc and trunc]

pyspark.sql.functions.date_trunc(format, timestamp)

pyspark.sql.functions.trunc

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

pyspark.sql.functions.date_trunc(format, timestamp)

pyspark.sql.functions.trunc

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget