pyspark.sql.functions.datediff and pyspark.sql.functions.floor
In this article we will learn two function , mainly datediff and floor.
pyspark.sql.functions.datediff : To get difference between two dates in days in pyspark you can use datediff().
pyspark.sql.functions.floor : The column name is sent as a parameter to the function in PySpark, which rounds down the column and stores the results in a separate column as seen below.
Example : In this example you will see how to find the age of a person by giving the date of birth.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import IntegerType,StringType,FloatType,DateType
from pyspark.sql.functions import datediff
from pyspark.sql.functions import current_date
from pyspark.sql import functions as F
import datetime
cust_data = ([
(1,"Jack Willsmith",datetime.datetime.strptime('2000-02-01', "%Y-%m-%d").date()),\
(2,"Twinkle We",datetime.datetime.strptime('2001-03-12', "%Y-%m-%d").date()),\
(3,"Abil Wicky",datetime.datetime.strptime('2002-01-22', "%Y-%m-%d").date()),\
(4,"Kore Sam",datetime.datetime.strptime('2001-06-01', "%Y-%m-%d").date()),\
(5,"Bob Smith",datetime.datetime.strptime('2001-07-02', "%Y-%m-%d").date())])
cust_schema=StructType([
StructField("si_no",IntegerType(),True),
StructField("customer_name",StringType(),True),
StructField("customer_dob",DateType(),True)])
cust_data_df = spark.createDataFrame(data=cust_data,schema=cust_schema)
cust_data_df.printSchema()
cust_data_df.show()
cust_data_df.withColumn('age',F.floor(datediff(current_date(),cust_data_df["customer_dob"]))).show()
cust_data_df.withColumn('age',F.floor(datediff(current_date(),cust_data_df["customer_dob"])/365)).show()
Output
Source Schema
root
|-- si_no: integer (nullable = true)
|-- customer_name: string (nullable = true)
|-- customer_dob: date (nullable = true)
Raw DataFrame
+-----+--------------+------------+
|si_no| customer_name|customer_dob|
+-----+--------------+------------+
| 1|Jack Willsmith| 2000-02-01|
| 2| Twinkle We| 2001-03-12|
| 3| Abil Wicky| 2002-01-22|
| 4| Kore Sam| 2001-06-01|
| 5| Bob Smith| 2001-07-02|
+-----+--------------+------------+
Datediff with number of days : current date – customer_dob [ When I write this post its 03-09-2022 ]
+-----+--------------+------------+----+
|si_no| customer_name|customer_dob| age|
+-----+--------------+------------+----+
| 1|Jack Willsmith| 2000-02-01|8250|
| 2| Twinkle We| 2001-03-12|7845|
| 3| Abil Wicky| 2002-01-22|7529|
| 4| Kore Sam| 2001-06-01|7764|
| 5| Bob Smith| 2001-07-02|7733|
+-----+--------------+------------+----+
Calculation of age by dividing by 365
+-----+--------------+------------+---+
|si_no| customer_name|customer_dob|age|
+-----+--------------+------------+---+
| 1|Jack Willsmith| 2000-02-01| 22|
| 2| Twinkle We| 2001-03-12| 21|
| 3| Abil Wicky| 2002-01-22| 20|
| 4| Kore Sam| 2001-06-01| 21|
| 5| Bob Smith| 2001-07-02| 21|
+-----+--------------+------------+---+
Reference