I have a use case where I want to add months to a date column in spark DataFrame
Function : pyspark.sql.functions.add_months
Syntax : pyspark.sql.functions.add_months(start, months)
This will returns the date that is months months after start
Sample code :
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("Freshers.in Learn add_months").getOrCreate()
df = spark.read.csv("/content/sample_data/subscription.csv", header=True, inferSchema=True)
df.printSchema()
root
|-- sino: integer (nullable = true)
|-- name: string (nullable = true)
|-- date_of_joining: timestamp (nullable = true)
|-- subscribed_for: integer (nullable = true)
df.show(20,False)
+----+-----------+-------------------+--------------+
|sino| name| date_of_joining|subscribed_for|
+----+-----------+-------------------+--------------+
| 1| Sam Peter|2022-04-09 00:00:00| 7|
| 2|John Manual|2022-09-11 00:00:00| 12|
| 3| Eric Burst|2022-11-12 00:00:00| 6|
| 4| Tim Moris|2022-10-13 00:00:00| 8|
| 5| Jack Berry|2022-12-14 00:00:00| 10|
+----+-----------+-------------------+--------------+
Implementing add_months to the dataframe
from pyspark.sql.functions import add_months
df.select(df.sino,df.name,df.date_of_joining,df.subscribed_for,
add_months(df.date_of_joining,df.subscribed_for).alias("renewal_date")).show(20,False)
+----+-----------+-------------------+--------------+------------+
|sino|name |date_of_joining |subscribed_for|renewal_date|
+----+-----------+-------------------+--------------+------------+
|1 |Sam Peter |2022-04-09 00:00:00|7 |2022-11-09 |
|2 |John Manual|2022-09-11 00:00:00|12 |2023-09-11 |
|3 |Eric Burst |2022-11-12 00:00:00|6 |2023-05-12 |
|4 |Tim Moris |2022-10-13 00:00:00|8 |2023-06-13 |
|5 |Jack Berry |2022-12-14 00:00:00|10 |2023-10-14 |
+----+-----------+-------------------+--------------+------------+
Reference