We specify schema = true when a CSV file is being read. Spark determines the data type of a column by setting this and using the values that are stored there. However, because a spark cannot deduce a schema for date and timestamp value fields, it reads these elements as strings instead. We are concentrating on many approaches to solving this problem in this recipe.
Here we explained in dataframe method as well as Spark SQL way of converting to date datatype
pyspark.sql.functions.to_date
A Column is transformed into pyspark.sql.types, using the optionally supplied format, DateType. Formats should be specified using the date/time pattern. It automatically adheres to the pyspark.sql.types casting conventions. This is similar to col.cast (“date”).
Sample code to show how to_date works
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,IntegerType
from pyspark.sql.types import StructType,StructField
spark = SparkSession.builder.appName('www.freshers.in training : to_date ').getOrCreate()
from pyspark.sql.functions import to_date
car_data = [
(1,"Japan","2023-01-11"),
(2,"Italy","2023-04-21"),
(3,"France","2023-05-22"),
(4,"India","2023-07-18"),
(5,"USA","2023-08-23"),
]
car_data_schema = StructType([
StructField("si_no",IntegerType(),True),
StructField("country_origin",StringType(),True),
StructField("car_make_year",StringType(),True)
])
car_df = spark.createDataFrame(data=car_data, schema=car_data_schema)
car_df.printSchema()
root
|-- si_no: integer (nullable = true)
|-- country_origin: string (nullable = true)
|-- car_make_year: string (nullable = true)
Applying to_date function
car_df_updated = car_df.withColumn("car_make_year_dt",to_date("car_make_year"))
car_df_updated.show()
+-----+--------------+-------------+----------------+
|si_no|country_origin|car_make_year|car_make_year_dt|
+-----+--------------+-------------+----------------+
| 1| Japan| 2023-01-11| 2023-01-11|
| 2| Italy| 2023-04-21| 2023-04-21|
| 3| France| 2023-05-22| 2023-05-22|
| 4| India| 2023-07-18| 2023-07-18|
| 5| USA| 2023-08-23| 2023-08-23|
+-----+--------------+-------------+----------------+
Check the schema that is going to print , you can see the date data time for the new column car_make_year_dt
car_df_updated.printSchema()
root
|-- si_no: integer (nullable = true)
|-- country_origin: string (nullable = true)
|-- car_make_year: string (nullable = true)
<span style="color: #0000ff;"> |-- car_make_year_dt: date (nullable = true)</span>
The above can be done in the SQL way as follows by creating a TempView
car_df.createOrReplaceTempView("car_table")
spark.sql("select si_no,country_origin, to_date(car_make_year) from car_table").show()
+-----+--------------+----------------------------------+
|si_no|country_origin|to_date(car_table.`car_make_year`)|
+-----+--------------+----------------------------------+
| 1| Japan| 2023-01-11|
| 2| Italy| 2023-04-21|
| 3| France| 2023-05-22|
| 4| India| 2023-07-18|
| 5| USA| 2023-08-23|
+-----+--------------+----------------------------------+
For checking the schema
spark.sql("select si_no,country_origin, to_date(car_make_year) from car_table").printSchema()
root
|-- si_no: integer (nullable = true)
|-- country_origin: string (nullable = true)
<span style="color: #0000ff;"> |-- to_date(car_table.`car_make_year`): date (nullable = true)</span>
Reference