PySpark: How to accept date in a Dataframe : DateType can not accept object ‘YYYY-MM-DD’ in type

PySpark @ Freshers.in

Accepting date in a Dataframe

When you define a data in a a list of tuple and trying to read the date column , you will get an error as DateType can not accept object ‘YYYY-MM-DD’ in type <class ‘str’> . This can happen in the case of Time Stamp field as well 

Consider we have the data as 

1,"Japan","2023-01-01"
2,"Italy","2023-01-01"
3,"France","2023-01-01"

We are going to read this by specifying the schema

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,IntegerType,DateType
from pyspark.sql.types import StructType,StructField
spark = SparkSession.builder.appName('www.freshers.in training').getOrCreate()
from datetime import datetime
car_data = [
(1,"Japan","2023-01-01"),
(2,"Italy","2023-01-01"),
(3,"France","2023-01-01"),
]
car_data_schema = StructType([
StructField("si_no",IntegerType(),True),
StructField("country_origin",StringType(),True),
StructField("car_make_year",DateType(),True)
])
car_df = spark.createDataFrame(data=car_data, schema=car_data_schema)

We will get the error as

TypeError: field car_make_year: DateType can not accept object ‘2023-01-01’ in type <class ‘str’>

How to Solve this .

For that we need to have the date ( Which is in string )  converted to <class ‘datetime.datetime’>

For easy understanding , I will show how the data needs to be

car_data = [
(1,"Japan",datetime.strptime("2023-01-01","%Y-%m-%d")),
(2,"Italy",datetime.strptime("2023-01-01","%Y-%m-%d")),
(3,"France",datetime.strptime("2023-01-01","%Y-%m-%d"))
]

Complete code

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,IntegerType,DateType
from pyspark.sql.types import StructType,StructField
spark = SparkSession.builder.appName('www.freshers.in training').getOrCreate()
from datetime import datetime
car_data = [
(1,"Japan",datetime.strptime("2023-01-01","%Y-%m-%d")),
(2,"Italy",datetime.strptime("2023-01-01","%Y-%m-%d")),
(3,"France",datetime.strptime("2023-01-01","%Y-%m-%d"))
]
car_data_schema = StructType([
StructField("si_no",IntegerType(),True),
StructField("country_origin",StringType(),True),
StructField("car_make_year",DateType(),True)
])
car_df = spark.createDataFrame(data=car_data, schema=car_data_schema)
car_df.printSchema()
car_df.show()

The result in the Print Schema , you can see the datatype as date

root
 |-- si_no: integer (nullable = true)
 |-- country_origin: string (nullable = true)
 |-- car_make_year: date (nullable = true)
+-----+--------------+-------------+
|si_no|country_origin|car_make_year|
+-----+--------------+-------------+
|    1|         Japan|   2023-01-01|
|    2|         Italy|   2023-01-01|
|    3|        France|   2023-01-01|
+-----+--------------+-------------+

Reference

  1. Spark Interview Questions
  2. Spark Examples
  3. PySpark Blogs
  4. Bigdata Blogs
  5. Official Page
Author: user

Leave a Reply