PySpark : How to read date datatype from CSV ?

user January 4, 2023 Leave a Comment

We specify schema = true when a CSV file is being read. Spark determines the data type of a column by setting this and using the values that are stored there. However, because a spark cannot deduce a schema for date and timestamp value fields, it reads these elements as strings instead. We are concentrating on many approaches to solving this problem in this recipe.

Here we explained in dataframe method as well as Spark SQL way of converting to date datatype

pyspark.sql.functions.to_date

A Column is transformed into pyspark.sql.types, using the optionally supplied format, DateType. Formats should be specified using the date/time pattern. It automatically adheres to the pyspark.sql.types casting conventions. This is similar to col.cast (“date”).

Sample code to show how to_date works

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,IntegerType
from pyspark.sql.types import StructType,StructField
spark = SparkSession.builder.appName('www.freshers.in training : to_date ').getOrCreate()
from pyspark.sql.functions import to_date
car_data = [
(1,"Japan","2023-01-11"),
(2,"Italy","2023-04-21"),
(3,"France","2023-05-22"),
(4,"India","2023-07-18"),
(5,"USA","2023-08-23"),
]
car_data_schema = StructType([
StructField("si_no",IntegerType(),True),
StructField("country_origin",StringType(),True),
StructField("car_make_year",StringType(),True)
])
car_df = spark.createDataFrame(data=car_data, schema=car_data_schema)
car_df.printSchema()

root
 |-- si_no: integer (nullable = true)
 |-- country_origin: string (nullable = true)
 |-- car_make_year: string (nullable = true)

Applying to_date function

car_df_updated = car_df.withColumn("car_make_year_dt",to_date("car_make_year"))
car_df_updated.show()

+-----+--------------+-------------+----------------+
|si_no|country_origin|car_make_year|car_make_year_dt|
+-----+--------------+-------------+----------------+
|    1|         Japan|   2023-01-11|      2023-01-11|
|    2|         Italy|   2023-04-21|      2023-04-21|
|    3|        France|   2023-05-22|      2023-05-22|
|    4|         India|   2023-07-18|      2023-07-18|
|    5|           USA|   2023-08-23|      2023-08-23|
+-----+--------------+-------------+----------------+

Check the schema that is going to print , you can see the date data time for the new column car_make_year_dt

car_df_updated.printSchema()

root
 |-- si_no: integer (nullable = true)
 |-- country_origin: string (nullable = true)
 |-- car_make_year: string (nullable = true)
<span style="color: #0000ff;"> |-- car_make_year_dt: date (nullable = true)</span>

The above can be done in the SQL way as follows by creating a TempView

car_df.createOrReplaceTempView("car_table")
spark.sql("select si_no,country_origin, to_date(car_make_year) from car_table").show()

+-----+--------------+----------------------------------+
|si_no|country_origin|to_date(car_table.`car_make_year`)|
+-----+--------------+----------------------------------+
|    1|         Japan|                        2023-01-11|
|    2|         Italy|                        2023-04-21|
|    3|        France|                        2023-05-22|
|    4|         India|                        2023-07-18|
|    5|           USA|                        2023-08-23|
+-----+--------------+----------------------------------+

For checking the schema

spark.sql("select si_no,country_origin, to_date(car_make_year) from car_table").printSchema()

root
 |-- si_no: integer (nullable = true)
 |-- country_origin: string (nullable = true)
<span style="color: #0000ff;"> |-- to_date(car_table.`car_make_year`): date (nullable = true)</span>

Reference

Post Views: 140

Author: user

PySpark : How to read date datatype from CSV ?

pyspark.sql.functions.to_date

The above can be done in the SQL way as follows by creating a TempView

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

How PARTITION BY Works in Snowflake, and SQL in general

Stash a specific file using Git

Prevent your computer from locking : Python to simulate mouse movements

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Most Viewed Posts

pyspark.sql.functions.to_date

The above can be done in the SQL way as follows by creating a TempView

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget