How to transform a JSON Column to multiple columns based on Key in PySpark

user February 5, 2022 Leave a Comment

JSON Column to multiple columns

Consider you have situation with incoming raw data got a json column, and you need to transform each key separate column for further analysis. Here we will learn

How to read a json column using PySpark?
How to have create the schema for JSON Column?
How to transform Key as column name in dataframe from key value ?

Source Code

from pyspark.sql.types import *
from pyspark.sql.types import MapType,StringType,IntegerType
data = [
(1,{"city":"Baltimore","zip_code":21201,"county":"Baltimore City"},"USA"),
(2,{"city":"East Case","zip_code":21202,"county":"Baltimore City"},"USA"),
(3,{"city":"Ruxton","zip_code":21204,"county":"Baltimore County"},"USA"),
(4,{"city":"Orchard Beach","county":"Anne Arundel County"},"USA"),
(5,{"city":"Arbutus","zip_code":21227,"county":"Baltimore County"},"USA"),
]
schema = StructType([
StructField("si_no",IntegerType(),True),
StructField("city_info",MapType(StringType(),StringType(),True)),
StructField("country",StringType(),True),
]) 
df = spark.createDataFrame(data,schema)
df.show(20,False)
df.printSchema()
df2 = df.select(df.si_no,df.city_info.city.alias('city'),\
df.city_info.zip_code.cast(IntegerType()).alias('zip_code'),\
df.city_info.county.alias('county'),\
df.country)
df2.show(20,False)
df2.printSchema()

Reference

Execution Result

Post Views: 1,384

Author: user

How to transform a JSON Column to multiple columns based on Key in PySpark

JSON Column to multiple columns

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Security Features of Snowflake

Most Viewed Posts

JSON Column to multiple columns

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget