How to transform a JSON Column to multiple columns based on Key in PySpark

PySpark @ Freshers.in

JSON Column to multiple columns

Consider you have situation with incoming raw data got a json column, and you need to transform each key separate column for further analysis. Here we will learn

  1. How to read a json column using PySpark?
  2. How to have create the schema for JSON Column?
  3. How to transform Key as column name in dataframe from key value ?

Source Code

from pyspark.sql.types import *
from pyspark.sql.types import MapType,StringType,IntegerType
data = [
(1,{"city":"Baltimore","zip_code":21201,"county":"Baltimore City"},"USA"),
(2,{"city":"East Case","zip_code":21202,"county":"Baltimore City"},"USA"),
(3,{"city":"Ruxton","zip_code":21204,"county":"Baltimore County"},"USA"),
(4,{"city":"Orchard Beach","county":"Anne Arundel County"},"USA"),
(5,{"city":"Arbutus","zip_code":21227,"county":"Baltimore County"},"USA"),
]
schema = StructType([
StructField("si_no",IntegerType(),True),
StructField("city_info",MapType(StringType(),StringType(),True)),
StructField("country",StringType(),True),
]) 
df = spark.createDataFrame(data,schema)
df.show(20,False)
df.printSchema()
df2 = df.select(df.si_no,df.city_info.city.alias('city'),\
df.city_info.zip_code.cast(IntegerType()).alias('zip_code'),\
df.city_info.county.alias('county'),\
df.country)
df2.show(20,False)
df2.printSchema()

Reference

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
  6. How to parses a column containing a JSON string using PySpark(from_json)

Execution Result

PySpark transform JSON Key to Columns

PySpark transform JSON Key to Columns

Author: user

Leave a Reply