Converts a column containing a StructType, ArrayType or a MapType into a JSON string-PySpark(to_json)

PySpark @ Freshers.in

You can convert a column containing a StructType, ArrayType or a MapType into a JSON string using to_json function.

pyspark.sql.functions.to_json

Syntax : to_json(col: ColumnOrName, options: Optional[Dict[str, str]] = None) → pyspark.sql.column.Column

col : Column or str

options : dict, optional ( options to control converting. accepts the same options as the JSON data source )

to_json : Converts a column containing a StructType, ArrayType or a MapType into a JSON string. Throws an exception, in the case of an unsupported type.

Sample Code

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_json
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import IntegerType,StringType,MapType
veh_data = ([
(1,{"si_no":1,"vehicle_make":"Cadillac","make_year":2022}),\
(2,{"si_no":2,"vehicle_make":"Tesla","make_year":2022}),\
(3,{"si_no":3,"vehicle_make":"Lincoln","make_year":2021}),\
(4,{"si_no":4,"vehicle_make":"Ford","make_year":2022})])
veh_schema=StructType([
StructField("si_no",IntegerType(),True),
StructField("vehicle_details",MapType(StringType(),StringType()),True)])
veh_df = spark.createDataFrame(data=veh_data,schema=veh_schema)
veh_df.show(20,False)
+-----+---------------------------------------------------------+
|si_no|vehicle_details                                          |
+-----+---------------------------------------------------------+
|1    |[vehicle_make -> Cadillac, make_year -> 2022, si_no -> 1]|
|2    |[vehicle_make -> Tesla, make_year -> 2022, si_no -> 2]   |
|3    |[vehicle_make -> Lincoln, make_year -> 2021, si_no -> 3] |
|4    |[vehicle_make -> Ford, make_year -> 2022, si_no -> 4]    |
+-----+---------------------------------------------------------+

Applying to_json

veh_df2 = veh_df.select(veh_df.si_no,to_json(veh_df.vehicle_details).alias("json"))
veh_df2.show(20,False)
+-----+----------------------------------------------------------+
|si_no|json                                                      |
+-----+----------------------------------------------------------+
|1    |{"vehicle_make":"Cadillac","make_year":"2022","si_no":"1"}|
|2    |{"vehicle_make":"Tesla","make_year":"2022","si_no":"2"}   |
|3    |{"vehicle_make":"Lincoln","make_year":"2021","si_no":"3"} |
|4    |{"vehicle_make":"Ford","make_year":"2022","si_no":"4"}    |
+-----+----------------------------------------------------------+

Let’s compare the schema difference 

veh_df  : Will have mapType 

veh_df.printSchema()
root
 |-- si_no: integer (nullable = true)
 |-- vehicle_details: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

veh_df2 : Will have StringType

veh_df2.printSchema()
root
 |-- si_no: integer (nullable = true)
 |-- json: string (nullable = true)
Author: user

Leave a Reply