You can convert a column containing a StructType, ArrayType or a MapType into a JSON string using to_json function.
pyspark.sql.functions.to_json
Syntax : to_json(col: ColumnOrName, options: Optional[Dict[str, str]] = None) → pyspark.sql.column.Column
col : Column or str
options : dict, optional ( options to control converting. accepts the same options as the JSON data source )
to_json : Converts a column containing a StructType, ArrayType or a MapType into a JSON string. Throws an exception, in the case of an unsupported type.
Sample Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_json
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import IntegerType,StringType,MapType
veh_data = ([
(1,{"si_no":1,"vehicle_make":"Cadillac","make_year":2022}),\
(2,{"si_no":2,"vehicle_make":"Tesla","make_year":2022}),\
(3,{"si_no":3,"vehicle_make":"Lincoln","make_year":2021}),\
(4,{"si_no":4,"vehicle_make":"Ford","make_year":2022})])
veh_schema=StructType([
StructField("si_no",IntegerType(),True),
StructField("vehicle_details",MapType(StringType(),StringType()),True)])
veh_df = spark.createDataFrame(data=veh_data,schema=veh_schema)
veh_df.show(20,False)
+-----+---------------------------------------------------------+
|si_no|vehicle_details |
+-----+---------------------------------------------------------+
|1 |[vehicle_make -> Cadillac, make_year -> 2022, si_no -> 1]|
|2 |[vehicle_make -> Tesla, make_year -> 2022, si_no -> 2] |
|3 |[vehicle_make -> Lincoln, make_year -> 2021, si_no -> 3] |
|4 |[vehicle_make -> Ford, make_year -> 2022, si_no -> 4] |
+-----+---------------------------------------------------------+
Applying to_json
veh_df2 = veh_df.select(veh_df.si_no,to_json(veh_df.vehicle_details).alias("json")) veh_df2.show(20,False)
+-----+----------------------------------------------------------+
|si_no|json |
+-----+----------------------------------------------------------+
|1 |{"vehicle_make":"Cadillac","make_year":"2022","si_no":"1"}|
|2 |{"vehicle_make":"Tesla","make_year":"2022","si_no":"2"} |
|3 |{"vehicle_make":"Lincoln","make_year":"2021","si_no":"3"} |
|4 |{"vehicle_make":"Ford","make_year":"2022","si_no":"4"} |
+-----+----------------------------------------------------------+
Let’s compare the schema difference
veh_df : Will have mapType
veh_df.printSchema()
root
|-- si_no: integer (nullable = true)
|-- vehicle_details: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
veh_df2 : Will have StringType
veh_df2.printSchema()
root
|-- si_no: integer (nullable = true)
|-- json: string (nullable = true)