How can you convert PySpark Dataframe to JSON ?

PySpark @ Freshers.in

pyspark.sql.DataFrame.toJSON

There may be some situation that you need to send your dataframe to a file to a server or cloud storage like S3. It is always suggested that , in this kind of situation you can send that as a JSON object. For that in Spark you have predefined function as toJSON.

Converts a DataFrame into a RDD of string. [ Please note:  The converted result will be ab RDD not a dataframe ]

On conversion each row is turned into a JSON document as one element in the returned RDD.

Sample code 

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import IntegerType,StringType,FloatType
emp_data = ([
(1,"Wilson Sam",10000,12.0,"125960 W 112th PlLos Angeles, California(CA), 90045"),\
(2,"Twinkle Peter",24000,6.0,"38500 Beverly 66192 Blvd Los Angeles, California 90048"),\
(3,"McDonald John",9000,8.5,"33215 Overland Ave Cross Los Angeles, California(CA), 90034"),\
(4,"Yellow Jaison",12000,11.0,"321325 Overland Ave New Los Angeles, California(CA), 90034"),\
(5,"Joy Mike",15000,22.0,"523960 W 85th PL Base Los Angeles, California(CA), 90045")])
emp_schema=StructType([
StructField("si_no",IntegerType(),True),
StructField("name",StringType(),True),
StructField("salary",IntegerType(),True),
StructField("commission",FloatType(),True),
StructField("address",StringType(),True)])
state_data_df = spark.createDataFrame(data=emp_data,schema=emp_schema)
state_data_df.show(20,False)
+-----+-------------+------+----------+-----------------------------------------------------------+
|si_no|name         |salary|commission|address                                                    |
+-----+-------------+------+----------+-----------------------------------------------------------+
|1    |Wilson Sam   |10000 |12.0      |125960 W 112th PlLos Angeles, California(CA), 90045        |
|2    |Twinkle Peter|24000 |6.0       |38500 Beverly 66192 Blvd Los Angeles, California 90048     |
|3    |McDonald John|9000  |8.5       |33215 Overland Ave Cross Los Angeles, California(CA), 90034|
|4    |Yellow Jaison|12000 |11.0      |321325 Overland Ave New Los Angeles, California(CA), 90034 |
|5    |Joy Mike     |15000 |22.0      |523960 W 85th PL Base Los Angeles, California(CA), 90045   |
+-----+-------------+------+----------+-----------------------------------------------------------+

Now we can convert the above dataframe to a JSON Object 

state_data_rdd = state_data_df.toJSON()
state_data_rdd.collect()
['{"si_no":1,"name":"Wilson Sam","salary":10000,"commission":12.0,"address":"125960 W 112th PlLos Angeles, California(CA), 90045"}', 
'{"si_no":2,"name":"Twinkle Peter","salary":24000,"commission":6.0,"address":"38500 Beverly 66192 Blvd Los Angeles, California 90048"}', 
'{"si_no":3,"name":"McDonald John","salary":9000,"commission":8.5,"address":"33215 Overland Ave Cross Los Angeles, California(CA), 90034"}', 
'{"si_no":4,"name":"Yellow Jaison","salary":12000,"commission":11.0,"address":"321325 Overland Ave New Los Angeles, California(CA), 90034"}', 
'{"si_no":5,"name":"Joy Mike","salary":15000,"commission":22.0,"address":"523960 W 85th PL Base Los Angeles, California(CA), 90045"}']
Author: user

Leave a Reply