Loading JSON schema from a JSON string in PySpark

PySpark @ Freshers.in

We want to load the JSON schema from a JSON string. In PySpark, you can do this by parsing the JSON string and creating a schema from it. Here’s a revised article that demonstrates how to load a JSON schema from a JSON string:

Loading JSON Schema from a JSON string in PySpark

In PySpark, you can load a JSON schema from a JSON string, allowing you to dynamically define the schema for your data. This can be useful when your data structure evolves or when you want to provide flexibility in handling different JSON structures.

1. Importing PySpark

First, make sure you have PySpark installed. You can install it using pip:

pip install pyspark

Import the necessary modules:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType
import json

2. Creating a SparkSession

Create a SparkSession, the entry point for using PySpark:

spark = SparkSession.builder.appName("JSONSchema from JSONString at Freshers.in").getOrCreate()

3. Defining the JSON schema

Define your JSON schema by parsing a JSON string. This JSON string represents the schema structure. Here’s an example JSON schema string:

schema_json_string = """
{
    "type": "struct",
    "fields": [
        {"name": "id", "type": "integer", "nullable": true, "metadata": {}},
        {"name": "first_name", "type": "string", "nullable": true, "metadata": {}},
        {"name": "last_name", "type": "string", "nullable": true, "metadata": {}},
        {"name": "age", "type": "integer", "nullable": true, "metadata": {}},
        {"name": "salary", "type": "double", "nullable": true, "metadata": {}}
    ],
    "metadata": {}
}
"""

4. Creating a StructType schema

Parse the JSON schema string and create a StructType schema object:

schema_dict = json.loads(schema_json_string)
schema = StructType.fromJson(schema_dict)

5. Loading JSON Data with the schema

Now, you can load JSON data using the defined schema:

json_data = [
    {"id": 1, "first_name": "Sachin", "last_name": "Tendulkar", "age": 30, "salary": 50000.0},
    {"id": 2, "first_name": "Rajesh", "last_name": "Kanna", "age": 25, "salary": 60000.0},
    {"id": 3, "first_name": "Mahesh", "last_name": "Raj", "age": 35, "salary": 75000.0}
]

df = spark.createDataFrame(json_data, schema=schema)

6. Viewing the dataframe

You can now perform various operations on the DataFrame, such as displaying the schema or showing the first few rows of data:

df.printSchema()
df.show()
Output
root
 |-- id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: double (nullable = true)

+---+----------+---------+---+-------+
| id|first_name|last_name|age| salary|
+---+----------+---------+---+-------+
|  1|    Sachin|Tendulkar| 30|50000.0|
|  2|    Rajesh|    Kanna| 25|60000.0|
|  3|    Mahesh|      Raj| 35|75000.0|
+---+----------+---------+---+-------+
Author: user