We want to load the JSON schema from a JSON string. In PySpark, you can do this by parsing the JSON string and creating a schema from it. Here’s a revised article that demonstrates how to load a JSON schema from a JSON string:
Loading JSON Schema from a JSON string in PySpark
In PySpark, you can load a JSON schema from a JSON string, allowing you to dynamically define the schema for your data. This can be useful when your data structure evolves or when you want to provide flexibility in handling different JSON structures.
1. Importing PySpark
First, make sure you have PySpark installed. You can install it using pip:
pip install pyspark
Import the necessary modules:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType
import json
2. Creating a SparkSession
Create a SparkSession, the entry point for using PySpark:
spark = SparkSession.builder.appName("JSONSchema from JSONString at Freshers.in").getOrCreate()
3. Defining the JSON schema
Define your JSON schema by parsing a JSON string. This JSON string represents the schema structure. Here’s an example JSON schema string:
schema_json_string = """
{
"type": "struct",
"fields": [
{"name": "id", "type": "integer", "nullable": true, "metadata": {}},
{"name": "first_name", "type": "string", "nullable": true, "metadata": {}},
{"name": "last_name", "type": "string", "nullable": true, "metadata": {}},
{"name": "age", "type": "integer", "nullable": true, "metadata": {}},
{"name": "salary", "type": "double", "nullable": true, "metadata": {}}
],
"metadata": {}
}
"""
4. Creating a StructType schema
Parse the JSON schema string and create a StructType
schema object:
schema_dict = json.loads(schema_json_string)
schema = StructType.fromJson(schema_dict)
5. Loading JSON Data with the schema
Now, you can load JSON data using the defined schema:
json_data = [
{"id": 1, "first_name": "Sachin", "last_name": "Tendulkar", "age": 30, "salary": 50000.0},
{"id": 2, "first_name": "Rajesh", "last_name": "Kanna", "age": 25, "salary": 60000.0},
{"id": 3, "first_name": "Mahesh", "last_name": "Raj", "age": 35, "salary": 75000.0}
]
df = spark.createDataFrame(json_data, schema=schema)
6. Viewing the dataframe
You can now perform various operations on the DataFrame, such as displaying the schema or showing the first few rows of data:
df.printSchema()
df.show()
root
|-- id: integer (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- age: integer (nullable = true)
|-- salary: double (nullable = true)
+---+----------+---------+---+-------+
| id|first_name|last_name|age| salary|
+---+----------+---------+---+-------+
| 1| Sachin|Tendulkar| 30|50000.0|
| 2| Rajesh| Kanna| 25|60000.0|
| 3| Mahesh| Raj| 35|75000.0|
+---+----------+---------+---+-------+
Spark important urls to refer