In this article, I’ll walk you through how to read JSON data from a Hive table using an example with multiple JSON objects as input. We’ll cover the following steps:
- Creating a Hive table to store the JSON data
- Loading JSON data into the Hive table
- Querying the Hive table to retrieve the JSON data
- Reading the JSON data using Apache Spark
- Loading NESTED JSON data into the Hive table
Step 1: Creating a Hive table to store the JSON data
To create a Hive table to store the JSON data, you’ll need to define the table schema. The schema should match the structure of the JSON data you’ll be loading into the table.
Here’s an example schema for a Hive table that can store JSON data with the following structure:
{ "id": "1",
"name": "Sachin",
"age": 30,
"email": "sachin@example.com"
}
CREATE EXTERNAL TABLE test.freshers_json_table
( id string,
name string,
age int,
email string )
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE LOCATION 's3://freshers-in/training/hive/json_sample/'
This will create a Hive table called test.freshers_json_table
with columns id
, name
, age
, and email
.
Note the ROW FORMAT SERDE and STORED AS clauses. These specify the format of the data stored in the table. In this case, we’re using the JsonSerDe
serde, which can handle JSON data.
Step 2: Loading JSON data into the Hive table
To load JSON data into the Hive table, you’ll need to use the LOAD DATA statement.
Assuming you have a file containing multiple JSON objects like this:
{
“id”: “1”,
“name”: “Sachin”,
“age”: 30,
“email”: “sachin@example.com”
}
{
“id”: “2”,
“name”: “Tony Dane”,
“age”: 25,
“email”: “tony.dane@example.com”
}
Actual JSON in file looks like
{"id": "1", "name": "Sachin", "age": 30, "email": "sachin@example.com"}
{"id": "2", "name": "Tony", "age": 25, "email": "tony@example.com"}
In the create statement LOCATION ‘s3://freshers-in/training/hive/json_sample/’ will load the data from this location.
Step 3: Querying the Hive table to retrieve the JSON data
To retrieve the JSON data from the Hive table, you can use a simple SELECT statement:
select * from test.freshers_json_table
Step 4: Reading the JSON data using Apache Spark
If you want to analyze the JSON data using Apache Spark, you can read it into a Spark DataFrame using the spark.read.json() method.
val spark = SparkSession.builder().appName("ReadJSONDataFromHiveTable").master("local[*]").getOrCreate()
val myJsonTableDF = spark.read.json("<code class="language-sql">test.freshers_json_table
") myJsonTableDF.show()
Step 5: Loading Nested JSON data into the hive tables
Parsing multiple nested JSON data in Hive can be a bit more complex, but it can be done using Hive’s get_json_object function in conjunction with the json_tuple function.
Let’s say we have a nested JSON data structure like this:
“id”: “1”,
“name”: “sachin”,
“age”: 30,
“email”: “sachin@example.com”,
“addresses”: [
{
“street”: “123MainSt”,
“city”: “Maryland”,
“state”: “MD”,
“zipcode”: “21210”
},
{
“street”: “456ElmSt”,
“city”: “Niche ville”,
“state”: “NY”,
“zipcode”: “67890”
}
]
}
In the file you need json in single file format. Each JSON object in each line. The above is just for your understanding. In file it should be like
{ "id": "1", "name": "sachin", "age": 30, "email": "sachin@example.com", "addresses": [ { "street": "123MainSt", "city": "Maryland", "state": "MD", "zipcode": "21210" }, { "street": "456ElmSt", "city": "Niche ville", "state": "NY", "zipcode": "67890" } ] }
CREATE TABLE statement for a Hive table that stores above JSON data.
CREATE EXTERNAL TABLE test.freshers_json_table2 ( id string, name string, age int, email string, addresses array<struct<street:string, city:string, state:string, zipcode:string>> ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION 's3://freshers-in/training/hive/json_nested_sample/';
In this statement, we’re defining a table with columns for id, name, age, email, and address. The address column is defined as a struct with fields for street, city, state, and zipcode. We’re using the JsonSerDe serde to serialize and deserialize JSON data. The TEXTFILE format is used to store the data in plain text files. If you prefer to use a different file format, you can specify a different storage format.Note that when inserting data into this table, you’ll need to provide the JSON data as a string. For example:
In this statement, we’re defining a table with columns for id, name, age, email, and addresses. The addresses column is defined as an array of structs with fields for street, city, state, and zipcode.
We’re using the JsonSerDe serde to serialize and deserialize JSON data. The TEXTFILE format is used to store the data in plain text files. If you prefer to use a different file format, you can specify a different storage format.
Note that the addresses column is defined as an array. This is because your JSON data contains multiple addresses for each person.
To browse the data you can have the select table as below.
SELECT
id,
name,
age,
email,
get_json_object(json_data, '$.addresses.street') as street,
get_json_object(json_data, '$.addresses.city') as city,
get_json_object(json_data, '$.addresses.state') as state,
get_json_object(json_data, '$.addresses.zipcode') as zipcode
FROM test.my_json_table2;
SELECT
id,
name,
age,
email,
address.street,
address.city
FROM
test.freshers_json_table2
LATERAL VIEW
explode(addresses) exploded_table AS address;
In you want to insert into hive you can do as follows. But in this case Hive table should not be an EXTERNAL table.
INSERT INTO my_json_table VALUES (
'1',
'Mike Walle',
30,
'mike.walle@example.com',
'[{"street":"John Herny MainSt","city":"Alabama","state":"CA","zipcode":"65252"},{"street":"456ElmSt","city":"Jachsonville","state":"OH","zipcode":"54854"}]'
);
Hive important pages to refer