Hive : Understanding Array Aggregation in Apache Hive

Hive @ Freshers.in

Apache Hive offers many inbuilt functions to process data, among which collect_list() and collect_set() are commonly used to perform array aggregation. These functions aggregate the input values into an array. In the case of an empty input, they return an empty array. This article will provide a step-by-step guide on how to utilize these functions.

Step 1: Data Definition

Let’s create a table named product_sales with three columns, salesperson_id, product_id, and product_name. Here’s the Data Definition Language (DDL) statement to create the table:

CREATE TABLE freshers_product_sales (
  salesperson_id INT, 
  product_id INT,
  product_name STRING
);

Step 2: Inserting Data

Once the table is ready, let’s populate it with some data:

INSERT INTO TABLE freshers_product_sales (salesperson_id, product_id, product_name) VALUES (1, 101, 'ProductAlpha');
INSERT INTO TABLE freshers_product_sales (salesperson_id, product_id, product_name) VALUES (1, 102, 'ProductBeta');
INSERT INTO TABLE freshers_product_sales (salesperson_id, product_id, product_name) VALUES (2, 103, 'ProductCharlie');
INSERT INTO TABLE freshers_product_sales (salesperson_id, product_id, product_name) VALUES (2, 104, 'ProductDelta');
INSERT INTO TABLE freshers_product_sales (salesperson_id, product_id, product_name) VALUES (2, 105, 'ProductEniga');

Step 3: Applying Array Aggregation Functions

Now that we have our data, let’s see how to use the collect_list() and collect_set() functions to gather the products sold by each salesperson into an array:

SELECT salesperson_id, collect_list(product_name) as products_sold
FROM freshers_product_sales
GROUP BY salesperson_id;

This query will return an array of product_name for each salesperson_id. collect_list() can contain duplicate values if a salesperson sold the same product multiple times.

If you want to avoid duplicates in the result, you can use collect_set() instead. However, note that collect_set() does not guarantee the order of elements:

SELECT salesperson_id, collect_set(product_name) as products_sold
FROM freshers_product_sales
GROUP BY salesperson_id;

Results

The results of these queries could look something like:

For the collect_list() function:

salesperson_id    products_sold
1                 ["ProductAlpha", "ProductBeta"]
2                 ["ProductCharlie", "ProductDelta", "ProductEniga"]
For the collect_set() function:
salesperson_id    products_sold
1                 ["ProductAlpha", "ProductBeta"]
2                 ["ProductCharlie", "ProductDelta", "ProductEniga"]

Hive important pages to refer

  1. Hive
  2. Hive Interview Questions
  3. Hive Official Page
  4. Spark Examples
  5. PySpark Blogs
  6. Bigdata Blogs
  7. Spark Interview Questions
  8. Spark Official Page
Author: user

Leave a Reply