How to insert from Non Partitioned table to Partitioned table in Hive?

Hive @ Freshers.in

You can insert data from Non Partitioned table to Partitioned table , in short , if you want to have a partitioned table from a raw data you can perform the query as below. 

Partitioned table : Hive organizes tables into partitions. The way of dividing a table into related parts based on the values in partitioned columns is the partitioned table.

Sample Data

36925,Montgomery,US
86556,Phoenix,US
06928,Connecticut,US
39901,Atlanta,US
67954,Kansas,US
A2H9Z9,Corner_Brook,CA
B2A9Z9,Sydney,CA
A0E9Z9,Newfoundland_Average,CA

1. Create a hive table DDL based on your Data (table name : world_zip_codes)

create table world_zip_codes 
(
zip_code string,
state_capital string,
country string
) 
ROW FORMAT DELIMITED FIELDS 
TERMINATED BY ','
LOCATION '/user/freshers_in/inData/zip_data'
;

2. Create partitioned table partitioned_world_zip_codes(table_name) based on the required partition column (country)

create table partitioned_world_zip_codes 
(
zip_code string,
state_capital string
) 
PARTITIONED BY (country string)  
ROW FORMAT DELIMITED FIELDS 
TERMINATED BY ',';

3. You can insert into partitioned table partitioned_world_zip_codes based on partition column value.

INSERT OVERWRITE TABLE partitioned_world_zip_codes 
PARTITION (country='US') 
SELECT zip_code,state_capital  
FROM 
world_zip_codes where country = 'US';

INSERT INTO TABLE partitioned_world_zip_codes 
PARTITION (country='CA') 
SELECT zip_code,state_capital  
from world_zip_codes where country = 'CA';

Things to remember

In Hive the partitioning “columns” are managed as hive metadata. The partitioned column are not included in the data files, instead of that they are used or created as sub-directory names. If your table have 9 columns and if you are partitioning based on 1 column , then your partitioned table will have 8 columns , the partitioned column will be in the name of subdirectory. There for when creating the partitioned column your SELECT should (ignore the partition column name ) have only 8 column names. If you are giving the 9 columns in select ( which is wrong ) hive will throw an error mentioning “SemanticException [Error 10044]: Line 1:23 Cannot insert into target table because column number/types are different: Table insclause-0 has 8 columns, but query has 9 columns.

Reference

  1. Hive reference articles
  2. Hive interview questions
  3. Spark Examples
  4. PySpark Blogs
  5. Bigdata Blogs
  6. Spark Interview Questions
  7. Spark Official Page
Author: user

Leave a Reply