Explain how can you implement dynamic partitioning in Hive (automatically creating partition based on column value)

Dynamic partition in hive

          Dynamic partitioning is a tactical method for loading data from a non-partitioned table. Where there are large number of partition values , then its is hard to mention each partition manually. Dynamic partition columns are used mostly on columns whose values are only known at EXECUTION TIME. Dynamic Partition columns are specified the same way as it is for static partition columns, in the partition-clause. The main difference is that dynamic partition columns do not have values while static partition columns have. In the partition-clause we need to specify all partitioning columns, even though all of them are dynamic partition columns. All dynamic partition columns, only allowed in nonstrict mode. In strict mode, it should throw an error. The dynamic partitioning column’s type should be derived from the expression. The data type has to be able to be converted to a string in order to be saved as a directory name in HDFS. After converting column value to string, we still need to convert the string value to a valid directory name. Some reasons are (a)  String length is unlimited in theory, but HDFS/local FS directory name length is limited. (b) String value could contains special characters that is reserved in FS path names (such as ‘/’ or ‘..’ etc ). If there are multiple partitioning columns, their order is significant since that translates to the directory structure in HDFS / S3 etc.

Sample Data

144,Tim John,11,M,max,UK,262
145,Elon Musk,77,F,mid,USA,273
146,Jerry Jhones,17,M,max,IND,274
147,Nick Cater,52,M,mid,USA,275
148,Briam Rodgers ,59,F,mid,USA,276

Create External Table

create external table clients_pii_tbl (
id1 int,
name string,
age int,
sex string,
rge string,
country string,
id2 int) 
row format delimited 
fields terminated by ','
location 's3://freshers-in-data/hive-data/dynamicsample' ;

Create a partition table

create external table clients_pii_dynamic_tbl (
id1 int,
name string,
age int,
sex string,
rge string,
id2 int) 
partitioned by (country string)
row format delimited 
fields terminated by ',';

In the above partitioning columns should be mentioned as partitioned by , and should not used as fields name . If used will get an error as FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns.

Hive Setting 

The below need to be enabled. If using hive console, then paste this in hive terminal.

set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nostrict;
set hive.exec.dynamic.partition.pernode = 3;
set hive.exec.max.created.files = 100;

Execute the insert script

insert into table clients_pii_dynamic_tbl 
partition(country)
select id1 ,name ,age ,sex ,rge ,id2,country 
from clients_pii_tbl;  

Loading Logs

Result of the above query
Loading data to table sac01.t3p_dp1 partition (country=null)
         Time taken for load dynamic partitions : 1318
        Loading partition {country=UK}
        Loading partition {country=CHN}
        Loading partition {country=USA}
        Loading partition {country=IND}

Result

show partitions t3p_dp1;
OK
country=IND
country=UK
country=USA
Time taken: 0.108 seconds, Fetched: 4 row(s)

Reference

  1. Hive reference articles
  2. Hive interview questions
  3. Spark Examples
  4. PySpark Blogs
  5. Bigdata Blogs
  6. Spark Interview Questions
Author: user

Leave a Reply