1. Partitioning : Partitioning works by dividing the data into smaller segments, These are created using logical grouping based on columns and column should have low cardinality(Number of Distinct value should be less).
2. Bucketing: Bucketing also divide the data into smaller segments, These segments are created based on system defined hash functions, Bucketing we can use when the column has high cardinality. Each partition stored as directory and each bucket stored as files under directory. Both Partition and Bucketing we have to use frequently used columns and we can use both partition and bucketing techniques in same table. Final intention is to just scan one chunk of data & ignore the rest of them, This will give us lot of performance gain.
3. Join Optimization
a. Map side Join : Whenever we write a join in way that reducer does not have to anything then we say such joins are Map side joins. It improves the processing time, Reduces the data transfer in cluster and also reduces the shuffle and sort between the map and reduce phases. Suppose if we have two tables one table has to be small to do map side join, because hash map is created for small table that will be added to HDFS, From HDFS it is broadcasted to all the nodes and this resides on local disk in all the machine it is also called as distributed cache.
b. Bucket Map Join : This will work on multiple big tables also. for doing Bucket map join both the column should be bucketed on join columns and also number of bucket in one table should be integral multiple of other table.
c. Sort Bucket Map Join : This will work on Two big tables. In yhis case number of Buckets in the Both the tables is exactly same and data in the both the tables sorted on join columns in ascending order.
4. Usage of Suitable File Format : Optimized Row columnar(ORC) File is best suited for Hive, ORC provides the highly efficient ways of storing the data by reducing the data storage format by 75% of Original and It uses the predicate push-down compression and lightweight compression like Dictionary encoding, bit packing, delta encoding and run length encoding along with the generalized compression techniques like snappy.
5. Vectorization : Vectorization improves the performance by fetching 1024 rows in a single operation instead of fetching single row each time it improve the performance operation like filter, join and aggregation etc..
6. Changing the execution engine : Hive Optimization Techniques to increase the Hive performance of our hive query by using execution engine as Tez or spark because Map reduce is slow.
7. Window : Simplifying the query expressions using windowing functions.
8. UDF’s are not very optimized : filter condition evaluated from left to right for best performance put the right side expression of where clause.