One of the key features of Hive is the ability to define table properties, which can be used to control various aspects of table behavior and performance. In this article, we will discuss Hive table properties and how they can be used.
What are Hive Table Properties?
In Hive, a table property is a key-value pair that provides additional information about a table. Table properties can be set when a table is created or altered, and can be used to control a wide range of behaviors, including file format, storage location, partitioning, compression, and more.
Hive table properties can be classified into two categories: system-defined properties and user-defined properties. System-defined properties are predefined by Hive and are used to control the behavior of the Hive system itself. User-defined properties are defined by the user and are used to control the behavior of the user’s specific tables.
How are Hive Table Properties used?
Hive table properties can be used in a variety of ways. Some common use cases include:
File format: Hive supports several file formats, including text, sequence file, and ORC. The file format property can be used to specify the file format for a table.
For example, to create a table named freshers_in_tbl_property with the text file format, we can use the following SQL statement:
CREATE TABLE freshers_in_tbl_property (id INT, name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Storage location: Hive stores table data in HDFS by default. However, it is possible to store table data in other file systems, such as Amazon S3 or Azure Blob Storage. The location property can be used to specify the storage location for a table.
For example, to create a table named freshers_in_ with the storage location in Amazon S3, we can use the following SQL statement:
CREATE TABLE freshers_in_tbl_property (id INT, name STRING)
LOCATION 's3://my-bucket/my-path/';
Partitioning: Hive supports partitioning of tables, which can improve query performance by allowing the system to skip scanning irrelevant data. The partitioned by property can be used to specify the partitioning columns for a table.
For example, to create a partitioned table named freshers_in_ partitioned by the year and month columns, we can use the following SQL statement:
CREATE TABLE freshers_in_tbl_property (id INT, name STRING, value DOUBLE)
PARTITIONED BY (year INT, month INT);
Compression: Hive supports several compression algorithms, including Snappy, LZO, and Gzip. The compressed by property can be used to specify the compression algorithm for a table.
For example, to create a table named freshers_in_tbl_property compressed with Snappy, we can use the following SQL statement:
CREATE TABLE freshers_in_tbl_property (id INT, name STRING)
STORED AS PARQUET
TBLPROPERTIES ('parquet.compression'='SNAPPY');
SerDe properties: Hive uses SerDes (Serialization Deserialization libraries) to read and write data from and to different file formats. The SerDe property can be used to specify the SerDe used for a table.
For example, to create a table named freshers_in_ with the Avro SerDe, we can use the following SQL statement:
CREATE TABLE freshers_in_tbl_property (
id INT,
name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal'='{
"namespace": "com.example",
"type": "record",
"name": "Freshers",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"}
]
}')
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
In this example, we are creating a Hive table called freshers_in_ with two columns, id and name, both with data types of INT and STRING respectively. We are using the AvroSerDe (Serializer/Deserializer) to read and write data in the Avro format.
The SERDEPROPERTIES clause is used to specify the schema for the data in the Avro format. Here, we are defining the schema as a JSON string literal, with a namespace, type, record name, and fields.
Finally, we are specifying the input and output format of the data as AvroContainerInputFormat and AvroContainerOutputFormat respectively.
Overall, Hive table properties are a powerful way to customize how tables are stored, managed, and queried in Hive. The ability to specify the SerDe and schema for a table allows for greater flexibility in working with different data formats and structures.
Hive important pages to refer