Hive : How to preserve Hive metadata [Preserve the last DDL time for the table]

Hive @ Freshers.in

HOLD_DDLTIME

The “last DDL time” refers to the timestamp of the most recent DDL (Data Definition Language) operation that was performed on a table. DDL operations include commands such as CREATE, ALTER, DROP, TRUNCATE, and RENAME which are used to create, alter, or delete the structure of database objects like tables, procedures, or indices.

When a DDL operation is executed on a table, the “last DDL time” typically gets updated to reflect the time of this operation. However, if you set the HOLD_DDLTIME property to ‘true’, the last DDL time remains unchanged even after subsequent DDL operations.

In the context of Hive, this “last DDL time” is stored as a table property. You can check it by using the DESCRIBE FORMATTED <table_name> command, where it is listed under Table Parameters: as last_modified_time.

Understanding and Working with HOLD_DDLTIME in Apache Hive

Apache Hive is a data warehouse infrastructure tool that allows processing of structured and semi-structured data in Hadoop. It provides a SQL-like interface to data stored in Hadoop file systems, making it easy for developers and analysts to perform data queries and analysis. In this article, we’re going to delve into the feature named HOLD_DDLTIME which is used to preserve Hive metadata.

What is HOLD_DDLTIME?

The HOLD_DDLTIME property in Hive is a table-level property which, when set, helps preserve the last DDL time for the table. This is useful when you do not want Hive to alter the last DDL time after operations such as ALTER TABLE or MSCK REPAIR TABLE. This can be particularly important in use cases where you want to maintain consistency of metadata for auditing or regulatory compliance purposes.

Example

Now that we’ve got an understanding of what HOLD_DDLTIME is, let’s look at a working example on how to use it.

  1. First, let’s create a table in Hive:
CREATE TABLE employee(
  id int,
  name string,
  department string,
  salary float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
  1. Let’s say we’re altering this table, but we don’t want the last DDL time to change. We can set the HOLD_DDLTIME to true before executing the ALTER TABLE operation:
ALTER TABLE employee ADD COLUMNS (position string);
  1. Now, we can alter the table. For instance, let’s add a new column position:
ALTER TABLE employee ADD COLUMNS (position string);
  1. After the ALTER TABLE operation, Hive will not update the last DDL time because we’ve set the HOLD_DDLTIME property to true.
  2. To verify this, we can check the table properties:
DESCRIBE FORMATTED employee;

In the output, you can check the last_modified_time under Table Parameters: which will remain unchanged.

If you want to revert this behavior and allow Hive to update the last DDL time after subsequent operations, you can set HOLD_DDLTIME to false:

Using HOLD_DDLTIME in Apache Hive provides a level of control over how the system handles metadata changes. It is a valuable tool for preserving the integrity of data, particularly in cases where consistent historical metadata is a requirement.

Hive important pages to refer

  1. Hive
  2. Hive Interview Questions
  3. Hive Official Page
  4. Spark Examples
  5. PySpark Blogs
  6. Bigdata Blogs
  7. Spark Interview Questions
  8. Spark Official Page
Author: user

Leave a Reply