Hive : How to Delete Old Apache Hive Logs , increase space and boosting Cluster Performance

Hive @ Freshers.in

Apache Hive logs are a critical component for debugging and performance optimization. However, over time, these logs can occupy significant disk space, leading to reduced performance. In such cases, it becomes necessary to delete older logs while ensuring that recent logs, which might be critical for troubleshooting recent issues, are preserved. This article provides a guide on how to delete Hive logs that are more than a month old to save space and enhance your cluster’s performance.

Prerequisites

Before starting, ensure you have:

  • Basic understanding of Linux command line
  • Basic understanding of Hadoop and Apache Hive
  • Access to your Hadoop and Hive servers
  • Necessary permissions to access and delete log files

Deleting Old Apache Hive Logs

Apache Hive does not provide an inbuilt mechanism to delete logs, but the task can be easily accomplished using Linux commands. Below are the steps to remove logs older than a month.

Step 1: Identify the Hive Log Location

Hive logs are typically stored in the $HIVE_HOME/logs directory. However, your Hive logs’ location may vary based on your installation and configuration. You may want to confirm the log directory location from your Hive and Hadoop administrators. By default, Hive logs are typically located in the /tmp directory. However, this location might vary depending on your cluster configuration. You can find the relevant directories by checking the Hive configuration file (hive-site.xml) or consulting with your cluster administrator.

Step 2: Backup the Logs (Optional)

Before deleting the logs, you may want to create a backup, especially if the logs are not already backed up. This step is optional but recommended to prevent accidental data loss.

You can create a backup using the cp command. Here is an example:

cp -r $HIVE_HOME/logs /freshers-in/oldlogs/backup

Step 3: Craft the Hive Deletion Script.

Once you’ve determined the log directories and retention period, it’s time to craft a Hive script that will facilitate the log deletion process. The Hive script will be responsible for identifying and deleting the logs that meet the criteria (older than one month).

Below is an example Hive script to delete Hive logs older than one month:

-- Make sure you are running this as a user with appropriate permissions

SET mapred.job.queue.name=<YOUR_YARN_QUEUE_NAME>;

SET hive_logs_dir='/freshers-in/logs/';

-- Define the retention period (30 days in this case)
SET retention_period_days=30;

-- Delete Hive logs older than the retention period
DFS -rm -r ${hive_logs_dir}/hive.log.*{now() - ${retention_period_days}d}

Step 4: Run the Hive Deletion Script.

Execute the Hive script you created in Step 4 using the Hive CLI or any other Hive interface available in your cluster. Ensure that you have the necessary permissions to perform the deletion operation. Depending on the volume of logs and hardware resources, the process might take some time to complete.

Step 5: Monitor and Verify.

After the script execution, monitor the process to ensure logs are deleted as expected. Verify that the logs older than one month are no longer present in the specified directories. Additionally, keep an eye on the cluster performance and disk space usage to confirm the improvements.

Hive important pages to refer

  1. Hive
  2. Hive Interview Questions
  3. Hive Official Page
  4. Spark Examples
  5. PySpark Blogs
  6. Bigdata Blogs
  7. Spark Interview Questions
  8. Spark Official Page
Author: user

Leave a Reply