Apache Hive users often encounter a scenario where running a Hive query in different directories leads to the creation of a new metastore_db
in each directory. This article aims to explain the reason behind this behavior and offers guidance on how to manage it effectively.
Why Does Hive create metastore_db in each directory?
Default Embedded Derby database
Hive uses Apache Derby, an embedded database, for its metastore in a default setup. The embedded Derby database is intended for lightweight and single-user purposes.
Working Directory Dependent
When you run a Hive query, it looks for the metastore in the current working directory. If it doesn’t find an existing metastore (metastore_db
), it creates a new one. This is why executing Hive queries in different directories results in multiple metastore_db
instances.
Implications of Multiple metastore_db Instances
- Inconsistency: Different
metastore_db
instances in various directories can lead to inconsistency in metadata across these instances. - Space Utilization: Each new
metastore_db
consumes disk space, potentially leading to inefficient space usage.
Managing Hive Metastore for consistency
Configuring a shared Metastore
To avoid the creation of multiple metastore_db
directories, configure Hive to use a shared, central metastore. This can be achieved by setting up a standalone metastore service using a more robust database like MySQL or PostgreSQL.
Steps to Configure a shared Metastore
- Install a Database Server: Choose a database like MySQL or PostgreSQL and install it on a server.
- Configure Hive to Use the Database: Update the Hive configuration (
hive-site.xml
) to point to the database server for the metastore. - Initialize the Metastore Schema: Use Hive schema tool commands to initialize the database schema for the metastore.
Benefits of a shared Metastore
- Consistency: Ensures metadata consistency across different Hive sessions and directories.
- Scalability: More robust for handling larger, multi-user environments.
- Central Management: Simplifies the management of the metastore.
Hive important pages to refer