Introduction to Custom Input/Output Formats in Hive: Hive allows users to define custom input and…
Tag: Big Data
Hive : Comparison between the ORC and Parquet file formats in Hive
ORC (Optimized Row Columnar) and Parquet are two popular file formats for storing and processing large datasets in Hadoop-based systems…
Hive : Different types of storage formats supported by Hive.[16 Formats supported by Hive]
Apache Hive is an open-source data warehousing tool that was developed to provide an SQL-like interface to query and analyze…
PySpark : Setting PySpark parameters – A complete Walkthru [3 Ways]
In PySpark, you can set various parameters to configure your Spark application. These parameters can be set in different ways…
PySpark : Using CASE WHEN for Spark SQL to conditionally execute expressions : Dataframe and SQL way explained
The WHEN clause is used in Spark SQL to conditionally execute expressions. It’s similar to a CASE statement in SQL…
Spark : Calculation of executor memory in Spark – A complete info.
The executor memory is the amount of memory allocated to each executor in a Spark cluster. It determines the amount…
Hive : How to load JSON and nested JSON in Hive and how to view it [Sample code with Data]
In this article, I’ll walk you through how to read JSON data from a Hive table using an example with…
PySpark : PySpark program to write DataFrame to Snowflake table.
Overview of Snowflake and PySpark. Snowflake is a cloud-based data warehousing platform that allows users to store and analyze large…
Hive : Role of Hive type coercion and how can you perform type coercion in Hive?
In Hive, type coercion is the process of converting one data type to another data type during query execution. Type…
Hive : Role of Hive CBO (cost-based optimization) and how can you enable CBO in Hive
Hive’s Cost-Based Optimization (CBO) is a powerful feature that enables Hive to optimize queries based on the estimated cost of…
Hive : Hive’s dynamic partitioning and how can you use it in your Hive queries?
Hive’s dynamic partitioning is a feature that enables the automatic partitioning of data in Hive tables based on the data’s…