One of the key functionalities of PySpark is the ability to transform data into the…
Tag: big_data_interview
PySpark : Dropping duplicate rows in Pyspark – A Comprehensive Guide with example
PySpark provides several methods to remove duplicate rows from a dataframe. In this article, we will go over the steps…
PySpark : Replacing null column in a PySpark dataframe to 0 or any value you wish.
To replace null values in a PySpark DataFrame column that contain null with a numeric value (e.g., 0), you can…
PySpark : unix_timestamp function – A comprehensive guide
One of the key functionalities of PySpark is the ability to transform data into the desired format. In some cases,…
PySpark : Reading parquet file stored on Amazon S3 using PySpark
To read a Parquet file stored on Amazon S3 using PySpark, you can use the following code: from pyspark.sql import…
Hive : Hive Table Properties : How are Hive Table Properties used?
One of the key features of Hive is the ability to define table properties, which can be used to control…
Hive : Implementation of UDF in Hive using Python. A Comprehensive Guide
A User-Defined Function (UDF) in Hive is a function that is defined by the user and can be used in…
Hive : Hive metastore and its importance.
The Hive Metastore is an important component of the Apache Hive data warehouse software. It acts as a central repository…
Hive : Hive Optimizers: A Comprehensive Guide
Hive is a data warehousing tool that provides a SQL-like interface for querying large datasets stored in Hadoop Distributed File…
Hive : Comparison between the ORC and Parquet file formats in Hive
ORC (Optimized Row Columnar) and Parquet are two popular file formats for storing and processing large datasets in Hadoop-based systems…
Hive : Different types of storage formats supported by Hive.[16 Formats supported by Hive]
Apache Hive is an open-source data warehousing tool that was developed to provide an SQL-like interface to query and analyze…