Hive is a data warehousing tool built on top of Hadoop, which allows us to…
Tag: SparkExamples
Hive : Different types of Hive execution engines
Hive is an open-source data warehouse tool built on top of Hadoop. It allows users to write SQL-like queries, called…
Hive : Difference between the MapReduce execution engine and the Tez execution engine in Hive
MapReduce and Tez are two popular execution engines used in Apache Hive for processing large-scale datasets. While both engines are…
PySpark : LongType and ShortType data types in PySpark
pyspark.sql.types.LongType pyspark.sql.types.ShortType In this article, we will explore PySpark’s LongType and ShortType data types, their properties, and how to work…
PySpark : HiveContext in PySpark – A brief explanation
One of the key components of PySpark is the HiveContext, which provides a SQL-like interface to work with data stored…
PySpark: Explanation of PySpark Full Outer Join with example.
One of the most commonly used operations in PySpark is joining two dataframes together. Full outer join is one of…
PySpark : Reading from multiple files , how to get the file which contain each record in PySpark [input_file_name]
pyspark.sql.functions.input_file_name One of the most useful features of PySpark is the ability to access metadata about the input files being…
PySpark : Exploding a column of arrays or maps into multiple rows in a Spark DataFrame [posexplode_outer]
pyspark.sql.functions.posexplode_outer The posexplode_outer function in PySpark is part of the pyspark.sql.functions module and is used to explode a column of…
PySpark : Transforming a column of arrays or maps into multiple columns, with one row for each element in the array or map [posexplode]
pyspark.sql.functions.posexplode The posexplode function in PySpark is part of the pyspark.sql.functions module and is used to transform a column of…
PySpark : Calculate the percent rank of a set of values in a DataFrame column using PySpark[percent_rank]
pyspark.sql.functions.percent_rank PySpark provides a percent_rank function as part of the pyspark.sql.functions module, which is used to calculate the percent rank…
PySpark : Extracting minutes of a given date as integer in PySpark [minute]
pyspark.sql.functions.minute The minute function in PySpark is part of the pyspark.sql.functions module, and is used to extract the minute from…