Tag: Big Data
Hive : How to drop duplicate rows from Hive table.
This is a work around to show how can we drop duplicate rows from Hive table. Here is how to…
PySpark : Understanding the PySpark next_day Function
Time series data often involves handling and manipulating dates. Apache Spark, through its PySpark interface, provides an arsenal of date-time…
PySpark : Extracting the Month from a Date in PySpark
Working with dates Working with dates and time is a common task in data analysis. Apache Spark provides a variety…
PySpark : Calculating the Difference Between Dates with PySpark: The months_between Function
When working with time series data, it is often necessary to calculate the time difference between two dates. Apache Spark…
PySpark : Retrieving Unique Elements from two arrays in PySpark
Let’s start by creating a DataFrame named freshers_in. We’ll make it contain two array columns named ‘array1’ and ‘array2’, filled…
Hive : How to preserve Hive metadata [Preserve the last DDL time for the table]
HOLD_DDLTIME The “last DDL time” refers to the timestamp of the most recent DDL (Data Definition Language) operation that was…
Extracting Unique Values From Array Columns in PySpark
When dealing with data in Spark, you may find yourself needing to extract distinct values from array columns. This can…
PySpark : Returning an Array that Contains Matching Elements in Two Input Arrays in PySpark
This article will focus on a particular use case: returning an array that contains the matching elements in two input…
PySpark : Creating Ranges in PySpark DataFrame with Custom Start, End, and Increment Values
In PySpark, there isn’t a built-in function to create an array sequence given a start, end, and increment value. In PySpark,…
PySpark : How to Prepending an Element to an Array on specific condition in PySpark
If you want to prepend an element to the array only when the array contains a specific word, you can…