PySpark : Dataset with columns contain duplicate values, How to to keep only the last occurrence of each value.

user April 3, 2023 Leave a Comment

Duplicate values in a dataset can cause problems for data analysis and modeling. It is often necessary to remove duplicates and keep only the last occurrence of each value to ensure the accuracy and reliability of the results. PySpark provides a simple and efficient way to remove duplicates and keep only the last occurrence of each value using its built-in functions.

Input Data

Let’s assume we have the following dataset that contains duplicate values:

+------+-------+-------+
| Name | Score | Grade |
+------+-------+-------+
| John |   80  |   B   |
| Jane |   90  |   A   |
| John |   85  |   B   |
| Jane |   95  |   A   |
| Mary |   75  |   C   |
| John |   90  |   A   |
+------+-------+-------+

Removing Duplicates and Keeping the Last Occurrence in PySpark

To remove duplicates and keep the last occurrence of each value in PySpark, we can use the dropDuplicates function and specify the column(s) to use for comparison. By default, the dropDuplicates function keeps the first occurrence of each value, but we can change this behavior by using the subset parameter and specifying the column(s) to order by.

For example, to remove duplicates and keep only the last occurrence of each value based on the Name column, we can use the following code:

To remove duplicates and keep the last occurrence of each value in PySpark, we can use the Window function and the row_number() function. The Window function is used to define a window or a group of rows based on one or more columns, and the row_number() function is used to assign a unique sequential number to each row within the window. By ordering the rows within the window in descending order based on a timestamp or an ID column, we can assign the highest number to the last occurrence of each value.

Here is an example code snippet that demonstrates how to remove duplicates and keep the last occurrence of each value based on the Name column:

from pyspark.sql import SparkSession
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
# create a SparkSession
spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate()
# load the input data into a DataFrame
df = spark.createDataFrame([
    ("John", 80, "B"),
    ("Jane", 90, "A"),
    ("John", 85, "B"),
    ("Jane", 95, "A"),
    ("Mary", 75, "C"),
    ("John", 90, "A")
], ["Name", "Score", "Grade"])
# create a window function to partition the data by Name and order by Score in descending order
window = Window.partitionBy(df["Name"]).orderBy(df["Score"].desc())
# assign a sequential number to each row within the window
df_last = df.select("*", row_number().over(window).alias("row_number"))
# filter the rows that have the highest row_number within each partition
df_last = df_last.filter(df_last.row_number == 1).drop("row_number")
# show the result
df_last.show()

Output

+----+-----+-----+
|Name|Score|Grade|
+----+-----+-----+
|Jane|   95|    A|
|John|   90|    A|
|Mary|   75|    C|
+----+-----+-----+

As we can see, the output DataFrame contains only the last occurrence of each unique Name value, and the Score and Grade columns are retained.

Spark important urls to refer

Post Views: 87

Author: user

PySpark : Dataset with columns contain duplicate values, How to to keep only the last occurrence of each value.

Input Data

Removing Duplicates and Keeping the Last Occurrence in PySpark

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

Input Data

Removing Duplicates and Keeping the Last Occurrence in PySpark

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget