Duplicate values in a dataset can cause problems for data analysis and modeling. It is often necessary to remove duplicates and keep only the last occurrence of each value to ensure the accuracy and reliability of the results. PySpark provides a simple and efficient way to remove duplicates and keep only the last occurrence of each value using its built-in functions.
Input Data
Let’s assume we have the following dataset that contains duplicate values:
+------+-------+-------+
| Name | Score | Grade |
+------+-------+-------+
| John | 80 | B |
| Jane | 90 | A |
| John | 85 | B |
| Jane | 95 | A |
| Mary | 75 | C |
| John | 90 | A |
+------+-------+-------+
Removing Duplicates and Keeping the Last Occurrence in PySpark
To remove duplicates and keep the last occurrence of each value in PySpark, we can use the dropDuplicates
function and specify the column(s) to use for comparison. By default, the dropDuplicates
function keeps the first occurrence of each value, but we can change this behavior by using the subset
parameter and specifying the column(s) to order by.
For example, to remove duplicates and keep only the last occurrence of each value based on the Name
column, we can use the following code:
To remove duplicates and keep the last occurrence of each value in PySpark, we can use the Window function and the row_number() function. The Window function is used to define a window or a group of rows based on one or more columns, and the row_number() function is used to assign a unique sequential number to each row within the window. By ordering the rows within the window in descending order based on a timestamp or an ID column, we can assign the highest number to the last occurrence of each value.
Here is an example code snippet that demonstrates how to remove duplicates and keep the last occurrence of each value based on the Name column:
from pyspark.sql import SparkSession
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
# create a SparkSession
spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate()
# load the input data into a DataFrame
df = spark.createDataFrame([
("John", 80, "B"),
("Jane", 90, "A"),
("John", 85, "B"),
("Jane", 95, "A"),
("Mary", 75, "C"),
("John", 90, "A")
], ["Name", "Score", "Grade"])
# create a window function to partition the data by Name and order by Score in descending order
window = Window.partitionBy(df["Name"]).orderBy(df["Score"].desc())
# assign a sequential number to each row within the window
df_last = df.select("*", row_number().over(window).alias("row_number"))
# filter the rows that have the highest row_number within each partition
df_last = df_last.filter(df_last.row_number == 1).drop("row_number")
# show the result
df_last.show()
+----+-----+-----+
|Name|Score|Grade|
+----+-----+-----+
|Jane| 95| A|
|John| 90| A|
|Mary| 75| C|
+----+-----+-----+
As we can see, the output DataFrame contains only the last occurrence of each unique Name value, and the Score and Grade columns are retained.
Spark important urls to refer