PySpark provides several methods to remove duplicate rows from a dataframe. In this article, we will go over the steps to drop duplicate rows in Pyspark.
First, let’s create a sample dataframe with 5 columns. We will use the createDataFrame() method of the SparkSession object to create a dataframe.
Output
As you can see, there are some duplicate rows in the dataframe. Now, let’s drop these duplicate rows.
Method 1: Using dropDuplicates()
The simplest way to drop duplicate rows in Pyspark is to use the dropDuplicates()
method. This method returns a new dataframe with the duplicate rows removed.
Output : Duplicate rows have been removed.
Method 2: Using groupBy() and agg() functions
Another way to drop duplicate rows is to use the groupBy() and agg() functions. This method groups the dataframe by all the columns and then aggregates the data using any aggregation function, such as first() or last(). This method is useful when you want to retain only one row for each combination of column values.
Other column you can drop , if not required. This is for your understanding purpose.
As you can see, the duplicate rows have been removed and only one row is retained for each combination of column values.
Spark important urls to refer.