For most of the data cleansing the first thing that you may need to do drop the nulls in the data set . In Pyspark we have “drop” function in class pyspark.sql.DataFrameNaFunctions. Lets see how to use this
class pyspark.sql.DataFrameNaFunctions
Syntax : drop(how=’any’, thresh=None, subset=None)
The drop will return new dataframe by discarding the nulls in the rows.
how :- How do you want to drop the row
“any” : Drop a row if it contains any nulls.
“all” : Drop a row only if all its values are null.
thresh :- Value should be an integer int. Default will be None . If thresh is specified, then the dataframe will drop rows that have less than thresh non-null values. This thresh value will overwrites the “how” parameter. For example if we give df.na.drop(thresh=1) , then this iterates through all the rows and then will keeps each row that has at least 1 non-null values. Which means all rows have at least one non-null value.
subset :- If you give subset , then in a list you can give the column names that need to consider.
Lets give examples for all the above
#Create sample data
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import IntegerType,StringType,FloatType,BooleanType
veh_data = ([
(1,"Cadillac",2022,28.0,"USA",True),\
(2,"Tesla",2022,80.0,"USA",False),\
(3,"Lincoln",2021,25.5,None,True),\
(None,None,None,None,None,None),\
(4,"Ford",2022,None,"USA",None),\
(5,"Benz",2020,20.0,"USA",False)])
veh_schema=StructType([
StructField("si_no",IntegerType(),True),
StructField("vehicle_make",StringType(),True),
StructField("make_year",IntegerType(),True),
StructField("price",FloatType(),True),
StructField("county",StringType(),True),
StructField("available_statue",BooleanType(),True)])
veh_df = spark.createDataFrame(data=veh_data,schema=veh_schema)
veh_df.show(20,False)
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
|1 |Cadillac |2022 |28.0 |USA |true |
|2 |Tesla |2022 |80.0 |USA |false |
|3 |Lincoln |2021 |25.5 |null |true |
|null |null |null |null |null |null |
|4 |Ford |2022 |null |USA |null |
|5 |Benz |2020 |20.0 |USA |false |
+-----+------------+---------+-----+------+----------------+
We are going to drop all rows having null value
veh_df.na.drop().show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
| 1| Cadillac| 2022| 28.0| USA| true|
| 2| Tesla| 2022| 80.0| USA| false|
| 5| Benz| 2020| 20.0| USA| false|
+-----+------------+---------+-----+------+----------------+
We will use how = “any”, if any rows having null will be removed, the result will be same as above.
veh_df.na.drop(how='any').show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
| 1| Cadillac| 2022| 28.0| USA| true|
| 2| Tesla| 2022| 80.0| USA| false|
| 5| Benz| 2020| 20.0| USA| false|
+-----+------------+---------+-----+------+----------------+
We need to drop if all the rows are null
veh_df.na.drop(how='all').show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
| 1| Cadillac| 2022| 28.0| USA| true|
| 2| Tesla| 2022| 80.0| USA| false|
| 3| Lincoln| 2021| 25.5| null| true|
| 4| Ford| 2022| null| USA| null|
| 5| Benz| 2020| 20.0| USA| false|
+-----+------------+---------+-----+------+----------------+
If we need to drop a row if any particular column contains null. In the below, if “price” column is null , then that rows will be dropped.
veh_df.na.drop(subset=["price"]).show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
| 1| Cadillac| 2022| 28.0| USA| true|
| 2| Tesla| 2022| 80.0| USA| false|
| 3| Lincoln| 2021| 25.5| null| true|
| 5| Benz| 2020| 20.0| USA| false|
+-----+------------+---------+-----+------+----------------+
Same as the above, but here we mention two column names.
veh_df.na.drop(subset=["price","county"]).show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
| 1| Cadillac| 2022| 28.0| USA| true|
| 2| Tesla| 2022| 80.0| USA| false|
| 5| Benz| 2020| 20.0| USA| false|
+-----+------------+---------+-----+------+----------------+
Thresh : We need to drop if the total number of not null is less that the limit mentioned. If we give thresh=5, then if the row contains less than 5 not null , that row will be .
veh_df.na.drop(thresh=5).show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
| 1| Cadillac| 2022| 28.0| USA| true|
| 2| Tesla| 2022| 80.0| USA| false|
| 3| Lincoln| 2021| 25.5| null| true|
| 5| Benz| 2020| 20.0| USA| false|
+-----+------------+---------+-----+------+----------------+
veh_df.na.drop(thresh=6).show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
| 1| Cadillac| 2022| 28.0| USA| true|
| 2| Tesla| 2022| 80.0| USA| false|
| 5| Benz| 2020| 20.0| USA| false|
+-----+------------+---------+-----+------+----------------+
FYI : This kind of things or the inverse can also be accomplished by isNotNull() or/and isNull()
See the below similar posts
How to replace null value for all columns or for each column separately ?
How to replace a value with another value in a column in Pyspark Dataframe ?