In Spark how to replace null value for all columns or for each column separately-PySpark (na.fill)

PySpark @ Freshers.in

Spark api : pyspark.sql.DataFrameNaFunctions.fill

Syntax : fill(value, subset=None)

value : “value” can only be int, long, float, string, bool or dict. Means the value to replace null values with. If the value given is a dict, then subset is ignored and value must be a mapping from column name.

subset : List of column names to consider. Note , if value is a string, and the subset contains a non-string column, then the non-string column is simply ignored.

In this article we will see how to replace null column value to defined value. You can replace all null to a specific value or each column null to each specific value. 

We use the na.fill() to replace null values. DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.

The value to replace can be int, float, string, bool or dict . Columns specified in subset (str, tuple or list, optional) that do not have matching data type are ignored.

Sample code

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import IntegerType,StringType,FloatType,BooleanType
from pyspark.sql.functions import udf,col
emp_data = ([
(1,"Eric",90000,11.0,"USA",True),\
(2,"Gates",14000,1.0,"USA",False),\
(3,"Elon",18000,3.5,None,True),\
(4,"Steve",22000,None,"USA",None),\
(5,"Tim",None,2.0,"USA",False)])
emp_schema=StructType([
StructField("si_no",IntegerType(),True),
StructField("name",StringType(),True),
StructField("salary",IntegerType(),True),
StructField("commission",FloatType(),True),
StructField("county",StringType(),True),
StructField("status",BooleanType(),True)])
state_data_df = spark.createDataFrame(data=emp_data,schema=emp_schema)
state_data_df.show(20,False)
+-----+-----+------+----------+------+------+
|si_no|name |salary|commission|county|status|
+-----+-----+------+----------+------+------+
|1    |Eric |90000 |11.0      |USA   |true  |
|2    |Gates|14000 |1.0       |USA   |false |
|3    |Elon |18000 |3.5       |null  |true  |
|4    |Steve|22000 |null      |USA   |null  |
|5    |Tim  |null  |2.0       |USA   |false |
+-----+-----+------+----------+------+------+

If you want to fill all integer or float to 0 or 0.0  you can do the following

state_data_df.na.fill(0).show()
+-----+-----+------+----------+------+------+
|si_no| name|salary|commission|county|status|
+-----+-----+------+----------+------+------+
|    1| Eric| 90000|      11.0|   USA|  true|
|    2|Gates| 14000|       1.0|   USA| false|
|    3| Elon| 18000|       3.5|  null|  true|
|    4|Steve| 22000|       0.0|   USA|  null|
|    5|  Tim|     0|       2.0|   USA| false|
+-----+-----+------+----------+------+------+

Now if you want to fill each value for each column that has null , you can do the following 

state_data_df.na.fill({'salary':0,'county':'Not mentioned','status': False}).show()
+-----+-----+------+----------+------+------+
|si_no| name|salary|commission|county|status|
+-----+-----+------+----------+------+------+
|    1| Eric| 90000|      11.0|   USA|  true|
|    2|Gates| 14000|       1.0|   USA| false|
|    3| Elon| 18000|       3.5|  null|  true|
|    4|Steve| 22000|       0.0|   USA|  null|
|    5|  Tim|     0|       2.0|   USA| false|
+-----+-----+------+----------+------+------+

See the below similar posts

How to drop nulls in a dataframe ?
How to replace a value with another value in a column in Pyspark Dataframe ?

Author: user

Leave a Reply