In PySpark we can replace a value in one column or multiple column or multiple values in a column to another value.
This feature is provided by class pyspark.sql.DataFrameNaFunctions(df)
Syntax : replace(to_replace, value=<no value>, subset=None)
replace : will returns a new DataFrame after replacing a value with another given value.
to_replace : This can have bool, int, long, float, string, list or dict. value to be replaced. If the value is a dict, then value is ignored or can be omitted, and to_replace must be a mapping between a value and a replacement.
value : The value can be of bool, int, long, float, string, list or None. The replacement value must be a bool, int, long, float, string or None. If value is a list, value should be of the same length and type as to_replace.
subset : Optional list of column names that need to consider.
Example program for na.replace
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import IntegerType,StringType,FloatType,BooleanType
emp_data = ([
(1,"Eric",90000,11.0,"USA",True),\
(2,"Gates",14000,1.0,"USA",False),\
(3,"Elon",18000,3.5,None,True),\
(4,"Steve",22000,None,"UK",None),\
(5,"Tim",None,2.0,"INDIA",False)])
emp_schema=StructType([
StructField("si_no",IntegerType(),True),
StructField("name",StringType(),True),
StructField("salary",IntegerType(),True),
StructField("commission",FloatType(),True),
StructField("country",StringType(),True),
StructField("status",BooleanType(),True)])
state_data_df = spark.createDataFrame(data=emp_data,schema=emp_schema)
state_data_df.show(20,False)
+-----+-----+------+----------+-------+------+
|si_no|name |salary|commission|country|status|
+-----+-----+------+----------+-------+------+
|1 |Eric |90000 |11.0 |USA |true |
|2 |Gates|14000 |1.0 |USA |false |
|3 |Elon |18000 |3.5 |null |true |
|4 |Steve|22000 |null |UK |null |
|5 |Tim |null |2.0 |INDIA |false |
+-----+-----+------+----------+-------+------+
Now lets replace USA with United States
state_data_df.na.replace(['USA'],['United States'],'country').show()
+-----+-----+------+----------+-------------+------+
|si_no| name|salary|commission| country|status|
+-----+-----+------+----------+-------------+------+
| 1| Eric| 90000| 11.0|United States| true|
| 2|Gates| 14000| 1.0|United States| false|
| 3| Elon| 18000| 3.5| null| true|
| 4|Steve| 22000| null| UK| null|
| 5| Tim| null| 2.0| INDIA| false|
+-----+-----+------+----------+-------------+------+
If we need to replace multiple values in the same time , for example USA and UK need to be replace as United States and United Kingdom you can use as below .
state_data_df.na.replace(['USA','UK'],['United States','United Kingdom'],'country').show()
+-----+-----+------+----------+--------------+------+
|si_no| name|salary|commission| country|status|
+-----+-----+------+----------+--------------+------+
| 1| Eric| 90000| 11.0| United States| true|
| 2|Gates| 14000| 1.0| United States| false|
| 3| Elon| 18000| 3.5| null| true|
| 4|Steve| 22000| null|United Kingdom| null|
| 5| Tim| null| 2.0| INDIA| false|
+-----+-----+------+----------+--------------+------+
Now if we want to replace multiple values in different column m for example we want to replace name in the name column and country in the country column at the same time ,you can use as below.
state_data_df.na.replace(['USA','UK'],['United States','United Kingdom'],'country').replace(['Eric','Steve'],['Eric Walter','Steve John'],'name').show()
+-----+-----+------+----------+--------------+------+
|si_no| name|salary|commission| country|status|
+-----+-----+------+----------+--------------+------+
| 1| Eric| 90000| 11.0| United States| true|
| 2|Gates| 14000| 1.0| United States| false|
| 3| Elon| 18000| 3.5| null| true|
| 4|Steve| 22000| null|United Kingdom| null|
| 5| Tim| null| 2.0| INDIA| false|
+-----+-----+------+----------+--------------+------+
See the below similar posts
How to replace null value for all columns or for each column separately ?