How to replace a value with another value in a column in Pyspark Dataframe ?

PySpark @ Freshers.in

In PySpark we can replace a value in one column or multiple column or multiple values in a column to another value. 

This feature is provided by class pyspark.sql.DataFrameNaFunctions(df)

Syntax : replace(to_replace, value=<no value>, subset=None) 

replace : will returns a new DataFrame after replacing a value with another given value.

to_replace : This can have  bool, int, long, float, string, list or dict. value to be replaced. If the value is a dict, then value is ignored or can be omitted, and to_replace must be a mapping between a value and a replacement.

value : The value can be of bool, int, long, float, string, list or None. The replacement value must be a bool, int, long, float, string or None. If value is a list, value should be of the same length and type as to_replace.

subset :  Optional list of column names that need to consider.

Example program for na.replace

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import IntegerType,StringType,FloatType,BooleanType
emp_data = ([
(1,"Eric",90000,11.0,"USA",True),\
(2,"Gates",14000,1.0,"USA",False),\
(3,"Elon",18000,3.5,None,True),\
(4,"Steve",22000,None,"UK",None),\
(5,"Tim",None,2.0,"INDIA",False)])
emp_schema=StructType([
StructField("si_no",IntegerType(),True),
StructField("name",StringType(),True),
StructField("salary",IntegerType(),True),
StructField("commission",FloatType(),True),
StructField("country",StringType(),True),
StructField("status",BooleanType(),True)])
state_data_df = spark.createDataFrame(data=emp_data,schema=emp_schema)
state_data_df.show(20,False)
+-----+-----+------+----------+-------+------+
|si_no|name |salary|commission|country|status|
+-----+-----+------+----------+-------+------+
|1    |Eric |90000 |11.0      |USA    |true  |
|2    |Gates|14000 |1.0       |USA    |false |
|3    |Elon |18000 |3.5       |null   |true  |
|4    |Steve|22000 |null      |UK     |null  |
|5    |Tim  |null  |2.0       |INDIA  |false |
+-----+-----+------+----------+-------+------+

Now lets replace USA with United States

state_data_df.na.replace(['USA'],['United States'],'country').show()
+-----+-----+------+----------+-------------+------+
|si_no| name|salary|commission|      country|status|
+-----+-----+------+----------+-------------+------+
|    1| Eric| 90000|      11.0|United States|  true|
|    2|Gates| 14000|       1.0|United States| false|
|    3| Elon| 18000|       3.5|         null|  true|
|    4|Steve| 22000|      null|           UK|  null|
|    5|  Tim|  null|       2.0|        INDIA| false|
+-----+-----+------+----------+-------------+------+

If we need to replace multiple values in the same time , for example USA and UK need to be replace as United States and United Kingdom you can use as below .

state_data_df.na.replace(['USA','UK'],['United States','United Kingdom'],'country').show()
+-----+-----+------+----------+--------------+------+
|si_no| name|salary|commission|       country|status|
+-----+-----+------+----------+--------------+------+
|    1| Eric| 90000|      11.0| United States|  true|
|    2|Gates| 14000|       1.0| United States| false|
|    3| Elon| 18000|       3.5|          null|  true|
|    4|Steve| 22000|      null|United Kingdom|  null|
|    5|  Tim|  null|       2.0|         INDIA| false|
+-----+-----+------+----------+--------------+------+

Now if we want to replace multiple values in different column m for example we want to replace name in the name column and country in the country column at the same time ,you can use as below.

state_data_df.na.replace(['USA','UK'],['United States','United Kingdom'],'country').replace(['Eric','Steve'],['Eric Walter','Steve John'],'name').show()
+-----+-----+------+----------+--------------+------+
|si_no| name|salary|commission|       country|status|
+-----+-----+------+----------+--------------+------+
|    1| Eric| 90000|      11.0| United States|  true|
|    2|Gates| 14000|       1.0| United States| false|
|    3| Elon| 18000|       3.5|          null|  true|
|    4|Steve| 22000|      null|United Kingdom|  null|
|    5|  Tim|  null|       2.0|         INDIA| false|
+-----+-----+------+----------+--------------+------+

See the below similar posts

How to replace null value for all columns or for each column separately ?

How to drop nulls in a dataframe ?

Author: user

Leave a Reply