PySpark : Replacing null column in a PySpark dataframe to 0 or any value you wish.

PySpark @ Freshers.in

To replace null values in a PySpark DataFrame column that contain null with a numeric value (e.g., 0), you can use the na.fill() method. This method replaces all null values in a DataFrame with a specified value.

In this example, we create a PySpark DataFrame with 3 columns: “id”, “name”, and “age”. The first two columns are of StringType, and the third column is of IntegerType. We also include some sample data, including a null value in the “age” column, to demonstrate how to handle null values in PySpark.

from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Create a SparkSession
spark = SparkSession.builder.appName("None to 0 pyspark @ Freshers.in").getOrCreate()
# Define the schema for the DataFrame
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])
# Create a list of tuples containing sample data
data = [
    (1, "Barry", 25),
    (2, "Charlie", 30),
    (3, "Marrie", 35),
    (4, "Gold", None),
    (5, "Twinkle", 28)
]
# Create a DataFrame from the list of tuples and the schema
df = spark.createDataFrame(data, schema)
# Show the DataFrame
df.show()
df = df.withColumn("age", when(col("age").isNull(), 0).otherwise(col("age")))
df.show()

Assuming that the name of the DataFrame is df and the name of the column that you want to replace null values with 0 is social_col, you can use the following code:
Here, when(col(“age”).isNull(), 0) creates a conditional expression that checks if the value in the age column is null. If it is null, it replaces it with the integer value 0. Otherwise, it leaves the value unchanged. The otherwise(col(“age”)) function is used to ensure that the original value is retained for any non-null values.

The withColumn() method is used to apply the above transformation to the age column in the df DataFrame. The resulting DataFrame will have null values in the age column replaced with the integer value 0.

Output before changing 

+---+-------+----+
| id|   name| age|
+---+-------+----+
|  1|  Barry|  25|
|  2|Charlie|  30|
|  3| Marrie|  35|
|  4|   Gold|null|
|  5|Twinkle|  28|
+---+-------+----+

Output after changing 

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Barry| 25|
|  2|Charlie| 30|
|  3| Marrie| 35|
|  4|   Gold|  0|
|  5|Twinkle| 28|
+---+-------+---+

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply