Dealing with missing or null values is a common challenge in data preprocessing and cleaning tasks. PySpark, the Python API for Apache Spark, offers several techniques to handle missing values efficiently. In this article, we’ll explore different strategies for handling missing or null values in PySpark, along with practical examples and outputs. PySpark provides various methods for handling missing or null values in DataFrame.
1. Dropping Rows with Null Values:
One approach to handle missing values is to simply drop rows containing null values from the DataFrame using the dropna()
method.
# Importing PySpark modules
from pyspark.sql import SparkSession
# Creating a SparkSession
spark = SparkSession.builder.appName("HandleNullValues").getOrCreate()
# Creating a DataFrame with null values
data = [(1, "Alice", None), (2, "Bob", 25), (3, None, 30)]
df = spark.createDataFrame(data, ["ID", "Name", "Age"])
# Dropping rows with null values
cleaned_df = df.dropna()
# Displaying the cleaned DataFrame
cleaned_df.show()
Output:
+---+----+---+
| ID|Name|Age|
+---+----+---+
| 2| Bob| 25|
+---+----+---+
2. Filling Null Values with a Specific Value:
Another approach is to fill null values in specific columns with a predefined value using the fillna()
method.
# Filling null values with a specific value
filled_df = df.fillna({"Name": "Unknown", "Age": 0})
# Displaying the DataFrame with filled values
filled_df.show()
Output:
+---+-------+---+
| ID| Name|Age|
+---+-------+---+
| 1| Alice| 0|
| 2| Bob| 25|
| 3|Unknown| 30|
+---+-------+---+
3. Imputing Null Values with Mean or Median:
Imputing null values with the mean or median of the respective column is another commonly used technique.
# Importing PySpark modules
from pyspark.ml.feature import Imputer
# Creating an Imputer object
imputer = Imputer(strategy="mean", inputCols=["Age"], outputCols=["Age_imputed"])
# Fitting the imputer model
imputer_model = imputer.fit(df)
# Transforming the DataFrame to impute null values
imputed_df = imputer_model.transform(df)
# Displaying the DataFrame with imputed values
imputed_df.show()
Output:
+---+----+----+-----------+
| ID|Name| Age|Age_imputed|
+---+----+----+-----------+
| 1|null|null| 27.5|
| 2| Bob| 25| 25|
| 3|null| 30| 30|
+---+----+----+-----------+
Spark important urls to refer