This article provides a comprehensive guide on how to accomplish this, a crucial step in data cleaning and preprocessing. Identifying and counting missing values (null, None, NaN) in a dataset is crucial for:
- Data Quality Assessment: Understanding the extent of missing data to evaluate data quality.
- Data Cleaning: Informing the strategy for handling missing data, like imputation or deletion.
- Analytical Accuracy: Ensuring accurate analysis by acknowledging data incompleteness.
Counting missing values in PySpark
PySpark provides functions to efficiently count null, None, and NaN values in DataFrames. Let’s walk through a method to perform this task.
Step-by-step guide
Example:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, when, isnan
# Initialize Spark Session
spark = SparkSession.builder.appName("CountMissingValues").getOrCreate()
# Sample Data
data = [
("Sachin", None, 35),
("Manju", "Female", None),
("Ram", "Male", 40),
("Raju", None, None),
("David", "Male", 50),
("Wilson", "Male", None)
]
columns = ["Name", "Gender", "Age"]
# Creating DataFrame
df = spark.createDataFrame(data, columns)
# Counting Null, None, NaN Values
null_counts = df.select([count(when(col(c).isNull() | isnan(col(c)), c)).alias(c) for c in df.columns])
# Show Results
null_counts.show()
Output
+----+------+---+
|Name|Gender|Age|
+----+------+---+
| 1| 2| 3|
+----+------+---+
In this example, we use
when
, col
, isNull
, and isnan
functions from PySpark to count null, None, and NaN values across all columns of the DataFrame.Spark important urls to refer