Counting Null or None or Missing values with Precision in PySpark.

user November 24, 2023

This article provides a comprehensive guide on how to accomplish this, a crucial step in data cleaning and preprocessing. Identifying and counting missing values (null, None, NaN) in a dataset is crucial for:

Data Quality Assessment: Understanding the extent of missing data to evaluate data quality.
Data Cleaning: Informing the strategy for handling missing data, like imputation or deletion.
Analytical Accuracy: Ensuring accurate analysis by acknowledging data incompleteness.

Counting missing values in PySpark

PySpark provides functions to efficiently count null, None, and NaN values in DataFrames. Let’s walk through a method to perform this task.

Step-by-step guide

Example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, when, isnan
# Initialize Spark Session
spark = SparkSession.builder.appName("CountMissingValues").getOrCreate()
# Sample Data
data = [
    ("Sachin", None, 35),
    ("Manju", "Female", None),
    ("Ram", "Male", 40),
    ("Raju", None, None),
    ("David", "Male", 50),
    ("Wilson", "Male", None)
]
columns = ["Name", "Gender", "Age"]
# Creating DataFrame
df = spark.createDataFrame(data, columns)
# Counting Null, None, NaN Values
null_counts = df.select([count(when(col(c).isNull() | isnan(col(c)), c)).alias(c) for c in df.columns])
# Show Results
null_counts.show()

Output

+----+------+---+
|Name|Gender|Age|
+----+------+---+
|   1|     2|  3|
+----+------+---+

In this example, we use when, col, isNull, and isnan functions from PySpark to count null, None, and NaN values across all columns of the DataFrame.

Spark important urls to refer

Post Views: 18

Author: user

Counting Null or None or Missing values with Precision in PySpark.

Counting missing values in PySpark

Step-by-step guide

Trending

Recent Posts

Featured Posts – Slider Widget

How PARTITION BY Works in Snowflake, and SQL in general

Stash a specific file using Git

Prevent your computer from locking : Python to simulate mouse movements

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Most Viewed Posts

Counting missing values in PySpark

Step-by-step guide

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget