Replacing NaN (Not a Number) values with a specified value in a column : nanvl

user November 21, 2023

The nanvl function in PySpark is used to replace NaN (Not a Number) values with a specified value in a column. This is particularly useful in datasets where NaN values need to be handled differently from regular null values. This article explores the nanvl function in PySpark, its advantages, and demonstrates its application through a practical use case.

from pyspark.sql.functions import nanvl

Advantages of using nanvl

Data Integrity: Ensures meaningful handling of NaN values, preserving the integrity of the dataset.

Flexibility: Provides the ability to replace NaN values with a specific value, which can be crucial in statistical computations or data visualizations.

Simplicity: Offers a straightforward and effective solution for NaN value replacement.

Example : Handling employee attendance records

Consider a dataset representing the attendance records of employees, where NaN values indicate days not recorded. We want to replace these NaN values with a default value.

Dataset

Name	Attendance Days
Sachin	NaN
Ram	22
Raju	NaN
David	18
Wilson	20

Objective

Replace NaN values in the ‘Attendance Days’ column with the default value of 0.

Implementation in PySpark

Setting up the PySpark environment and creating the DataFrame:

from pyspark.sql import SparkSession
from pyspark.sql.functions import nanvl
# Initialize Spark Session
spark = SparkSession.builder.appName("Nanvl Example").getOrCreate()
# Sample Data
data = [("Sachin", float('nan')), ("Ram", 22.0), ("Raju", float('nan')), ("David", 18.0), ("Wilson", 20.0)]
df = spark.createDataFrame(data, ["Name", "Attendance Days"])
df.show()

Output

+------+---------------+
|  Name|Attendance Days|
+------+---------------+
|Sachin|            NaN|
|   Ram|           22.0|
|  Raju|            NaN|
| David|           18.0|
|Wilson|           20.0|
+------+---------------+

Applying the nanvl function:

from pyspark.sql.functions import lit
nanvl_df = df.withColumn("Adjusted Attendance", nanvl(df["Attendance Days"], lit(0)))
nanvl_df.show()

Output

+------+---------------+-------------------+
|  Name|Attendance Days|Adjusted Attendance|
+------+---------------+-------------------+
|Sachin|            NaN|                0.0|
|   Ram|           22.0|               22.0|
|  Raju|            NaN|                0.0|
| David|           18.0|               18.0|
|Wilson|           20.0|               20.0|
+------+---------------+-------------------+

Spark important urls to refer

Post Views: 29

Author: user

Replacing NaN (Not a Number) values with a specified value in a column : nanvl

Advantages of using nanvl

Example : Handling employee attendance records

Dataset

Objective

Implementation in PySpark

Trending

Recent Posts

Featured Posts – Slider Widget

How PARTITION BY Works in Snowflake, and SQL in general

Stash a specific file using Git

Prevent your computer from locking : Python to simulate mouse movements

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Most Viewed Posts

Advantages of using nanvl

Example : Handling employee attendance records

Dataset

Objective

Implementation in PySpark

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget