PySpark-How to returns the first column that is not null

PySpark @ Freshers.in

pyspark.sql.functions.coalesce

If you want to return the first non zero from list of column you can use coalesce function in PySpark. COALESCE simply returns the first value out of a list that is not NULL. COALESCE is a powerful tool if you are returning numerous null values and if you need the first non null value.

Syntax 

pyspark.sql.functions.coalesce(*cols)

Sample Code

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import IntegerType,StringType,FloatType,BooleanType
from pyspark.sql.functions import coalesce
cust_data = ([
(1,"Bric",150.0,None,None,'09-03-2022'),\
(2,"Gates",None,300.00,None,'09-09-2022'),\
(3,"ElonMusk",None,None,450.00,'09-13-2022'),\
(4,"SteveJob",350.0,None,None,'09-15-2022'),\
(5,"Tim Berner Lee",650.0,None,None,'09-17-2022')])
cust_schema=StructType([
StructField("si_no",IntegerType(),True),
StructField("customer_name",StringType(),True),
StructField("yearly",FloatType(),True),
StructField("quarterly",FloatType(),True),
StructField("monthly",FloatType(),True),
StructField("date_string",StringType(),True)])
cust_order_df = spark.createDataFrame(data=cust_data,schema=cust_schema)
cust_order_df.show(20,False)
+-----+--------------+------+---------+-------+-----------+
|si_no|customer_name |yearly|quarterly|monthly|date_string|
+-----+--------------+------+---------+-------+-----------+
|1    |Bric          |150.0 |null     |null   |09032022   |
|2    |Gates         |null  |300.0    |null   |09092022   |
|3    |ElonMusk      |null  |null     |450.0  |09132022   |
|4    |SteveJob      |350.0 |null     |null   |09152022   |
|5    |Tim Berner Lee|650.0 |null     |null   |09172022   |
+-----+--------------+------+---------+-------+-----------+
cust_order_df.select(\
cust_order_df["si_no"],\
cust_order_df["customer_name"],\
coalesce(cust_order_df["yearly"],\
         cust_order_df["quarterly"],\
         cust_order_df["monthly"]).alias("first_payment"),\
cust_order_df["date_string"]).show()
+-----+--------------+-------------+-----------+
|si_no| customer_name|first_payment|date_string|
+-----+--------------+-------------+-----------+
|    1|          Bric|        150.0|   09032022|
|    2|         Gates|        300.0|   09092022|
|    3|      ElonMusk|        450.0|   09132022|
|    4|      SteveJob|        350.0|   09152022|
|    5|Tim Berner Lee|        650.0|   09172022|
+-----+--------------+-------------+-----------+

Reference

  1. Spark Interview Questions
  2. Spark Examples
  3. PySpark Blogs
  4. Bigdata Blogs
  5. Official Page
Author: user

Leave a Reply