What is the difference between concat and concat_ws in Pyspark

user February 16, 2022 Leave a Comment

concat vs concat_ws

Syntax:
pyspark.sql.functions.concat(*cols)
pyspark.sql.functions.concat_ws(sep, *cols)

concat : concat concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
concat_ws : concat_ws concatenates multiple input string columns together into a single string column, using the given separator.
concat returns None when any of the column/columns is null. All keys in the column is lost.
concat_ws returns non empty column’s after concatenating with the separator.
In short in general scenario concat_ws is recommended rather than concat.

#Sample Code 
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType,IntegerType
from pyspark.sql.functions import concat,concat_ws,col,lit
car_emp =[
("Finance",("Tom","Jack","Mark"),101),
("Technology",("Barry","Mike","Jacob"),None),
("Human_Resource",("Tim","","More"),301),
("Administrative",("Edisson","","Kevin"),None),
("Contractors",("Trunp","Mark","Elvin"),501),
]
car_emp_schema = StructType([
StructField("dept",StringType(),True),
StructField("name",StructType([
StructField("first_name",StringType(),True),
StructField("middle_name",StringType(),True),
StructField("last_name",StringType(),True)])),
StructField("dept_id",IntegerType(),True),])
car_emp_df_1 = spark.createDataFrame(data=car_emp,schema=car_emp_schema)
car_emp_df_1.printSchena()
car_emp_df_1.show(20,False)
"""
Note: Here the dept_id is null for second and fourth row
+--------------+--------------------+-------+
|dept          |name                |dept_id|
+--------------+--------------------+-------+
|Finance       |[Tom, Jack, Mark]   |101    |
|Technology    |[Barry, Mike, Jacob]|null   |
|Human_Resource|[Tim, , More]       |301    |
|Administrative|[Edisson, , Kevin]  |null   |
|Contractors   |[Trunp, Mark, Elvin]|501    |
+--------------+--------------------+-------+
"""
car_emp_df_1.withColumn("dept_with_id",concat(car_emp_df_1.dept,car_emp_df_1.dept_id)).show(20,False)
"""
On using concat ,it return null as dept_id is null for second and fourth row. No concatenation happened there.
+--------------+--------------------+-------+-----------------+
|dept          |name                |dept_id|dept_with_id     |
+--------------+--------------------+-------+-----------------+
|Finance       |[Tom, Jack, Mark]   |101    |Finance101       |
|Technology    |[Barry, Mike, Jacob]|null   |null             |
|Human_Resource|[Tim, , More]       |301    |Human_Resource301|
|Administrative|[Edisson, , Kevin]  |null   |null             |
|Contractors   |[Trunp, Mark, Elvin]|501    |Contractors501   |
+--------------+--------------------+-------+-----------------+
"""
car_emp_df_1.withColumn("dept_with_id",concat_ws("_",car_emp_df_1.dept,car_emp_df_1.dept_id)).show(20,False)
"""
On using concat_ws ,it return non empty colum values( Without dept_id , it returns dept alone making meaning full)
+--------------+--------------------+-------+------------------+
|dept          |name                |dept_id|dept_with_id      |
+--------------+--------------------+-------+------------------+
|Finance       |[Tom, Jack, Mark]   |101    |Finance_101       |
|Technology    |[Barry, Mike, Jacob]|null   |Technology        |
|Human_Resource|[Tim, , More]       |301    |Human_Resource_301|
|Administrative|[Edisson, , Kevin]  |null   |Administrative    |
|Contractors   |[Trunp, Mark, Elvin]|501    |Contractors_501   |
+--------------+--------------------+-------+------------------+
"""

Reference

Post Views: 1,025

Author: user

What is the difference between concat and concat_ws in Pyspark

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

How PARTITION BY Works in Snowflake, and SQL in general

Stash a specific file using Git

Prevent your computer from locking : Python to simulate mouse movements

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Most Viewed Posts

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget