concat vs concat_ws
Syntax: pyspark.sql.functions.concat(*cols) pyspark.sql.functions.concat_ws(sep, *cols)
concat : concat concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
concat_ws : concat_ws concatenates multiple input string columns together into a single string column, using the given separator.
concat returns None when any of the column/columns is null. All keys in the column is lost.
concat_ws returns non empty column’s after concatenating with the separator.
In short in general scenario concat_ws is recommended rather than concat.
#Sample Code
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType,IntegerType
from pyspark.sql.functions import concat,concat_ws,col,lit
car_emp =[
("Finance",("Tom","Jack","Mark"),101),
("Technology",("Barry","Mike","Jacob"),None),
("Human_Resource",("Tim","","More"),301),
("Administrative",("Edisson","","Kevin"),None),
("Contractors",("Trunp","Mark","Elvin"),501),
]
car_emp_schema = StructType([
StructField("dept",StringType(),True),
StructField("name",StructType([
StructField("first_name",StringType(),True),
StructField("middle_name",StringType(),True),
StructField("last_name",StringType(),True)])),
StructField("dept_id",IntegerType(),True),])
car_emp_df_1 = spark.createDataFrame(data=car_emp,schema=car_emp_schema)
car_emp_df_1.printSchena()
car_emp_df_1.show(20,False)
"""
Note: Here the dept_id is null for second and fourth row
+--------------+--------------------+-------+
|dept |name |dept_id|
+--------------+--------------------+-------+
|Finance |[Tom, Jack, Mark] |101 |
|Technology |[Barry, Mike, Jacob]|null |
|Human_Resource|[Tim, , More] |301 |
|Administrative|[Edisson, , Kevin] |null |
|Contractors |[Trunp, Mark, Elvin]|501 |
+--------------+--------------------+-------+
"""
car_emp_df_1.withColumn("dept_with_id",concat(car_emp_df_1.dept,car_emp_df_1.dept_id)).show(20,False)
"""
On using concat ,it return null as dept_id is null for second and fourth row. No concatenation happened there.
+--------------+--------------------+-------+-----------------+
|dept |name |dept_id|dept_with_id |
+--------------+--------------------+-------+-----------------+
|Finance |[Tom, Jack, Mark] |101 |Finance101 |
|Technology |[Barry, Mike, Jacob]|null |null |
|Human_Resource|[Tim, , More] |301 |Human_Resource301|
|Administrative|[Edisson, , Kevin] |null |null |
|Contractors |[Trunp, Mark, Elvin]|501 |Contractors501 |
+--------------+--------------------+-------+-----------------+
"""
car_emp_df_1.withColumn("dept_with_id",concat_ws("_",car_emp_df_1.dept,car_emp_df_1.dept_id)).show(20,False)
"""
On using concat_ws ,it return non empty colum values( Without dept_id , it returns dept alone making meaning full)
+--------------+--------------------+-------+------------------+
|dept |name |dept_id|dept_with_id |
+--------------+--------------------+-------+------------------+
|Finance |[Tom, Jack, Mark] |101 |Finance_101 |
|Technology |[Barry, Mike, Jacob]|null |Technology |
|Human_Resource|[Tim, , More] |301 |Human_Resource_301|
|Administrative|[Edisson, , Kevin] |null |Administrative |
|Contractors |[Trunp, Mark, Elvin]|501 |Contractors_501 |
+--------------+--------------------+-------+------------------+
"""
Reference