How to concatenate multiple columns in a Spark dataframe

PySpark @ Freshers.in

concat_ws : With concat_ws () function you can  concatenates multiple input string columns together into a single string column, using the given separator.  You need to include concat_ws from pyspark.sql.functions

from pyspark.sql import SparkSession
from pyspark.sql.functions import concat_ws
spark = SparkSession.builder.appName("Concatenate Sample @ Freshers.in ").getOrCreate()
raw_data =[('01','BMW','1-Series','2021'),
('02','BMW','3-Series','2021'),
('03','BMW','5-Series','2021'),
('04','BMW','7-Series','2021'),
('05','Audi','Q3','2021'),
('06','Audi','Q5','2021'),
('07','Audi','Q7','2021'),
('08','Benz','GLS','2021')]
raw_data_column_name = ["ID","Make","Model","Year"]
df = spark.createDataFrame(data=raw_data, schema = raw_data_column_name)
df.show(truncate=False)
df2=df.select(concat_ws('_',df.ID,df.Make,df.Model).\
alias("concatenated_value"),"ID","Make","Model")
df2.show(truncate=False)

Result : Before and After concat_ws

>>> df.show(truncate=False)
[Stage 6:> (0 + 0) / 1]
+---+----+--------+----+
|ID |Make|Model |Year|
+---+----+--------+----+
|01 |BMW |1-Series|2021|
|02 |BMW |3-Series|2021|
|03 |BMW |5-Series|2021|
|04 |BMW |7-Series|2021|
|05 |Audi|Q3 |2021|
|06 |Audi|Q5 |2021|
|07 |Audi|Q7 |2021|
|08 |Benz|GLS |2021|
+---+----+--------+----+

>>> df2=df.select(concat_ws('_',df.ID,df.Make,df.Model).alias("concatenated_value"),"ID","Make","Model")
>>> df2.show(truncate=False)
+------------------+---+----+--------+
|concatenated_value|ID |Make|Model |
+------------------+---+----+--------+
|01_BMW_1-Series |01 |BMW |1-Series|
|02_BMW_3-Series |02 |BMW |3-Series|
|03_BMW_5-Series |03 |BMW |5-Series|
|04_BMW_7-Series |04 |BMW |7-Series|
|05_Audi_Q3 |05 |Audi|Q3 |
|06_Audi_Q5 |06 |Audi|Q5 |
|07_Audi_Q7 |07 |Audi|Q7 |
|08_Benz_GLS |08 |Benz|GLS |
+------------------+---+----+--------+

 

Author: user

Leave a Reply