How to add a new column in PySpark using withColumn

user February 16, 2022 Leave a Comment

withColumn

Syntax:
DataFrame.withColumn(column_name, col)

withColumn is comonly used to add a column on an existing dataframe. withColumn returns a new DataFrame by adding a column or replacing the existing column that has the same name. The column expression must be an expression over this DataFrame. Attempting to add a column from some other DataFrame will raise an error. If you attemp to add a column from some other DataFrame will raise an error. withColumn method introduces a projection internally. So if this is called multiple times, in order to add multiple columns can generate big plans which may cause performance issues. So try avoiding calling this function inside a loop, if the loop is large.

#Sample Code
from pyspark.sql import SparkSession
from pyspark.sql.types import MapType
from pyspark.sql.types import StructType,StructField, StringType,IntegerType
from pyspark.sql.functions import current_date,concat_ws,col,lit
emp_data =[
("Finance",("Sam","James","John"),101),
("Technology",("Michael","David","William"),201),
("Human_Resource",("Richard","","John"),301),
("Administrative",("Thomas","Christopher","Daniel"),401),
("Contractors",("Donald","Kenneth","Kevin"),501),
]
emp_data_schema = StructType([
StructField("dept",StringType(),True),
StructField("name",StructType([
StructField("first_name",StringType(),True),
StructField("middle_name",StringType(),True),
StructField("last_name",StringType(),True)])),
StructField("dept_id",IntegerType(),True),])
emp_df_1 = spark.createDataFrame(data=emp_data,schema=emp_data_schema)
emp_df_1.printSchema()
"""
root
 |-- dept: string (nullable = true)
 |-- name: struct (nullable = true)
 |    |-- first_name: string (nullable = true)
 |    |-- middle_name: string (nullable = true)
 |    |-- last_name: string (nullable = true)
 |-- dept_id: integer (nullable = true)
"""
emp_df_1.show(20,False)
"""
+--------------+-----------------------------+-------+
|dept          |name                         |dept_id|
+--------------+-----------------------------+-------+
|Finance       |[Sam, James, John]           |101    |
|Technology    |[Michael, David, William]    |201    |
|Human_Resource|[Richard, , John]            |301    |
|Administrative|[Thomas, Christopher, Daniel]|401    |
|Contractors   |[Donald, Kenneth, Kevin]     |501    |
+--------------+-----------------------------+-------+
"""
#Adding date to a column and concatinating first name and last name 
emp_df_2 = emp_df_1.withColumn("processed_date",current_date())\
				   .withColumn("emp_full_name",concat_ws(" ",emp_df_1.name.first_name,emp_df_1.name.last_name))
emp_df_2.show(20,False)
"""
+--------------+-----------------------------+-------+--------------+---------------+
|dept          |name                         |dept_id|processed_date|emp_full_name  |
+--------------+-----------------------------+-------+--------------+---------------+
|Finance       |[Sam, James, John]           |101    |2022-02-16    |Sam John       |
|Technology    |[Michael, David, William]    |201    |2022-02-16    |Michael William|
|Human_Resource|[Richard, , John]            |301    |2022-02-16    |Richard John   |
|Administrative|[Thomas, Christopher, Daniel]|401    |2022-02-16    |Thomas Daniel  |
|Contractors   |[Donald, Kenneth, Kevin]     |501    |2022-02-16    |Donald Kevin   |
+--------------+-----------------------------+-------+--------------+---------------+
"""
#Concatinating two column values with a seperator and adding constant value to a new column
emp_df_3 = emp_df_2.withColumn("dept_with_id",concat_ws("_",emp_df_2.dept,emp_df_2.dept_id)).withColumn("batch_no",lit("batch_001"))
emp_df_3.show(20,False)
"""
+--------------+-----------------------------+-------+--------------+---------------+------------------+---------+
|dept          |name                         |dept_id|processed_date|emp_full_name  |dept_with_id      |batch_no |
+--------------+-----------------------------+-------+--------------+---------------+------------------+---------+
|Finance       |[Sam, James, John]           |101    |2022-02-16    |Sam John       |Finance_101       |batch_001|
|Technology    |[Michael, David, William]    |201    |2022-02-16    |Michael William|Technology_201    |batch_001|
|Human_Resource|[Richard, , John]            |301    |2022-02-16    |Richard John   |Human_Resource_301|batch_001|
|Administrative|[Thomas, Christopher, Daniel]|401    |2022-02-16    |Thomas Daniel  |Administrative_401|batch_001|
|Contractors   |[Donald, Kenneth, Kevin]     |501    |2022-02-16    |Donald Kevin   |Contractors_501   |batch_001|
+--------------+-----------------------------+-------+--------------+---------------+------------------+---------+
"""

Reference

Post Views: 23

Author: user

How to add a new column in PySpark using withColumn

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget