How to renaming Spark Dataframe having a complex schema with AWS Glue – PySpark

PySpark @ Freshers.in

There can be multiple reason to rename the Spark Data frame . Even though withColumnRenamed can be used to rename at the root level , it wont work at the nested column. This can be achieved but using the following method.

Example

df = spark.createDataFrame(
[Row(struct_type_A=Row(col_1=99, col_2=88.21), 
struct_type_B=Row(col_3="New York",col_4=True)
)])

df.printSchema()

root
|-- struct_type_A: struct (nullable = true)
| |-- col_1: long (nullable = true)
| |-- col_2: double (nullable = true)
|-- struct_type_B: struct (nullable = true)
| |-- col_3: string (nullable = true)
| |-- col_4: boolean (nullable = true)

For changing the names of nested columns, the following method can be used.

You can create a new schema with StructType(), and you need to use the type casting on the original struct column that you already defined.

from pyspark.sql.types import *
new_struct = StructType([
StructField("new_Column_1", LongType()),
StructField("new_Column_2", DoubleType())
])
df_renamed = df.withColumn("struct_type_A", col("struct_type_A").cast(new_struct)).printSchema()
root
|-- struct_type_A: struct (nullable = true)
| |-- new_Column_1: long (nullable = true)
| |-- col_2: double (nullable = true)
|-- struct_type_B: struct (nullable = true)
| |-- col_3: string (nullable = true)
| |-- col_4: boolean (nullable = true)

Spark Reference

Spark Official Doc

Author: user

Leave a Reply