PySpark : Explanation of MapType in PySpark with Example

user January 31, 2023 Leave a Comment

MapType in PySpark is a data type used to represent a value that maps keys to values. It is similar to Python’s built-in dictionary data type. The keys must be of a specific data type and the values must be of another specific data type.

Advantages of MapType in PySpark:

It allows for a flexible schema, as the number of keys and their values can vary for each row in a DataFrame.
MapType is particularly useful when working with semi-structured data, where there is a lot of variability in the structure of the data.

Example: Let’s say we have a DataFrame with the following schema:

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- hobbies: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer

We can create this DataFrame using the following code:

from pyspark.sql.types import *
from pyspark.sql.functions import *
data = [("John", 30, {"reading": 3, "traveling": 5}), 
("Jane", 25, {"cooking": 2, "painting": 4})]
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("hobbies", MapType(StringType(), IntegerType()), True)
])
df = spark.createDataFrame(data, schema)
df.show(20,False)

Result

+----+---+------------------------------+
|name|age|hobbies                       |
+----+---+------------------------------+
|John|30 |[reading -> 3, traveling -> 5]|
|Jane|25 |[painting -> 4, cooking -> 2] |
+----+---+------------------------------+

Spark important urls to refer

Post Views: 49