MapType in PySpark is a data type used to represent a value that maps keys to values. It is similar to Python’s built-in dictionary data type. The keys must be of a specific data type and the values must be of another specific data type.
Advantages of MapType in PySpark:
- It allows for a flexible schema, as the number of keys and their values can vary for each row in a DataFrame.
- MapType is particularly useful when working with semi-structured data, where there is a lot of variability in the structure of the data.
Example: Let’s say we have a DataFrame with the following schema:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- hobbies: map (nullable = true)
| |-- key: string
| |-- value: integer
We can create this DataFrame using the following code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
data = [("John", 30, {"reading": 3, "traveling": 5}),
("Jane", 25, {"cooking": 2, "painting": 4})]
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("hobbies", MapType(StringType(), IntegerType()), True)
])
df = spark.createDataFrame(data, schema)
df.show(20,False)
+----+---+------------------------------+
|name|age|hobbies |
+----+---+------------------------------+
|John|30 |[reading -> 3, traveling -> 5]|
|Jane|25 |[painting -> 4, cooking -> 2] |
+----+---+------------------------------+
Spark important urls to refer