In PySpark’s realm, the map_values function is employed to extract the values from a map column. Drawing a parallel to Python, it’s akin to invoking .values() on a dictionary. However, map_values operates at a DataFrame level, targeting individual columns.
Use map_values for
Value Analysis: To understand the distribution or characteristics of values in a map column.
Data Transformation: Before reshaping values into distinct columns or rows.
Filtering Data: To curate rows based on the content or absence of specific values in a map column.
Advantages of map_values:
Performance: Given Spark’s distributed nature, map_values can process mammoth datasets swiftly.
Intuitive: Its use brings clarity and precision to PySpark code, enhancing readability.
Flexibility: Seamless integration with other DataFrame operations allows for comprehensive data processing.
from pyspark.sql import SparkSession
from pyspark.sql.functions import map_values
# Setting up Spark Session
spark = SparkSession.builder.appName("map_values_demo Learning @ Freshers.in").getOrCreate()
# Crafting a DataFrame with a map column
data = [(1, {"Sachin": 10, "India": 20}),
(2, {"Ramesh": 30, "USA": 40}),
(3, {"Raju": 50, "Ireland": 60})]
df = spark.createDataFrame(data, ["id", "country"])
df.show(20,False)
# Deploying map_values to extract the values from the map column
df_values = df.select("id", map_values(df["country"]).alias("age"))
df_values.show(20,False)
Output
+---+---------------------------+
|id |country |
+---+---------------------------+
|1 |{India -> 20, Sachin -> 10}|
|2 |{USA -> 40, Ramesh -> 30} |
|3 |{Raju -> 50, Ireland -> 60}|
+---+---------------------------+
+---+--------+
|id |age |
+---+--------+
|1 |[20, 10]|
|2 |[40, 30]|
|3 |[50, 60]|
+---+--------+
Spark important urls to refer