pyspark.sql.functions.map_from_entries
map_from_entries(col) is a function in PySpark that creates a map from a column of structs, where the structs have two fields: key and value. This is a collection function which returns a map created from the given array of entries
from pyspark.sql.functions import map_from_entries, struct
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
df2 = spark.createDataFrame([
(1, "John", 25000, [("name","John"), ("age",25)]),
(2, "Mike", 30000, [("name","Mike"),("age",30)]),
(3, "Sophia", 35000, [("name","Sophia"), ("age",35)])
],
["id", "name", "salary", "person_map"])
df2 = df2.select("id","name", "salary", map_from_entries("person_map").alias("map_col"))
df2.show(20,False)
In this example, we first import the necessary functions and create a SparkSession. We then create a DataFrame with a column called “person_map
” which contains a list of structs each with two fields “key” and “value”.
We then use the map_from_entries() function to create a new column called “map_col” from the struct column, using the alias() function to rename the new column.
The “map_col
” is used to select the fields of the structs to be used as key and value for the map.
The final DataFrame has two columns: “id” and “map_col”, where “map_col” contains a map created from the structs in “struct_col”.
For reference , the schema will beĀ
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- salary: long (nullable = true)
|-- map_col: map (nullable = true)
| |-- key: stringap_col")["name"]).show()
| |-- value: string (valueContainsNull = true)
|map_col[name]|
Result
+---+------+------+---------------------------+
|id |name |salary|map_col |
+---+------+------+---------------------------+
|1 |John |25000 |[name -> John, age -> 25] |
|2 |Mike |30000 |[name -> Mike, age -> 30] |
|3 |Sophia|35000 |[name -> Sophia, age -> 35]|
+---+------+------+---------------------------+
In PySpark, creating a map column from entries allows you to convert existing columns in a DataFrame into a map, where each row in the DataFrame becomes a key-value pair in the map. This can be useful for organizing and structuring data in a more readable and efficient way. Additionally, it can also be used to perform operations such as filtering, aggregation and joining on the map column.
Spark important urls to refer