There are three complex datatype in PySpark, (1) ArrayType, (2) MapType (3) StructType.
ArrayType
ArrayType represents values comprising a sequence of elements with the type of elementType. containsNull is used to indicate if elements in a ArrayType value can have null values. ArrayType can have value type in Python as list, tuple, or array. API to access ArrayType is ArrayType(elementType, [containsNull])
Function : pyspark.sql.types.ArrayType(elementType, containsNull=True)
Example for ArrayType
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType,IntegerType
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.appName('www.freshers.in training').getOrCreate()
car_data = [
(1,"Japan",["Datsun","Honda","Infiniti","Isuzu","Lexus"],1001),
(2,"Italy",["Ferrari","Lamborghini","Maserati"],2001),
(3,"France",["Bugatti","Citroen","Peugeot","Renault","Alpine"],3001),
(4,"South Korea",["Daewoo","Hyundai","KIA","SsangYong"],4001),
(5,"United States",["Cadillac","Chevrolet","Chrysler","Dodge","Fisker"],5001),
(6,"China",None,5001)
]
car_data_schema = StructType([
StructField("si_no",IntegerType(),True),
StructField("country_origin",StringType(),True),
StructField("car_make",ArrayType(StringType()),True),
StructField("ledger_no",IntegerType(),True)])
car_df = spark.createDataFrame(data=car_data,schema=car_data_schema)
car_df.printSchema()
car_df.show()
Output
root |-- si_no: integer (nullable = true) |-- country_origin: string (nullable = true) |-- car_make: array (nullable = true) | |-- element: string (containsNull = true) |-- ledger_no: integer (nullable = true) >>> car_df.show() +-----+--------------+--------------------+---------+ |si_no|country_origin| car_make|ledger_no| +-----+--------------+--------------------+---------+ | 1| Japan|[Datsun, Honda, I...| 1001| | 2| Italy|[Ferrari, Lamborg...| 2001| | 3| France|[Bugatti, Citroen...| 3001| | 4| South Korea|[Daewoo, Hyundai,...| 4001| | 5| United States|[Cadillac, Chevro...| 5001| | 6| China| null| 5001| +-----+--------------+--------------------+---------+
MapType :
MapType are great for key value pairs with an arbitrary length. MapType object comprises three fields, keyType, valueType, valueContainsNull
Function : pyspark.sql.types.MapType(keyType, valueType, valueContainsNull=True) keyType: DataType (DataType of the keys in the map) valueType: DataType (DataType of the values in the map) valueContainsNull : bool, optional (indicates whether values can contain null (None) values)
Example for MapType
from pyspark.sql import SparkSession
from pyspark.sql.types import MapType
from pyspark.sql.types import StructType,StructField, StringType
student_data =[
("class10",{"name":"Sam","age":14},101),
("class11",{"name":"Jack","age":15},102),
("class12",{"name":"Jim","age":16},103),
("class11",{"name":"Nancy","age":15},102),
("class11",{"name":"Houdy","age":15},102),]
student_data_schema = StructType([
StructField("stu_class",StringType(),True),
StructField("students",MapType(StringType(),StringType()),True),
StructField("class_id",IntegerType(),True),])
student_df = spark.createDataFrame(data=student_data,schema=student_data_schema)
student_df.printSchema()
student_df.show(20,False)
Output
root |-- stu_class: string (nullable = true) |-- students: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true) |-- class_id: integer (nullable = true) >>> student_df.show(20,False) +---------+--------------------------+--------+ |stu_class|students |class_id| +---------+--------------------------+--------+ |class10 |[name -> Sam, age -> 14] |101 | |class11 |[name -> Jack, age -> 15] |102 | |class12 |[name -> Jim, age -> 16] |103 | |class11 |[name -> Nancy, age -> 15]|102 | |class11 |[name -> Houdy, age -> 15]|102 | +---------+--------------------------+--------+
StructType
Struct type, consisting of a list of StructField.This is the data type representing a Row.StructType will iterate over its StructFields. A contained StructField can be accessed by its name or position.StructType and StructField classes are used to specify the schema programmatically. This can be used to create complex columns (nested struct, array and map columns etc).StructType is a collection of StructFields. StructField is used to define the column name, data type, and a flag to mention nullable or not.
Example for StructType
from pyspark.sql import SparkSession
from pyspark.sql.types import MapType
from pyspark.sql.types import StructType,StructField, StringType
student_data =[
("class10",("Sam","James","John"),15),
("class11",("Michael","David","William"),16),
("class12",("Richard","","John"),17),
("class10",("Thomas","Christopher","Daniel"),15),
("class10",("Donald","Kenneth","Kevin"),15),
]
student_data_schema = StructType([
StructField("stu_class",StringType(),True),
StructField("name",StructType([
StructField("first_name",StringType(),True),
StructField("middle_name",StringType(),True),
StructField("last_name",StringType(),True)])),
StructField("age",IntegerType(),True),])
student_df = spark.createDataFrame(data=student_data,schema=student_data_schema)
student_df.printSchema()
student_df.show(20,False)
Output
root |-- stu_class: string (nullable = true) |-- name: struct (nullable = true) | |-- first_name: string (nullable = true) | |-- middle_name: string (nullable = true) | |-- last_name: string (nullable = true) |-- age: integer (nullable = true) >>> student_df.show(20,False) +---------+-----------------------------+---+ |stu_class|name |age| +---------+-----------------------------+---+ |class10 |[Sam, James, John] |15 | |class11 |[Michael, David, William] |16 | |class12 |[Richard, , John] |17 | |class10 |[Thomas, Christopher, Daniel]|15 | |class10 |[Donald, Kenneth, Kevin] |15 | +---------+-----------------------------+---+
Reference