pyspark.sql.DataFrame.groupBy
PySpark groupby functions groups the DataFrame using the specified columns to run aggregation ( count,sum,mean, min, max) on them. We can use groupby for single column and multiple column. There is no partial aggregation with group aggregate UDFs, i.e., a full shuffle is required. Also, all the data of a group will be loaded into memory, so the user should be aware of the potential Out of memory risk if data is skewed
and certain groups are too large to fit in memory.
Sample code with hardcoded data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark groupby @ Freshers.in ").getOrCreate()
emp_schema= ["name","id","age","gender","dept","state","salary","increment"]
emp_data = [("Sam",1010,30,"Male","IT","CT",125000,8.1),
("Peter",1020,32,"Male","HR","NY",90000,7.0),
("Tom",1030,34,"Male","IT","CT",85000,5.0),
("Joe",1040,55,"Male","HR","GA",60000,4.0),
("Barry",1050,34,"Male","IT","OR",95000,6.0),
("Mike",1060,42,"Male","SALES","OR",70000,4.0),
("Suzy",1070,34,"Female","IT","GA",99000,6.0),
("Annie",1080,54,"Female","SALES","GA",80000,3.0),
("Bush",1090,34,"Male","IT","AL",85000,5.0),
("Evan",1100,44,"Female","CRM","GA",75000,8.0)]
df = spark.createDataFrame(data=emp_data,schema=emp_schema)
# PySpark groupby with sum
grp_by_dept = df.groupby('dept').agg({'salary' :'sum'})
grp_by_dept_sum.show()
+-----+-----------+
| dept|sum(salary)|
+-----+-----------+
| CRM| 75000|
|SALES| 150000|
| IT| 489000|
| HR| 150000|
+-----+-----------+
# PySpark groupby with mean
grp_by_dept_mean = df.groupby('dept').agg({'salary' :'mean'}).show()
+-----+-----------+
| dept|avg(salary)|
+-----+-----------+
| CRM| 75000.0|
|SALES| 75000.0|
| IT| 97800.0|
| HR| 75000.0|
+-----+-----------+
# PySpark groupby with min
grp_by_dept_mean = df.groupby('dept').agg({'salary' :'min'}).show()
+-----+-----------+
| dept|min(salary)|
+-----+-----------+
| CRM| 75000|
|SALES| 70000|
| IT| 85000|
| HR| 60000|
+-----+-----------+
# PySpark groupby with max
grp_by_dept_mean = df.groupby('dept').agg({'salary' :'max'}).show()
+-----+-----------+
| dept|max(salary)|
+-----+-----------+
| CRM| 75000|
|SALES| 80000|
| IT| 125000|
| HR| 90000|
+-----+-----------+
# PySpark groupby with count
grp_by_dept_mean = df.groupby('dept',).agg({'*' :'count'}).show()
+-----+--------+
| dept|count(1)|
+-----+--------+
| CRM| 1|
|SALES| 2|
| IT| 5|
| HR| 2|
+-----+--------+
# PySpark groupby with multiple column
grp_by_dept_mean = df.groupby('dept','gender').agg({'salary' :'mean'}).show()
+-----+------+-----------+
| dept|gender|avg(salary)|
+-----+------+-----------+
| CRM|Female| 75000.0|
|SALES| Male| 70000.0|
| IT| Male| 97500.0|
|SALES|Female| 80000.0|
| IT|Female| 99000.0|
| HR| Male| 75000.0|
+-----+------+-----------+
Reference.