PySpark – groupby with aggregation (count, sum, mean, min, max)

user August 25, 2021 Leave a Comment

pyspark.sql.DataFrame.groupBy

PySpark groupby functions groups the DataFrame using the specified columns to run aggregation ( count,sum,mean, min, max) on them. We can use groupby for single column and multiple column. There is no partial aggregation with group aggregate UDFs, i.e., a full shuffle is required. Also, all the data of a group will be loaded into memory, so the user should be aware of the potential Out of memory risk if data is skewed
and certain groups are too large to fit in memory.

Sample code with hardcoded data

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark groupby @ Freshers.in ").getOrCreate()
emp_schema= ["name","id","age","gender","dept","state","salary","increment"]
emp_data = [("Sam",1010,30,"Male","IT","CT",125000,8.1),
("Peter",1020,32,"Male","HR","NY",90000,7.0),
("Tom",1030,34,"Male","IT","CT",85000,5.0),
("Joe",1040,55,"Male","HR","GA",60000,4.0),
("Barry",1050,34,"Male","IT","OR",95000,6.0),
("Mike",1060,42,"Male","SALES","OR",70000,4.0),
("Suzy",1070,34,"Female","IT","GA",99000,6.0),
("Annie",1080,54,"Female","SALES","GA",80000,3.0),
("Bush",1090,34,"Male","IT","AL",85000,5.0),
("Evan",1100,44,"Female","CRM","GA",75000,8.0)]
df = spark.createDataFrame(data=emp_data,schema=emp_schema)
# PySpark groupby with sum
grp_by_dept = df.groupby('dept').agg({'salary' :'sum'})
grp_by_dept_sum.show()
+-----+-----------+
| dept|sum(salary)|
+-----+-----------+
|  CRM|      75000|
|SALES|     150000|
|   IT|     489000|
|   HR|     150000|
+-----+-----------+
# PySpark groupby with mean
grp_by_dept_mean = df.groupby('dept').agg({'salary' :'mean'}).show()
+-----+-----------+
| dept|avg(salary)|
+-----+-----------+
|  CRM|    75000.0|
|SALES|    75000.0|
|   IT|    97800.0|
|   HR|    75000.0|
+-----+-----------+
# PySpark groupby with min
grp_by_dept_mean = df.groupby('dept').agg({'salary' :'min'}).show()
+-----+-----------+
| dept|min(salary)|
+-----+-----------+
|  CRM|      75000|
|SALES|      70000|
|   IT|      85000|
|   HR|      60000|
+-----+-----------+
# PySpark groupby with max
grp_by_dept_mean = df.groupby('dept').agg({'salary' :'max'}).show()
+-----+-----------+
| dept|max(salary)|
+-----+-----------+
|  CRM|      75000|
|SALES|      80000|
|   IT|     125000|
|   HR|      90000|
+-----+-----------+
# PySpark groupby with count
grp_by_dept_mean = df.groupby('dept',).agg({'*' :'count'}).show()
+-----+--------+
| dept|count(1)|
+-----+--------+
|  CRM|       1|
|SALES|       2|
|   IT|       5|
|   HR|       2|
+-----+--------+
# PySpark groupby with multiple column
grp_by_dept_mean = df.groupby('dept','gender').agg({'salary' :'mean'}).show()
+-----+------+-----------+
| dept|gender|avg(salary)|
+-----+------+-----------+
|  CRM|Female|    75000.0|
|SALES|  Male|    70000.0|
|   IT|  Male|    97500.0|
|SALES|Female|    80000.0|
|   IT|Female|    99000.0|
|   HR|  Male|    75000.0|
+-----+------+-----------+

Reference.

Post Views: 1,923

Author: user

PySpark – groupby with aggregation (count, sum, mean, min, max)

pyspark.sql.DataFrame.groupBy

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

How PARTITION BY Works in Snowflake, and SQL in general

Stash a specific file using Git

Prevent your computer from locking : Python to simulate mouse movements

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Most Viewed Posts

pyspark.sql.DataFrame.groupBy

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget