GroupedData in PySpark is a powerful tool for data grouping and aggregation, enabling detailed and complex data analysis. Mastering this feature is crucial for data scientists and analysts dealing with large-scale data.
Features and Functions of PySpark GroupedData
Essential Grouping and Aggregation Methods
- Grouping Data: Learn about the
groupBy()
function and its applications. - Aggregation Functions: Dive into methods like
agg()
,count()
,max()
,mean()
, andsum()
for summarizing grouped data.
Advanced GroupedData Techniques
- Custom Aggregations: Explore the use of custom aggregation functions for tailored data analysis.
- Pivot Tables: Understand the creation and utility of pivot tables with GroupedData.
Example: PySpark GroupedData
Dataset and Scenario
Suppose we have a dataset of employee records with names, departments, and years of experience. We will use PySpark to analyze this data.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("Learning @ Freshers.in - GroupedData Example").getOrCreate()
data = [("Sachin", "Sales", 5),
("Manju", "Marketing", 3),
("Ram", "Sales", 4),
("Raju", "IT", 6),
("David", "Marketing", 2),
("Freshers_in", "IT", 1),
("Wilson", "Sales", 8)]
columns = ["Name", "Department", "Experience"]
df = spark.createDataFrame(data, schema=columns)
df.show()
Output
+-----------+----------+----------+
| Name|Department|Experience|
+-----------+----------+----------+
| Sachin| Sales| 5|
| Manju| Marketing| 3|
| Ram| Sales| 4|
| Raju| IT| 6|
| David| Marketing| 2|
|Freshers_in| IT| 1|
| Wilson| Sales| 8|
+-----------+----------+----------+
Grouping and Aggregating Data
Group by Department and Calculate Average Experience:
df.groupBy("Department").avg("Experience").show()
+----------+-----------------+
|Department| avg(Experience)|
+----------+-----------------+
| Sales|5.666666666666667|
| Marketing| 2.5|
| IT| 3.5|
+----------+-----------------+
Counting Employees in Each Department:
df.groupBy("Department").count().show()
Output
+----------+-----+
|Department|count|
+----------+-----+
| Sales| 3|
| Marketing| 2|
| IT| 2|
+----------+-----+
Best Practices and Optimization Techniques
Efficient Use of GroupedData in PySpark
Tips for managing large datasets using GroupedData efficiently.
Strategies for optimizing group and aggregation operations.
Spark important urls to refer