The mean function in PySpark is used to compute the average value of a numeric column. This function is part of PySpark’s aggregate functions, which are essential in statistical analysis. This article explores the mean function in PySpark, its benefits, and its practical application through a real-world example. The mean function in PySpark is a powerful tool for statistical analysis, offering a simple yet effective way to understand the central tendency of numerical data.
The syntax for mean is:
from pyspark.sql.functions import mean
Advantages of using mean
- Statistical Insights: Provides a quick overview of the central tendency of numeric data.
- Data Reduction: Summarizes large datasets into a single representative value.
- Versatility: Can be used in various contexts, from financial analysis to scientific research.
Example : Analyzing employee salaries
Consider a dataset with the names of employees and their salaries. Our goal is to calculate the average salary.
Dataset
Name | Salary |
---|---|
Sachin | 70000 |
Ram | 48000 |
Raju | 54000 |
David | 62000 |
Wilson | 58000 |
Objective
Compute the average salary of the employees.
Implementation in PySpark
Setting up the PySpark environment and creating the DataFrame:
from pyspark.sql import SparkSession
from pyspark.sql.functions import mean
# Initialize Spark Session
spark = SparkSession.builder.appName("Mean Example").getOrCreate()
# Sample Data
data = [("Sachin", 70000), ("Ram", 48000), ("Raju", 54000), ("David", 62000), ("Wilson", 58000)]
# Creating DataFrame
df = spark.createDataFrame(data, ["Name", "Salary"])
df.show()
Output
+------+------+
| Name|Salary|
+------+------+
|Sachin| 70000|
| Ram| 48000|
| Raju| 54000|
| David| 62000|
|Wilson| 58000|
+------+------+
Applying the mean function:
# Calculating Mean Salary
mean_salary = df.select(mean("Salary")).collect()[0][0]
print("Average Salary:", mean_salary)
Output
Average Salary: 58400.0
Spark important urls to refer