The ntile function in PySpark is used for dividing an ordered dataset into a specified number of approximately equal segments, or “tiles”. It’s particularly useful in scenarios involving percentile calculations, data stratification, or when dividing a dataset into quantiles.The ntile function in PySpark is invaluable for data analysts and scientists looking to segment data effectively. This article aims to demystify the ntile function with a comprehensive guide, bolstered by a practical example.
Syntax:
from pyspark.sql.window import Window
from pyspark.sql.functions import ntile
windowSpec = Window.orderBy("column_to_order")
df.withColumn("tile_column", ntile(number_of_tiles).over(windowSpec))
Let’s consider an example where we have a dataset of employees with their respective salaries. We aim to segment this data into 4 quartiles based on their salary.
Sample data
Suppose we have the following data in a DataFrame named employee_df
:
Name | Salary |
---|---|
Sachin | 70000 |
Manju | 80000 |
Ram | 55000 |
Raju | 65000 |
David | 72000 |
Wilson | 60000 |
Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import ntile
from pyspark.sql.window import Window
from pyspark.sql.types import *
# Initialize Spark Session
spark = SparkSession.builder.appName("NtileExample").getOrCreate()
# Sample data
data = [("Sachin", 70000),
("Manju", 80000),
("Ram", 55000),
("Raju", 65000),
("David", 72000),
("Wilson", 60000)]
# Define schema
schema = StructType([
StructField("Name", StringType(), True),
StructField("Salary", IntegerType(), True)
])
# Create DataFrame
employee_df = spark.createDataFrame(data, schema)
# Define Window Specification
windowSpec = Window.orderBy(employee_df["Salary"])
# Apply ntile function
employee_df_with_quartiles = employee_df.withColumn("Quartile", ntile(4).over(windowSpec))
# Show results
employee_df_with_quartiles.show()
Output
+------+------+--------+
| Name|Salary|Quartile|
+------+------+--------+
| Ram| 55000| 1|
|Wilson| 60000| 1|
| Raju| 65000| 2|
|Sachin| 70000| 2|
| David| 72000| 3|
| Manju| 80000| 4|
+------+------+--------+
The output will display the original data along with a new column, Quartile. This column indicates the quartile to which each employee belongs based on their salary, effectively dividing the dataset into four segments.
Spark important urls to refer