# PySpark : How to Compute the cumulative distribution of a column in a DataFrame

## pyspark.sql.functions.cume_dist

The cumulative distribution is a method used in probability and statistics to determine the distribution of a random variable, X, at any given point. The cumulative distribution function (CDF) of X, denoted by F(x), is defined as the probability that X will take a value less than or equal to x.

In PySpark, the cume_dist function is used to compute the cumulative distribution of a column in a DataFrame. This function computes the cumulative distribution of a column in a DataFrame, with respect to the order specified in the sort order.

Here’s an example to demonstrate the usage of the cume_dist function in PySpark:

``````from pyspark.sql import SparkSession
<code class="sig-prename descclassname">from pyspark.sql.functions import ````ume_dist` # Initialize Spark session spark = SparkSession.builder.appName("CumeDistExample").getOrCreate() # Create a DataFrame with sample data data = [("Mike Jack", 30), ("King Elene", 40), ("Barry Tim", 25), ("Yang Jakie", 35), ("Joby John", 20)] df = spark.createDataFrame(data, ["Name", "Age"]) # Sort the DataFrame by Age in ascending order df = df.sort("Age") # Compute the cumulative distribution of the Age column cumulative_dist = df.selectExpr("cume_dist() over (order by Age) as cum_dist").show() ```

Output

``````+--------+
|cum_dist|
+--------+
|     0.2|
|     0.4|
|     0.6|
|     0.8|
|     1.0|
+--------+``````

In this example, the cumulative distribution of the Age column is calculated with respect to the ascending order of the column. The result shows the cumulative distribution of the Age column, with the first row having a cumulative distribution of 0.2, and the last row having a cumulative distribution of 1.0, which indicates that 100% of the values are less than or equal to the corresponding value.

Spark important urls to refer Author: user