Spark : Calculate the number of unique elements in a column using PySpark

user February 8, 2023 Leave a Comment on Spark : Calculate the number of unique elements in a column using PySpark

pyspark.sql.functions.countDistinct

In PySpark, the countDistinct function is used to calculate the number of unique elements in a column. This is also known as the number of distinct values. After removing duplicate rows, DataFrame distinct() returns a new DataFrame (distinct on all columns). Use the PySpark SQL function countDistinct to obtain the count distinct for a selection of multiple columns (). The result of this function is the number of unique items in a group.

Here is an example of how to use the countDistinct function in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct
spark = SparkSession.builder.appName('Freshers.in countDistinct Learning').getOrCreate()
data = [("John", "Finance"), ("Jane", "IT"), ("Jim", "Finance"), ("Wilson John", "Travel"), ("Mike", "Travel")]
columns = ["Name","Dept"]
df = spark.createDataFrame(data=data,schema=columns)
# Using countDistrinct()
df.select(countDistinct("Dept",)).show()

Output

+--------------------+
|count(DISTINCT Dept)|
+--------------------+
|                   3|
+--------------------+

Returns a new Column for distinct count of col or cols.

Spark important urls to refer

Spark Examples
PySpark Blogs
Bigdata Blogs
Spark Interview Questions
Official Page

Post Views: 21

PySpark:Getting approximate number of unique elements in a column of a DataFrame
pyspark.sql.functions.approx_count_distinct Pyspark's approx_count_distinct function is a way to approximate the number of unique elements in…
In pyspark what is the difference between Spark spark.table() and spark.read.table()
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas spark.read.table()…
PySpark-How to returns the first column that is not null
pyspark.sql.functions.coalesce If you want to return the first non zero from list of column you…
PySpark : Sort an array of elements in a DataFrame column
pyspark.sql.functions.array_sort The array_sort function is a PySpark function that allows you to sort an array…
PySpark : Exploding a column of arrays or maps into multiple rows in a Spark DataFrame [posexplode_outer]
pyspark.sql.functions.posexplode_outer The posexplode_outer function in PySpark is part of the pyspark.sql.functions module and is used…
How to run dataframe as Spark SQL - PySpark
If you have a situation that you can easily get the result using SQL/ SQL…
How to add a new column in PySpark using withColumn
withColumn Syntax: DataFrame.withColumn(column_name, col) withColumn is comonly used to add a column on an existing…
PySpark : Explanation of MapType in PySpark with Example
MapType in PySpark is a data type used to represent a value that maps keys…
PySpark : Combine the elements of two or more arrays in a DataFrame column
pyspark.sql.functions.array_union The array_union function is a PySpark function that allows you to combine the elements…
PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…