pyspark.sql.functions.countDistinct
In PySpark, the countDistinct function is used to calculate the number of unique elements in a column. This is also known as the number of distinct values. After removing duplicate rows, DataFrame distinct() returns a new DataFrame (distinct on all columns). Use the PySpark SQL function countDistinct to obtain the count distinct for a selection of multiple columns (). The result of this function is the number of unique items in a group.
Here is an example of how to use the countDistinct function in PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct
spark = SparkSession.builder.appName('Freshers.in countDistinct Learning').getOrCreate()
data = [("John", "Finance"), ("Jane", "IT"), ("Jim", "Finance"), ("Wilson John", "Travel"), ("Mike", "Travel")]
columns = ["Name","Dept"]
df = spark.createDataFrame(data=data,schema=columns)
# Using countDistrinct()
df.select(countDistinct("Dept",)).show()
+--------------------+
|count(DISTINCT Dept)|
+--------------------+
| 3|
+--------------------+
Spark important urls to refer