pyspark.sql.functions.array_position
The array_position function is used to find the position of a given value in an array column. This is a collection function and it finds the first instance of the specified value in the specified array. If either argument is empty, the return value is null.
Here is an example of how to use array_position:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("array_position example").getOrCreate()
# Create a DataFrame with an array column
data = [("Alice", ["apple", "banana", "orange"]),
("Bob", ["grape", "apple", "kiwi"]),
("Charlie", ["banana", "orange", "mango"])]
df = spark.createDataFrame(data, ["name", "fruits"])
# Use the array_position function to find the position of "apple" in the "fruits" column
from pyspark.sql.functions import array_position
df.select("name", array_position("fruits", "apple").alias("apple_position")).show(20,False)
Result
+-------+--------------+
| name|apple_position|
+-------+--------------+
| Alice| 1|
| Bob| 2|
|Charlie| 0|
+-------+--------------+
Here, we first create a SparkSession and then a DataFrame. The DataFrame has two columns: “name” and “fruits”, where “fruits” is an array column. We then use the array_position function to find the position of the value “apple” in the “fruits” column. The function returns the position of the first occurrence of the value in the array. If the value is not found in the array, it returns null. In this example, we see that “Alice” has “apple” at position 1, “Bob” has “apple” at position 2 and “Charlie” doesn’t have “apple” in fruits, so it returns null.
We also use alias() to rename the new column generated by array_position function to apple_position.
It’s important to note that the positions in the array are zero-based, so the first position is 0, the second is 1, and so on.
Spark important urls to refer