PySpark : Concatenatinating elements of an array into a single string.

PySpark @ Freshers.in

pyspark.sql.functions.array_join

PySpark’s array_join function is used to concatenate elements of an array into a single string, with the elements separated by a specified delimiter. The function takes two arguments: the array to be concatenated and the delimiter to use.

Syntax
array_join(array, delimiter [, nullReplacement])

Here is an example of how to use the array_join function in PySpark:

from pyspark.sql.functions import array_join

# Create a sample dataframe
data = [("John", ["apple", "banana", "orange"]), ("Jane", ["grapes", "pineapple", "kiwi"])]
df = spark.createDataFrame(data, ["name", "fruits"])

# Use the array_join function to concatenate the elements of the "fruits" column into a single string
df = df.withColumn("fruits_list", array_join("fruits", ","))

# Show the result
df.show(20, False)

This will output:

+----+-------------------------+---------------------+
|name|fruits                   |fruits_list          |
+----+-------------------------+---------------------+
|John|[apple, banana, orange]  |apple,banana,orange  |
|Jane|[grapes, pineapple, kiwi]|grapes,pineapple,kiwi|
+----+-------------------------+---------------------+

In this example, array_join function is used to concatenate the elements of the “fruits” column, which is an array of strings, into a single string. The delimiter used is a comma. The result of the function is stored in a new column named “fruits_list”.

You can also use the array_join function on a specific columns, like this:

df.selectExpr("name", "array_join(fruits, ',') as fruits_list").show(20, False)
+----+---------------------+
|name|fruits_list          |
+----+---------------------+
|John|apple,banana,orange  |
|Jane|grapes,pineapple,kiwi|
+----+---------------------+

This will give you the same output as previous example, but in this case it’s used as a function with column name as argument.

It’s important to note that the array_join function only works on columns of type array and the resulting column will always be of type string. Also, the delimiter passed to the function should be a string.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply