PySpark : Function to perform simple column transformations [expr]

PySpark @ Freshers.in

pyspark.sql.functions.expr

The expr module is part of the PySpark SQL module and is used to create column expressions that can be used to perform operations on Spark dataframes. These expressions can be used to transform columns, calculate new columns based on existing columns, and perform various other operations on Spark dataframes.

One of the most common uses for expr is to perform simple column transformations. For example, you can use the expr function to convert a string column to a numeric column by using the cast function. Here is an example:

from pyspark.sql.functions import expr
df = spark.createDataFrame([(1, "100"), (2, "200"), (3, "300")], ["id", "value"])
df.printSchema()
root
 |-- id: long (nullable = true)
 |-- value: string (nullable = true)

Use expr

df = df.withColumn("value", expr("cast(value as int)"))
df.printSchema()
root
 |-- id: long (nullable = true)
 |-- value: integer (nullable = true)

In this example, we create a Spark dataframe with two columns, id and value. The value column is a string column, but we want to convert it to a numeric column. To do this, we use the expr function to create a column expression that casts the value column as an integer. The result is a new Spark dataframe with the value column converted to a numeric column.

Another common use for expr is to perform operations on columns. For example, you can use expr to create a new column that is the result of a calculation involving multiple columns. Here is an example:

from pyspark.sql.functions import expr
df = spark.createDataFrame([(1, 100, 10), (2, 200, 20), (3, 300, 30)], ["id", "value1", "value2"])
df = df.withColumn("sum", expr("value1 + value2"))
df.show()

Result

+---+------+------+---+
| id|value1|value2|sum|
+---+------+------+---+
|  1|   100|    10|110|
|  2|   200|    20|220|
|  3|   300|    30|330|
+---+------+------+---+

In this example, we create a Spark dataframe with three columns, id, value1, and value2. We use the expr function to create a new column, sum, that is the result of adding value1 and value2. The result is a new Spark dataframe with the sum column containing the result of the calculation.

The expr module also provides a number of other functions that can be used to perform operations on Spark dataframes. For example, you can use the coalesce function to select the first non-null value from a set of columns, the ifnull function to return a specified value if a column is null, and the case function to perform conditional operations on columns.

In conclusion, the expr module in PySpark provides a convenient and flexible way to perform operations on Spark dataframes. Whether you want to transform columns, calculate new columns, or perform other operations, the expr module provides the tools you need to do so.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply