One of PySpark’s capabilities is the conversion of decimal values to integers. This conversion is beneficial when you need to eliminate fractional parts of numbers for specific calculations or simplify your data for particular analyses. PySpark allows for this conversion, and importantly, treats NULL inputs to produce NULL outputs, preserving the integrity of your data.
In this article, we will walk you through a step-by-step guide to convert decimal values to integer numbers in PySpark.
PySpark’s Integer Casting Function.
The conversion of decimal to integer in PySpark is facilitated using the cast function. The cast function allows us to change the data type of a DataFrame column to another type. In our case, we are changing a decimal type to an integer type.
Here’s the general syntax to convert a decimal column to integer:
from pyspark.sql.functions import col
df.withColumn("integer_column", col("decimal_column").cast("integer"))
In the above code:
df is your DataFrame.
integer_column is the new column with integer values.
decimal_column is the column you want to convert from decimal to integer.
Now, let’s illustrate this process with a practical example. We will first initialize a PySpark session and create a DataFrame:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("DecimalToIntegers").getOrCreate()
data = [("Sachin", 10.5), ("Ram", 20.8), ("Vinu", 30.3), (None, None)]
df = spark.createDataFrame(data, ["Name", "Score"])
df.show()
+------+-----+
| Name|Score|
+------+-----+
|Sachin| 10.5|
| Ram| 20.8|
| Vinu| 30.3|
| null| null|
+------+-----+
Let’s convert the ‘Score’ column to integer:
df = df.withColumn("Score", col("Score").cast("integer"))
df.show()
+------+-----+
| Name|Score|
+------+-----+
|Sachin| 10|
| Ram| 20|
| Vinu| 30|
| null| null|
+------+-----+
The ‘Score’ column values are now converted into integers. The decimal parts have been truncated, and not rounded. Also, observe how the NULL value remained NULL after the conversion.
PySpark’s flexible and powerful data manipulation functions, like cast, make it a highly capable tool for data analysis.
Spark important urls to refer