pyspark.sql.functions.format_number
One of the useful functions in PySpark is the format_number function, which is used to format numbers to a specific number of decimal places. In this article, we will discuss the PySpark format_number function and its usage.
The format_number function in PySpark is used to format numbers to a specific number of decimal places. This function is useful when you need to display numbers in a specific format, such as displaying a currency value with two decimal places. The format_number function takes two arguments: the number to be formatted and the number of decimal places to format it to.
Here is an example of how to use the format_number function in PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import format_number
# Initialize SparkSession
spark = SparkSession.builder.appName("PySpark Format Number").getOrCreate()
# Create a dataframe with a column of numbers
data = [(1, 3.14159),
(2, 2.71828),
(3, 1.41421)]
df = spark.createDataFrame(data, ["id", "value"])
# Use the format_number function to format the numbers to two decimal places
df = df.select("id", format_number("value", 2).alias("value_formatted"))
# Show the resulting dataframe
df.show()
In this example, we start by creating a SparkSession, which is the entry point for PySpark. Then, we create a dataframe with two columns, id and value, where the value column contains numbers.
Next, we use the format_number function to format the numbers in the value column to two decimal places. The format_number function takes the value column and formats it to two decimal places. We use the alias method to give a name to the newly created column, which is value_formatted in this example.
Finally, we display the resulting dataframe using the show method, which outputs the following result:
+---+---------------+
| id|value_formatted|
+---+---------------+
| 1| 3.14|
| 2| 2.72|
| 3| 1.41|
+---+---------------+
As you can see, the format_number function has formatted the numbers in the value column to two decimal places, as specified in the second argument of the function.
In conclusion, the format_number function in PySpark is a useful tool for formatting numbers to a specific number of decimal places. Whether you are a data scientist or a software engineer, understanding the basics of the PySpark format_number function is crucial for performing effective big data analysis. With its simple yet powerful functionality, the format_number function can help you present your data in a clear and concise format, making it easier to understand and analyze.
Spark important urls to refer