PySpark : Formatting numbers to a specific number of decimal places.

PySpark @ Freshers.in

pyspark.sql.functions.format_number

One of the useful functions in PySpark is the format_number function, which is used to format numbers to a specific number of decimal places. In this article, we will discuss the PySpark format_number function and its usage.

The format_number function in PySpark is used to format numbers to a specific number of decimal places. This function is useful when you need to display numbers in a specific format, such as displaying a currency value with two decimal places. The format_number function takes two arguments: the number to be formatted and the number of decimal places to format it to.

Here is an example of how to use the format_number function in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import format_number

# Initialize SparkSession
spark = SparkSession.builder.appName("PySpark Format Number").getOrCreate()

# Create a dataframe with a column of numbers
data = [(1, 3.14159),
        (2, 2.71828),
        (3, 1.41421)]
df = spark.createDataFrame(data, ["id", "value"])

# Use the format_number function to format the numbers to two decimal places
df = df.select("id", format_number("value", 2).alias("value_formatted"))

# Show the resulting dataframe
df.show()

In this example, we start by creating a SparkSession, which is the entry point for PySpark. Then, we create a dataframe with two columns, id and value, where the value column contains numbers.

Next, we use the format_number function to format the numbers in the value column to two decimal places. The format_number function takes the value column and formats it to two decimal places. We use the alias method to give a name to the newly created column, which is value_formatted in this example.

Finally, we display the resulting dataframe using the show method, which outputs the following result:

+---+---------------+
| id|value_formatted|
+---+---------------+
|  1|           3.14|
|  2|           2.72|
|  3|           1.41|
+---+---------------+

As you can see, the format_number function has formatted the numbers in the value column to two decimal places, as specified in the second argument of the function.

In conclusion, the format_number function in PySpark is a useful tool for formatting numbers to a specific number of decimal places. Whether you are a data scientist or a software engineer, understanding the basics of the PySpark format_number function is crucial for performing effective big data analysis. With its simple yet powerful functionality, the format_number function can help you present your data in a clear and concise format, making it easier to understand and analyze. 

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply