pyspark.sql.functions.format_string
‘format_string’ is a parameter in the select method of a DataFrame in PySpark. It is used to specify the output format of the columns in the resulting DataFrame.
Here is a full code example that demonstrates the use of the ‘format_string’ parameter in PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import format_string
# Create a SparkSession
spark = SparkSession.builder.appName("format_string_example").getOrCreate()
# Create a sample DataFrame
data = [(1, "foo", 3.14), (2, "bar", 2.71), (3, "baz", 1.41)]
columns = ["id", "name", "value"]
df = spark.createDataFrame(data, columns)
df.show()
Input Dataframe
+---+----+-----+
| id|name|value|
+---+----+-----+
| 1| foo| 3.14|
| 2| bar| 2.71|
| 3| baz| 1.41|
+---+----+-----+
Use the ‘format_string’ parameter to specify the output format
df2 = df.select("id", "name", "value").select(
format_string("%05d", "id").alias("ID"),
format_string("%10s", "name").alias("NAME"),
format_string("%.2f", "value").alias("VALUE")
)
df2.show()
Result
+-----+----------+-----+
| ID| NAME|VALUE|
+-----+----------+-----+
|00001| foo| 3.14|
|00002| bar| 2.71|
|00003| baz| 1.41|
+-----+----------+-----+
In this example, we create a sample DataFrame, df with columns “id”, “name”, and “value”. We then use the select method and the ‘format_string’ parameter to specify that the output for column “id” should be an integer with a minimum width of 5 digits and zero-padded, column “name” should be formatted as a string with a maximum length of 10 characters, and column “value” should be a floating point number with 2 decimal places. The resulting DataFrame will be displayed with the specified format.
Additional notes:
For example, suppose you have a DataFrame called df with columns “A”, “B”, and “C”. You can use the ‘format_string’ parameter to specify that the output for column “A” should be a string with a maximum length of 10 characters, column “B” should be formatted as a floating point number with 2 decimal places, and column “C” should be an integer.
df.select("A", "B", "C").select(
format_string("%10s", "A").alias("A"),
format_string("%.2f", "B").alias("B"),
format_string("%d", "C").alias("C")
).show()
The string passed to ‘format_string’ is similar to the one used in python’s string formatting. For example, in the above format_string(“%10s”, “A”) %10s is the format string and A is the column name.
The above example will output the DataFrame with columns “A”, “B”, and “C” with the specified format.
Spark important urls