In PySpark, the Pandas API provides a range of functionalities, including the to_numeric()
function, which allows for converting arguments to numeric types. This article explores the usage, syntax, and practical applications of to_numeric()
with detailed examples.
Understanding to_numeric()
The to_numeric()
function in the Pandas API on Spark converts argument values to numeric type, facilitating data manipulation and analysis. It offers flexibility in handling errors during conversion, enhancing data integrity and reliability.
Syntax
The syntax for to_numeric()
is as follows:
pandas.to_numeric(arg, errors='raise')
Here, arg
represents the argument to be converted to a numeric type, and errors
(optional) specifies how errors should be handled during conversion.
Examples
Let’s explore various scenarios to understand the functionality of to_numeric()
:
Example 1: Basic Conversion
import pandas as pd
# Define a list of strings
data = ['10', '20', '30', '40']
# Convert strings to numeric type
numeric_data = pd.to_numeric(data)
print(numeric_data)
# Output: [10, 20, 30, 40]
Example 2: Handling Errors
import pandas as pd
# Define a list of strings with an invalid value
data = ['10', '20', '30', 'invalid']
# Convert strings to numeric type with errors='coerce'
numeric_data = pd.to_numeric(data, errors='coerce')
print(numeric_data)
# Output: [10.0, 20.0, 30.0, NaN]
Example 3: Using with Spark DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Create a SparkSession
spark = SparkSession.builder \
.appName("BooleanExpression Example : Learning @ Freshers.in ") \
.getOrCreate()
# Sample data
data = [(1, 15), (2, 25), (3, 35), (4, 45)]
columns = ["id", "value"]
# Create a DataFrame
df = spark.createDataFrame(data, columns)
# Perform a filter operation using '&' for 'and' operator
filtered_df = df.filter((col("id") > 2) & (col("value") < 40))
# Show the filtered DataFrame
filtered_df.show()
Output
+---+-----+
| id|value|
+---+-----+
| 3| 35|
+---+-----+
Spark important urls to refer