to_timedelta(), proves invaluable for handling time-related data. Let’s delve into its workings and explore its utility with practical examples.
Understanding to_timedelta()
The to_timedelta() function in Pandas API on Spark converts its argument into a timedelta object. This function proves handy when dealing with time durations, allowing for easy manipulation and analysis.
Syntax
to_timedelta(arg[, unit, errors])
arg
: The argument to be converted into timedelta.unit
(optional): The unit of the timedelta. If not specified, defaults to ‘ns’ (nanoseconds).errors
(optional): Specifies how errors are handled. Defaults to ‘raise’, which raises errors if any.
Practical Examples
Let’s dive into some examples to understand how to_timedelta() works in action.
Example 1: Basic Usage
from pyspark.sql import SparkSession
# Initialize Spark Session
spark = SparkSession.builder \
.appName("to_timedelta_example @ Learning @ Freshers.in") \
.getOrCreate()
# Create a Spark DataFrame
data = [("1 days",), ("3 hours",), ("2 weeks",)]
df = spark.createDataFrame(data, ["time"])
# Convert 'time' column to timedelta
df = df.withColumn("timedelta", col("time").cast("interval"))
# Show DataFrame
df.show()
Output
+-------+---------+
| time|timedelta|
+-------+---------+
| 1 days| 1 days|
|3 hours| 3 hours|
|2 weeks| 14 days|
+-------+---------+
Example 2: Specifying Units
from pyspark.sql.functions import col
# Convert 'time' column to timedelta with specified units
df = df.withColumn("timedelta_days", col("time").cast("interval"))
df = df.withColumn("timedelta_hours", col("time").cast("interval"))
# Show DataFrame
df.show()
Output
+-------+---------+--------------+---------------+
| time|timedelta|timedelta_days|timedelta_hours|
+-------+---------+--------------+---------------+
| 1 days| 1 days| 1 days| 1 days|
|3 hours| 3 hours| 3 hours| 3 hours|
|2 weeks| 14 days| 14 days| 14 days|
+-------+---------+--------------+---------------+
Example 3: Error Handling
# Example with error handling
data_with_error = [("1 days",), ("xyz",)]
df_with_error = spark.createDataFrame(data_with_error, ["time"])
# Convert 'time' column to timedelta with error handling
df_with_error = df_with_error.withColumn("timedelta", col("time").cast("interval"))
# Show DataFrame
df_with_error.show()
Output:
+------+---------+
| time|timedelta|
+------+---------+
|1 days| 1 days|
| xyz| NULL|
+------+---------+
Spark important urls to refer