The fusion of Spark’s distributed computing prowess with the intuitive functionalities of Pandas unleashes unparalleled capabilities for handling massive datasets efficiently. One of the key features that empowers this synergy is the support for binary operator functions within the Pandas API on Spark. These functions, including Series.rmul()
, Series.rsub()
, Series.rtruediv()
, Series.sub()
, and Series.truediv()
, enable users to perform element-wise operations seamlessly across distributed data. In this article, we’ll delve into each of these functions, explore their applications, and demonstrate their usage with illustrative examples.
1. Series.rmul(other) in Spark
The Series.rmul()
function calculates the reverse multiplication of two series element-wise. It multiplies each element of the second series by the corresponding element of the first series, yielding a new series with the result. This function is particularly useful when you need to compute the product of two datasets in a distributed manner.
# Example of Series.rmul()
import pandas as pd
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Learning @ Freshers.in Pandas API on Spark").getOrCreate()
# Sample data
data1 = {'A': [1, 2, 3, 4, 5]}
data2 = {'A': [10, 20, 30, 40, 50]}
df1 = spark.createDataFrame(pd.DataFrame(data1))
df2 = spark.createDataFrame(pd.DataFrame(data2))
# Convert DataFrames to Pandas Series
series1 = df1.select('A').toPandas()['A']
series2 = df2.select('A').toPandas()['A']
# Perform reverse multiplication
result = series1.rmul(series2)
# Print the result
print("Result of reverse multiplication:")
print(result)
Output:
Result of reverse multiplication:
0 10
1 40
2 90
3 160
4 250
Name: A, dtype: int64
2. Series.rsub(other) in Spark
The Series.rsub()
function computes the reverse subtraction of two series element-wise. It subtracts each element of the second series from the corresponding element of the first series, generating a new series with the result. This function is valuable for scenarios where you need to determine the difference between two datasets.
# Example of Series.rsub()
# Assume the series1 and series2 are defined from the previous example
# Perform reverse subtraction
result = series1.rsub(series2)
# Print the result
print("Result of reverse subtraction:")
print(result)
Output:
Result of reverse subtraction:
0 9
1 18
2 27
3 36
4 45
Name: A, dtype: int64
3. Series.rtruediv(other)
The Series.rtruediv()
function calculates the reverse floating-point division of two series element-wise. It divides each element of the second series by the corresponding element of the first series, yielding a new series with the result. This function is beneficial for performing division operations with a distributed dataset.
# Example of Series.rtruediv()
# Assume the series1 and series2 are defined from the previous example
# Perform reverse division
result = series1.rtruediv(series2)
# Print the result
print("Result of reverse division:")
print(result)
Output:
Result of reverse division:
0 10.000000
1 10.000000
2 10.000000
3 10.000000
4 10.000000
Name: A, dtype: float64
4. Series.sub(other)
The Series.sub()
function computes the subtraction of two series element-wise. It subtracts each element of the second series from the corresponding element of the first series, generating a new series with the result. This function is useful for calculating the difference between datasets.
# Example of Series.sub()
# Assume the series1 and series2 are defined from the previous example
# Perform subtraction
result = series1.sub(series2)
# Print the result
print("Result of subtraction:")
print(result)
Output:
Result of subtraction:
0 -9
1 -18
2 -27
3 -36
4 -45
Name: A, dtype: int64
5. Series.truediv(other)
The Series.truediv()
function computes the floating-point division of two series element-wise. It divides each element of the first series by the corresponding element of the second series, yielding a new series with the result. This function is essential for performing division operations across distributed datasets.
# Example of Series.truediv()
# Assume the series1 and series2 are defined from the previous example
# Perform division
result = series1.truediv(series2)
# Print the result
print("Result of division:")
print(result)
Output:
Result of division:
0 0.1
1 0.1
2 0.1
3 0.1
4 0.1
Name: A, dtype: float64
Spark important urls to refer