PySpark : Correlation Analysis in PySpark with a detailed example

PySpark @ Freshers.in

In this article, we will explore correlation analysis in PySpark, a statistical technique used to measure the strength and direction of the relationship between two continuous variables. We will provide a detailed example using hardcoded values as input.

Prerequisites

  • Python 3.7 or higher
  • PySpark library
  • Java 8 or higher

Creating a PySpark DataFrame with Hardcoded Values

First, let’s create a PySpark DataFrame with hardcoded values:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

spark = SparkSession.builder \
    .appName("Correlation Analysis Example") \
    .getOrCreate()

data_schema = StructType([
    StructField("name", StringType(), True),
    StructField("variable1", DoubleType(), True),
    StructField("variable2", DoubleType(), True),
])

data = spark.createDataFrame([
    ("A", 1.0, 2.0),
    ("B", 2.0, 3.0),
    ("C", 3.0, 4.0),
    ("D", 4.0, 5.0),
    ("E", 5.0, 6.0),
], data_schema)

data.show()
Output
+----+---------+---------+
|name|variable1|variable2|
+----+---------+---------+
|   A|      1.0|      2.0|
|   B|      2.0|      3.0|
|   C|      3.0|      4.0|
|   D|      4.0|      5.0|
|   E|      5.0|      6.0|
+----+---------+---------+

Calculating Correlation

Now, let’s calculate the correlation between variable1 and variable2:

from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

vector_assembler = VectorAssembler(inputCols=["variable1", "variable2"], outputCol="features")
data_vector = vector_assembler.transform(data).select("features")

correlation_matrix = Correlation.corr(data_vector, "features").collect()[0][0]
correlation_value = correlation_matrix[0, 1]
print(f"Correlation between variable1 and variable2: {correlation_value:.2f}")
Output
Correlation between variable1 and variable2: 1.00
In this example, we used the VectorAssembler to combine the two variables into a single feature vector column called features. Then, we used the Correlation module from pyspark.ml.stat to calculate the correlation between the two variables. The corr function returns a correlation matrix, from which we can extract the correlation value between variable1 and variable2.

Interpreting the Results

The correlation value ranges from -1 to 1, where:

  • -1 indicates a strong negative relationship
  • 0 indicates no relationship
  • 1 indicates a strong positive relationship

In our example, the correlation value is 1.0, which indicates a strong positive relationship between variable1 and variable2. This means that as variable1 increases, variable2 also increases, and vice versa.

In this article, we explored correlation analysis in PySpark, a statistical technique used to measure the strength and direction of the relationship between two continuous variables. We provided a detailed example using hardcoded values as input, showcasing how to create a DataFrame, calculate the correlation between two variables, and interpret the results. Correlation analysis can be useful in various fields, such as finance, economics, and social sciences, to understand the relationships between variables and make data-driven decisions.

Author: user

Leave a Reply