PySpark : Splitting a DataFrame into multiple smaller DataFrames [randomSplit function in PySpark]

PySpark @ Freshers.in

In this article, we will discuss the randomSplit function in PySpark, which is useful for splitting a DataFrame into multiple smaller DataFrames based on specified weights. This function is particularly helpful when you need to divide a dataset into training and testing sets for machine learning tasks. We will provide a detailed example using hardcoded values as input.

First, let’s create a PySpark DataFrame :

from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType

spark = SparkSession.builder \
    .appName("RandomSplit @ Freshers.in Example") \
    .getOrCreate()

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("timestamp", TimestampType(), True)
])

data = spark.createDataFrame([
    ("Sachin", 30, datetime.strptime("2022-12-01 12:30:15.123", "%Y-%m-%d %H:%M:%S.%f")),
    ("Barry", 25, datetime.strptime("2023-01-10 16:45:35.789", "%Y-%m-%d %H:%M:%S.%f")),
    ("Charlie", 35, datetime.strptime("2023-02-07 09:15:30.246", "%Y-%m-%d %H:%M:%S.%f")),
    ("David", 28, datetime.strptime("2023-03-15 18:20:45.567", "%Y-%m-%d %H:%M:%S.%f")),
    ("Eva", 22, datetime.strptime("2023-04-21 10:34:25.890", "%Y-%m-%d %H:%M:%S.%f"))
], schema)

data.show(20,False)

Output

+-------+---+--------------------+
|   name|age|           timestamp|
+-------+---+--------------------+
| Sachin| 30|2022-12-01 12:30:...|
|  Barry| 25|2023-01-10 16:45:...|
|Charlie| 35|2023-02-07 09:15:...|
|  David| 28|2023-03-15 18:20:...|
|    Eva| 22|2023-04-21 10:34:...|
+-------+---+--------------------+

Using randomSplit Function

Now, let’s use the randomSplit function to split the DataFrame into two smaller DataFrames. In this example, we will split the data into 70% for training and 30% for testing:

train_data, test_data = data.randomSplit([0.7, 0.3], seed=42)
train_data.show()
test_data.show()

Output

+------+---+-----------------------+
|name  |age|timestamp              |
+------+---+-----------------------+
|Barry |25 |2023-01-10 16:45:35.789|
|Sachin|30 |2022-12-01 12:30:15.123|
|David |28 |2023-03-15 18:20:45.567|
|Eva   |22 |2023-04-21 10:34:25.89 |
+------+---+-----------------------+

+-------+---+-----------------------+
|name   |age|timestamp              |
+-------+---+-----------------------+
|Charlie|35 |2023-02-07 09:15:30.246|
+-------+---+-----------------------+

The randomSplit function accepts two arguments: a list of weights for each DataFrame and a seed for reproducibility. In this example, we’ve used the weights [0.7, 0.3] to allocate approximately 70% of the data to the training set and 30% to the testing set. The seed value 42 ensures that the split will be the same every time we run the code.

Please note that the actual number of rows in the resulting DataFrames might not exactly match the specified weights due to the random nature of the function. However, with a larger dataset, the split will be closer to the specified weights.

Here we demonstrated how to use the randomSplit function in PySpark to divide a DataFrame into smaller DataFrames based on specified weights. This function is particularly useful for creating training and testing sets for machine learning tasks. We provided an example using hardcoded values as input, showcasing how to create a DataFrame and perform the random split.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply