PySpark : Splitting a DataFrame into multiple smaller DataFrames [randomSplit function in PySpark]

user April 11, 2023 Leave a Comment

In this article, we will discuss the randomSplit function in PySpark, which is useful for splitting a DataFrame into multiple smaller DataFrames based on specified weights. This function is particularly helpful when you need to divide a dataset into training and testing sets for machine learning tasks. We will provide a detailed example using hardcoded values as input.

First, let’s create a PySpark DataFrame :

from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType

spark = SparkSession.builder \
    .appName("RandomSplit @ Freshers.in Example") \
    .getOrCreate()

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("timestamp", TimestampType(), True)
])

data = spark.createDataFrame([
    ("Sachin", 30, datetime.strptime("2022-12-01 12:30:15.123", "%Y-%m-%d %H:%M:%S.%f")),
    ("Barry", 25, datetime.strptime("2023-01-10 16:45:35.789", "%Y-%m-%d %H:%M:%S.%f")),
    ("Charlie", 35, datetime.strptime("2023-02-07 09:15:30.246", "%Y-%m-%d %H:%M:%S.%f")),
    ("David", 28, datetime.strptime("2023-03-15 18:20:45.567", "%Y-%m-%d %H:%M:%S.%f")),
    ("Eva", 22, datetime.strptime("2023-04-21 10:34:25.890", "%Y-%m-%d %H:%M:%S.%f"))
], schema)

data.show(20,False)

Output

+-------+---+--------------------+
|   name|age|           timestamp|
+-------+---+--------------------+
| Sachin| 30|2022-12-01 12:30:...|
|  Barry| 25|2023-01-10 16:45:...|
|Charlie| 35|2023-02-07 09:15:...|
|  David| 28|2023-03-15 18:20:...|
|    Eva| 22|2023-04-21 10:34:...|
+-------+---+--------------------+

Using randomSplit Function

Now, let’s use the randomSplit function to split the DataFrame into two smaller DataFrames. In this example, we will split the data into 70% for training and 30% for testing:

train_data, test_data = data.randomSplit([0.7, 0.3], seed=42)
train_data.show()
test_data.show()

Output

+------+---+-----------------------+
|name  |age|timestamp              |
+------+---+-----------------------+
|Barry |25 |2023-01-10 16:45:35.789|
|Sachin|30 |2022-12-01 12:30:15.123|
|David |28 |2023-03-15 18:20:45.567|
|Eva   |22 |2023-04-21 10:34:25.89 |
+------+---+-----------------------+

+-------+---+-----------------------+
|name   |age|timestamp              |
+-------+---+-----------------------+
|Charlie|35 |2023-02-07 09:15:30.246|
+-------+---+-----------------------+

The randomSplit function accepts two arguments: a list of weights for each DataFrame and a seed for reproducibility. In this example, we’ve used the weights [0.7, 0.3] to allocate approximately 70% of the data to the training set and 30% to the testing set. The seed value 42 ensures that the split will be the same every time we run the code.

Please note that the actual number of rows in the resulting DataFrames might not exactly match the specified weights due to the random nature of the function. However, with a larger dataset, the split will be closer to the specified weights.

Here we demonstrated how to use the randomSplit function in PySpark to divide a DataFrame into smaller DataFrames based on specified weights. This function is particularly useful for creating training and testing sets for machine learning tasks. We provided an example using hardcoded values as input, showcasing how to create a DataFrame and perform the random split.

Spark important urls to refer

Post Views: 405

Author: user

PySpark : Splitting a DataFrame into multiple smaller DataFrames [randomSplit function in PySpark]

Using randomSplit Function

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

Using randomSplit Function

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget