What is the difference between repartition() and coalesce() ?

user July 27, 2022 Leave a Comment

The repartition algorithm will perform a full shuffle and creates new partitions with data that’s distributed evenly. The repartition algorithm makes new partitions and it will evenly distributes the data in the new partitions (the data distribution is more even for larger data sets).Repartition can increase the size of data on disk.

Coalesce will use existing partitions to minimize the amount of data that’s shuffled. Coalesce results in partitions with different amounts of data. Coalesce may run faster than repartition, but unequal sized partitions are generally slower to work with than equal sized partitions. In spark RDD coalesce() is mainly used to reduce the number of partitions.

If business need a single output file (in a folder) you can repartition (If upstream data is large, but requires a shuffle). In repartition() the number of partitions can be increased or decreased, but with coalesce() the number of partitions can only be decreased.

For example , when using coalesce, if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions. For your information , coalesce(1) or repartition(1) may work for small data-sets, but large data-sets would all be thrown into one partition on one node. This is likely to throw OOM errors, or at best, to process slowly.

FYI : Default shuffling partition is 200 , this is set using spark.sql.shuffle.partitions configuration.

Sample code using repartition and coalesce

df
   .repartition(1)
   .write.format("com.databricks.spark.csv")
   .option("header", "true")
   .save("freshers_data.csv")
df
   .coalesce(1)
   .write.format("com.databricks.spark.csv")
   .option("header", "true")
   .save("freshers_data.csv")

Post Views: 32

Author: user

What is the difference between repartition() and coalesce() ?

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Security Features of Snowflake

Most Viewed Posts

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget