What is the difference between repartition() and coalesce() ?

PySpark @ Freshers.in

The repartition algorithm will perform a full shuffle and creates new partitions with data that’s distributed evenly. The repartition algorithm makes new partitions and it will evenly distributes the data in the new partitions (the data distribution is more even for larger data sets).Repartition can increase the size of data on disk.

Coalesce will use existing partitions to minimize the amount of data that’s shuffled. Coalesce results in partitions with different amounts of data.  Coalesce may run faster than repartition, but unequal sized partitions are generally slower to work with than equal sized partitions. In spark RDD coalesce() is mainly used to reduce the number of partitions.

If business need a single output file (in a folder) you can repartition (If upstream data is large, but requires a shuffle).  In repartition() the number of partitions can be increased or decreased, but with coalesce() the number of partitions can only be decreased.

For example , when using coalesce, if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions. For your information , coalesce(1) or repartition(1) may work for small data-sets, but large data-sets would all be thrown into one partition on one node. This is likely to throw OOM errors, or at best, to process slowly.

FYI : Default shuffling partition is 200 , this is set using spark.sql.shuffle.partitions configuration.

Sample code using repartition and coalesce

df
   .repartition(1)
   .write.format("com.databricks.spark.csv")
   .option("header", "true")
   .save("freshers_data.csv")
df
   .coalesce(1)
   .write.format("com.databricks.spark.csv")
   .option("header", "true")
   .save("freshers_data.csv")
Author: user

Leave a Reply