Apache Spark interview questions

PySpark @ Freshers.in

113. What is Checkpointing
Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system.
There are two types of checkpointing:
reliable – in Spark (core), RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system, e.g. HDFS.
local – in Spark Streaming or GraphX – RDD checkpointing that truncates RDD lineage graph.
Before checkpointing is used, a Spark developer has to set the checkpoint directory using SparkContext.setCheckpointDir(directory: String) method.
Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext.setCheckpointDir() and all references to its parent RDDs will be removed.

114. How to see number of partitions
rdd.getNumPartitions()

115. Explain MapPartition
mapPartitions transformation is faster than map since it calls your function once/partition, not once/element.Whenever you have heavyweight initialization that should be done once for many RDD elements rather than once per RDD element, and if this initialization, such as creation of objects from a third-party library, cannot be serialized (so that Spark can transmit it across the cluster to the worker nodes), use mapPartitions()
map works the function being utilized at a per element level while mapPartitions exercises the function at the partition level.
Example Scenario : if we have 100K elements in a particular RDD partition then we will fire off the function being used by the mapping transformation 100K times when we use map.
Conversely, if we use mapPartitions then we will only call the particular function one time, but we will pass in all 100K records and get back all responses in one function call.The entire content of the respective partitions is available as a sequential stream of values via the input argument (Iterarator[T]). The custom function must return yet another Iterator[U]. The combined result iterators are automatically converted into a new RDD.

116. What is cache and persist
cache is merely persist with the default storage level MEMORY_ONLY.
persist(MEMORY_ONLY) .Note that cache() is an alias for persist(StorageLevel.MEMORY_ONLY) which may not be ideal for datasets larger than available cluster memory.

117 . What is Checkpointing ?
Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system.
There are two types of checkpointing:
reliable – in Spark (core), RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system, e.g. HDFS.
local – in Spark Streaming or GraphX – RDD checkpointing that truncates RDD lineage graph.

118. How to see number of partitions
rdd.getNumPartitions()

119. What is cache
cache is merely persist with the default storage level MEMORY_ONLY.
persist(MEMORY_ONLY) .Note that cache() is an alias for persist(StorageLevel.MEMORY_ONLY) which may not be ideal for datasets larger than available cluster memory.
parsedData3.persist(StorageLevel.MEMORY_ONLY)

Author: user

Leave a Reply