Apache Spark interview questions

PySpark @ Freshers.in

64. Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?
Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.

65 How can you achieve high availability in Apache Spark?
Implementing single node recovery with local file system
Using StandBy Masters with Apache ZooKeeper.

66 How Spark uses Akka?
Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.

67 How can you launch Spark jobs inside Hadoop MapReduce?
Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without requiring any admin rights.SIMR provides a quick way for Hadoop MapReduce 1 users to use Apache Spark.

68 Does Apache Spark provide check pointing?
Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint – is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.

69 How Spark handles monitoring and logging in Standalone mode?
Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job statistics. The log output for each job is written to the work directory of the slave nodes.

70 What are the various levels of persistence in Apache Spark?
Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.
The various storage/persistence levels in Spark are –
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER, DISK_ONLY
OFF_HEAP.

Author: user

Leave a Reply