Apache Spark interview questions

PySpark @ Freshers.in

22. Which file systems does Spark support?
Hadoop Distributed File System (HDFS)
Local File system
S3

23. What is ‘YARN’?
‘YARN’ is a large-scale, distributed operating system for big data applications. It is one of the key features of Spark, providing a central and resource management platform to deliver scalable operations across the cluster.

24. List the benefits of Spark over MapReduce.
Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce.
Unlike MapReduce, Spark provides in-built libraries to perform multiple tasks form the same core; like batch processing, steaming, machine learning, interactive SQL queries among others.
MapReduce is highly disk-dependent whereas Spark promotes caching and in-memory data storage
Spark is capable of iterative computation while MapReduce is not.
Additionally, Spark stores data in-memory whereas Hadoop stores data on the disk. Hadoop uses replication to achieve fault tolerance while Spark uses a different data storage model, resilient distributed datasets (RDD). It also uses a clever way of guaranteeing fault tolerance that minimizes network input and output.
-Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage

25. What is a ‘Spark Executor’?
When ‘SparkContext’ connects to a cluster manager, it acquires an ‘Executor’ on the cluster nodes. ‘Executors’ are Spark processes that run computations and store the data on the worker node. The final tasks by ‘SparkContext’ are transferred to executors.

26. List the various types of ‘Cluster Managers’ in Spark.
The Spark framework supports three major types of Cluster Managers:
a. Standalone: a basic manager to set up a cluster
b. Apache Mesos: generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications
c. Yarn: responsible for resource management in Hadoop

27. What is a ‘worker node’?
‘Worker node’ refers to any node that can run the application code in a cluster.

28. Define ‘PageRank’.
‘PageRank’ is the measure of each vertex in a graph.A unique feature and algorithm in graph, PageRank is the measure of each vertex in the graph. For instance, an edge from u to v represents endorsement of v’s importance by u. In simple terms, if a user at Instagram is followed massively, it will rank high on that platform.

Author: user

Leave a Reply