Apache Spark interview questions

PySpark @ Freshers.in

92. What is a Sparse Vector?
A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.

93. What is RDD?
RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –
Immutable – RDDs cannot be altered.
Resilient – If a node holding the partition fails the other node takes the data.

94. Explain about transformations and actions in the context of RDDs.
Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.
Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

95. What are the languages supported by Apache Spark for
developing big data applications?
Scala, Java, Python, R and Clojure

96. Can you use Spark to access and analyse data stored in Cassandra databases?
Yes, it is possible if you use Spark Cassandra Connector.

97. Is it possible to run Apache Spark on Apache Mesos?
Yes, Apache Spark can be run on the hardware clusters managed by Mesos.

98. Explain about the different cluster managers in Apache Spark
The 3 different clusters managers supported in Apache Spark are:
YARN
Apache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along with other applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation between commands.
Standalone deployments – Well suited for new deployments which only run and are easy to set up.

Author: user

Leave a Reply