Apache Spark interview questions

PySpark @ Freshers.in

57. What are broadcast variables?
Broadcast variables let programmer keep a read­only variable cached on each machine, rather than shipping a copy of it with tasks. Spark supports 2 types of shared variables called broadcast variables (like Hadoop distributed cache) and accumulators (like Hadoop counters). Broadcast variables stored as Array Buffers, which sends read­only values to work nodes.

58. What are Accumulators in Spark?
Spark of­line debuggers called accumulators. Spark accumulators are similar to Hadoop counters, to count the number of events and what’s happening during job you can use accumulators. Only the driver program can read an accumulator value, not the tasks.

59. How RDD persist the data?
There are two methods to persist the data, such as persist() to persist permanently and cache() to persist temporarily in the memory. Different storage level options there such as MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY and many more. Both persist() and cache() uses different options depends on the task.

60. What do you understand by SchemaRDD?
An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column. SchemaRDDs are composed Row objects along with a schema that describes the data types of each column in the row.

61. Define a worker node.
A node that can run the Spark application code in a cluster can be called as a worker node. A worker node can have more than one worker which is configured by setting the SPARK_ WORKER_INSTANCES property in the spark-env.sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined.

62. What do you understand by Lazy Evaluation?
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result. When a transformation like map () is called on a RDD-the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.

63. Explain about the core components of a distributed Spark application.
Driver – The process that runs the main () method of the program to create RDDs and perform transformations and actions on them.
Executor – The worker processes that run the individual tasks of a Spark job.
Cluster Manager – A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN.

Author: user

Leave a Reply