Apache Spark interview questions

PySpark @ Freshers.in

85. What is Catalyst framework?
Catalyst framework is a new optimizationaffecting framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.

86. When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without any change to the cluster.

87. What is a DStream?
A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous sequence of RDDs (of the same type) representing a continuous stream of data (see spark.RDD for more details on RDDs). DStreams can either be created from live data (such as, data from HDFS, Kafka or Flume) or it can be generated by transformation existing DStreams using operations such as map, window and reduceByKeyAndWindow. While a Spark Streaming program is running, each DStream periodically generates a RDD, either from live data or by transforming the RDD generated by a parent DStream.

DStreams have two operations –
Transformations that produce a new DStream.
Output operations that write data to an external system.
DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset

88. What is the significance of Sliding Window operation?
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

89. What are the benefits of using Spark with Apache Mesos?
It renders/provides scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.

89. What are the advantages of using Apache Spark over Hadoop
MapReduce for big data processing?
Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop.
Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD).
Spark is easier to program as it comes with an interactive mode.
It provides complete recovery using lineage graph whenever something goes wrong.

90. What is Shark?
Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background – to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark – offering compatibility with Hive metastore, queries and data.

91. List some use cases where Spark outperforms Hadoop in processing.
Sensor Data Processing – Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources.
Spark is preferred over Hadoop for real time querying of data
Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.

Author: user

Leave a Reply