8. Define ‘Transformations’ in Spark.
‘Transformations’ are functions applied on RDD, resulting in a new RDD. It does not execute until an action occurs. map() and filer() are examples of ‘transformations’, where the former applies the function assigned to it on each element of the RDD and results in another RDD. The filter() creates a new RDD by selecting elements from the current RDD.
9. Define ‘Action’ in Spark.
An ‘action’ helps in bringing back the data from the RDD to the local machine. Execution of ‘action’ is the result of all transformations created previously. reduce() is an action that implements the function passed again and again until only one value is left. On the other hand, the take() action takes all the values from the RDD to the local node.
10. What are the functions of ‘Spark Core’?
The ‘SparkCore’ performs an array of critical functions like memory management, monitoring jobs, fault tolerance, job scheduling and interaction with storage systems.
It is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic input and output functionalities. RDD in Spark Core makes it fault tolerance. RDD is a collection of items distributed across many nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these collections.
11. What is an ‘RDD Lineage’?
Spark does not support data replication in the memory. In the event of any data loss, it is rebuilt using the ‘RDD Lineage’. It is a process that reconstructs lost data partitions. Its the graph of all parent RDD of an RDD. A graph of all parent RDD.
toDebugString() can be used
12. What is a ‘Spark Driver’?
‘Spark Driver’ is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. The driver also delivers RDD graphs to the ‘Master’, where the standalone cluster manager runs. – In technical perspective, it is the one which creates spark context
13. What is SparkContext?
‘SparkContext’ is the main entry point for Spark functionality. A ‘SparkContext’ represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
14. What is Hive on Spark?
Hive is a component of Hortonworks’ Data Platform (HDP). Hive provides an SQL-like interface to data stored in the HDP. Spark users will automatically get the complete set of Hive’s rich features, including any new features that Hive might introduce in the future.
The main task around implementing the Spark execution engine for Hive lies in query planning, where Hive operator plans from the semantic analyzer which is translated to a task plan that Spark can execute. It also includes query execution, where the generated Spark plan gets actually executed in the Spark cluster.