Apache Spark interview questions

PySpark @ Freshers.in

50. What is GraphX?
GraphX is a Spark API for manipulating Graphs and collections. It unifies ETL, other analysis, and iterative graph computation. It’s fastest graph system, provides fault tolerance and ease of use without special skills.

51. What is File System API?
FS API can read data from different storage devices like HDFS, S3 or local FileSystem. Spark uses FS API to read data from different storage engines.

52. Why Partitions are immutable?
Every transformation generate new partition. Partitions uses HDFS API so that partition is immutable, distributed and fault tolerance. Partition also aware of data locality.

53. What is Transformation in spark?
Spark provides two special operations on RDDs called transformations and Actions. Transformation follow lazy operation and temporary hold the data until unless called the Action. Each transformation generate/return new RDD. Example of transformations: Map, flatMap, groupByKey, reduceByKey, filter, co­group, join, sortByKey, Union, distinct, sample are common spark transformations.

54. What is Action in Spark?
Actions is RDD’s operation, that value return back to the spar driver programs, which kick off a job to execute on a cluster. Transformation’s output is input of Actions. reduce, collect, takeSample, take, first, saveAsTextfile, saveAsSequenceFile, countByKey, foreach are common actions in Apache spark.

55. What is RDD Lineage?
Lineage is a RDD process to reconstruct lost partitions. Spark not replicate the data in memory, if data lost, Rdd use linege to rebuild lost data.Each RDD remembers how the RDD build from other datasets.

56. What is Map and flatMap in Spark?
Map is a specific line or row to process that data. In FlatMap each input item can be mapped to multiple output items (so function should return a Seq rather than a single item). So most frequently used to return Array elements.map and flatMap are similar, in the sense they take a line from the input RDD and apply a function on it. The way they differ is that the function in map returns only one element, while function in flatMap can return a list of elements (0 or more) as an iterator.Basically map is defined in abstract class RDD in spark and it is a transformation kind of operation which means it is a lazy operation.

Author: user

Leave a Reply