78. What are the key features of Apache Spark that you like?
Spark provides advanced analytic options like graph algorithms, machine learning, streaming data, etc
It has built-in APIs in multiple languages like Java, Scala, Python and R
It has good performance gains, as it helps run an application in the Hadoop cluster ten times faster on disk and 100 times faster in memory.
79. How Spark uses Hadoop?
Spark has its own cluster management computation and mainly uses Hadoop for storage.
80. What are the various data sources available in SparkSQL?
Parquet file
JSON Datasets
Hive tables
81. What is the advantage of a Parquet file?
Parquet file is a columnar format file that helps – Parquet is technically a hybrid columnar format. All columns are stored in a single file and is compressed in a single file.
Limit I/O operations
Consumes less space
Fetches only required columns.
Offer better write performance by storing metadata at the end of the file
82. How can you compare Hadoop and Spark in terms of ease of use?
Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers – making it comparatively easier to use than Hadoop.
83. Why is BlinkDB used?
BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time. BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data.
84. Which spark library allows reliable file sharing at memory speed
across different cluster frameworks?
Tachyon, a memory centric fault-tolerant distributed file system, which enables reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.