Apache Spark interview questions

user March 7, 2021 Leave a Comment

127. What is Apache Mesos
Apache Mesos is a general-purpose cluster manager that can run both analytics workloads and long-running services (e.g., web applications or key/value stores) on a cluster. To use Spark on Mesos, pass a mesos:// URI to spark-submit:
spark-submit –master mesos://masternode:5050 yourapp

128. What is a stage in Apache Spark?
A stage is a set of independent tasks all computing the same function that need to run as part of a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run by the scheduler is split up into stages at the boundaries where shuffle occurs, and then the DAGScheduler runs these stages in topological order.
Each Stage can either be a shuffle map stage, in which case its tasks’ results are input for another stage, or a result stage, in which case its tasks directly compute the action that initiated a job (e.g. count(), save(), etc). For shuffle map stages, we also track the nodes that each output partition is on.
Each Stage also has a jobId, identifying the job that first submitted the stage. When FIFO scheduling is used, this allows Stages from earlier jobs to be computed first or recovered faster on failure.

129. What is coalesce() and repartition() Apache Spark?
Sometimes, we want to change the partitioning of an RDD outside the context of grouping and aggregation operations. For those cases, Spark provides the repartition() function, which shuffles the data across the network to create a new set of partitions. Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions. To know whether you can safely call coalesce(), you can check the size of the RDD using rdd.partitions.size() in Java/Scala and rdd.getNumPartitions() in Python and make sure that you are coalescing.
coalesce() avoids a full shuffle. If it’s known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.
So, it would go something like this:
Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12
Then coalesce down to 2 partitions:
Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)

130. What is sortByKey in Apache Spark ?
The sortByKey() function takes a parameter called ascending indicating whether we want it in ascending order (it defaults to true).
rdd.sortByKey(ascending=True, numPartitions=None, keyfunc = lambda x: str(x))

131. What are the Actions Available on Pair RDDs Apache Spark?
CountByKey()
colletAsMap()
lookup(key)

132. What are the Operations That Benefit from Partitioning in Apache Spark ?
Many of Spark’s operations involve shuffling data by key across the network. All of these will benefit from partitioning. As of Spark 1.0, the operations that benefit from partitioning are cogroup(), groupWith(), join(), leftOuterJoin(), rightOuter Join(), groupByKey(), reduceByKey(), combineByKey(), and lookup().

Post Views: 322

Related Posts

When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark,…

Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following…

Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in…

Spark : Advantages of Google's Serverless Spark
Google's Serverless Spark has several advantages compared to traditional Spark clusters: Cost-effective: Serverless Spark eliminates…

How do you break a lineage in Apache Spark ? Why we need to break a lineage in Apache Spark ?
In Apache Spark, a lineage refers to the series of RDD (Resilient Distributed Dataset) operations…

AWS Glue interview questions
For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs 1.…

PySpark : Inserting row in Apache Spark Dataframe.
In PySpark, you can insert a row into a DataFrame by first converting the DataFrame…

PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…

PySpark : Explain in detail whether Apache Spark SQL lazy or not ?
Yes, Apache Spark SQL is lazy. In Spark, the concept of "laziness" refers to the…

In pyspark what is the difference between Spark spark.table() and spark.read.table()
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas spark.read.table()…

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Share: Twitter Facebook Pinterest Reddit VK Digg Linkedin Mix
Tagged Big Data, software_engineering, Technical

Author: user

Website

Related Articles

AWS S3 interview questions

Apache Storm interview questions

Digital Electronics interview questions

dbt (data build tool) interview questions

Amazon Athena interview questions

Cobol interview questions

Computer Organization interview questions

Amazon API Gateway interview questions

Post navigation

What are the Best Practices when using Snowflake Transactions? →
← Apache PIG interview questions

Leave a Reply Cancel reply
You must be logged in to post a comment.

Search for:
Trending
DBT
Python
Numpy
PySpark
Hive
Snowflake
Redshift
Airflow
Aptitude

Recent Posts

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Related Posts

When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark,…

Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following…

Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in…

Spark : Advantages of Google's Serverless Spark
Google's Serverless Spark has several advantages compared to traditional Spark clusters: Cost-effective: Serverless Spark eliminates…

How do you break a lineage in Apache Spark ? Why we need to break a lineage in Apache Spark ?
In Apache Spark, a lineage refers to the series of RDD (Resilient Distributed Dataset) operations…

AWS Glue interview questions
For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs 1.…

PySpark : Inserting row in Apache Spark Dataframe.
In PySpark, you can insert a row into a DataFrame by first converting the DataFrame…

PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…

PySpark : Explain in detail whether Apache Spark SQL lazy or not ?
Yes, Apache Spark SQL is lazy. In Spark, the concept of "laziness" refers to the…

In pyspark what is the difference between Spark spark.table() and spark.read.table()
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas spark.read.table()…

Most Viewed Posts

dbt (data build tool) interview questions

Python throwing as NameError: name ‘__file__’ is not defined – Solution

DBT command not found after intalling DBT-How to resolve.

BigQuery : Handle missing or null values in BigQuery

Airflow dags not getting refreshed/updating. How to do it manually?

How to delete a partition data as well from Hive external table on DROP command?

PySpark – groupby with aggregation (count, sum, mean, min, max)

Copyright © 2024 Freshers.in