Apache Storm interview questions

16. What happens when a node dies in Apache Storm?
The tasks assigned to that machine will time-out and Nimbus will reassign those tasks to other machines.

17. What happens when Nimbus or Supervisor daemons die in Apache Storm?
The Nimbus and Supervisor daemons are designed to be fail-fast (process self-destructs whenever any unexpected situation is encountered) and stateless (all state is kept in Zookeeper or on disk).Nimbus and Supervisor daemons must be run under supervision using a tool like daemontools or monit. So if the Nimbus or Supervisor daemons die, they restart like nothing happened. Most notably, no worker processes are affected by the death of Nimbus or the Supervisors. This is in contrast to Hadoop, where if the JobTracker dies, all the running jobs are lost.

18. Is Nimbus a single point of failure in Apache Storm?
If you lose the Nimbus node, the workers will still continue to function. Additionally, supervisors will continue to restart workers if they die. However, without Nimbus, workers won’t be reassigned to other machines when necessary (like if you lose a worker machine).

19. What is Apache Storm’s reliability API?
There are two things you have to do as a user to benefit from Storm’s reliability capabilities. First, you need to tell Storm whenever you’re creating a new link in the tree of tuples. Second, you need to tell Storm when you have finished processing an individual tuple. By doing both these things, Storm can detect when the tree of tuples is fully processed and can ack or fail the spout tuple appropriately. Storm’s API provides a concise way of doing both of these tasks.

20. What happens if a message is fully processed or fails to be fully processed in Apache Storm?
To understand this question, let’s take a look at the lifecycle of a tuple coming off of a spout. For reference, here is the interface that spouts implement (see the Javadoc for more information):

public interface ISpout extends Serializable {
void open(Map conf, TopologyContext context, SpoutOutputCollector collector);
void close();
void nextTuple();
void ack(Object msgId);
void fail(Object msgId);
}
First, Storm requests a tuple from the Spout by calling the nextTuple method on the Spout. The Spout uses the SpoutOutputCollector provided in the open method to emit a tuple to one of its output streams. When emitting a tuple, the Spout provides a “message id” that will be used to identify the tuple later. For example, the KestrelSpout reads a message off of the kestrel queue and emits as the “message id” the id provided by Kestrel for the message. Emitting a message to the SpoutOutputCollector looks like this:
_collector.emit(new Values(“field1”, “field2”, 3) , msgId);

Next, the tuple gets sent to consuming bolts and Storm takes care of tracking the tree of messages that is created. If Storm detects that a tuple is fully processed, Storm will call the ack method on the originating Spout task with the message id that the Spout provided to Storm. Likewise, if the tuple times-out Storm will call the fail method on the Spout. Note that a tuple will be acked or failed by the exact same Spout task that created it. So if a Spout is executing as many tasks across the cluster, a tuple won’t be acked or failed by a different task than the one that created it.

Let’s use KestrelSpout again to see what a Spout needs to do to guarantee message processing. When KestrelSpout takes a message off the Kestrel queue, it “opens” the message. This means the message is not actually taken off the queue yet, but instead placed in a “pending” state waiting for acknowledgement that the message is completed. While in the pending state, a message will not be sent to other consumers of the queue. Additionally, if a client disconnects all pending messages for that client are put back on the queue. When a message is opened, Kestrel provides the client with the data for the message as well as a unique id for the message. The KestrelSpout uses that exact id as the “message id” for the tuple when emitting the tuple to the SpoutOutputCollector. Sometime later on, when ack or fail are called on the KestrelSpout, the KestrelSpout sends an ack or fail message to Kestrel with the message id to take the message off the queue or have it put back on.

Post Views: 107

Related Posts

Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following…

Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in…

When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark,…

Apache PIG interview questions
1. What is pig? Pig is a Apache open soucre project which run on top…

How do you break a lineage in Apache Spark ? Why we need to break a lineage in Apache Spark ?
In Apache Spark, a lineage refers to the series of RDD (Resilient Distributed Dataset) operations…

AWS Glue interview questions
For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs 1.…

Algorithm interview questions
1. What is Insertion sort ? Insertion sort takes elements of the array sequentially and…

Digital Electronics interview questions
1. What is a Multiplexer ? Multiplexer, also known as a data selector, is a…

Operating system interview questions
1. What is the main purpose of an operating system ? Three main functions: a.…

Compiler interview questions
1. What is an interpreter ? An interpreter is a program that appears to execute…

Pages: 1 2 3 4 5