Monday, November 4, 2013

Hadoop Core (HDFS and YARN) Components Explained

It's critical to understand the core components in Hadoop YARN (Yet Another Resource Negotiator) or MapReduce 2.0, and how the components interact with each other in the system. Following tutorial will explain those components and there are reference links at the bottom you can follow to read up more details.

If you don't have Hadoop setup in your linux, you can follow Hadoop Setup Guide

NameNode (Hadoop FileSystem Component)

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

DateNode (Hadoop FileSystem Component)

A DataNode stores the actual data in the HDFS. A functional filesystem typically have more than one DataNode in the cluster, with data replicated across them. On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.

A quickstart tutorial on HDFS can be Hadoop FileSystem (HDFS) Tutorial 1

Application Submission in YARN

1. Application Submission Client submits an Application to the YARN Resource Manager. The client needs to provide sufficient information to the ResourceManager in order to launch ApplicationMaster

2. YARN ResourceManager starts ApplicationMaster.

3. The ApplicationMaster then communicates with the ResourceManager to request resource allocation.

4. After a container is allocated to it, the ApplicationMaster communicates with the NodeManager to launch the tasks in the container.

Resource Manager (YARN Component)

The function of the Resource Manager is simple: Keeping track of available resources. One per cluster. It contains two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications.
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

Application Master (YARN Component)

Application Master is created for each application running in the cluster. It provides task-level scheduling and monitoring.

Node Manager (YARN Component)

The NodeManager is the per-machine framework agent who creates container for each task. The containers can have variable resource sizes and the task can be any type of computations not just map/reduce tasks. It then monitors the resource usage (cpu, memory, disk, network) of the container and report them to the ResourceManager.

Reference Links

Apache Hadoop NextGen MapReduce (YARN)
Yahoo Hadoop Tutorial
More reference links to be added...

Please feel to leave me any comments or suggestions below.