Monday, November 4, 2013

Hadoop Core (HDFS and YARN) Components Explained

It's critical to understand the core components in Hadoop YARN (Yet Another Resource Negotiator) or MapReduce 2.0, and how the components interact with each other in the system. Following tutorial will explain those components and there are reference links at the bottom you can follow to read up more details.

If you don't have Hadoop setup in your linux, you can follow Hadoop Setup Guide

NameNode (Hadoop FileSystem Component)

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

DateNode (Hadoop FileSystem Component)

A DataNode stores the actual data in the HDFS. A functional filesystem typically have more than one DataNode in the cluster, with data replicated across them. On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.

A quickstart tutorial on HDFS can be Hadoop FileSystem (HDFS) Tutorial 1

Application Submission in YARN

1. Application Submission Client submits an Application to the YARN Resource Manager. The client needs to provide sufficient information to the ResourceManager in order to launch ApplicationMaster

2. YARN ResourceManager starts ApplicationMaster.

3. The ApplicationMaster then communicates with the ResourceManager to request resource allocation.

4. After a container is allocated to it, the ApplicationMaster communicates with the NodeManager to launch the tasks in the container.

Resource Manager (YARN Component)

The function of the Resource Manager is simple: Keeping track of available resources. One per cluster. It contains two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications.
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

Application Master (YARN Component)

Application Master is created for each application running in the cluster. It provides task-level scheduling and monitoring.

Node Manager (YARN Component)

The NodeManager is the per-machine framework agent who creates container for each task. The containers can have variable resource sizes and the task can be any type of computations not just map/reduce tasks. It then monitors the resource usage (cpu, memory, disk, network) of the container and report them to the ResourceManager.

Reference Links

Apache Hadoop NextGen MapReduce (YARN)
Yahoo Hadoop Tutorial
More reference links to be added...

Please feel to leave me any comments or suggestions below.

Sunday, October 27, 2013

Hadoop FileSystem (HDFS) Tutorial 1

In this tutorial I will show some common commands for HDFS operations.
If you don't have Hadoop setup in your linux, you can follow Hadoop Setup Guide

Log into Linux, "hduser" is the login used in following examples.

Start Hadoop If it's not running
Create someFile.txt in your home directory
hduser@ubuntu:~$ vi someFile.txt

Paste any text you want in to the file and save it.
Create Home Directory In HDFS (If it doesn't exist)
hduser@ubuntu:~$ hadoop fs -mkdir -p /user/hduser
Copy file someFile.txt from local disk to the user’s directory in HDFS.
hduser@ubuntu:~$ hadoop fs -copyFromLocal someFile.txt someFile.txt
Get a directory listing of the user’s home directory in HDFS
hduser@ubuntu:~$ hadoop fs –ls

Found 1 items
-rw-r--r--   1 hduser supergroup          5 2013-10-27 17:57 someFile.txt

Display the contents of the HDFS file /user/hduser/someFile.txt
hduser@ubuntu:~$ hadoop fs –cat /user/hduser/someFile.txt
Get a directory listing of the HDFS root directory
hduser@ubuntu:~$ hadoop fs –ls /
copy that file to the local disk, named as someFile2.txt
hduser@ubuntu:~$ hadoop fs –copyToLocal /user/hduser/someFile.txt someFile2.txt
Delete the file from hadoop hdfs
hduser@ubuntu:~$ hadoop fs –rm someFile.txt

Deleted someFile.txt

For a full list of commands, Please visit HDFS FileSystem Shell Commands. Please feel free to leave me any comments or suggestions.

Monday, October 21, 2013

Setup newest Hadoop 2.x (2.2.0) on Ubuntu

In this tutorial I am going to guide you through setting up hadoop 2.2.0 environment on Ubuntu.


$ sudo apt-get install openjdk-7-jdk
$ java -version
java version "1.7.0_25"
OpenJDK Runtime Environment (IcedTea 2.3.12) (7u25-2.3.12-4ubuntu3)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
$ cd /usr/lib/jvm
$ ln -s java-7-openjdk-amd64 jdk

$ sudo apt-get install openssh-server

Add Hadoop Group and User

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
$ sudo adduser hduser sudo
After user is created, re-login into ubuntu using hduser

Setup SSH Certificate

$ ssh-keygen -t rsa -P ''
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/
$ cat ~/.ssh/ >> ~/.ssh/authorized_keys
$ ssh localhost

Download Hadoop 2.2.0

$ cd ~
$ wget
$ sudo tar vxzf hadoop-2.2.0.tar.gz -C /usr/local
$ cd /usr/local
$ sudo mv hadoop-2.2.0 hadoop
$ sudo chown -R hduser:hadoop hadoop

Setup Hadoop Environment Variables

$cd ~
$vi .bashrc

paste following to the end of the file

#Hadoop variables
export JAVA_HOME=/usr/lib/jvm/jdk/
export HADOOP_INSTALL=/usr/local/hadoop
###end of paste

$ cd /usr/local/hadoop/etc/hadoop
$ vi

#modify JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/jdk/
Re-login into Ubuntu using hdser and check hadoop version
$ hadoop version
Hadoop 2.2.0
Subversion -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
This command was run using /usr/local/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar
At this point, hadoop is installed.

Configure Hadoop

$ cd /usr/local/hadoop/etc/hadoop
$ vi core-site.xml
#Paste following between <configuration>

$ vi yarn-site.xml
#Paste following between <configuration>



$ mv mapred-site.xml.template mapred-site.xml
$ vi mapred-site.xml
#Paste following between <configuration>

$ cd ~
$ mkdir -p mydata/hdfs/namenode
$ mkdir -p mydata/hdfs/datanode
$ cd /usr/local/hadoop/etc/hadoop
$ vi hdfs-site.xml
Paste following between <configuration> tag


Format Namenode

hduser@ubuntu40:~$ hdfs namenode -format

Start Hadoop Service


hduser@ubuntu40:~$ jps
If everything is sucessful, you should see following services running
2583 DataNode
2970 ResourceManager
3461 Jps
3177 NodeManager
2361 NameNode
2840 SecondaryNameNode

Run Hadoop Example

hduser@ubuntu: cd /usr/local/hadoop
hduser@ubuntu:/usr/local/hadoop$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 5

Number of Maps  = 2
Samples per Map = 5
13/10/21 18:41:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Wrote input for Map #0
Wrote input for Map #1
Starting Job
13/10/21 18:41:04 INFO client.RMProxy: Connecting to ResourceManager at /
13/10/21 18:41:04 INFO input.FileInputFormat: Total input paths to process : 2
13/10/21 18:41:04 INFO mapreduce.JobSubmitter: number of splits:2
13/10/21 18:41:04 INFO Configuration.deprecation: is deprecated. Instead, use

Note: ericduq has created a shell script ( for this setup and it is available at git repo at

What to read next
Hadoop FileSystem (HDFS) Tutorial 1
Hadoop 2.x Core (HDFS and YARN) Components Explained
Hadoop Wordcount example

Feel free to leave comments below. I will have more hadoop tutorials added regularly.

Friday, October 18, 2013

Hadoop WordCount with new map reduce api

There are so many version of WordCount hadoop example flowing around the web. However, a lot of them are using the older version of hadoop api. Following are example of word count using the newest hadoop map reduce api. The new map reduce api reside in org.apache.hadoop.mapreduce package instead of org.apache.hadoop.mapred.

import java.util.StringTokenizer;

import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
 private Text word = new Text();
 private final static IntWritable one = new IntWritable(1);
 public void map(Object key, Text value,
   Context contex) throws IOException, InterruptedException {
  // Break line into words for processing
  StringTokenizer wordList = new StringTokenizer(value.toString());
  while (wordList.hasMoreTokens()) {
   contex.write(word, one);

import java.util.Iterator;

import org.apache.hadoop.mapreduce.Reducer;

public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
 private IntWritable totalWordCount = new IntWritable();
 public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
  int wordCount = 0;
  Iterator<IntWritable> it=values.iterator();
  while (it.hasNext()) {
   wordCount +=;
  context.write(key, totalWordCount);
} (Driver)

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
 public static void main(String[] args) throws Exception {
        if (args.length != 2) {
          System.out.println("usage: [input] [output]");
        Job job = Job.getInstance(new Configuration());



        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));




Monday, April 15, 2013

Java IO vs NIO

Java NIO (New IO) was introduced in JDK 1.4. While Java IO centralizes around Stream/Reader/Writer and uses decorator pattern as its main design, where you have decorate one InputStream type with each other in the right order. The NIO uses Channel/Buffer/Selector.

Read a file using IO

File file =new File("C:\\test\\test.txt");
FileInputStream  fis = new FileInputStream(file);

//decorate BufferedReader with InputStreamReader
BufferedReader d = new BufferedReader(new InputStreamReader(fis));
String line=null;
while((line = d.readLine()) != null){

Read a file using NIO

RandomAccessFile rFile = new RandomAccessFile("C:\\test\\test.txt", "rw");
//get Channel
FileChannel inChannel = rFile.getChannel();
//Intialize byte buffer
ByteBuffer buf = ByteBuffer.allocate(100);
//read data to buffer
 buf.flip(); //flip into read mode