Monday, November 4, 2013

Hadoop Core (HDFS and YARN) Components Explained

It's critical to understand the core components in Hadoop YARN (Yet Another Resource Negotiator) or MapReduce 2.0, and how the components interact with each other in the system. Following tutorial will explain those components and there are reference links at the bottom you can follow to read up more details.

If you don't have Hadoop setup in your linux, you can follow Hadoop Setup Guide

NameNode (Hadoop FileSystem Component)

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.


DateNode (Hadoop FileSystem Component)

A DataNode stores the actual data in the HDFS. A functional filesystem typically have more than one DataNode in the cluster, with data replicated across them. On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.



A quickstart tutorial on HDFS can be Hadoop FileSystem (HDFS) Tutorial 1


Application Submission in YARN

1. Application Submission Client submits an Application to the YARN Resource Manager. The client needs to provide sufficient information to the ResourceManager in order to launch ApplicationMaster

2. YARN ResourceManager starts ApplicationMaster.

3. The ApplicationMaster then communicates with the ResourceManager to request resource allocation.

4. After a container is allocated to it, the ApplicationMaster communicates with the NodeManager to launch the tasks in the container.


Resource Manager (YARN Component)

The function of the Resource Manager is simple: Keeping track of available resources. One per cluster. It contains two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications.
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.


Application Master (YARN Component)

Application Master is created for each application running in the cluster. It provides task-level scheduling and monitoring.


Node Manager (YARN Component)

The NodeManager is the per-machine framework agent who creates container for each task. The containers can have variable resource sizes and the task can be any type of computations not just map/reduce tasks. It then monitors the resource usage (cpu, memory, disk, network) of the container and report them to the ResourceManager.

Reference Links

Apache Hadoop NextGen MapReduce (YARN)
Yahoo Hadoop Tutorial
More reference links to be added...


Please feel to leave me any comments or suggestions below.

Sunday, October 27, 2013

Hadoop FileSystem (HDFS) Tutorial 1

In this tutorial I will show some common commands for HDFS operations.
If you don't have Hadoop setup in your linux, you can follow Hadoop Setup Guide

Log into Linux, "hduser" is the login used in following examples.

Start Hadoop If it's not running
$ start-dfs.sh
....
$ start-yarn.sh
Create someFile.txt in your home directory
hduser@ubuntu:~$ vi someFile.txt

Paste any text you want in to the file and save it.
Create Home Directory In HDFS (If it doesn't exist)
hduser@ubuntu:~$ hadoop fs -mkdir -p /user/hduser
Copy file someFile.txt from local disk to the user’s directory in HDFS.
hduser@ubuntu:~$ hadoop fs -copyFromLocal someFile.txt someFile.txt
Get a directory listing of the user’s home directory in HDFS
hduser@ubuntu:~$ hadoop fs –ls


Found 1 items
-rw-r--r--   1 hduser supergroup          5 2013-10-27 17:57 someFile.txt

Display the contents of the HDFS file /user/hduser/someFile.txt
hduser@ubuntu:~$ hadoop fs –cat /user/hduser/someFile.txt
Get a directory listing of the HDFS root directory
hduser@ubuntu:~$ hadoop fs –ls /
copy that file to the local disk, named as someFile2.txt
hduser@ubuntu:~$ hadoop fs –copyToLocal /user/hduser/someFile.txt someFile2.txt
Delete the file from hadoop hdfs
hduser@ubuntu:~$ hadoop fs –rm someFile.txt

Deleted someFile.txt


For a full list of commands, Please visit HDFS FileSystem Shell Commands. Please feel free to leave me any comments or suggestions.

Friday, October 18, 2013

Hadoop WordCount with new map reduce api

There are so many version of WordCount hadoop example flowing around the web. However, a lot of them are using the older version of hadoop api. Following are example of word count using the newest hadoop map reduce api. The new map reduce api reside in org.apache.hadoop.mapreduce package instead of org.apache.hadoop.mapred.

WordMapper.java

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
 private Text word = new Text();
 private final static IntWritable one = new IntWritable(1);
 
 @Override
 public void map(Object key, Text value,
   Context contex) throws IOException, InterruptedException {
  // Break line into words for processing
  StringTokenizer wordList = new StringTokenizer(value.toString());
  while (wordList.hasMoreTokens()) {
   word.set(wordList.nextToken());
   contex.write(word, one);
  }
 }
}

SumReducer.java

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;



public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
 
 private IntWritable totalWordCount = new IntWritable();
 
 @Override
 public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
  int wordCount = 0;
  Iterator<IntWritable> it=values.iterator();
  while (it.hasNext()) {
   wordCount += it.next().get();
  }
  totalWordCount.set(wordCount);
  context.write(key, totalWordCount);
 }
}

WordCount.java (Driver)

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
 public static void main(String[] args) throws Exception {
        if (args.length != 2) {
          System.out.println("usage: [input] [output]");
          System.exit(-1);
        }
  
  
        Job job = Job.getInstance(new Configuration());
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(WordMapper.class); 
        job.setReducerClass(SumReducer.class);  

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setJarByClass(WordCount.class);

        job.submit();
        
        
        
  

  
 }
}

Monday, April 15, 2013

Java IO vs NIO

Java NIO (New IO) was introduced in JDK 1.4. While Java IO centralizes around Stream/Reader/Writer and uses decorator pattern as its main design, where you have decorate one InputStream type with each other in the right order. The NIO uses Channel/Buffer/Selector.

Read a file using IO


File file =new File("C:\\test\\test.txt");
FileInputStream  fis = new FileInputStream(file);

//decorate BufferedReader with InputStreamReader
BufferedReader d = new BufferedReader(new InputStreamReader(fis));
String line=null;
while((line = d.readLine()) != null){
         System.out.println(line);
}

Read a file using NIO

RandomAccessFile rFile = new RandomAccessFile("C:\\test\\test.txt", "rw");
   
//get Channel
FileChannel inChannel = rFile.getChannel();
   
//Intialize byte buffer
ByteBuffer buf = ByteBuffer.allocate(100);
   
//read data to buffer
while(inChannel.read(buf)>0){
 buf.flip(); //flip into read mode
 while(buf.hasRemaining()){
  System.out.print((char)buf.get());
 }
 buf.clear();
    
}
rFile.close();