Monday, November 4, 2013

Hadoop Core (HDFS and YARN) Components Explained

It's critical to understand the core components in Hadoop YARN (Yet Another Resource Negotiator) or MapReduce 2.0, and how the components interact with each other in the system. Following tutorial will explain those components and there are reference links at the bottom you can follow to read up more details.

If you don't have Hadoop setup in your linux, you can follow Hadoop Setup Guide

NameNode (Hadoop FileSystem Component)

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

DateNode (Hadoop FileSystem Component)

A DataNode stores the actual data in the HDFS. A functional filesystem typically have more than one DataNode in the cluster, with data replicated across them. On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.

A quickstart tutorial on HDFS can be Hadoop FileSystem (HDFS) Tutorial 1

Application Submission in YARN

1. Application Submission Client submits an Application to the YARN Resource Manager. The client needs to provide sufficient information to the ResourceManager in order to launch ApplicationMaster

2. YARN ResourceManager starts ApplicationMaster.

3. The ApplicationMaster then communicates with the ResourceManager to request resource allocation.

4. After a container is allocated to it, the ApplicationMaster communicates with the NodeManager to launch the tasks in the container.

Resource Manager (YARN Component)

The function of the Resource Manager is simple: Keeping track of available resources. One per cluster. It contains two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications.
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

Application Master (YARN Component)

Application Master is created for each application running in the cluster. It provides task-level scheduling and monitoring.

Node Manager (YARN Component)

The NodeManager is the per-machine framework agent who creates container for each task. The containers can have variable resource sizes and the task can be any type of computations not just map/reduce tasks. It then monitors the resource usage (cpu, memory, disk, network) of the container and report them to the ResourceManager.

Reference Links

Apache Hadoop NextGen MapReduce (YARN)
Yahoo Hadoop Tutorial
More reference links to be added...

Please feel to leave me any comments or suggestions below.

Sunday, October 27, 2013

Hadoop FileSystem (HDFS) Tutorial 1

In this tutorial I will show some common commands for HDFS operations.
If you don't have Hadoop setup in your linux, you can follow Hadoop Setup Guide

Log into Linux, "hduser" is the login used in following examples.

Start Hadoop If it's not running

$ start-dfs.sh
....
$ start-yarn.sh

Create someFile.txt in your home directory

hduser@ubuntu:~$ vi someFile.txt

Paste any text you want in to the file and save it.

Create Home Directory In HDFS (If it doesn't exist)

hduser@ubuntu:~$ hadoop fs -mkdir -p /user/hduser

Copy file someFile.txt from local disk to the user’s directory in HDFS.

hduser@ubuntu:~$ hadoop fs -copyFromLocal someFile.txt someFile.txt

Get a directory listing of the user’s home directory in HDFS

hduser@ubuntu:~$ hadoop fs –ls


Found 1 items
-rw-r--r--   1 hduser supergroup          5 2013-10-27 17:57 someFile.txt

Display the contents of the HDFS file /user/hduser/someFile.txt

hduser@ubuntu:~$ hadoop fs –cat /user/hduser/someFile.txt

Get a directory listing of the HDFS root directory

hduser@ubuntu:~$ hadoop fs –ls /

copy that file to the local disk, named as someFile2.txt

hduser@ubuntu:~$ hadoop fs –copyToLocal /user/hduser/someFile.txt someFile2.txt

Delete the file from hadoop hdfs

hduser@ubuntu:~$ hadoop fs –rm someFile.txt

Deleted someFile.txt

For a full list of commands, Please visit HDFS FileSystem Shell Commands. Please feel free to leave me any comments or suggestions.

Friday, October 18, 2013

Hadoop WordCount with new map reduce api

There are so many version of WordCount hadoop example flowing around the web. However, a lot of them are using the older version of hadoop api. Following are example of word count using the newest hadoop map reduce api. The new map reduce api reside in org.apache.hadoop.mapreduce package instead of org.apache.hadoop.mapred.

WordMapper.java

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
 private Text word = new Text();
 private final static IntWritable one = new IntWritable(1);
 
 @Override
 public void map(Object key, Text value,
   Context contex) throws IOException, InterruptedException {
  // Break line into words for processing
  StringTokenizer wordList = new StringTokenizer(value.toString());
  while (wordList.hasMoreTokens()) {
   word.set(wordList.nextToken());
   contex.write(word, one);
  }
 }
}

SumReducer.java

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;



public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
 
 private IntWritable totalWordCount = new IntWritable();
 
 @Override
 public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
  int wordCount = 0;
  Iterator<IntWritable> it=values.iterator();
  while (it.hasNext()) {
   wordCount += it.next().get();
  }
  totalWordCount.set(wordCount);
  context.write(key, totalWordCount);
 }
}

WordCount.java (Driver)

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
 public static void main(String[] args) throws Exception {
        if (args.length != 2) {
          System.out.println("usage: [input] [output]");
          System.exit(-1);
        }
  
  
        Job job = Job.getInstance(new Configuration());
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(WordMapper.class); 
        job.setReducerClass(SumReducer.class);  

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setJarByClass(WordCount.class);

        job.submit();
        
        
        
  

  
 }
}

Monday, April 15, 2013

Java IO vs NIO

Java NIO (New IO) was introduced in JDK 1.4. While Java IO centralizes around Stream/Reader/Writer and uses decorator pattern as its main design, where you have decorate one InputStream type with each other in the right order. The NIO uses Channel/Buffer/Selector.

Read a file using IO


File file =new File("C:\\test\\test.txt");
FileInputStream  fis = new FileInputStream(file);

//decorate BufferedReader with InputStreamReader
BufferedReader d = new BufferedReader(new InputStreamReader(fis));
String line=null;
while((line = d.readLine()) != null){
         System.out.println(line);
}

Read a file using NIO

RandomAccessFile rFile = new RandomAccessFile("C:\\test\\test.txt", "rw");
   
//get Channel
FileChannel inChannel = rFile.getChannel();
   
//Intialize byte buffer
ByteBuffer buf = ByteBuffer.allocate(100);
   
//read data to buffer
while(inChannel.read(buf)>0){
 buf.flip(); //flip into read mode
 while(buf.hasRemaining()){
  System.out.print((char)buf.get());
 }
 buf.clear();
    
}
rFile.close();

Monday, October 15, 2012

Customize Spring JSON output

Spring uses Jackson JSON from http://jackson.codehaus.org/ to serialize an object for json output. To change the default strategy to serialize/format a field in a bean, you can easily create your own serializer. Following is an example to customize date format in json oupt

1. Creat serializer

public class CustomDateSerializer extends JsonSerializer {
 public static Log logger=LogFactory.getLog(CustomDateSerializer.class);
 
 @Override
 public void serialize(Date value, JsonGenerator gen, SerializerProvider arg2) throws IOException, JsonProcessingException {
  
  SimpleDateFormat formatter = new SimpleDateFormat("MM-dd-yyyy");
                String formattedDate = formatter.format(value);

  gen.writeString(formattedDate);

 }
}

2. Annotate bean property to use the serializer


public class TransactionHistoryBean{
  private Date date;

  @JsonSerialize(using = CustomDateSerializer.class)
  public Date getDate() {
      return date;
  }
}

Monday, June 11, 2012

Spring Session Scope Bean

Since HTTP stateless, in a web application, a typical way to keep an user states across multiple requests is through HttpSession. However, doing it this way makes your application has tight dependency on HttpSession and its application container. With Spring, carrying states through multiple request can be done through its session scoped bean. It is cleaner and more flexible. Spring session scoped bean is very easy setup, I will try to explain it in a couple steps.

Scenario: Whenever an user enters your site, he is asked to pick a theme to use when navigating your site.
Solution: Store user's selection using Spring Session scoped bean

1. Annotate your bean

@Component
@Scope(value = "session", proxyMode = ScopedProxyMode.TARGET_CLASS)
public class UserSelection{
   private String selectedTheme;
   //and other session values

   public void setSelectedTheme(String theme){
        this.selectedTheme=selectedTheme;
   }

   public String getSelectedTheme(){
       return this.selectedTheme;
   }

}

2. Inject your session scoped bean

@Controller
@RequestMapping("/")
public class SomeController{
   private UserSelection userSelection;

   @Resource
   public void setUserSelection(UserSelectio userSelection){
         this.userSelection=userSelection;
   }

   @RequestMapping("saveThemeChoice")
   public void saveThemeChoice(@RequestParam String selectedTheme){
        userSelection.setSelectedTheme(selectedTheme);
   }

   
   public String doOperationBasedOnTheme(){
          String theme=userSelection.selectedTheme();
 
          //operation codes
   }

}

In above code snippet, Although 'SomeController' has scope of singleton(default scope of Spring bean), each session will have its own instance of UserSelection. Spring will make the judgement, and inject a new instance of UserSelection if the request is a new session.

Required jars

Besides spring library, aopalliance and cglib jar are also required
http://mvnrepository.com/artifact/cglib/cglib-nodep
http://mvnrepository.com/artifact/aopalliance/aopalliance

Monday, June 4, 2012

Spring Custom Converter

Recently I ran into a problem where Spring would automatically convert comma separated String into String array, and quick search in google, I found people have the same problem on stackoverflow.com
http://stackoverflow.com/questions/4998748/how-to-prevent-parameter-binding-from-interpreting-commas-in-spring-3-0-5

Read more about Spring Converter http://static.springsource.org/spring/docs/current/spring-framework-reference/htmlsingle/spring-framework-reference.html#core-convert

Just a summary of the solution.

1. Create Custom Converter for String to String[]

import org.springframework.core.convert.converter.Converter;
import org.springframework.util.StringUtils;

public class CustomStringToArrayConverter implements Converter{
   @Override
    public String[] convert(String source) {
        return StringUtils.delimitedListToStringArray(source, ";");
    }
}

2. Register the Custom Conveter.

specify this conversion service bean in your configuration:

Codes Fusion