Thursday, July 16, 2015

create sample wordcount jar in IntelliJ, run it against hdfs

1. Open IntelliJ, created a new command line project.
2. In WordCount.java, copy following code:
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

    public static class TokenizerMapper
            extends Mapper{

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer
            extends Reducer {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable values,
                           Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
3. Import dependencies.
'Command' + ';' => click 'Modules' => select 'SDK 1.7' => click '+', select 'Libraries' => "New Libraries" => 'From Mavon' => Search "hadoop-core" => select "org.apache.hadoop:hadoop-core:1.2.0".
4. Build a jar.
'Command' + ';' => click "Artifacts" => click '+' => 'JAR' => 'From modules with dependences...'
5. Run
hdfs dfs -put A-LOCAL-FOLDER input
hadoop jar out/artifacts/wordcount_jar/wordcount.jar input output

setup hadoop single node cluster on mac

referenceApache Hadoop - Setting up a Single Node Cluster

After setup in reference above is done, better to update hadoop config to make it works well after reboot.
1. specified space for namenode and datanode, or they will be put in /tmp, which will be lost after reboot.
$ vi etc/hadoop/hdfs-site.xml
added following property:
    <property>
       <name>dfs.namenode.name.dir</name>
       <value>~/hadoop/namenode</value>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>~/hadoop/data</value>
    </property>
2. disabled all permission check.
$ vi etc/hadoop/hdfs-site.xml
<property>
<name>dfs.permissions</name> 
<value>false</value>
</property>
3. format namenode
$ hdfs namenode -format
4. start hadoop and yarn
$ start-dfs.sh && start-yarn.sh
5. visit web ui
http://localhost:50070/
http://localhost:8088/