Since Spring annouce spring-hadoop, let's perform a quick practice.
(http://blog.springsource.org/2012/02/29/introducing-spring-hadoop/ )
This is definetly not a regular usage for spring-hadoop.
I make some change because I met some problems with dependencies and ipc issues.
http://trac.3du.me/cloud/wiki/Hadoop_Lab1
Of course, you can get a free one from NCHC too.
http://hadoop.nchc.org.tw/
/home/evanshsu/springhadoop git pull "git://github.com/SpringSource/spring-hadoop.git"
/home/evanshsu/springhadoop mkdir lib
/home/evanshsu/springhadoop cp build/libs/spring-data-hadoop-1.0.0.BUILD-SNAPSHOT.jar lib/
/home/evanshsu/spring unzip spring-framework-3.1.1.RELEASE.zip
/home/evanshsu/spring cp spring-framework-3.1.1.RELEASE/dist/*.jar /home/evanshsu/springhadoop/lib/
/home/evanshsu/spring/samples/wordcount vim src/main/java/org/springframework/data/hadoop/samples/wordcount/WordCountReducer.java
/home/evanshsu/spring/samples/wordcount vim src/main/resources/META-INF/spring.handlers
/home/evanshsu/spring/samples/wordcount hadoop jar build/libs/wordcount-1.0.0.M1.jar org.springframework.data.hadoop.samples.wordcount.Main
(http://blog.springsource.org/2012/02/29/introducing-spring-hadoop/ )
This is definetly not a regular usage for spring-hadoop.
I make some change because I met some problems with dependencies and ipc issues.
- pre-requirement: hadoop 0.20.2+
http://trac.3du.me/cloud/wiki/Hadoop_Lab1
Of course, you can get a free one from NCHC too.
http://hadoop.nchc.org.tw/
- Step1. get spring-hadoop. you can get it by git or download from website.
/home/evanshsu/springhadoop git pull "git://github.com/SpringSource/spring-hadoop.git"
- Step2. build spring-hadoop.jar
/home/evanshsu/springhadoop mkdir lib
/home/evanshsu/springhadoop cp build/libs/spring-data-hadoop-1.0.0.BUILD-SNAPSHOT.jar lib/
- Step3. get spring-framework.
/home/evanshsu/spring unzip spring-framework-3.1.1.RELEASE.zip
/home/evanshsu/spring cp spring-framework-3.1.1.RELEASE/dist/*.jar /home/evanshsu/springhadoop/lib/
- Step4. change the build file. we will assembly all jars into 1 jar file.
description = 'Spring Hadoop Samples - WordCount'
apply plugin: 'base'
apply plugin: 'java'
apply plugin: 'idea'
apply plugin: 'eclipse'
repositories {
flatDir(dirs: '/home/evanshsu/springhadoop/lib/')
// Public Spring artefacts
maven { url "http://repo.springsource.org/libs-release" }
maven { url "http://repo.springsource.org/libs-milestone" }
maven { url "http://repo.springsource.org/libs-snapshot" }
}
dependencies {
compile fileTree('/home/evanshsu/springhadoop/lib/')
compile "org.apache.hadoop:hadoop-examples:$hadoopVersion"
// see HADOOP-7461
runtime "org.codehaus.jackson:jackson-mapper-asl:$jacksonVersion"
testCompile "junit:junit:$junitVersion"
testCompile "org.springframework:spring-test:$springVersion"
}
jar {
from configurations.compile.collect { it.isDirectory() ? it : zipTree(it).matching{
exclude 'META-INF/spring.schemas'
exclude 'META-INF/spring.handlers'
} }
}
- Step5. change hdfs hostname and wordcount paths (wordcount.input.path wordcount.output.path hd.fs) If you use nchc, please change hd.fs to hd.fs=hdfs://gm2.nchc.org.tw:8020
wordcount.input.path=/user/evanshsu/input.txt
wordcount.output.path=/user/evanshsu/output
hive.host=localhost
hive.port=12345
hive.url=jdbc:hive://${hive.host}:${hive.port}
hd.fs=hdfs://localhost:9000
mapred.job.tracker=localhost:9001
path.cat=bin${file.separator}stream-bin${file.separator}cat
path.wc=bin${file.separator}stream-bin${file.separator}wc
input.directory=logs
log.input=/logs/input/
log.output=/logs/output/
distcp.src=${hd.fs}/distcp/source.txt
distcp.dst=${hd.fs}/distcp/dst
- Step6. This is the most important part of spring-hadoop
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:context="http://www.springframework.org/schema/context"
xmlns:hdp="http://www.springframework.org/schema/hadoop"
xmlns:p="http://www.springframework.org/schema/p"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd
http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd">
<context:property-placeholder location="hadoop.properties"/>
<hdp:configuration>
fs.default.name=${hd.fs}
</hdp:configuration>
<hdp:job id="wordcount-job" validate-paths="false"
input-path="${wordcount.input.path}" output-path="${wordcount.output.path}"
mapper="org.springframework.data.hadoop.samples.wordcount.WordCountMapper"
reducer="org.springframework.data.hadoop.samples.wordcount.WordCountReducer"
jar-by-class="org.springframework.data.hadoop.samples.wordcount.WordCountMapper" />
<hdp:script id="clean-script" language="javascript">
// 'hack' default permissions to make Hadoop work on Windows
if (java.lang.System.getProperty("os.name").startsWith("Windows")) {
// 0655 = -rwxr-xr-x
org.apache.hadoop.mapreduce.JobSubmissionFiles.JOB_DIR_PERMISSION.fromShort(0655)
org.apache.hadoop.mapreduce.JobSubmissionFiles.JOB_FILE_PERMISSION.fromShort(0655)
}
inputPath = "${wordcount.input.path}"
outputPath = "${wordcount.output.path}"
if (fsh.test(inputPath)) { fsh.rmr(inputPath) }
if (fsh.test(outputPath)) { fsh.rmr(outputPath) }
// copy using the streams directly (to be portable across envs)
inStream = cl.getResourceAsStream("data/nietzsche-chapter-1.txt")
org.apache.hadoop.io.IOUtils.copyBytes(inStream, fs.create(inputPath), cfg)
</hdp:script>
<!-- simple job runner -->
<bean id="runner" class="org.springframework.data.hadoop.mapreduce.JobRunner" p:jobs-ref="wordcount-job"/>
</beans>
- Step7. Add your self mapper and reducer
package org.springframework.data.hadoop.samples.wordcount;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
/home/evanshsu/spring/samples/wordcount vim src/main/java/org/springframework/data/hadoop/samples/wordcount/WordCountReducer.java
package org.springframework.data.hadoop.samples.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
- Step8. add spring.schemas and spring.handlers.
http\://www.springframework.org/schema/context/spring-context.xsd=org/springframework/context/config/spring-context-3.1.xsd
http\://www.springframework.org/schema/hadoop/spring-hadoop.xsd=/org/springframework/data/hadoop/config/spring-hadoop-1.0.xsd
/home/evanshsu/spring/samples/wordcount vim src/main/resources/META-INF/spring.handlers
http\://www.springframework.org/schema/p=org.springframework.beans.factory.xml.SimplePropertyNamespaceHandler
http\://www.springframework.org/schema/context=org.springframework.context.config.ContextNamespaceHandler
http\://www.springframework.org/schema/hadoop=org.springframework.data.hadoop.config.HadoopNamespaceHandler
- Step9. build and run
/home/evanshsu/spring/samples/wordcount hadoop jar build/libs/wordcount-1.0.0.M1.jar org.springframework.data.hadoop.samples.wordcount.Main
- Step10. confirm it works