March 2012 Archives

Spring Hadoop 快速入門

|
傳說中的 Spring 終於整合了 Hadoop, 推出了 Spring Hadoop.
當你想要開始體驗 Spring Hadoop 的時候, 你會遇到各式各樣奇怪的問題, 目前也有人開始
陸續回報了.
如果你只是想要簡單的試用一下, 又不想要自己解決這些疑難雜症, 建議大家可以參考下面的步驟來快速體驗一下 Spring Hadoop 的威力.


  • 環境要求: Hadoop 0.20.2+
  如果你還沒有安裝 Hadoop 的話, 可以參考國網中心的教學去安裝 Hadoop
  http://trac.3du.me/cloud/wiki/Hadoop_Lab1
  當然你也可以直接申請國網中心的 Hadoop 環境來使用
  http://hadoop.nchc.org.tw/
                                                                   
  安裝之後, 那就讓我們來開始吧...

  • Step1. 下載 Spring Hadoop, 這邊是使用 git 去下載, 如果你對 git 不熟悉的話, 也可以直接從官網下載再解壓縮
  這邊的例子裡面是用我的 home 目錄為例, 大家記得要改成你自己的目錄名稱
  /home/evanshsu mkdir springhadoop 
  /home/evanshsu cd springhadoop
  /home/evanshsu/springhadoop git init
  /home/evanshsu/springhadoop git pull "git://github.com/SpringSource/spring-hadoop.git"

  • Step2. build spring-hadoop.jar
  build完之後, 我們要把所有的 jar 檔都放在 /home/evanshsu/springhadoop/lib 裡面, 以便之後把所有的 jar 檔包在同一包裡面
  /home/evanshsu/springhadoop ./gradlew jar
  /home/evanshsu/springhadoop mkdir lib
  /home/evanshsu/springhadoop cp build/libs/spring-data-hadoop-1.0.0.BUILD-SNAPSHOT.jar lib/
   

  • Step3. get spring-framework.
  因為 spring hadoop 是倚賴於 spring-framework 的, 所以我們也要把 spring-framework 的 jar 檔放在 lib 裡面
  /home/evanshsu/spring wget "http://s3.amazonaws.com/dist.springframework.org/release/SPR/spring-framework-3.1.1.RELEASE.zip"
  /home/evanshsu/spring unzip spring-framework-3.1.1.RELEASE.zip
  /home/evanshsu/spring cp spring-framework-3.1.1.RELEASE/dist/*.jar /home/evanshsu/springhadoop/lib/


  • Step4. 修改 build file 讓我們可以把所有的 jar 檔, 封裝到同一個 jar 檔裡面
  /home/evanshsu/spring/samples/wordcount vim build.gradle

    description = 'Spring Hadoop Samples - WordCount'

    apply plugin: 'base'
    apply plugin: 'java'
    apply plugin: 'idea'
    apply plugin: 'eclipse'

    repositories {
        flatDir(dirs: '/home/evanshsu/springhadoop/lib/')
        // Public Spring artefacts
        maven { url "http://repo.springsource.org/libs-release" }
        maven { url "http://repo.springsource.org/libs-milestone" }
        maven { url "http://repo.springsource.org/libs-snapshot" }
    }

    dependencies {
        compile fileTree('/home/evanshsu/springhadoop/lib/')
        compile "org.apache.hadoop:hadoop-examples:$hadoopVersion"
        // see HADOOP-7461
        runtime "org.codehaus.jackson:jackson-mapper-asl:$jacksonVersion"

        testCompile "junit:junit:$junitVersion"
        testCompile "org.springframework:spring-test:$springVersion"
    }

    jar {
        from configurations.compile.collect { it.isDirectory() ? it : zipTree(it).matching{
            exclude 'META-INF/spring.schemas'
            exclude 'META-INF/spring.handlers'
            } }
    }

  • Step5. 這邊有一個特殊的 hadoop.properties 主要是放置 hadoop 相關的設定資料.
  基本上我們要把 wordcount.input.path wordcount.output.path 改成之後執行 wordcount 要使用的目錄, 而且 wordcount.input.path 裡面記得要放幾個文字檔
  另外, 還要把 hd.fs 改成你 hdfs 的設定
  如果你是用國網中心 Hadoop 的話, 要把 hd.fs 改成 hd.fs=hdfs://gm2.nchc.org.tw:8020
  /home/evanshsu/spring/samples/wordcount vim src/main/resources/hadoop.properties

    wordcount.input.path=/user/evanshsu/input.txt
    wordcount.output.path=/user/evanshsu/output

    hive.host=localhost
    hive.port=12345
    hive.url=jdbc:hive://${hive.host}:${hive.port}
    hd.fs=hdfs://localhost:9000
    mapred.job.tracker=localhost:9001

    path.cat=bin${file.separator}stream-bin${file.separator}cat
    path.wc=bin${file.separator}stream-bin${file.separator}wc

    input.directory=logs
    log.input=/logs/input/
    log.output=/logs/output/

    distcp.src=${hd.fs}/distcp/source.txt
    distcp.dst=${hd.fs}/distcp/dst

  • Step6. 這是最重要的一個設定檔, 有用過 Spring 的人都知道這個設定檔是 Spring 的靈魂
  /home/evanshsu/spring/samples/wordcount vim src/main/resources/META-INF/spring/context.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <beans xmlns="http://www.springframework.org/schema/beans"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:context="http://www.springframework.org/schema/context"
        xmlns:hdp="http://www.springframework.org/schema/hadoop"
        xmlns:p="http://www.springframework.org/schema/p"
        xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
        http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd
        http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd">

        <context:property-placeholder location="hadoop.properties"/>

        <hdp:configuration>
            fs.default.name=${hd.fs}
        </hdp:configuration>

        <hdp:job id="wordcount-job" validate-paths="false"
            input-path="${wordcount.input.path}" output-path="${wordcount.output.path}"
            mapper="org.springframework.data.hadoop.samples.wordcount.WordCountMapper"
            reducer="org.springframework.data.hadoop.samples.wordcount.WordCountReducer"
            jar-by-class="org.springframework.data.hadoop.samples.wordcount.WordCountMapper" />

        <!-- simple job runner -->
        <bean id="runner" class="org.springframework.data.hadoop.mapreduce.JobRunner"  p:jobs-ref="wordcount-job"/>
      
    </beans>

  • Step7. 加上自己的 mapper, reducer
  /home/evanshsu/spring/samples/wordcount vim src/main/java/org/springframework/data/hadoop/samples/wordcount/WordCountMapper.java

    package org.springframework.data.hadoop.samples.wordcount;
    import java.io.IOException;
    import java.util.StringTokenizer;

    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;

    public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }


  /home/evanshsu/spring/samples/wordcount vim src/main/java/org/springframework/data/hadoop/samples/wordcount/WordCountReducer.java

    package org.springframework.data.hadoop.samples.wordcount;
    import java.io.IOException;

    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;

    public class WordCountReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

  • Step8.  加上 spring.schemas, spring.handlers
  /home/evanshsu/spring/samples/wordcount vim src/main/resources/META-INF/spring.schemas
    http\://www.springframework.org/schema/context/spring-context.xsd=org/springframework/context/config/spring-context-3.1.xsd
    http\://www.springframework.org/schema/hadoop/spring-hadoop.xsd=/org/springframework/data/hadoop/config/spring-hadoop-1.0.xsd

  /home/evanshsu/spring/samples/wordcount vim src/main/resources/META-INF/spring.handlers

    http\://www.springframework.org/schema/p=org.springframework.beans.factory.xml.SimplePropertyNamespaceHandler
    http\://www.springframework.org/schema/context=org.springframework.context.config.ContextNamespaceHandler
    http\://www.springframework.org/schema/hadoop=org.springframework.data.hadoop.config.HadoopNamespaceHandler

  • Step9. 終於到最後一步囉, 這一步我們要把所有的 jar 檔封裝在一起, 並且丟到 hadoop 上面去跑
/home/evanshsu/spring/samples/wordcount ../../gradlew jar
/home/evanshsu/spring/samples/wordcount hadoop jar build/libs/wordcount-1.0.0.M1.jar org.springframework.data.hadoop.samples.wordcount.Main

  • Step10. 最後來確認看看結果有沒有跑出來吧
/home/evanshsu/spring/samples/wordcount hadoop fs -cat /user/evanshsu/output/*

Spring Hadoop Quickstart

|
Since Spring annouce spring-hadoop, let's perform a quick practice.
(http://blog.springsource.org/2012/02/29/introducing-spring-hadoop/ )
This is definetly not a regular usage for spring-hadoop.
I make some change because I met some problems with dependencies and ipc issues.

  • pre-requirement: hadoop 0.20.2+
  If you don't have hadoop environment yet, please reference these documents to install it.
  http://trac.3du.me/cloud/wiki/Hadoop_Lab1                                                                     
 
  Of course, you can get a free one from NCHC too.
  http://hadoop.nchc.org.tw/
                                                                    
  • Step1. get spring-hadoop. you can get it by git or download from website.                                            
  /home/evanshsu/springhadoop git init
  /home/evanshsu/springhadoop git pull "git://github.com/SpringSource/spring-hadoop.git"
 
  • Step2. build spring-hadoop.jar 
  /home/evanshsu/springhadoop ./gradlew jar
  /home/evanshsu/springhadoop mkdir lib
  /home/evanshsu/springhadoop cp build/libs/spring-data-hadoop-1.0.0.BUILD-SNAPSHOT.jar lib/

  • Step3. get spring-framework.
  /home/evanshsu/spring wget "http://s3.amazonaws.com/dist.springframework.org/release/SPR/spring-framework-3.1.1.RELEASE.zip"
  /home/evanshsu/spring unzip spring-framework-3.1.1.RELEASE.zip
  /home/evanshsu/spring cp spring-framework-3.1.1.RELEASE/dist/*.jar /home/evanshsu/springhadoop/lib/

  • Step4. change the build file. we will assembly all jars into 1 jar file.
  /home/evanshsu/spring/samples/wordcount vim build.gradle

description = 'Spring Hadoop Samples - WordCount'

apply plugin: 'base'
apply plugin: 'java'
apply plugin: 'idea'
apply plugin: 'eclipse'

repositories {
    flatDir(dirs: '/home/evanshsu/springhadoop/lib/')
    // Public Spring artefacts
    maven { url "http://repo.springsource.org/libs-release" }
    maven { url "http://repo.springsource.org/libs-milestone" }
    maven { url "http://repo.springsource.org/libs-snapshot" }
}

dependencies {
    compile fileTree('/home/evanshsu/springhadoop/lib/')
    compile "org.apache.hadoop:hadoop-examples:$hadoopVersion"
    // see HADOOP-7461
    runtime "org.codehaus.jackson:jackson-mapper-asl:$jacksonVersion"

    testCompile "junit:junit:$junitVersion"
    testCompile "org.springframework:spring-test:$springVersion"
}

jar {
    from configurations.compile.collect { it.isDirectory() ? it : zipTree(it).matching{
        exclude 'META-INF/spring.schemas'
        exclude 'META-INF/spring.handlers'
        } }
}

  • Step5. change hdfs hostname and wordcount paths (wordcount.input.path wordcount.output.path hd.fs) If you use nchc, please change hd.fs to hd.fs=hdfs://gm2.nchc.org.tw:8020
  /home/evanshsu/spring/samples/wordcount vim src/main/resources/hadoop.properties

wordcount.input.path=/user/evanshsu/input.txt
wordcount.output.path=/user/evanshsu/output

hive.host=localhost
hive.port=12345
hive.url=jdbc:hive://${hive.host}:${hive.port}
hd.fs=hdfs://localhost:9000
mapred.job.tracker=localhost:9001

path.cat=bin${file.separator}stream-bin${file.separator}cat
path.wc=bin${file.separator}stream-bin${file.separator}wc

input.directory=logs
log.input=/logs/input/
log.output=/logs/output/

distcp.src=${hd.fs}/distcp/source.txt
distcp.dst=${hd.fs}/distcp/dst

  • Step6. This is the most important part of spring-hadoop
  /home/evanshsu/spring/samples/wordcount vim src/main/resources/META-INF/spring/context.xml
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:context="http://www.springframework.org/schema/context"
    xmlns:hdp="http://www.springframework.org/schema/hadoop"
    xmlns:p="http://www.springframework.org/schema/p"
    xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
    http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd
    http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd">


    <context:property-placeholder location="hadoop.properties"/>

    <hdp:configuration>
        fs.default.name=${hd.fs}
    </hdp:configuration>

    <hdp:job id="wordcount-job" validate-paths="false"
        input-path="${wordcount.input.path}" output-path="${wordcount.output.path}"
        mapper="org.springframework.data.hadoop.samples.wordcount.WordCountMapper"
        reducer="org.springframework.data.hadoop.samples.wordcount.WordCountReducer"
        jar-by-class="org.springframework.data.hadoop.samples.wordcount.WordCountMapper" />


      <hdp:script id="clean-script" language="javascript">
             // 'hack' default permissions to make Hadoop work on Windows
        if (java.lang.System.getProperty("os.name").startsWith("Windows")) {
            // 0655 = -rwxr-xr-x
             org.apache.hadoop.mapreduce.JobSubmissionFiles.JOB_DIR_PERMISSION.fromShort(0655)
             org.apache.hadoop.mapreduce.JobSubmissionFiles.JOB_FILE_PERMISSION.fromShort(0655)
        }
        
        inputPath = "${wordcount.input.path}"
        outputPath = "${wordcount.output.path}"   
        if (fsh.test(inputPath)) { fsh.rmr(inputPath) }
        if (fsh.test(outputPath)) { fsh.rmr(outputPath) }

        // copy using the streams directly (to be portable across envs)
        inStream = cl.getResourceAsStream("data/nietzsche-chapter-1.txt")
        org.apache.hadoop.io.IOUtils.copyBytes(inStream, fs.create(inputPath), cfg)
    </hdp:script>
   
    <!-- simple job runner -->
    <bean id="runner" class="org.springframework.data.hadoop.mapreduce.JobRunner"  p:jobs-ref="wordcount-job"/>
   
</beans>


  • Step7. Add your self mapper and reducer
  /home/evanshsu/spring/samples/wordcount vim src/main/java/org/springframework/data/hadoop/samples/wordcount/WordCountMapper.java

package org.springframework.data.hadoop.samples.wordcount;
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}


  /home/evanshsu/spring/samples/wordcount vim src/main/java/org/springframework/data/hadoop/samples/wordcount/WordCountReducer.java

package org.springframework.data.hadoop.samples.wordcount;
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends
        Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

  • Step8. add spring.schemas and spring.handlers.
  /home/evanshsu/spring/samples/wordcount vim src/main/resources/META-INF/spring.schemas
http\://www.springframework.org/schema/context/spring-context.xsd=org/springframework/context/config/spring-context-3.1.xsd
http\://www.springframework.org/schema/hadoop/spring-hadoop.xsd=/org/springframework/data/hadoop/config/spring-hadoop-1.0.xsd

  /home/evanshsu/spring/samples/wordcount vim src/main/resources/META-INF/spring.handlers
http\://www.springframework.org/schema/p=org.springframework.beans.factory.xml.SimplePropertyNamespaceHandler
http\://www.springframework.org/schema/context=org.springframework.context.config.ContextNamespaceHandler
http\://www.springframework.org/schema/hadoop=org.springframework.data.hadoop.config.HadoopNamespaceHandler

  • Step9. build and run
/home/evanshsu/spring/samples/wordcount ../../gradlew jar
/home/evanshsu/spring/samples/wordcount hadoop jar build/libs/wordcount-1.0.0.M1.jar org.springframework.data.hadoop.samples.wordcount.Main

  • Step10. confirm it works
/home/evanshsu/spring/samples/wordcount hadoop fs -cat /user/evanshsu/output/*




 
感謝創辦人蔡奕楷(Yi-Kai Tsai)於臉書創立 Taiwan Hadoop User Group 粉絲團。
有興趣加入分享國內 Hadoop 相關研究的社群同好,
歡迎蒞臨 http://www.facebook.com/groups/191142874328429

- Jazz