亚洲国产日韩成人综合天堂,国产亚洲漂亮白嫩美女在线,亚洲人成黄网在线观看

Install hadoop+hbase+nutch+elasticsearch

This document is for Anyela Chavarro.
Only these version of each framework work together

Hadoop 1.2.1

Hbase 0.90.4

Nutch 2.2.1

Elasticsearch 0.19.4

Linux version : Ubuntu 12.04.2 LTS

Hadoop cluster environment:

Name node/Job tracker
192.168.1.100 master

Data node/Task tracker
192.168.1.101 slave1
192.168.1.102 slave2
192.168.1.103 slave3

Install Hadoop(pseudo-distributed mode)

add user hadoop
useradd -s /bin/bash -d /home/hadoop -m hadoop
set password
passwd hadoop
login as hadoop
su hadoop
add a data folder
mkdir data
uninstall openjdk on centos
[hadoop@netfox ~] rpm -qa | grep java
[hadoop@netfox ~] java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
[hadoop@netfox ~] java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
[hadoop@netfox ~] rpm -e --nodeps java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
[hadoop@netfox ~] rpm -e --nodeps java-1.6.0-openjdk-1.6.0.0-1.7.b09.el5
install JDK 1.6

apt-get update
apt-get install python-software-properties
add-apt-repository ppa:webupd8team/java
apt-get update
apt-get install oracle-java6-installer
get hadoop tar file

[hadoop@netfox ~]$ wget http://www.eu.apache.org/dist/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
untar tar file
[hadoop@netfox hadoop]$ tar -vxf hadoop-1.2.1.tar.gz
install ssh-server

apt-get install openssh-server
setup ssh key(ssh-keygen is the built in tool in linux)
[hadoop@netfox hadoop]$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
make public key file
[hadoop@netfox hadoop]$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
change public key file authoriate mode
[hadoop@netfox hadoop]$ chmod 600 ~/.ssh/authorized_keys
find the ip of local machine

[hadoop@netfox hadoop]$ ifconfig
the ip can be found in this string:
inet addr:192.168.1.100
add to hosts, this line should be at the first line.

[hadoop@netfox hadoop]$ vi /etc/hosts
192.168.1.100 master
add to /etc/profile
export JAVA_HOME=/usr/lib/jvm/java-6-oracle

export HADOOP_HOME=/home/hadoop/hadoop-1.2.1

export HBASE_HOME=/home/hadoop/hbase-0.90.4

export PATH=$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
source it
[hadoop@netfox hadoop]$ source /etc/profile
create folder
hadoop@netfox:~$ mkdir /home/hadoop/data
edit /home/hadoop/hadoop-1.2.1/conf/hdfs-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

</configuration>
edit /home/hadoop/hadoop-1.2.1/conf/mapred-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>
  <name>mapred.job.tracker</name>
  <value>master:9002</value>
  <description>The host and port that the MapReduce job tracker runs
  at. If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

</configuration>
edit /home/hadoop/hadoop-1.2.1/conf/core-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoop/data</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://master:9001</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

</configuration>
add to /home/hadoop/hadoop-1.2.1/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
add to /home/hadoop/hadoop-1.2.1/conf/slaves and masters
master
format hdoop namenode
[hadoop@netfox ~]$ hadoop namenode -format
start hadoop
[hadoop@netfox hadoop]$ start-all.sh
check if hdoop install correctly
[hadoop@netfox hadoop]$ hadoop dfs -ls /
for example, it will show the following output without error message.

Found 4 items
drwxr-xr-x   - hadoop supergroup          0 2013-08-28 14:02 /chukwa
drwxr-xr-x   - hadoop supergroup          0 2013-08-29 09:53 /hbase
drwxr-xr-x   - hadoop supergroup          0 2013-08-27 10:36 /opt
drwxr-xr-x   - hadoop supergroup          0 2013-09-01 15:22 /tmp

Install Hadoop(fully-distributed mode)
repeat step1-23 on slave1-3, but some steps will be different:

changet step 9 as below:
don't make the public key, just transfer the public key from master to each slave.

[hadoop@netfox hadoop]$ scp ~/.ssh/id_dsa.pub hadoop@slave1:/home/hadoop
change step 12 as below:
add to host

[hadoop@netfox hadoop]$ vi /etc/hosts
192.168.1.100 master
192.168.1.101 slave1
192.168.1.102 slave2
192.168.1.103 slave3
step 20, add to /home/hadoop/hadoop-1.2.1/conf/masters

master

add to /home/hadoop/hadoop-1.2.1/conf/slaves

slave1
slave2
slave3
step 22, start hadoop only on master

[hadoop@netfox hadoop]$ start-all.sh

Install Hbase

get hbase tar file
[hadoop@netfox ~]$ wget http://archive.apache.org/dist/hbase/hbase-0.90.4/hbase-0.90.4.tar.gz
untar the file
[hadoop@netfox ~]$ tar -vxf hbase-0.90.4.tar.gz
change /home/hadoop/hbase-0.90.4/conf/hbase-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://master:9001/hbase</value>
  </property>

  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>

  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>localhost</value>
  </property>

</configuration>
change /home/hadoop/hbase-0.90.4/conf/regionservers as below
master
add JAVA_HOME to /home/hadoop/hbase-0.90.4/conf/hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
replace with the new hadoop jar
[hadoop@netfox ~]$ rm /home/hadoop/hbase-0.90.4/lib/hadoop-core-0.20-append-r1056497.jar
[hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/hadoop-core-1.2.1.jar /home/hadoop/hbase-0.90.4/lib
[hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/lib/commons-collections-3.2.1.jar /home/hadoop/hbase-0.90.4/lib
[hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/lib/commons-configuration-1.6.jar /home/hadoop/hbase-0.90.4/lib
start hbse
[hadoop@netfox ~]$ start-hbase.sh
check if hbase install correctly
[hadoop@netfox ~]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011

hbase(main):001:0> list
TABLE webpage
1 row(s) in 0.5270 seconds

Install Nutch

install ant
[root@netfox ~]# apt-get install ant
switch user and folder
[root@netfox ~]# su hadoop
[hadoop@netfox root]$ cd ~
get nutch tar file
[hadoop@netfox ~]$ wget http://www.eu.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz
untar this file
[hadoop@netfox webcrawer]$ tar -vxf apache-nutch-2.2.1-src.tar.gz
add to /etc/profile

export NUTCH_HOME=/home/hadoop/webcrawer/apache-nutch-2.2.1
export PATH=$NUTCH_HOME/runtime/deploy/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
change /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/hbase-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://master:9001/hbase</value>
  </property>

  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>

  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>localhost</value>
  </property>

</configuration>
change /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/nutch-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.hbase.store.HBaseStore</value>
        <description>Default class for storing data</description>
    </property>
    <property>
        <name>http.agent.name</name>
        <value>NutchCrawler</value>
    </property>
    <property>
        <name>http.robots.agents</name>
        <value>NutchCrawler,*</value>
    </property>

</configuration>
Uncomment the following in the /home/hadoop/webcrawer/apache-nutch-2.2.1/ivy/ivy.xml file
<dependency org="org.apache.gora" name="gora-hbase" rev="0.2"
conf="*->default" />
add to /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/gora.properties file
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
go to nutch installation folder(/home/hadoop/webcrawer/apache-nutch-2.2.1) and run
ant clean
ant runtime
Create a directory in HDFS to upload the seed urls.
[hadoop@netfox ~]$ hadoop dfs -mkdir urls
Create a text file with the seed URLs for the crawl. Upload the seed URLs file to the directory created in the above step
[hadoop@netfox ~]$ hadoop dfs -put seed.txt urls
Issue the following command from inside the copied deploy directory in the
JobTracker node to inject the seed URLs to the Nutch database and to generate the
initial fetch list(-topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE )

[hadoop@netfox ~]$ nutch inject urls
[hadoop@netfox ~]$ nutch generate -topN 3
Issue the following commands from inside the copied deploy directory in the
JobTracker node
[hadoop@netfox ~]$ nutch fetch -all
[hadoop@netfox ~]$ nutch parse -all
[hadoop@netfox ~]$ nutch updatedb
[hadoop@netfox ~]$ nutch generate -topN 10

Install ElasticSearch

get the tar file
[hadoop@netfox ~]$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.19.4.tar.gz
untar file
[hadoop@netfox ~]$ tar -vxf elasticsearch-0.19.4.tar.gz
add to /etc/profile

export ELAST_HOME=/home/hadoop/webcrawer/elasticsearch-0.19.4

export PATH=$ELAST_HOME/bin:$NUTCH_HOME/runtime/deploy/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
Go to the extracted ElasticSearch directory and execute the following command to
start the ElasticSearch server in the foreground
> bin/elasticsearch -f
Go to the $NUTCH_HOME/runtime/deploy (or $NUTCH_HOME/runtime/local
in case you are running Nutch in the local mode) directory. Execute the following
command to index the data crawled by Nutch in to the ElasticSearch server.
> bin/nutch elasticindex elasticsearch -all
install curl
[hadoop@netfox ~]$ sudo apt-get install curl
check if elasticsearch installation correct
[hadoop@netfox ~]$ curl master:9200
check query
[hadoop@netfox ~]$ curl -XGET 'http://master:9200/_search?q=hadoop'

posted on 2013-08-31 01:17 paulwong 閱讀(6309) 評論(3) 編輯收藏所屬分類: 分布式、HADOOP 、云計算、分布式搜索

Feedback

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-23 14:19 ap

nutch2.2.1默認支持hbase0.90.4 和 elasticsearch0.19.4 ，能否將其支持elasticsearch0.90.x以上版本呢（嘗試使用elasticsearch0.90.x.jar包替換nutch2.2.1 lib目錄下elasticsearch0.19.4.jar,但nutch elasticindex時報錯）？
Nutch1.7 默認是支持elasticsearch0.90.1的。
回復更多評論

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-24 18:27 paulwong

@ap
我試過換0.90以上的版本不行的。
Nutch1.7 不整合HBASE，就不試了回復更多評論

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-25 15:34 ap

@paulwong
我在Nutch1.7 的lib目錄下確實是沒找到HBASE的jar包，要是整合就好了。謝謝。
回復更多評論

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: !!!架構網站內容不錯 SPRING CACHE資源使用WILDFLY中的分布式緩存INFISHPAN SPRING-SESSION 分布式調度QUARTZ+SPRING 樂視 TV 載入 4K 片點解咁快？CDN 網絡解構 Java并行處理框架 JPPF 騰訊CKV海量分布式存儲系統【轉載】經典漫畫講解HDFS原理一些數據切分、緩存、rpc框架、nosql方案資料

paulwong

My Links

Blog Stats

常用鏈接

留言簿(67)

隨筆分類(1393)

隨筆檔案(1151)

文章分類(7)

文章檔案(10)

相冊

收藏夾(2)

AI

Develop

E-BOOK

Other

養生

微服務

搜索

最新評論

閱讀排行榜

評論排行榜

60天內閱讀排行

Install hadoop+hbase+nutch+elasticsearch

Feedback

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-23 14:19 ap

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-24 18:27 paulwong

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-25 15:34 ap