亚洲图片一区二区,亚洲国产中文在线二区三区免,高清在线亚洲精品国产二区

nutch在Windows上的安裝 step by step

Posted on 2006-10-18 19:52 天霽閱讀(7304) 評論(4) 編輯收藏所屬分類: nutch

???nutch作為開源代碼，為熱愛搜索引擎的開發(fā)人員們提供了很好的學(xué)習(xí)平臺，0.8版本開始，采用了Hadoop作為自己的分布式文件系統(tǒng)，更是把nutch同其他開源搜索引擎的差距拉開。

???nutch的官方網(wǎng)站：http://lucene.apache.org/nutch/
???nutch的入門文章：http://lucene.apache.org/nutch/tutorial8.html

???以下詳細(xì)的介紹一下nutch0.8的安裝方法：

一、環(huán)境：
??????1.操作系統(tǒng)：windowsXp,windows2000+
??????2.javaVM：java1.5.x，設(shè)置JAVA_HOME到環(huán)境變量
??????3.cygwin,當(dāng)然這個不是必需的，只是nutch提供的腳本只能在shell環(huán)境下使用，所以使用cygwin來虛擬shell命令。
??????4.nutch版本：0.8
??????5.tomcat：5.0

二、cygwin的安裝：

??????cygwin的安裝在Nutch在Windows中安裝之細(xì)解一文中有較為詳細(xì)的介紹，此處不再介紹安裝步驟，只介紹安裝后需要如何判斷是否能夠使用：在cygwin的安裝目錄下，查找x:\cygwin\cygwin\bin\sh.exe，存在此命令即可使用。
??????cygwin在刪除后會發(fā)現(xiàn)無法再次成功安裝的問題，可以通過注冊表內(nèi)的查找功能，刪除所有包含cygwin內(nèi)容的鍵值即可。

三、nutch的安裝和配置：

??????1。從http://lucene.apache.org/nutch/release/下載0.8或更高的版本，解壓縮后，放置到cygwin的根目錄下，如圖：

?????圖中可以看到nutch目錄在cygwin的根目錄下。
?????
??? 2。在nutch/bin下，建立urls目錄，然后建立一個url.txt文件，在url.txt文件內(nèi)寫入一個希望爬行的url，例如：www.sina.com.cn，目錄結(jié)構(gòu)如圖：

?????
??????3。打開nutch\conf\crawl-urlfilter.txt文件，把MY.DOMAIN.NAME字符替換為url.txt內(nèi)的url的域名，其實更簡單點，直接刪除MY.DOMAIN.NAME這幾個字就可以了，也就是說，只保存+^http://([a-z0-9]*\.)*這幾個字就可以了，表示所有http的網(wǎng)站都同意爬行。

??????4 。打開nutch\conf\conf/nutch-site.xml文件，在<configuration></configuration>內(nèi)插入一下內(nèi)容：

? <name>http.agent.name</name>

? <value></value>

? <description>HTTP 'User-Agent' request header. MUST NOT be empty -

? please set this to a single word uniquely related to your organization.

? NOTE: You should also check other related properties:

???? http.robots.agents

???? http.agent.description

???? http.agent.url

???? http.agent.email

???? http.agent.version

? and set their values appropriately.

? </description>

</property>

? <name>http.agent.description</name>

? <value></value>

? <description>Further description of our bot- this text is used in

? the User-Agent header.? It appears in parenthesis after the agent name.

? </description>

</property>

? <name>http.agent.url</name>

? <value></value>

? <description>A URL to advertise in the User-Agent header.? This will

?? appear in parenthesis after the agent name. Custom dictates that this

?? should be a URL of a page explaining the purpose and behavior of this

?? crawler.

? </description>

</property>

? <name>http.agent.email</name>

? <value></value>

? <description>An email address to advertise in the HTTP 'From' request

?? header and User-Agent header. A good practice is to mangle this

?? address (e.g. 'info at example dot com') to avoid spamming.

? </description>

</property>

?????把<name>XXX</name>之間的內(nèi)容替換為其他字符，當(dāng)然就算是不替換也無所謂，這里的設(shè)置，是因為nutch遵守了robots協(xié)議，在獲取response時，把自己的相關(guān)信息提交給被爬行的網(wǎng)站，以供識別。

???以上配置，是爬取intranet的配置方式。

????四、執(zhí)行nutch

???由于配置nutch采用的是單獨網(wǎng)站的配置方式，所以執(zhí)行上我們也采用的是單網(wǎng)查詢，全網(wǎng)查詢在以后的內(nèi)容中介紹。

???先看一看nutch給出的命令：nutch crawl urls -dir crawl -depth 3 -topN 50
???crawl：通知nutch.jar，執(zhí)行crawl的main方法。
???urls：存放需要爬行的url.txt文件的目錄，注意，這個名字需要和你的文件夾目錄相同，如果你的文件夾為search，那這里也應(yīng)該改成search。
???-dir crawl：爬行后文件保存的位置，可以在nutch/bin目錄下找到。
???-depth 3：爬行次數(shù)，或者成為深度，不過還是覺得次數(shù)更貼切，建議測試時改為1。
???-topN 50：一個網(wǎng)站保存的最大頁面數(shù)。

??????執(zhí)行命令的步驟：
??????1。進(jìn)入cygwin界面。
??????2。使用cd命令，進(jìn)入nutch\bin路徑下。
??????3。執(zhí)行：sh nutch crawl urls -dir crawl -depth 3 -topN 50

???具體的爬行日志可以在nutch/logs目錄下看到，注意查找“INFO? fetcher.Fetcher - fetching http://XXXXXXX”這樣的內(nèi)容，這里是抓去過程日志。

???五、查詢搜索：
???nutch提供了類似google、baidu的網(wǎng)頁頁面，在nutch壓縮包下找到nutch-0.8.war文件，放到tomcat/webapps目錄下，修改webapps/nutch/WEB-INF/classes/nutch-site.xml文件內(nèi)容如下：

<property>
<name>searcher.dir</name>
<value>E:\\software\\splider\\nutch\\nutch-0.8\\nutch-0.8\\crawl</value>
</property>

???<value/>的內(nèi)容是剛才爬行后的crawl目錄位置，提供給客戶端來查詢。

　　配置完成后，啟動ｔｏｍｃａｔ，輸入http://localhost:8080/nutch，輸入關(guān)鍵字，就會看到結(jié)果了，下圖是我抓去ｗａｐ網(wǎng)站的測試結(jié)果：

???六、總結(jié)：
???ntuch提供了一個高效、開源、易操作的搜索引擎，內(nèi)部有許多細(xì)微之處都是值得借鑒的，例如采用了hadoop的分布式文件系統(tǒng)，類似eclipse的插件技術(shù)，apache的httpclient來訪問網(wǎng)站，org.cyberneko.html得HtmlParse來解析頁面等等，在以后會逐個介紹。

歡迎轉(zhuǎn)載，請注明出處！

# re: nutch在Windows上的安裝 step by step 回復(fù) 更多評論

2006-10-18 20:16 by 壞男孩

頂~~~
圖片無法顯示啊

# re: nutch在Windows上的安裝 step by step 回復(fù) 更多評論

2006-11-15 11:59 by help

<property>
<name>http.proxy.host</name>
<value>？</value>
<description></description>
</property>
<property>
<name>http.proxy.port</name>
<value>？</value>
<description></description>
</property>

請問？號處應(yīng)該填什么呀！！萬分感謝！

# re: nutch在Windows上的安裝 step by step 回復(fù) 更多評論

2006-11-15 15:09 by 天霽[匿名]

填什么都可以，這個是遵守robots協(xié)議所需的內(nèi)容，一般來說是填寫你的公司或個人的一些信息，以方便被爬行網(wǎng)站來識別你的身份和與你聯(lián)系。

# re: nutch在Windows上的安裝 step by step 回復(fù) 更多評論

2010-03-09 22:56 by 路人a

如果我想search 100個 website urls.txt 要用什麼分開網(wǎng)址?

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發(fā)表評論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: Nutch數(shù)據(jù)查看工具：RedmuTool 0.2 Nutch數(shù)據(jù)查看工具：RedmuTool nutch在Windows上的安裝 step by step

天霽

nutch在Windows上的安裝 step by step

評論

# re: nutch在Windows上的安裝 step by step 回復(fù) 更多評論

# re: nutch在Windows上的安裝 step by step 回復(fù) 更多評論

# re: nutch在Windows上的安裝 step by step 回復(fù) 更多評論

# re: nutch在Windows上的安裝 step by step 回復(fù) 更多評論

日歷

常用鏈接

留言簿(7)

隨筆分類

隨筆檔案

文章檔案

test

搜索

最新評論

閱讀排行榜

評論排行榜

天 霽

nutch在Windows上的安裝 step by step

評論

# re: nutch在Windows上的安裝 step by step 回復(fù) 更多評論

# re: nutch在Windows上的安裝 step by step 回復(fù) 更多評論

# re: nutch在Windows上的安裝 step by step 回復(fù) 更多評論

# re: nutch在Windows上的安裝 step by step 回復(fù) 更多評論

日歷

常用鏈接

留言簿(7)

隨筆分類

隨筆檔案

文章檔案

test

搜索

最新評論

閱讀排行榜

評論排行榜

天霽