泰仔在線

java學(xué)習(xí),心情日記,繽紛時(shí)刻

posts - 100, comments - 34, trackbacks - 0, articles - 0

Nutch中的html頁面的解析問題

Posted on 2010-04-23 17:38 泰仔在線閱讀(3074) 評(píng)論(1) 編輯收藏所屬分類: 云計(jì)算相關(guān)

今天主要研究了Nutch中的html頁面的解析問題，因?yàn)槲业娜蝿?wù)是從頁面中提取特定的文本，因此首先要找到Nutch如何將html中的文本提取出來。Nutch提供了兩種html解析器，nekohtml和tagsoup，我采用了neko的解析器，在看了代碼后，發(fā)現(xiàn)其提取文本的方法在org.apache.nutch.parse.html中的DOMContentUtils文件中，主要的函數(shù)是getTextHelper。下面做一下解釋。

private boolean getTextHelper(StringBuffer sb, Node node,
                                             boolean abortOnNestedAnchors,
                                             int anchorDepth) {
    boolean abort = false;
    NodeWalker walker = new NodeWalker(node);// NodeWalk類用來非遞歸遍歷DOM樹節(jié)點(diǎn)
    int myint=1;

    while (walker.hasNext()){ //如果存在節(jié)點(diǎn)

      Node currentNode = walker.nextNode();//獲取下一個(gè)節(jié)點(diǎn)
      String nodeName = currentNode.getNodeName();//獲取節(jié)點(diǎn)名
      short nodeType = currentNode.getNodeType();//節(jié)點(diǎn)類型

      if ("script".equalsIgnoreCase(nodeName)) {//不處理腳本
        walker.skipChildren();
      }
      if ("style".equalsIgnoreCase(nodeName)) {//不處理style
        walker.skipChildren();
      }
      if (abortOnNestedAnchors && "a".equalsIgnoreCase(nodeName)) {//檢測(cè)是否嵌套
        anchorDepth++;
        if (anchorDepth > 1) {
          abort = true;
          break;
        }
      }
      if (nodeType == Node.COMMENT_NODE) {//不處理注釋
        walker.skipChildren();
      }
      if (nodeType == Node.TEXT_NODE) {
        // cleanup and trim the value
        String text = currentNode.getNodeValue();//獲取文本內(nèi)容
        text = text.replaceAll("\\s+", " ");//消除所有空格和轉(zhuǎn)行等字符
   text = text.trim();
        if (text.length() > 0) {
          if (sb.length() > 0) sb.append(' ');
         sb.append(text);
        }
      }
    }
}

調(diào)用這個(gè)函數(shù)的類是htmlParser類，如果想自己寫一個(gè)提取文本的函數(shù)，可以做相應(yīng)修改。

轉(zhuǎn)自:實(shí)習(xí)日記(六)

Feedback

# re: Nutch中的html頁面的解析問題 回復(fù) 更多評(píng)論

2013-03-19 16:53 by gongshijun

怎樣改啊，nutch1.6都沒有你說的那些東西，找不到啊

新用戶注冊(cè) 刷新評(píng)論列表


只有注冊(cè)用戶登錄后才能發(fā)表評(píng)論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: Nutch URL過濾配置規(guī)則 nutch抓取動(dòng)態(tài)網(wǎng)頁 Nutch中的html頁面的解析問題 Nutch中的一些小的問題解決 Nutch插件加載分析 nutch源代碼閱讀心得 MapReduce算法模式 MapReduce 簡(jiǎn)介

泰仔在線

導(dǎo)航

留言簿(3)

隨筆分類

收藏夾

Database相關(guān)

Enet 沖浪

Java 技術(shù)

Linux相關(guān)

搜索

最新評(píng)論

閱讀排行榜

Nutch中的html頁面的解析問題

Feedback

# re: Nutch中的html頁面的解析問題 回復(fù) 更多評(píng)論

泰仔在線

導(dǎo)航

留言簿(3)

隨筆分類

收藏夾

Database相關(guān)

Enet 沖浪

Java 技術(shù)

Linux相關(guān)

搜索

最新評(píng)論

閱讀排行榜

Nutch中的html頁面的解析問題

Feedback

# re: Nutch中的html頁面的解析問題 回復(fù) 更多評(píng)論

# re: Nutch中的html頁面的解析問題回復(fù) 更多評(píng)論