HTMLParser的使用

       出處：http://blog.csdn.net/redez/archive/2005/11/21/534277.aspx
       說明：本文參考HTMLParser使用，并在該文的基礎上進行了部分修改。
一. 簡介
       htmlparser用于對html頁面進行解析，它是一個功能比較強大的工具。
       項目首頁：http://htmlparser.sourceforge.net/
       下載地址：http://sourceforge.net/project/showfiles.php?group_id=24399
二. 使用舉例
       下面通過一個簡單的htmlparser的使用舉例，來學習htmlparser的使用。代碼如下：

package com.amigo.htmlparser;

import java.io.*;

import java.net.URL;

import java.net.URLConnection;

import org.htmlparser.filters.*;

import org.htmlparser.*;

import org.htmlparser.nodes.*;

import org.htmlparser.tags.*;

import org.htmlparser.util.*;

import org.htmlparser.visitors.*;

/**

* 測試HTMLParser的使用.

* @author <a href="mailto:xiexingxing1121@126.com">AmigoXie</a>

* Creation date: 2008-1-18 - 上午11:44:22

public class HTMLParserTest {

/**

* 入口方法.

* @param args

* @throws Exception

public static void main(String args[]) throws Exception {

String path = "http://www.tkk7.com/amigoxie";

URL url = new URL(path);

URLConnection conn = url.openConnection();

conn.setDoOutput(true);

InputStream inputStream = conn.getInputStream();

InputStreamReader isr = new InputStreamReader(inputStream, "utf8");

StringBuffer sb = new StringBuffer();

BufferedReader in = new BufferedReader(isr);

String inputLine;

while ((inputLine = in.readLine()) != null) {

sb.append(inputLine);

sb.append("\n");

}

String result = sb.toString();

readByHtml(result);

readTextAndLinkAndTitle(result);

}

/**

* 按頁面方式處理.解析標準的html頁面

* @param content 網頁的內容

* @throws Exception

public static void readByHtml(String content) throws Exception {

Parser myParser;

myParser = Parser.createParser(content, "utf8");

HtmlPage visitor = new HtmlPage(myParser);

myParser.visitAllNodesWith(visitor);

String textInPage = visitor.getTitle();

System.out.println(textInPage);

NodeList nodelist;

nodelist = visitor.getBody();

System.out.print(nodelist.asString().trim());

}

/**

* 分別讀純文本和鏈接.

* @param result 網頁的內容

* @throws Exception

public static void readTextAndLinkAndTitle(String result) throws Exception {

Parser parser;

NodeList nodelist;

parser = Parser.createParser(result, "utf8");

NodeFilter textFilter = new NodeClassFilter(TextNode.class);

NodeFilter linkFilter = new NodeClassFilter(LinkTag.class);

NodeFilter titleFilter = new NodeClassFilter(TitleTag.class);

OrFilter lastFilter = new OrFilter();

lastFilter.setPredicates(new NodeFilter[] { textFilter, linkFilter, titleFilter });

nodelist = parser.parse(lastFilter);

Node[] nodes = nodelist.toNodeArray();

String line = "";

for (int i = 0; i < nodes.length; i++) {

Node node = nodes[i];

if (node instanceof TextNode) {

TextNode textnode = (TextNode) node;

line = textnode.getText();

} else if (node instanceof LinkTag) {

LinkTag link = (LinkTag) node;

line = link.getLink();

} else if (node instanceof TitleTag) {

TitleTag titlenode = (TitleTag) node;

line = titlenode.getTitle();

}

if (isTrimEmpty(line))

continue;

System.out.println(line);

}

/**

* 去掉左右空格后字符串是否為空

public static boolean isTrimEmpty(String astr) {

if ((null == astr) || (astr.length() == 0)) {

return true;

}

if (isBlank(astr.trim())) {

return true;

}

return false;

}

/**

* 字符串是否為空:null或者長度為0.

public static boolean isBlank(String astr) {

if ((null == astr) || (astr.length() == 0)) {

return true;

} else {

return false;

}

posted on 2008-01-18 14:18 阿蜜果閱讀(14600) 評論(2) 編輯收藏所屬分類: Java

FeedBack:

# re: HTMLParser的使用

2008-04-17 18:20 | zzz

請問一下，怎樣將修改過得html保存到文件中
code如下
parser = new Parser(getContentByLocalFile(file));
NodeFilter nt = new NodeClassFilter(ImageTag.class) ;
NodeList tmpImageList = (NodeList) parser.parse(nt);

/*linkTmpHash = new Hashtable();
for (int i = 0; i < length; i++) {
Element tmpElement = (Element) tmpNodeList.item(i);
String href = tmpElement.getAttribute("href");
if (href != null && !href.equals("")) {
linkTmpHash.put(href, "");
}
}
data.setHrefs((String[]) linkTmpHash.keySet().toArray(new String[linkTmpHash.size()]));*/
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter (new FileOutputStream (file)));
linkTmpHash = new Hashtable();
for (int i = 0; i < tmpImageList.size(); i++) {
imgnode = (ImageTag)tmpImageList.elementAt(i);
String src = imgnode.getImageURL();
if (URLPathNameUtil.isAbsolutePath(src)) {
if (testAbsolutePath) {
testImagetag(file,src);
}
} else {
if (testRelativePath) {
testImagetag(file, src);
}
}
if(getRealPath()!=null){
imgnode.setImageURL(getRealPath());
writer.write(tmpImageList.toHtml());
}
/*if (src != null && !src.equals("")) {
linkTmpHash.put(src, "");
}*/
}
writer.flush();
writer.close ();

謝謝了回復更多評論

# re: HTMLParser的使用

2009-03-02 13:20 | 黃金礦工

感覺效率有點低下，另外處理字符編碼的地方有點問題，取正文的時候js代碼去不干凈回復更多評論

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: 關系型數據的分布式處理系統MyCAT（1）—概述和基本使用教程常用加密算法的Java實現總結(二)——對稱加密算法DES、3DES和AES 常用加密算法的Java實現總結(一)——單向加密算法MD5和SHA 基于注解的Spring MVC+Hibernate簡單入門新作《Java面試關鍵與綜合軟件項目開發全程實戰》蜜果私塾：在系統中使用內存對象緩存系統（下篇）蜜果私塾：在系統中使用內存對象緩存系統（上篇）蜜果私塾：數據同步給第三方系統的方案探索蜜果私塾：Java Web系統常用的第三方接口 Java發HTTP POST請求（內容為xml格式）