国产综合精品久久亚洲,亚洲欧美日韩综合俺去了,久久精品蜜芽亚洲国产AV

httpClient獲取Jsoup解析網頁

因項目需要從某個網站爬取一點數據，故我將爬取記錄如下，以后說不定還能用得到呢，廢話少說，進入正題：

HttpClient 是 Apache Jakarta Common 下的子項目，可以用來提供高效的、最新的、功能豐富的支持 HTTP 協議的客戶端編程工具包，并且它支持 HTTP 協議最新的版本和建議。本文首先介紹 HTTPClient，然后根據作者實際工作經驗給出了一些常見問題的解決方法。

HttpClient 主頁:http://hc.apache.org/httpcomponents-client-dev/index.html

jsoup是一個Java HTML Parser。能夠從URL、文件或字符串解析HTML。利用DOM遍歷或CSS選擇器查找和抽取數據。能夠操作HTML元素，屬性和文本。能夠依據一個白名單過濾用戶提交的內容。

jsoup主頁:http://jsoup.org/

具體的我就不解釋了自己度娘、谷哥去

要不找個例子先？！

比如就拿www.iteye.com首頁來說吧，我想定時抓取iteye首頁“精華文章”里面的數據

思路，用代碼請求www.iteye.com首頁，拿到首頁的html代碼，解析html代碼，獲取“精華文章”里面文章的連接地址在此請求該地址，拿下該文章，是吧？！ok，來看處理過程：

先用瀏覽器打開www.iteye.com，可以用調試工具 firefox裝上firebug chrome右擊審核元素

以firefox為例：

可以發現“精華文章” 里面文章的全結構是

在id=“page”的div下面的

id=“content”的div下面的

id=“main”的div下面的

class=“left”的div下面的

id=“recommend”的div下面的

ul下面的li下面的a標簽

首先用httpClient獲取首頁html代碼我用的是httpClient4.1.2 jar包見附件 jsoup用的是jsoup-1.6.1.jar

Java代碼  
/** 
     * 根據URL獲得所有的html信息 
     * @param url 
     * @return 
     */  
    public static String getHtmlByUrl(String url){  
        String html = null;  
        HttpClient httpClient = new DefaultHttpClient();//創建httpClient對象  
        HttpGet httpget = new HttpGet(url);//以get方式請求該URL  
        try {  
            HttpResponse responce = httpClient.execute(httpget);//得到responce對象  
            int resStatu = responce.getStatusLine().getStatusCode();//返回碼  
            if (resStatu==HttpStatus.SC_OK) {//200正常  其他就不對  
                //獲得相應實體  
                HttpEntity entity = responce.getEntity();  
                if (entity!=null) {  
                    html = EntityUtils.toString(entity);//獲得html源代碼  
                }  
            }  
        } catch (Exception e) {  
            System.out.println("訪問【"+url+"】出現異常!");  
            e.printStackTrace();  
        } finally {  
            httpClient.getConnectionManager().shutdown();  
        }  
        return html;  
    }  

上面是用httpClient獲取html源文件的代碼

下面就是對該html頁面進行解析得到我們想要的連接

下面是jsoup處理得到的html源碼

Java代碼  
import org.jsoup.Jsoup;  
import org.jsoup.nodes.Document;  
import org.jsoup.nodes.Element;  
import org.jsoup.select.Elements;  
  
public class JustTest {  
    public static void main(String[] args) {  
        String html = getHtmlByUrl("http://www.iteye.com/");  
        if (html!=null&&!"".equals(html)) {  
            Document doc = Jsoup.parse(html);  
            Elements linksElements = doc.select("div#page>div#content>div#main>div.left>div#recommend>ul>li>a");  
            //以上代碼的意思是 找id為“page”的div里面   id為“content”的div里面   id為“main”的div里面   class為“left”的div里面   id為“recommend”的div里面ul里面li里面a標簽  
            for (Element ele:linksElements) {  
                String href = ele.attr("href");  
                String title = ele.text();  
                System.out.println(href+","+title);  
            }  
        }  
    }  
}  

其實jsoup的語法很簡單，就是跟jquery一樣用“#”取id，用“.”取樣式位之后的數據

其實都很簡單的，當然，越規范的網頁分析起來越容易，要是網頁不規范就只好大家多寫點代碼咯

-----------------------------------------------------
Silence, the way to avoid many problems;
Smile, the way to solve many problems;

posted on 2012-08-17 14:47 Chan Chen 閱讀(9285) 評論(0) 編輯收藏

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理

Chan Chen Coding...

導航

統計

文章分類

文章檔案

最新評論

httpClient獲取Jsoup解析網頁