URL分別用三個(gè)List保存,
一個(gè)是boring,這個(gè)list中的url最后來下載
其他兩個(gè)是interesting和average
當(dāng)搜索到url時(shí)檢查是否包含設(shè)定為boring的詞,并放入boring中
用戶可設(shè)定“深度搜索”:每搜到一個(gè)url就放在list的最前面
也可廣度
有些網(wǎng)頁鏈接要特殊處理:
url = textReplace("?", URLEncoder.encode("?"), url);
url = textReplace("&", URLEncoder.encode("&"), url);
private String textReplace(String find, String replace, String input)
{
int startPos = 0;
while(true)
{
int textPos = input.indexOf(find, startPos);
if(textPos < 0)
{
break;
}
input = input.substring(0, textPos) + replace + input.substring(textPos + find.length());
startPos = textPos + replace.length();
}
return input;
}
讀取資源代碼:
BufferedInputStream remoteBIS = new BufferedInputStream(conn.getInputStream());
ByteArrayOutputStream baos = new ByteArrayOutputStream(10240);
byte[] buf = new byte[1024];
int bytesRead = 0;
while(bytesRead >= 0)
{
baos.write(buf, 0, bytesRead);
bytesRead = remoteBIS.read(buf);
}
byte[] content = baos.toByteArray();
建立多級目錄:
File f = new File(fileName);
f.getParentFile().mkdirs();
FileOutputStream out = new FileOutputStream(fileName);
out.write(content);
out.flush();
out.close();
給一個(gè)變量寫doc:(在eclipse中,鼠標(biāo)置上會(huì)顯示)
/**
* Set of URLs downloaded or scheduled, so we don't download a
* URL more than once.
* Thread safety: To access the set, first synchronize on it.
*/
private Set urlsDownloadedOrScheduled;
這種log挺好:(apache log4j)
private final static Category _logClass = Category.getInstance(TextSpider.class);
/*
顯示信息: 2005-05-01 11:40:44,250 [main] INFO? TextSpider.java:105 - Starting Spider...
*/
_logClass.info("Starting Spider...");
版權(quán)所有 羅明