-- 關(guān)注搜索引擎的開(kāi)發(fā)

日歷

2006年5月

日

一

二

三

四

五

六

統(tǒng)計(jì)

隨筆 - 82
文章 - 2
評(píng)論 - 228
引用 - 0

導(dǎo)航

常用鏈接

留言簿(8)

隨筆分類(45)

隨筆檔案(82)

文章檔案(2)

2006年4月 (2)

Java Spaces

Alanb(Sun) (rss)
FreeRoller (rss)
JavaBlogs
JavaWorld (rss)

搜索

積分與排名

積分 - 65500
排名 - 816

閱讀排行榜

評(píng)論排行榜

2006年5月17日

微軟的新搜索引擎

微軟從未放棄搜索引擎的競(jìng)爭(zhēng)，一直和Google暗暗較勁。盡管live search在內(nèi)部員工里像是一個(gè)joke，但老大一直毫不猶豫地往里砸錢(qián)。

說(shuō) 實(shí)話，我盡量使用微軟的產(chǎn)品，操作系統(tǒng)放棄了linux，開(kāi)發(fā)工具放棄了perl和java，當(dāng)然這些是工作使然。但map我以前用 MapQuest，現(xiàn)在改用live map，瀏覽器也棄Firefox改用IE8，但凡能用的，我都會(huì)改用微軟的產(chǎn)品，不過(guò)對(duì)于搜索引擎，感覺(jué)實(shí)在太爛了，搜出來(lái)的東西總不是自己想要的，往后翻了10來(lái)頁(yè)也不見(jiàn)有用的。后來(lái)就偷偷把Google設(shè)為默認(rèn)引擎。見(jiàn)到一個(gè)同事比我更過(guò)分，連outlook的搜索都改用Google Desktop來(lái)搜索。

后來(lái)，3月初的時(shí)候，內(nèi)部就發(fā)布了一個(gè)新的搜索引擎，叫Kumo(酷摸？)。據(jù)說(shuō)是因?yàn)閘ive這個(gè)名字不好，不信把它反過(guò)來(lái)念念看看是什么？我覺(jué)得只是一個(gè)名字的更換沒(méi)有什么意義。后來(lái)還是忍不住上去試了試，發(fā)現(xiàn)確實(shí)比原來(lái)的那個(gè)好一些。沒(méi)事的時(shí)候也會(huì)用Kumo 摸一把。

今天，鮑老大又宣布發(fā)布一個(gè)新的搜索引擎，叫Bing。感覺(jué)怎樣？我怎么讀的像有病的‘病’？還不叫Search Engine,改叫Decision Engine，夠新潮的概念。我不太清楚為什么取這樣一個(gè)名字（據(jù)鮑老大說(shuō)，是因?yàn)樗绦『糜洠贿^(guò)從一個(gè)日文名字變成一個(gè)中文名字，我感覺(jué)這是陸奇上臺(tái)登上Search老大交椅之后的一個(gè)成功。記得前兩天Search主頁(yè)的封面就開(kāi)始用上內(nèi)部某員工拍的中國(guó)陽(yáng)朔的風(fēng)景照片。不管猜測(cè)對(duì)不對(duì)，新的搜索引擎還是要試一試，結(jié)果有好事之徒一上來(lái)就搜了個(gè)“六四”，結(jié)果出來(lái)的全是大學(xué)四六級(jí)考試，讓人有些瀑布寒。還沒(méi)有公開(kāi)release，公關(guān)就已經(jīng)做得這么好了。

讓人更囧的是，為慶祝新的release，search組的人每人發(fā)了一件T-shirt。據(jù)說(shuō)前面是"I Bing"，后面是“U Bing”。聽(tīng)起來(lái)像“我有病，你也有病”。不過(guò)Search組的人并以為然，因?yàn)樗麄優(yōu)?#8220;Bing”取了一個(gè)中文名字叫“必應(yīng)”。比“谷歌”好一點(diǎn)么？

其他組的好事之徒可沒(méi)那么友好，測(cè)試了一段時(shí)間之后，把這個(gè)“bing”的搜索引擎親切地叫做Mr. Bean。

當(dāng)然，面對(duì)新鮮事物，我們還應(yīng)該抱著積極的態(tài)度。我想因?yàn)樵跍y(cè)試階段，我更愿意相信這是因?yàn)闆](méi)有足夠的用戶行為數(shù)據(jù)導(dǎo)致的短暫的發(fā)育不良。這個(gè)“必應(yīng)”在下周可能就會(huì)正式發(fā)布了。讓我們?cè)嚹恳源?

posted @ 2009-05-29 13:20 Dedian 閱讀(3634) | 評(píng)論 (14) | 編輯收藏

我們需要什么樣的應(yīng)用程序？

我先前有說(shuō)過(guò)，“很多的軟件做成web-based是web3.0的一個(gè)趨勢(shì)”。從技術(shù)角度上說(shuō)，這些web-based的應(yīng)用程序和以前裝在本地硬盤(pán)的軟件有些不一樣，確切地可以理解那些具有服務(wù)功能的網(wǎng)站或者應(yīng)用程序?yàn)槟軌驗(yàn)g覽器所容納的對(duì)象，而瀏覽器只是一個(gè)可以支持多種對(duì)象的容器，可對(duì)象的后臺(tái)的服務(wù)應(yīng)用程序正是 deploy在各種web服務(wù)器上的軟件。

而那些所謂的腳本語(yǔ)言只是容器與各種對(duì)象的通訊語(yǔ)言。

一直以來(lái)，容器和后臺(tái)服務(wù)應(yīng)用程序一直在改進(jìn)。但更多的是一個(gè)又一個(gè)鮮活的對(duì)象通過(guò)瀏覽器展現(xiàn)在我們眼前，默默地改變我們的生活。

其實(shí)，說(shuō)很多的軟件做成web-based就是變成一個(gè)個(gè)可以為瀏覽器所接納的對(duì)象模型只概括了其中的一部分。它只是說(shuō)到軟件的表現(xiàn)形式。這很容易讓大家忽略數(shù)據(jù)的存儲(chǔ)形式，而默認(rèn)這樣的web-based的服務(wù)讓我們更多的是享受網(wǎng)絡(luò)上的數(shù)據(jù)或者搜索引擎上的數(shù)據(jù)。我們不用經(jīng)常下載軟件占據(jù)自己的硬盤(pán)，有了網(wǎng)絡(luò)電視，我們也不用下載電影，甚至也無(wú)需下載音樂(lè)。我們自己的數(shù)據(jù)比如email，blog,訂閱的雜志，收藏的信息也都存放在各個(gè)網(wǎng)站的服務(wù)器上，而無(wú)需下載下來(lái)。

我們似乎已經(jīng)習(xí)慣了在線的狀態(tài)。淡忘了脫機(jī)的那個(gè)年代。而一向標(biāo)新立異的Google似乎又找到回歸的需求，那就是最近推出的的Google Gears。它提供人們一個(gè)瀏覽器的插件，通過(guò)這個(gè)插件我們下載數(shù)據(jù)到本地硬盤(pán)，并且提供一個(gè)小型數(shù)據(jù)庫(kù)引擎(SQLite)在本地硬盤(pán)幫助存儲(chǔ)，建立索引和搜索數(shù)據(jù)。另外提供接口實(shí)現(xiàn)后臺(tái)的數(shù)據(jù)同步而無(wú)需占用瀏覽器資源。

目前Google Gears的API應(yīng)用在Google Reader上，即用戶可以下載訂閱的電子雜志到本地硬盤(pán)，方便整理和收藏。

一句話，軟件有放在網(wǎng)上的趨勢(shì)，人們也同樣關(guān)注個(gè)人數(shù)據(jù)的搜集和存放。舉個(gè)例子，我一直用Del.icio.us來(lái)收藏一些技術(shù)網(wǎng)站或者文章，可有一天我查閱技術(shù)文章的時(shí)候，點(diǎn)擊鏈接過(guò)去，卻是物是人非頁(yè)已去。這時(shí)我就想當(dāng)時(shí)文章要是可以自動(dòng)下載到自己硬盤(pán)并整理好那該多好。當(dāng)然，手工的Copy+Paste就算了，我希望的是像Del.icio.us的一鍵操作。

posted @ 2007-05-31 14:27 Dedian 閱讀(1924) | 評(píng)論 (1) | 編輯收藏

what comparison function is in linux sorting ?

Got a question, when I apply sort command line in linux to sort some domain names by dictionary order, no matter which option i used, it will sort some domains like this:

...
abca.com
abc-d.com
abce.com
...

I am curious what comparison function it applys in its' sorting function. I supposed it should be a string comparison, like strcmp function, but it is not. coz strcmp will compare ascii code of characters in string one by one, thus above sorting should like this:

abc-d.com
abca.com
abce.com

one guess is that when sorting names the special characters like "." "-" will be skipped. but still got some problem when sorting following names:

abc---d.com
abc--d.com
abc-d.com

why can linux sorting keep this order? if it skips some special characters, above names should be compared equally and maybe sorted as a random order.

confused, anybody has thought about that?

-----
p.s.

Haven't got updated here for quite a long time, coz I am back to program with c under linux and I believe it is a place for Java programmers.

-----

update:

Linux sorting compares unicode of strings … more about unicode is here

posted @ 2007-02-02 07:10 Dedian 閱讀(1417) | 評(píng)論 (1) | 編輯收藏

創(chuàng)建自己的搜索引擎

隨著網(wǎng)絡(luò)上信息量的日益增加，人們的學(xué)習(xí)和工作越來(lái)越離不開(kāi)網(wǎng)絡(luò)搜索引擎(有些生活中的小例子在《Google 今天8歲》文中有提到)。

但是，另外一方面，我們會(huì)對(duì)搜索出來(lái)的成千上萬(wàn)的結(jié)果束手無(wú)措，使得我們基本上對(duì)第一頁(yè)的搜索結(jié)果保持興趣，從而引發(fā)各種為爭(zhēng)取出現(xiàn)在搜索引擎的第一頁(yè)的各種技術(shù)(如SEO)或手段(Spamdexing)出現(xiàn)，惡劣的則大打出手，甚至搜索引擎公司出現(xiàn)各種幕后黑手。

對(duì)于用戶來(lái)說(shuō)，則需要一點(diǎn)智商，來(lái)迅速地達(dá)到自己的搜索目的。

對(duì)于搜索引擎的老大Google顯然注意到這一事實(shí)以及這一事實(shí)帶來(lái)的客戶需求：即搜索引擎應(yīng)該滿足客戶自定義化(Customizable).

最近，Google推出的產(chǎn)品 custom search service 則適應(yīng)了這一需要。

idea很簡(jiǎn)單，就是用戶可以自己根據(jù)自己的興趣所在設(shè)置一些自己經(jīng)常去的或者感興趣的又信息量比較大的一些網(wǎng)站。這樣就可以制定Google的搜索引擎就搜索這幾個(gè)網(wǎng)站，或者以這幾個(gè)網(wǎng)站的為主。

例外，這個(gè)簡(jiǎn)單idea的產(chǎn)品還具備web2.0的色彩。也就是可以幾個(gè)興趣相投的人一起編輯網(wǎng)站列表，從而類似一個(gè)搜索圈(搜索社區(qū))搜索出大家共同感興趣的東西。

有興趣的大家可以自己玩玩。我初步自定義了一個(gè)與Blog有關(guān)的搜索引擎。

點(diǎn)擊這里。或者連接：
http://www.google.com/coop/cse?cx=006688650489436466578%3Ac7-4rxi0jf4

或者點(diǎn)擊這個(gè)簡(jiǎn)單的域名地址：

http://blogdigger.info

大家有興趣可以一起玩，只要你們有g(shù)mail的賬號(hào)。

加入的方法很簡(jiǎn)單，就是點(diǎn)擊主頁(yè)上的鏈接：

Volunteer to contribute to this search engine.

當(dāng)然，你需要一個(gè)Google 的賬號(hào)（沒(méi)有也沒(méi)有關(guān)系，只需要用你們的email注冊(cè)一個(gè)就可以了，很簡(jiǎn)單）

這樣，你就可以成為這個(gè)搜索引擎的一員了，平時(shí)，你覺(jué)得那個(gè)網(wǎng)站很好，里面的信息量也比較大，你可以把這個(gè)網(wǎng)站添加到Blog Digger的網(wǎng)站列表中。也可以為你感興趣的一些搜索添加搜索條目。

如果慢慢的覺(jué)得這個(gè)自定義的Google好玩，就記住這個(gè)鏈接吧：http://blogdigger.info

posted @ 2006-10-27 06:04 Dedian 閱讀(2392) | 評(píng)論 (3) | 編輯收藏

Again, Problem or Bug for URLConnection ?

Not sure if it is a bug of (Http)URLConnection, but it hang sometimes for some URLs while calling any functions to get information from connection (includes getResponseCode, getInputStream, getContent, getContentLength, getHeaderField blabla..) after connection has been built (even I have set the read timeout and connect time out).

the functions openConnection() and connect() are ok, curious about that problem.

anybody has the same problem or similar problem with URLConnection?

posted @ 2006-10-21 07:20 Dedian 閱讀(1313) | 評(píng)論 (0) | 編輯收藏

Ajax 淺談

---祝大家中秋愉快---

Ajax (Asynchronous JavaScript and XML)是近年來(lái)流行的一門(mén)web 技術(shù)。在Blogjava上看到有人開(kāi)始在介紹AJAX，但仿佛流于概念或理論的東西，對(duì)于想用Ajax的初學(xué)者似乎不是很make sense。我想，學(xué)習(xí)任何一樣新的技術(shù)，例子和步驟是極為make sense的兩樣?xùn)|西。

筆者想結(jié)合過(guò)去的學(xué)習(xí)經(jīng)驗(yàn)簡(jiǎn)單講講使用Ajax的基本步驟和舉幾個(gè)實(shí)用例子。由于筆者主要在于后臺(tái)端的開(kāi)發(fā)，所以很多腳本并不是很擅長(zhǎng)。Ajax也主要限于以前大學(xué)的修課和近期的一些為后臺(tái)端程序的測(cè)試的簡(jiǎn)單實(shí)現(xiàn)。所以只是一個(gè)拋磚引玉的使用Ajax版本，歡迎相互學(xué)習(xí)交流。

0. 導(dǎo)讀

??? 1。使用Ajax的基本流程
??? 2。使用Ajax的基本步驟。(簡(jiǎn)單例子--> Demo)
??? 3。再來(lái)一個(gè)例子(Google Suggest)。(Demo)
??? 4。家庭作業(yè) :)

1。使用Ajax的基本流程

在筆者看來(lái)，Ajax更像是一個(gè)簡(jiǎn)單的網(wǎng)絡(luò)框架，它描述著如何高效地使網(wǎng)絡(luò)前端的數(shù)據(jù)展現(xiàn)和網(wǎng)絡(luò)后端的數(shù)據(jù)之間的交互。基本上，就是瀏覽器提供一個(gè)XMLHttpRequest(當(dāng)然在IE里是ActiveXObject)的對(duì)象向后臺(tái)端的腳本程序或者Servlet Classes發(fā)送http請(qǐng)求，從后臺(tái)端的回應(yīng)中獲取文本數(shù)據(jù)(如xml格式和最近有人討論的Json格式)并嵌入前臺(tái)段的網(wǎng)頁(yè)中或腳本中。

下圖是一個(gè)簡(jiǎn)單的流程圖：

2。使用Ajax的基本步驟。

下面，我們結(jié)合上面的流程，以及一個(gè)簡(jiǎn)單的例子(見(jiàn)這篇文章)過(guò)一遍基本的步驟。(藍(lán)色代碼為標(biāo)準(zhǔn)寫(xiě)法)

第一步：Form 代碼：接受前臺(tái)端的輸入，并通過(guò)Action方法(方法函數(shù)里包含創(chuàng)建XMLHttpRequest對(duì)象)把request post到后臺(tái)端。

<input id="username" name="username" type="text"
? onblur="checkName(this.value,'')" />
<span class="hidden" id="nameCheckFailed">
? This name is in use, please try another.
</span>

<script language="javascript">
function checkName(input, response)
{
? if (response != ''){
??? // Response mode
??? message?? = document.getElementById('nameCheckFailed');
??? if (response == '1'){
????? message.className = 'error';
??? }else{
????? message.className = 'hidden';
??? }
? }else{
??? // Input mode
??? url? = 'http://localhost/xml/checkUserName.php?q=' + input;
??? loadXMLDoc(url);
? }
}

var req;

function loadXMLDoc(url)
{
??? // branch for native XMLHttpRequest object
??? if (window.XMLHttpRequest) {
??????? req = new XMLHttpRequest();
??????? req.onreadystatechange = processReqChange;
??????? req.open("GET", url, true);
??????? req.send(null);
??? // branch for IE/Windows ActiveX version
??? } else if (window.ActiveXObject) {
??????? req = new ActiveXObject("Microsoft.XMLHTTP");
??????? if (req) {
??????????? req.onreadystatechange = processReqChange;
??????????? req.open("GET", url, true);
??????????? req.send();
??????? }
??? }
}
</script>

注：
1。這里的form只是一個(gè)input box,action的方法是onblur,就是響應(yīng)失去焦點(diǎn)的事件，然后調(diào)用一個(gè)函數(shù)checkName, 這個(gè)函數(shù)里通過(guò)XMLHttpRequest向PHP server script 發(fā)送Post請(qǐng)求(看得出來(lái)，這里的php server script的文件名叫checkUserName.php，唯一參數(shù)是q)。
2。函數(shù)loadXMLDoc里有個(gè)通用的創(chuàng)建XMLHttpRequest對(duì)象的代碼，標(biāo)準(zhǔn)代碼整理如下：
??????? var req;
??? ??? function foo()
??? ??? {
??? ??? ??? req = false;

??? ??? ??? // branch for native XMLHttpRequest object
??? ??? ??? if(window.XMLHttpRequest)
??? ??? ??? {
??? ??? ??? ??? try
??? ??? ??? ??? {
??? ??? ??? ??? ??? req = new XMLHttpRequest();
??? ??? ??? ??? }
??? ??? ??? ??? catch(e)
??? ??? ??? ??? {
??? ??? ??? ??? ??? req = false;
??? ??? ??? ??? }
??? ??? ??? }
??? ??? ??? else if(window.ActiveXObject) // branch for IE/Windows ActiveX version
??? ??? ??? {
??? ??? ??? ??? try
??? ??? ??? ??? {
??? ??? ??? ??? ??? req = new ActiveXObject("Msxml2.XMLHTTP");
??? ??? ??? ??? }
??? ??? ??? ??? catch(e)
??? ??? ??? ??? {
??? ??? ??? ??? ??? try
??? ??? ??? ??? ??? {
??? ??? ??? ??? ??? ??? req = new ActiveXObject("Microsoft.XMLHTTP");
??? ??? ??? ??? ??? }
??? ??? ??? ??? ??? catch(e)
??? ??? ??? ??? ??? {
??? ??? ??? ??? ??? ??? req = false;
??? ??? ??? ??? ??? }
??? ??? ??? ??? }
??? ??? ??? }
??? ??? ??? if(req)
??? ??? ??? {
??? ?? ?? ?? ?? ??//do something here
???? ??? ??? }
??? ??? ???

??? ??? }

第二步：響應(yīng)文本處理代碼：XMLHttpRequest對(duì)象里有個(gè)類似消息響應(yīng)函數(shù)的屬性，即通過(guò)設(shè)置 req.onreadystatechange 來(lái)告訴XMLHttpRequest在哪個(gè)函數(shù)里處理服務(wù)端返回的文本信息。
如在上面的例子中：

req.onreadystatechange = processReqChange;

那么我們接著要有一個(gè)processReqChange的函數(shù)：

function processReqChange() 
{
    // only if req shows "complete"
    if (req.readyState == 4) {
        // only if "OK"
        if (req.status == 200) 
	{
            // ...processing statements go here...
	    processResponse();
        } else {
            alert("There was a problem retrieving 
               the XML data:\n" + req.statusText);
        }
    }
}

function processResponse()
{
    response  = req.responseXML.documentElement;
    method    = response.getElementsByTagName('method')[0].firstChild.data;
    result    = response.getElementsByTagName('result')[0].firstChild.data;
    eval(method + '(\'\', result)');

}

注：
1。基本上processReqChange 函數(shù)是標(biāo)準(zhǔn)代碼的寫(xiě)法。
2。這里要用到前面定義的全局變量(XMLHttpRequest對(duì)象)req

第三步：后臺(tái)端代碼(這個(gè)例子是php server script)：接受前臺(tái)端的請(qǐng)求，處理其參數(shù)，并返回相應(yīng)的結(jié)果。

文件名: checkUserName.php

<?php
header('Content-Type: text/xml');

function nameInUse($q)
{?
? if (isset($q)){
??? switch(strtolower($q))
??? {
????? case? 'drew' :
????????? return '1';
????????? break;
????? case? 'fred' :
????????? return '1';
????????? break;
????? default:
????????? return '0';
??? }
? }else{
??? return '0';
? }
?
}
?>
<?php echo '<?xml version="1.0" encoding="UTF-8"? standalone="yes"?>'; ?>
<response>
? <method>checkName</method>
? <result><?php
??? echo nameInUse($_GET['q']) ?>
? </result>
</response>
注：代碼很簡(jiǎn)單，就不用解釋了。這里返回的是xml格式的字符串。

總體效果見(jiàn)這里
輸入"fred"或者"drew"的名字，失去焦點(diǎn)后會(huì)顯示名字已存在的信息。

?3。再來(lái)一個(gè)例子。

這里再講一個(gè)實(shí)用的例子，這是以前上課的一個(gè)課堂作業(yè)，也很有代表性。是關(guān)于Google Suggest(好像新的Google Toolbar上就用的這個(gè)功能)的應(yīng)用問(wèn)題。這里是寫(xiě)好的DEMO。現(xiàn)在越來(lái)越多的網(wǎng)站提供類似Web Service的API, 我們利用他們提供的API URL可以返回一些我們用的著的數(shù)據(jù)，放在我們的網(wǎng)頁(yè)上。這里就用的上Ajax。只不過(guò)有些返回來(lái)的文本數(shù)據(jù)是xml格式的，就可以利用上面的簡(jiǎn)單例子來(lái)處理，但很多像Google Suggest那樣是返回一段類似代碼格式的文本。我們就要利用Javascript的eval函數(shù)，把這些文本當(dāng)作一段代碼在嵌入自己的網(wǎng)頁(yè)中。如果嵌入的代碼中含有函數(shù)，則需要自己再寫(xiě)一個(gè)同名的函數(shù)作為實(shí)現(xiàn)。(這就是流程圖中的optional的func 3)

這里完整代碼就不貼了，貼一些關(guān)鍵代碼(原本后臺(tái)端是用Java Servlet寫(xiě)的，但做demo的空間沒(méi)有Tomcat不支持Servlet,所以改用Php實(shí)現(xiàn)，大家可以自己用Java再寫(xiě)一邊作為家庭作業(yè) :) )：

1) form 代碼：

<form name = "QForm" method="POST" action="google_suggest.php">
??? <table bgcolor="8080C0" width="90%" >
??? <tr>
??? ??? <td? nowrap>Search Term:</td>
??? ??? <td ><input type="text" name="qtext"? onkeyup="return GetSuggestion()" size="60"></td>
??? </tr>
??? <tr>
??? ??? <th colspan="2" align="left" bgcolor="#A8A8FF"><DIV id=google_suggest_target>results go here . . . </DIV></th>
??? </tr>
??? </table>
??? </form>

注：
a. 看得出來(lái)，要把查詢的字符串post到google_suggest.php上
b. action的函數(shù)是GetSuggestion()，其返回的字符串會(huì)顯示在預(yù)留的網(wǎng)頁(yè)空間里。

2) 后臺(tái)端代碼(PHP)：這里主要接收前臺(tái)的請(qǐng)求，并不請(qǐng)求轉(zhuǎn)化為向Google Suggest的API URL請(qǐng)求，把接收到的文本信息返回給前端。代碼很簡(jiǎn)單，如下：

文件名：google_suggest.php

<?php
function getGoogleSuggest($q)
{

??? $url = "http://www.google.com/complete/search?hl=en&js=true&qu=" . $q;
??? return file_get_contents($url);
}
?>

<?php echo getGoogleSuggest($_POST['q']) ?>

注：
a。 Google Suggest API 返回的是一個(gè)代碼格式的文本信息，如下：
sendRPCDone(frameElement, "", new Array(), new Array(), new Array(""));
所以我們?cè)偾芭_(tái)接受到這個(gè)文本信息之后，應(yīng)該寫(xiě)一個(gè)sendRPCDone的函數(shù)來(lái)做進(jìn)一步信息處理(比如說(shuō)列表出查詢結(jié)果)。

3) 前臺(tái)文本處理代碼：

??? <script type="text/javascript">
??? ??? var req;
??? ??? function GetSuggestion()
??? ??? {
??? ??? ??? req = false;
??? ??? ??? var f = document.QForm;

??? ??? ??? // branch for native XMLHttpRequest object
??? ??? ??? if(window.XMLHttpRequest)
??? ??? ??? {
??? ??? ??? ??? try
??? ??? ??? ??? {
??? ??? ??? ??? ??? req = new XMLHttpRequest();
??? ??? ??? ??? }
??? ??? ??? ??? catch(e)
??? ??? ??? ??? {
??? ??? ??? ??? ??? req = false;
??? ??? ??? ??? }
??? ??? ??? }
??? ??? ??? else if(window.ActiveXObject) // branch for IE/Windows ActiveX version
??? ??? ??? {
??? ??? ??? ??? try
??? ??? ??? ??? {
??? ??? ??? ??? ??? req = new ActiveXObject("Msxml2.XMLHTTP");
??? ??? ??? ??? }
??? ??? ??? ??? catch(e)
??? ??? ??? ??? {
??? ??? ??? ??? ??? try
??? ??? ??? ??? ??? {
??? ??? ??? ??? ??? ??? req = new ActiveXObject("Microsoft.XMLHTTP");
??? ??? ??? ??? ??? }
??? ??? ??? ??? ??? catch(e)
??? ??? ??? ??? ??? {
??? ??? ??? ??? ??? ??? req = false;
??? ??? ??? ??? ??? }
??? ??? ??? ??? }
??? ??? ??? }
??? ??? ??? if(req)
??? ??? ??? {
??? ??? ??? ??? var url = "google_suggest.php";
???????
??? ??? ??? ??? req.onreadystatechange = processReqChange;
??? ??? ??? ??? req.open("POST", url, true);

??????? ??? ??? req.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
??? ??? ??? ??? req.setRequestHeader("Method", "POST " + url + " HTTP/1.1");
??? ??? ??? ??? req.send("q=" + escape(document.QForm.qtext.value));
??? ??? ??? }
??? ??? ???

??? ??? }
??? ???
??? ??? function processReqChange()
??? ??? {
??? ??? ??? if(req.readyState == 4) // only if req shows "loaded"
??? ??? ??? {
??? ??? ???????????????? if (req.status == 200) // only if "OK"
??? ??? ???????????????? {
??? ??? ???????????????? ??? ??? x = req.responseText;
??? ??? ???????????????????????? eval(x);
??? ??? ???????????????? }
??? ??? ???????????????? else
??? ??? ???????????????? {
??? ??? ?????????? ??? ??? ????? alert("There was a problem retrieving the XML data:\n" + req.statusText);
??? ??? ???????????????? }
??? ??? ??? }
??? ??? ??? else if(req.readyState == 2)
??? ??? ??? {
??? ??? ??? }
??? ??? }
??? ???
??? ??? function sendRPCDone(frameElement, qString, arr1, arr2, arr3)
??? ??? {
??? ???
??? ??? ??? var suggest_results = eval(arr1);
??? ??? ??? var counts = eval(arr2);
??? ??? ??? var htmlstr = "<TABLE cellspacing=4 border=0>";
??? ??? ??? for (var i=0; i < suggest_results.length; i++)
??? ??? ??? {
??? ??? ??? ??? htmlstr += "<tr><td><a href=\"javascript:self.location=\'http://www.google.com/search?hl=en&q=" + suggest_results[i] + "&btnG=Google+Search\'\">" + suggest_results[i] + "</a></td>";
??? ??? ??? ??? htmlstr += "<TD width=200><font color= 228b22>" + counts[i] + "</font></TD></TR>"
??? ??? ?????? ??? ?
??? ??? ??? }
??? ??? ??? htmlstr += "</TABLE>";
??? ??? ??? document.getElementById("google_suggest_target").innerHTML = htmlstr;
??? ???
??? ??? }
??? ???
??? ??? </script>

4。家庭作業(yè) :)

一定要自己寫(xiě)一些代碼，才能鞏固知識(shí):)
題目：
我們經(jīng)常用del.icio.us來(lái)收藏我們喜歡的網(wǎng)站或者文章，并加一些類似讀書(shū)筆記的注釋。那么我們?cè)趺蠢胐el.icio.us提供的API來(lái)訪問(wèn)我們的讀書(shū)筆記信息，并顯示在自己的Blog里呢？
提示：
1。你要有一個(gè)del.icio.us的賬號(hào)，并且已經(jīng)有所網(wǎng)頁(yè)收藏作為實(shí)驗(yàn)數(shù)據(jù):)
2。API URL 是 "http://del.icio.us/feeds/json/" + "你的賬號(hào)名"；自己參看一下，看返回什么樣的格式文本。另外，如果要限制返回的記錄數(shù)，可以加"?count=10"這樣的參數(shù)。

最后，祝大家中秋愉快！

---------------------------完----------------------------

posted @ 2006-10-07 07:05 Dedian 閱讀(2247) | 評(píng)論 (2) | 編輯收藏

PHP/Java Integration on Windows

reference: http://us3.php.net/java
help doc: http://php-java-bridge.sourceforge.net/


1- Make sure u have Installed Apache 2 & PHP 5 and Java J2EE 1.5

2- download pecl-5.0.5-Win32.zip and php-java-bridge_2.0.8.zip, which will include 

extra dll(s)

??  - unpack pecl pkg to your extensions folder, in PHP5 its ext.

??  - unpack java-Bridge to root php folder, in my case its simply C:\PHP

??  
Note: 
1. the java-Bridge inculdes new versions of certain files like php_java.dll

??  so, it would be wise to rename your old files that came with PECL pkg for example

??  file_old, to rollback at anytime.
2. Don't run batch file under php-java-bridge after unpacking to php root folder, just add following lines in php.ini configure file (depends on installation fold of j2ee):

extension=php_java.dll
extension_dir = "C:\php\ext" 
[java]
java.java_home=C:\Program Files\Java\jre1.5.0_06
java.java=C:\Program Files\Java\jre1.5.0_06\bin\javaw.exe
java.log_level=2
;java.log_file=ext/JavaBridge.log

posted @ 2006-10-06 09:05 Dedian 閱讀(1135) | 評(píng)論 (0) | 編輯收藏

install Apache2 & PHP5 on Windows XP

http://www.apachelounge.com/forum/viewtopic.php?t=570

http://www.webmasterstop.com/86.html

posted @ 2006-09-29 05:44 Dedian 閱讀(1026) | 評(píng)論 (0) | 編輯收藏

Google 今天8歲

估計(jì)大家已經(jīng)在Google的主頁(yè)上看到了新的logo。對(duì)，今天是google8歲的生日。

記不清什么時(shí)候第一次使用了Google,如今一個(gè)搜索引擎改變了人們的網(wǎng)絡(luò)生活，也帶來(lái)了互聯(lián)網(wǎng)的革命。如今人們大談網(wǎng)絡(luò)社區(qū)或社會(huì)化的同時(shí)，搜索引擎又開(kāi)始一個(gè)新的臺(tái)階。

8 年的時(shí)間，Google從一個(gè)單一的搜索產(chǎn)品已經(jīng)衍生出各種改變或影響人們生活的產(chǎn)品，并不斷推動(dòng)網(wǎng)絡(luò)概念和技術(shù)上的變革。比如我們經(jīng)常用的產(chǎn)品有 Google talk, Google Adsence, Google Gmail, Google Calendar, Google Map, Google Video, Google Store, Google Earth,Google toolbar, Google Desktop. 還有很多Google正在思考的產(chǎn)品。

總而言之，如果網(wǎng)絡(luò)成為你生活中的一部分，那么Google也越來(lái)越成為你生活的一部分。Google的文化連同它的產(chǎn)品也越來(lái)越成為很多其他網(wǎng)站公司效仿的對(duì)象。

那讓我們看看我們普通網(wǎng)民一般用Google來(lái)搜索什么？

1。如果你有個(gè)朋友多年未見(jiàn)，不妨用Google搜搜他的名字。
2。如果你提筆忘成語(yǔ)或古詩(shī)，不妨用Google搜搜你能想起來(lái)的殘缺部分。
3。如果你想找一張圖片，不妨也搜搜看。
4。如果你想做作業(yè)，寫(xiě)文章或?qū)懏厴I(yè)論文，最好不過(guò)了。可以搜到很多感興趣的，相關(guān)的素材。
5。如果你不知道翻譯你的成績(jī)單，利用Google的翻譯功能吧。
6。如果你有不認(rèn)識(shí)的單詞，句子，俚語(yǔ)或者一些文化背景的東西，用用Google,wiki的查詢結(jié)果通常在第一頁(yè)。
7。如果你聽(tīng)到一首好歌，且不知道歌名，誰(shuí)唱的，還想知道歌詞，那就用你聽(tīng)到的幾句歌詞搜搜吧。
8。如果你接到一個(gè)莫名其妙的電話，搜一搜，說(shuō)不定知道是哪家公司打過(guò)來(lái)的。
9。覺(jué)得一個(gè)人或者一個(gè)網(wǎng)站或者一邊文章很cool,不妨也搜一搜，會(huì)有很多有趣的東西出現(xiàn)。
10。大家都在談?wù)撝患拢蛘咦罱芰餍械囊粋€(gè)話題或術(shù)語(yǔ)，搜一搜，看看他們到底在說(shuō)什么。
11。有一個(gè)似乎很著名的英文縮寫(xiě)，搜一搜，看看到底全稱是什么。
12。電腦遇到問(wèn)題了，怎么辦？先不要著急，先搜一搜，看看有沒(méi)有人和你一樣的問(wèn)題，有沒(méi)有解決方案。
13。這家伙的網(wǎng)頁(yè)做的很cool,怎么弄得？搜一搜，保證長(zhǎng)見(jiàn)識(shí)。
14。很想問(wèn)問(wèn)題，搜一搜你的問(wèn)題，說(shuō)不定有答案。

好了，估計(jì)還有很多，大家接著補(bǔ)充。。。

posted @ 2006-09-28 07:55 Dedian 閱讀(1047) | 評(píng)論 (1) | 編輯收藏

關(guān)于抓蝦

當(dāng)你有一個(gè)很好的idea的時(shí)候，你或許會(huì)感到有一絲興奮。然而如果你發(fā)現(xiàn)你的idea以你一己之力卻無(wú)法實(shí)現(xiàn)，并且還找不到志同道合的同志，你的興奮就會(huì)很快地變?yōu)橛魫灐Ｔ龠^(guò)幾天，你會(huì)發(fā)現(xiàn)網(wǎng)上已經(jīng)有人做了一件幾乎同樣的事并且比你事先的idea還要做的好的時(shí)候，那種郁悶又會(huì)升級(jí)為失落。

其實(shí)很多普通的又有點(diǎn)智慧的IT人都要不同程度地承受這樣的一種失落。

抓蝦就是這樣一個(gè)曾經(jīng)讓我有幾許失落的感覺(jué)。失落得我有很長(zhǎng)一段時(shí)間沒(méi)有注冊(cè)一個(gè)用戶。不過(guò)收拾收拾自己的心情，我還是很欣然的接受這樣一個(gè)優(yōu)秀的國(guó)產(chǎn)web 2.0網(wǎng)站。

其實(shí)抓蝦的idea很簡(jiǎn)單。它是一個(gè)把web 2.0概念和目前風(fēng)行的基于RSS信息標(biāo)準(zhǔn)聚合格式很好地結(jié)合在一起的新興國(guó)產(chǎn)訂閱網(wǎng)站。盡管國(guó)外很早就有像Bloglines這樣的在線RSS信息訂閱網(wǎng)站。但不如抓蝦把web 2.0的概念有機(jī)地結(jié)合在一起。前者只是一個(gè)簡(jiǎn)單的訂閱系統(tǒng)和簡(jiǎn)單的共享。

關(guān)于web 2.0這個(gè)從上次網(wǎng)絡(luò)泡沫的廢墟上站起來(lái)的概念，目前大都的網(wǎng)民都有親密接觸。2005開(kāi)始在國(guó)內(nèi)流行至今的Blog和wiki其實(shí)就是web 2.0產(chǎn)物中的代表。

以前的網(wǎng)站更像一個(gè)信息發(fā)布的平臺(tái)。如果說(shuō)網(wǎng)站是一個(gè)電影院的話，那我們這些網(wǎng)民充其量就是觀看電影的觀眾，即便我們可以注冊(cè)成為VIP而進(jìn)入包廂看電影亦不過(guò)如此。你甚至可以把電影帶回家看，但你不能控制電影院播放電影的內(nèi)容，也不能隨隨便便發(fā)布你自己制作的電影。

然而，web 2.0的概念就是給網(wǎng)民提供一個(gè)享受各種web服務(wù)的平臺(tái)。

網(wǎng)民不再是觀眾，而可以是演員，導(dǎo)演，發(fā)行商，甚至二販子。從技術(shù)角度上講，web 2.0使用戶開(kāi)始可以控制數(shù)據(jù)。從用戶角度講，web 2.0使Internet成為一個(gè)虛擬社區(qū)，大家可以相互交流和共享。(從這種意義上說(shuō)，早期的BBS和P2P下載軟件都是web2.0)

關(guān)于RSS聚合，我一直認(rèn)為它只是一個(gè)基于xml的數(shù)據(jù)結(jié)構(gòu)。在很早以前開(kāi)始用.Net開(kāi)發(fā)的時(shí)候，我就接受xml schema的一個(gè)思路，就是實(shí)現(xiàn)數(shù)據(jù)與其表現(xiàn)形式相分離。這也是我克服想嘲笑xml這樣一個(gè)如此簡(jiǎn)單的網(wǎng)絡(luò)標(biāo)準(zhǔn)的沖動(dòng)。不過(guò)那時(shí)，我就有用RSS作為 Internet上凌亂不堪的信息的一個(gè)標(biāo)準(zhǔn)結(jié)構(gòu)的想法，這樣搜索引擎就會(huì)變得簡(jiǎn)單(也曾經(jīng)為此寫(xiě)過(guò)一個(gè)類似資料收集器的小程序)。尤其在選了一門(mén) Distributed Multimedia Information Management的課程后。里面大談網(wǎng)絡(luò)的Ontology和RDF技術(shù)。其實(shí)也就是用xml的數(shù)據(jù)結(jié)構(gòu)去描述網(wǎng)絡(luò)實(shí)體及其內(nèi)在聯(lián)系的一種技術(shù)。不過(guò)，rdf相對(duì)于簡(jiǎn)單的rss來(lái)說(shuō)，在應(yīng)用上似乎超前一些。

有了web 2.0的概念，有了標(biāo)準(zhǔn)的數(shù)據(jù)結(jié)構(gòu)，再加上一些具體的網(wǎng)站實(shí)現(xiàn)技術(shù)（比如目前流行的Ruby）,你就可以自己搗鼓一個(gè)web 2.0的網(wǎng)站出來(lái)。抓蝦很顯然在這方面做的比較成功。一方面，國(guó)內(nèi)這方面成功的網(wǎng)站還比較少(經(jīng)常去的也就是抓蝦和豆瓣)，另一方面，目前rss(如 blog)正在國(guó)內(nèi)大肆流行的季節(jié)。

當(dāng)然現(xiàn)在不少web 2.0沒(méi)戲的論調(diào)。其實(shí)這沒(méi)什么新鮮。網(wǎng)絡(luò)的東西就是這樣，每個(gè)人都有idea,都可以有技術(shù)做，但要存活做大，就這能是少數(shù)。web 2.0目前還是燒錢(qián)階段，因?yàn)樘峁┑姆?wù)都是免費(fèi)的(大家已經(jīng)習(xí)慣了網(wǎng)絡(luò)的免費(fèi)午餐)，只能燒錢(qián)搶用戶，最后賣(mài)流量，再搞壟斷。如果沒(méi)錢(qián)，就只能做成像奇客發(fā)現(xiàn)(diglog.com)這樣子（這個(gè)網(wǎng)站的idea和著名的digg.com類似，但顯然還在incubation階段）。這一點(diǎn)，和web 1.0沒(méi)有什么區(qū)別。這也是為什么大都的IT人依然郁悶，生活在各大小不等的目前還存活的公司的庇護(hù)下做著自己各自的夢(mèng)想。

posted @ 2006-09-26 08:51 Dedian 閱讀(1945) | 評(píng)論 (2) | 編輯收藏

Understand Java Map Collection

http://www.oracle.com/technology/pub/articles/maps1.html

posted @ 2006-09-23 02:52 Dedian 閱讀(1073) | 評(píng)論 (1) | 編輯收藏

HttpURLConnect Problem

When I try to get some information of http connection to some websites (say http://linuxbyte.net) by function HttpURLConnection.getResponseCode(), it seems tthat JVM hangs for quite a while. Somebody says that is maybe the problem of http server who must be a Microsoft webserver. Here and here are the bug report information for Java 1.3 or before. Though it is said that the problem has been solved after java 1.4, i still get undesirable a long time waiting before a SocketException (Connection reset) is thrown out. Btw, conn.setConnectTimeout or conn.setConnectTimeout is involved for this problem. I am not sure if there is any method that can save time to skip those bad links.

posted @ 2006-09-21 06:32 Dedian 閱讀(1139) | 評(píng)論 (0) | 編輯收藏

The Ruby Programming Language

Here is a good article to introduce Ruby ..why we choose Ruby instead of Perl and Python ?

posted @ 2006-09-19 05:51 Dedian 閱讀(949) | 評(píng)論 (0) | 編輯收藏

Reader and InputStream

-- Scenario:
??? The purpose of a reader is to interpret a low-level byte stream (ByteArrayInputStream, StringInputStream, FileInputStream and so on) as a character stream and provid character input to whatever class needs it. And it is very simple to convert an inputstream to a reader:

Reader reader = new InputStreamReader( in ); //in is an instance of class InputStream or derived classes

But the issue is sometimes we need convert a reader to inputstream, think about following scenaros:
1.? the original inputstream has been filtered by certian reader, now we need save back filtered content into database by inputstream: we can not use original inputstream but filtered stream which can only get from your reader.
2.? Given a class who contains a reader to access a streaming content after complex parsering or downloading, we want to utilize the streaming content in this class while not repeating complex operations for content analysis, so we need employ some wrapper methods to get inputstream from reader.

-- Solution:
1. write own InputStream implementation, such as following:

class MyInputStream extends InputStream
{
private Reader rd;
public  MyInputStream(Reader rd)
{
super();
     this.rd = rd;
}
?
?
// implement the read() method to make this all work
publicint read()
{
int t = rd.read();
// you can do your processing on the inputReader here
// fiddle with the values and return
return t;
?
}
}

Note: Applications that need to define a subclass of InputStream must always provide a method that returns the next byte of input.
(refer to http://java.sun.com/j2se/1.4.2/docs/api/java/io/InputStream.html)

-- anything else? BTW, for parsering xml-based input stream by SAX, I am glad to see that the inputSource constructor can take either InputStream or Reader (refer to http://java.sun.com/j2se/1.4.2/docs/api/org/xml/sax/InputSource.html)

posted @ 2006-08-29 09:46 Dedian 閱讀(1338) | 評(píng)論 (0) | 編輯收藏

About Hash function

for general purpose hash function:

http://www.partow.net/programming/hashfunctions/

for cryptography & hash function

http://www.x5.net/faqs/crypto/

for a faster and better hash function (comparison of several hash function):

http://burtleburtle.net/bob/hash/doobs.html

----> for further reading...

posted @ 2006-08-19 03:01 Dedian 閱讀(983) | 評(píng)論 (0) | 編輯收藏

Getting the IP Address and Hostname

1. Getting the IP Address of a Hostname

    try 
    {
        InetAddress addr = InetAddress.getByName("yahoo.com");
        byte[] ipAddr = addr.getAddress();

        // Convert to dot representation
        String ipAddrStr = "";
        for (int i=0; i<ipAddr.length; i++) {
            if (i > 0) {
                ipAddrStr += ".";
            }
            ipAddrStr += ipAddr[i]&0xFF;
        }
    } 
    catch (UnknownHostException e) {
    }

2. Getting the Hostname of an IP Address

This example attempts to retrieve the hostname for an IP address. Note that getHostName() may not succeed, in which case it simply returns the IP address.

try {
        // Get hostname by textual representation of IP address
        InetAddress addr = InetAddress.getByName("127.0.0.1");

        // Get hostname by a byte array containing the IP address
        byte[] ipAddr = new byte[]{127, 0, 0, 1};
        addr = InetAddress.getByAddress(ipAddr);

        // Get the host name
        String hostname = addr.getHostName();

        // Get canonical host name
        String hostnameCanonical = addr.getCanonicalHostName();
    } catch (UnknownHostException e) {
    }

3. Getting the IP Address and Hostname of the Local Machine

    try {
        InetAddress addr = InetAddress.getLocalHost();

        // Get IP Address
        byte[] ipAddr = addr.getAddress();

        // Get hostname
        String hostname = addr.getHostName();
    } catch (UnknownHostException e) {
    }

posted @ 2006-08-18 06:53 Dedian 閱讀(559) | 評(píng)論 (0) | 編輯收藏

How does Alexa work?

http://forums.seochat.com/alexa-ranking-49/how-does-alexa-work-140.html

posted @ 2006-08-16 07:24 Dedian 閱讀(310) | 評(píng)論 (1) | 編輯收藏

Robert Tappan Morris

In the last digest about Greatest software ever written, I noted a worm named Morris which is ranked 12 of greatest software by the author. Actually, after finishing my clustering searching enigne development which is based on Lucene, i am studying p2p architecture for my distributed searching engine (more precisely is webcrawler part). When I am reading some p2p loopup protocol papers such as Chord, I also noticed a guy named Morris who is one of the developers. Hmmm,? this is the same Morris, from wiki, I know that guys is now an associate professor in MIT, and was indicted because of the damage by his Morris worm. Anyway, I'd like to say that it is very interesting to know some stories about those geeks.

posted @ 2006-08-15 05:53 Dedian 閱讀(449) | 評(píng)論 (0) | 編輯收藏

What's The Greatest Software Ever Written?

http://www.informationweek.com/shared/printableArticle.jhtml?articleID=191901844

12. The Morris worm
11. Google search rank
10. Apollo guidance system
9. Excel spreadsheet
8. Macintosh OS
7. Sabre system
6. Mosaic browser
5. Java language
4. IBM System 360 OS
3. gene-sequencing software at the Institute for Genomic Research
2. IBM's System R
1. Unix System III

How r u thinking?

posted @ 2006-08-15 02:22 Dedian 閱讀(346) | 評(píng)論 (0) | 編輯收藏

Google, 開(kāi)源的教父？

有興趣的朋友可以參見(jiàn)原文

下面是本人的一些大致的翻譯：
------------------------------------------------------------

大伙都知道，Google是運(yùn)行在很多的Linux(GNU)系統(tǒng)的服務(wù)器上的，而這只是它支持免費(fèi)軟件的一個(gè)方面。其他的比如，Summer of Code, 現(xiàn)在已成為一個(gè)生產(chǎn)很多優(yōu)秀代碼和項(xiàng)目的孵化基地，并且最近開(kāi)放的Code Repository, 大有取代sourceforge.net(筆者注：廣大開(kāi)源的據(jù)點(diǎn))之趨勢(shì)。一方面，Google貢獻(xiàn)出它的Picasa(Linux(GNU)平臺(tái))(筆者注：一個(gè)圖片管理軟件)，并被Wine(筆者注：Linux/Unix上的Windows,建于x-window之上)所使用；另一方面，Google也贊助一些開(kāi)源項(xiàng)目，如Sri Lanka，大概有$25,000之多。
?
當(dāng)然，Google也會(huì)秘密地進(jìn)行一些開(kāi)源的資助。比如，令我們大伙驚訝的Mozilla Foundation(筆者注：大家熟悉的另一瀏覽器Firefox)居然在去年有賺到72個(gè)million?-- 就是在Firefox上把Google的搜索引擎作為缺省的搜索引擎。

2005年的1月份，Google把Ben Goodger招為靡下。此人乃Firefox的首席工程師，并且是幾個(gè)主要開(kāi)源編碼者之一。到了年末，Guido van Rossum, Python的始創(chuàng)人，也加入了Google。最近，Linux2.6核心的維護(hù)人，Andrew Morton也宣稱即將離開(kāi)OSDL并投奔到Google.

所有的這些，都意味著開(kāi)源領(lǐng)域的大變遷。

記得在最初的那些年代里，人們都為著自己的興趣愛(ài)好在業(yè)余時(shí)間里一邊工作一邊學(xué)習(xí)地奮力地寫(xiě)著自己的代碼。突然，第一個(gè).com的時(shí)代來(lái)臨，不少早期的開(kāi)源公司開(kāi)始聘請(qǐng)頂級(jí)程序員：如核心編碼員Alan Cox, David Miller，Stephen Tweedie等人紛紛來(lái)到Red Hat, 還有一些去了Linuxcare。

隨著第一個(gè).com泡沫經(jīng)濟(jì)的破滅，高手們被迫紛紛尋找新的工作，不少人去了新興之秀OSDL。基于這樣的一個(gè)背景，Google的興起以及大攬人才意味著早期公司廣具人才的模式的回歸。當(dāng)然，這次他們的工作都間接的有關(guān)于Google的主要市場(chǎng)策略。

Google的策略是精明的，看看最近招的人，Goodger和Morton,一個(gè)是瀏覽器，一個(gè)是操作系統(tǒng)。無(wú)不顯示出其與Microsoft暗暗較勁的決心。

當(dāng)然還有另一方面的原因，可能不是那么明顯，那就是最近的一些爭(zhēng)論，關(guān)于Google能否履行其最初對(duì)開(kāi)源領(lǐng)域許下的諾言。矛頭指向Google是否應(yīng)該公開(kāi)它的源碼？因?yàn)镚oogle用了不少開(kāi)源的東西。

所以，從某種角度上講，招一些開(kāi)源黑客人士入帳遠(yuǎn)遠(yuǎn)比把代碼隨處發(fā)布好的多。

那些關(guān)于用了開(kāi)源的代碼的公司是不是也應(yīng)該開(kāi)放他們的代碼的爭(zhēng)論不僅僅涉及到Google。其他的一些主要得益者如Yahoo, 其最近正活躍于收購(gòu)一些Web 2.0的公司如Flickr 和Del.icio.us，這些都很顯然有著開(kāi)源的印記，當(dāng)然它沒(méi)有Google那樣與開(kāi)源的關(guān)系那么源遠(yuǎn)流長(zhǎng)，不過(guò)Yahoo也開(kāi)始著手吸引開(kāi)源人才。

posted @ 2006-08-11 06:39 Dedian 閱讀(913) | 評(píng)論 (0) | 編輯收藏

Web Standards or web trends?

People are still talking about web 2.0, I am not sure that is pure technical term. In my understanding, maybe most of meaning of web 2.0 is its marketing meaning. that is, web is becoming commonality and people generate the web's content. Again, i am not sure?what is the place of web service in web 2.0, in my understanding, the web is not merely client-server marketing model (I am not talking web structure here), but an?interactive community. But question is , who gonna be the operator or administrator of this community or if there?are any game?rules?needed to follow?? will that be another utopian ?

Well, on a technical layer, I'd like to shed some lights on so-called web standard trends

1. front end --
???????? CSS ----> layout
?????????XML ----> data?
?????????XHTML ----> markup
?????????Javascript & DOM ----> behavior + XMLHttpRequest?--> AJAX ?

2. back end --?
?????????some open source projects such as Ruby on Rail...

let me know how you are thinking...

posted @ 2006-08-09 09:21 Dedian 閱讀(816) | 評(píng)論 (0) | 編輯收藏

Doug Cutting 訪談錄 -- 關(guān)于搜索引擎的開(kāi)發(fā)

作為Lucene和Nutch兩大Apach Open Source Project的始創(chuàng)人(其實(shí)還有Lucy, Lucene4C 和Hadoop等相關(guān)子項(xiàng)目)，Doug Cutting 一直為搜索引擎的開(kāi)發(fā)人員所關(guān)注。他終于在為Yahoo以Contractor的身份工作4年后，于今年正式以Employee的身份加入Yahoo

下面是筆者在工作之余,翻譯其一篇2年前的訪談錄，原文(Doug Cutting Interview)在網(wǎng)上Google一下就容易找到。希望對(duì)搜索引擎開(kāi)發(fā)的初學(xué)者起到一個(gè)拋磚引玉的效果。

(注：翻譯水平有限，不求雅，只求信，達(dá)。希望見(jiàn)諒)

1。請(qǐng)問(wèn)你以何為生？你是如何開(kāi)始從事搜索引擎開(kāi)發(fā)的？

我主要在家從事兩個(gè)與搜索有關(guān)的開(kāi)源項(xiàng)目的開(kāi)發(fā): Lucene和Nutch.?錢(qián)主要來(lái)自于一些與這些項(xiàng)目相關(guān)的一些合同中。目前Yahoo! Labs?有一部分贊助在Nutch上。這兩個(gè)項(xiàng)目還有一些其他的短期合同?。

2。你能大概給我們講解一下Nutch嗎？以及你將在哪方面運(yùn)用它？

我還是先說(shuō)一下Lucene吧。Lucene其實(shí)是一個(gè)提供全文文本搜索的函數(shù)庫(kù)，它不是一個(gè)應(yīng)用軟件。它提供很多API函數(shù)讓你可以運(yùn)用到各種實(shí)際應(yīng)用程序中。現(xiàn)在，它已經(jīng)成為Apache的一個(gè)項(xiàng)目并被廣泛應(yīng)用著。這里列出一些已經(jīng)使用Lucene的系統(tǒng)。

Nutch是一個(gè)建立在Lucene核心之上的Web搜索的實(shí)現(xiàn)，它是一個(gè)真正的應(yīng)用程序。也就是說(shuō)，你可以直接下載下來(lái)拿過(guò)來(lái)用。它在Lucene的基礎(chǔ)上加了網(wǎng)絡(luò)爬蟲(chóng)和一些和Web相關(guān)的東東。其目的就是想從一個(gè)簡(jiǎn)單的站內(nèi)索引和搜索推廣到全球網(wǎng)絡(luò)的搜索上，就像Google和Yahoo一樣。當(dāng)然，和那些巨人競(jìng)爭(zhēng)，你得動(dòng)一些腦筋，想一些辦法。我們已經(jīng)測(cè)試過(guò)100M的網(wǎng)頁(yè)，并且它的設(shè)計(jì)用在超過(guò)1B的網(wǎng)頁(yè)上應(yīng)該沒(méi)有問(wèn)題。當(dāng)然，讓它運(yùn)行在一臺(tái)機(jī)器上，搜索一些服務(wù)器，也運(yùn)行的很好。

3。在你看來(lái)，什么是搜索引擎的核心元素？也就說(shuō)，一般的搜索引擎軟件可以分成哪幾個(gè)主要部分或者模塊？

讓我想想，大概是如下幾塊吧：

?-- 攫取(fetching)：就是把被指向的網(wǎng)頁(yè)下載下來(lái)。
?-- 數(shù)據(jù)庫(kù)：保存攫取的網(wǎng)頁(yè)信息，比如那些網(wǎng)頁(yè)已經(jīng)被攫取，什么時(shí)候被攫取的以及他們又有哪些鏈接的網(wǎng)頁(yè)等等。
?-- 鏈接分析：對(duì)剛才數(shù)據(jù)庫(kù)的信息進(jìn)行分析，給每個(gè)網(wǎng)頁(yè)加上一些權(quán)值(比如PageRank,WebRank什么的)，以便對(duì)每個(gè)網(wǎng)頁(yè)的重要性有所估計(jì)。不過(guò)，在我看來(lái)，索引那些網(wǎng)頁(yè)標(biāo)記(Anchor)里面的內(nèi)容更為重要。(這也是為什么諸如Google Bombing如此高效的原因)
?-- 索引(Indexing): 就是對(duì)攫取的網(wǎng)頁(yè)內(nèi)容，以及鏈入鏈接，鏈接分析權(quán)值等信息進(jìn)行索引以便迅速查詢。
?-- 搜索(Searching): 就是通過(guò)一個(gè)索引進(jìn)行查詢?nèi)缓蟀凑站W(wǎng)頁(yè)排名顯示。

當(dāng)然，為了讓搜索引擎能夠處理數(shù)以億計(jì)的網(wǎng)頁(yè)，以上的模塊都應(yīng)該是分布式的。也就是說(shuō)，可以在多臺(tái)機(jī)器上并行運(yùn)行。

4。你剛才說(shuō)大家可以立馬下載Nutch運(yùn)行在自己的機(jī)器上。這是不是說(shuō)，即便那些對(duì)Apache服務(wù)器沒(méi)有掌控權(quán)的網(wǎng)站管理員在短時(shí)間內(nèi)就可以使用Nutch?

很不幸，估計(jì)他們大都沒(méi)戲。因?yàn)镹utch還是需要一個(gè)Java servlet的容器(筆者注：比如Tomcat)。而這個(gè)有些ISP支持，但大都不支持。(筆者注: 只有對(duì)Apache服務(wù)器有掌控權(quán)，你才能在上面安裝一個(gè)Tomcat之類的東東)

5。我可以把Lucene和Google Web API結(jié)合起來(lái)嗎？或者和其他的一些我先前寫(xiě)過(guò)的應(yīng)用程序結(jié)合起來(lái)？

有那么一幫人已經(jīng)為Nutch寫(xiě)了一些類似Google的API, 但還沒(méi)有一個(gè)融入現(xiàn)在的系統(tǒng)。估計(jì)不久的將來(lái)就行了。

6。你認(rèn)為目前實(shí)現(xiàn)一個(gè)搜索引擎最大的障礙在哪里？是硬件，存儲(chǔ)障礙還是排名算法？還有，你能不能告訴我大概需要多大的空間搜索引擎才能正常工作，就說(shuō)我只想寫(xiě)一個(gè)針對(duì)搜索成千上百萬(wàn)的RSS feeds的一個(gè)搜索引擎吧。

Nutch大概一個(gè)網(wǎng)頁(yè)總共需要10kb的空間吧。Rss feeds的網(wǎng)頁(yè)一般都比較小(筆者注: Rss feeds都是基于xml的文本網(wǎng)頁(yè)，所以不會(huì)很大)，所以應(yīng)該更好處理吧。當(dāng)然Nutch目前還沒(méi)有針對(duì)RSS的支持。(筆者注：實(shí)際上，API里面有針對(duì)RSS的數(shù)據(jù)結(jié)構(gòu)和解析)

7。從Yahoo! Labs拿到資金容易嗎？哪些人可以申請(qǐng)？你又要為之做出些什么作為回報(bào)？

我是被邀請(qǐng)的，我沒(méi)有申請(qǐng)。所以我不是很清楚個(gè)中的流程。

8。Google有沒(méi)有表示對(duì)Nutch感興趣？

我和那邊的一些家伙談過(guò)，包括Larry Page(筆者注: Google兩個(gè)創(chuàng)始人之一)。他們都很愿意提供一些幫助，但是他們也無(wú)法找到一種不會(huì)幫助到他們競(jìng)爭(zhēng)對(duì)手的合適方式。

9。你有實(shí)現(xiàn)你自己的PageRank或者WebRank算法系統(tǒng)在你的Nutch里嗎？什么是你做網(wǎng)頁(yè)排名(Ranking)的考慮？

是的，Nutch里面有一個(gè)鏈接分析模塊。它是可選的，因?yàn)閷?duì)于站內(nèi)搜索來(lái)說(shuō)，網(wǎng)頁(yè)排名是不需要的。

10。我想你以前有聽(tīng)說(shuō)過(guò)，就是對(duì)于一個(gè)開(kāi)源的搜索引擎，是不是意味著同樣會(huì)給那些搞搜索引擎優(yōu)化(SEO)的黑客們有機(jī)可趁？

恩，有可能。
就說(shuō)利用反向工程破解的非開(kāi)源搜索引擎中的最新的反垃圾信息檢測(cè)算法需要大概6個(gè)月的時(shí)間。對(duì)于一個(gè)開(kāi)放源碼的搜索引擎來(lái)說(shuō)，破解將會(huì)更快。但不管怎么說(shuō)，那些制造垃圾信息者最終總能找到破解辦法，唯一的區(qū)別就是破解速度問(wèn)題。所以最好的反垃圾信息技術(shù)，不管開(kāi)源也好閉源也好，就是讓別人知道了其中的機(jī)制之后也能繼續(xù)工作那一種。

還有，如果這六月中你是把檢測(cè)出來(lái)的垃圾信息從你的索引中移除，他們無(wú)計(jì)可施，他們只能改變他們的站點(diǎn)。如果你的垃圾信息檢測(cè)是基于對(duì)一些網(wǎng)站中好的和壞的例子的統(tǒng)計(jì)分析，你可以徹夜留意那些新的垃圾信息模式并在他們有機(jī)會(huì)反應(yīng)之前將他們移除。

開(kāi)源會(huì)使得禁止垃圾信息的任務(wù)稍稍艱巨一點(diǎn)，但不是使之成為不可能。況且，那些閉源的搜索引擎也并沒(méi)有秘密地解決這些問(wèn)題。我想閉源的好處就是不讓我們看到它其實(shí)沒(méi)有我們想象的那么好。

11。Nutch和分布式的網(wǎng)絡(luò)爬蟲(chóng)Grub相比怎么樣？你是怎么想這個(gè)問(wèn)題的？

我能說(shuō)的就是，Grub是一個(gè)能夠讓網(wǎng)民們貢獻(xiàn)一點(diǎn)自己的硬件和帶寬給巨大的LookSmart的爬行任務(wù)的一個(gè)工程。它只有客戶端是開(kāi)源，而服務(wù)端沒(méi)有。所以大家并不能配置自己的Grub服務(wù)，也不能訪問(wèn)到Grub收集的數(shù)據(jù)。

更一般意義的分布式網(wǎng)絡(luò)爬行又如何？當(dāng)一個(gè)搜索引擎變得很大的時(shí)候，其爬行上的代價(jià)相對(duì)搜索上需要付出的代價(jià)將是小巫見(jiàn)大巫。所以，一個(gè)分布式爬蟲(chóng)并不能是顯著降低成本，相反它會(huì)使得一些已經(jīng)不是很昂貴的東西變得很復(fù)雜(筆者注：指pc和硬盤(pán)之類的硬件)。所以這不是一個(gè)便宜的買(mǎi)賣(mài)。

廣泛的分布式搜索是一件很有趣的事，但我不能肯定它能否實(shí)現(xiàn)并保持速度足夠的快。一個(gè)更快的搜索引擎就是一個(gè)更好的搜索引擎。當(dāng)大家可以任意快速更改查詢的時(shí)候，他們就更能在他們失去耐心之前頻繁找到他們所需的東西。但是，要建立一個(gè)不到1秒內(nèi)就可以搜索數(shù)以億計(jì)的網(wǎng)頁(yè)的廣泛的分布式搜索引擎是很難的一件事，因?yàn)槠渲芯W(wǎng)絡(luò)有很高的延時(shí)。大都的半秒時(shí)間或者像Google展示它的查詢那樣就是在一個(gè)數(shù)據(jù)中心的網(wǎng)絡(luò)延時(shí)。如果你讓同樣一個(gè)系統(tǒng)運(yùn)行在千家萬(wàn)戶的家里的PC上，即便他們用的是DSL和Cable上網(wǎng)，網(wǎng)絡(luò)的延時(shí)將會(huì)更高從而使得一個(gè)查詢很可能要花上幾秒鐘甚至更長(zhǎng)的時(shí)間。從而他也不可能會(huì)是一個(gè)好的搜索引擎。

12。你反復(fù)強(qiáng)調(diào)速度對(duì)于搜索引擎的重要性，我經(jīng)常很迷惑Google怎么就能這么快地返回查詢結(jié)果。你認(rèn)為他們是怎么做到的呢？還有你在Nutch上的經(jīng)驗(yàn)看法如何？

我相信Google的原理和Nutch大抵相同：就是把查詢請(qǐng)求廣播到一些節(jié)點(diǎn)上，每個(gè)節(jié)點(diǎn)返回一些頁(yè)面的頂級(jí)查詢結(jié)果。每個(gè)節(jié)點(diǎn)上保存著幾百萬(wàn)的頁(yè)面，這樣可以避免大多查詢的磁盤(pán)訪問(wèn)，并且每個(gè)節(jié)點(diǎn)可以每秒同時(shí)處理成十上百的查詢。如果你想獲得數(shù)以億計(jì)的頁(yè)面，你可以把查詢廣播到成千的節(jié)點(diǎn)上。當(dāng)然這里會(huì)有不少網(wǎng)絡(luò)流量。

具體的在這篇文章（ www.computer.org/ micro/mi2003/ m2022.pdf）中有所描述。

13。你剛才有提到垃圾信息，在Nutch里面是不是也有類似的算法？怎么區(qū)別垃圾信息模式比如鏈接場(chǎng)(Linkfarms)(筆者注：就是一群的網(wǎng)頁(yè)彼此互相鏈接，這是當(dāng)初在1999年被一幫搞SEO弄出來(lái)的針對(duì)lnktomi搜索引擎的使網(wǎng)頁(yè)的排名得到提高的一種Spamdexing方法)和那些正常的受歡迎的站點(diǎn)鏈接。

這個(gè)，我們還沒(méi)有騰出時(shí)間做這塊。不過(guò)，很顯然這是一個(gè)很重要的領(lǐng)域。在我們進(jìn)入鏈接場(chǎng)之前，我們需要做一些簡(jiǎn)單的事情：察看詞匯填充(Word stuffing)(筆者注：就是在網(wǎng)頁(yè)里嵌入一些特殊的詞匯，并且出現(xiàn)很多的次，甚至上百次，有些是人眼看不到的，比如白板寫(xiě)白字等伎倆，這也是Spamdexing方法的一種)，白板寫(xiě)白字(White-on-white text)，等等。

我想在一般意義上來(lái)說(shuō)(垃圾信息檢測(cè)是其中的一個(gè)子問(wèn)題)，搜索質(zhì)量的關(guān)鍵在于擁有一個(gè)對(duì)查詢結(jié)果手工可靠評(píng)估的輔助措施。這樣，我們可以訓(xùn)練一個(gè)排名算法從而產(chǎn)生更好的查詢結(jié)果(垃圾信息的查詢結(jié)果是一種壞的查詢結(jié)果)。商業(yè)的搜索引擎往往會(huì)雇傭一些人進(jìn)行可靠評(píng)估。Nutch也會(huì)這樣做，但很顯然我們不能只接受那些友情贊助的評(píng)估，因?yàn)槟切├畔⒅圃煺吆苋菀讜?huì)防止那些評(píng)估。因此我們需要一種手段去建立一套自愿評(píng)估者的信任體制。我認(rèn)為一個(gè)平等評(píng)論系統(tǒng)(peer-review system),有點(diǎn)像Slashdot的karma系統(tǒng), 應(yīng)該在這里很有幫助。

14。你認(rèn)為搜索引擎在不久的將來(lái)路在何方？你認(rèn)為從一個(gè)開(kāi)發(fā)者的角度來(lái)看，最大的障礙將在哪里？

很抱歉，我不是一個(gè)想象力豐富的人。我的預(yù)測(cè)就是在未來(lái)的十年里web搜索引擎將和現(xiàn)在的搜索引擎相差無(wú)幾。現(xiàn)在應(yīng)該屬于平穩(wěn)期。在最初的幾年里，網(wǎng)絡(luò)搜索引擎確實(shí)曾經(jīng)發(fā)展非常迅速。源于1994年的網(wǎng)絡(luò)爬蟲(chóng)使用了標(biāo)準(zhǔn)的信息析取方法。直到1998年Google的出現(xiàn)，其間更多的基于Web的方法得到了發(fā)展。從那以后，新方法的引入大大放慢了腳步。那些樹(shù)枝低的果實(shí)已被收獲。創(chuàng)新只有在剛發(fā)展的時(shí)候比較容易，越到后來(lái)越成熟，越不容易創(chuàng)新。網(wǎng)絡(luò)搜索引擎起源于上個(gè)世紀(jì)90年代，現(xiàn)在儼然已成一顆搖錢(qián)樹(shù)，將來(lái)很快會(huì)走進(jìn)人們的日常生活中。

至于開(kāi)發(fā)上的挑戰(zhàn)，我認(rèn)為操作上的可靠性將是一個(gè)大的挑戰(zhàn)。我們目前正在開(kāi)發(fā)一個(gè)類似GFS(Google的文件系統(tǒng))的東西。它是巨型搜索引擎不可缺少的基石：你不能讓一個(gè)小組件的錯(cuò)誤導(dǎo)致一個(gè)大的癱瘓。你應(yīng)該很容易的讓系統(tǒng)擴(kuò)展，只需往硬件池里加更多硬件而不需繁縟的重新配置。還有，你不需要一大坨的操作人員完成，所有的一切將大都自己搞定。

----------------完----------------------

posted @ 2006-08-02 06:07 Dedian 閱讀(14474) | 評(píng)論 (199) | 編輯收藏

CVS Tutorial

--? Getting Ready to Use CVS

First set the variable CVSROOT to /class/`username`/cvsroot
[Or any other directory you wish]
[For csh/tcsh: setenv CVSROOT ~/cvsroot]
[For bash/ksh: CVSROOT=~/cvsroot;export CVSROOT]

Next run cvsinit. It will create this directory along with the subdirectory CVSROOT and put several files into CVSROOT.

-- How to put a project under CVS

A simple program consisting of multiple files is in /workspaces/project.

To put this program under cvs first

cd to /workspaces/project

Next

cvs import -m "Sample Program" project sample start

CVS should respond with
N project/Makefile
N project/main.c
N project/bar.c
N project/foo.c

No conflicts created by this import

If your were importing your own program, you could now delete the original source.
(Of course, keeping a backup is always a good idea)

-- Basic CVS Usage

Now that you have added 'project' to your CVS repository, you will want to be able to modify the code.

To do this you want to check out the source. You will want to cd to your home directory before you do this.

cd

cvs checkout project

CVS should respond with
cvs checkout: Updating project
U project/Makefile
U project/bar.c
U project/foo.c
U project/main.c

This creates the project directory in your home directory and puts the files: Makefile, bar.c, foo.c, and main.c into the directory along with a CVS directory which stores some information about the files.

You can now make changes to any of the files in the source tree.
Lets say you add a printf("DONE\n"); after the function call to bar()
[Or just cp /class/bfennema/project_other/main2.c to main.c]

Now you have to check in the new copy

cvs commit -m "Added a DONE message." main.c

CVS should respond with
Checking in main.c;
/class/'username'/cvsroot/project/main.c,v <-- main.c
new revision: 1.2; previous revision: 1.1
done

Note, the -m option lets you define the checking message on the command line. If you omit it you will be placed into an editor where you can type in the checking message.

-- Using CVS with Multiple Developers

To simulate multiple developers, first create a directory for your second developer.
Call it devel2 (Create it in your home directory).
Next check out another copy of project.

HINT: cvs checkout project

Next, in the devel2/project directory, add a printf("YOU\n"); after the printf("BAR\n");
[Or copy /class/bfennema/project_other/bar2.c to bar.c]

Next, check in bar.c as developer two.

HINT: cvs commit -m "Added a YOU" bar.c

Now, go back to the original developer directory.
[Probably /class/'username'/project]

Now look at bar.c. As you can see, the change made by developer one has no been integrated into your version. For that to happen you must

cvs update bar.c

CVS should respond with
U bar.c

Now look at bar.c. It should now be the same as developer two's.
Next, edit foo.c as the original developer and add printf("YOU\n"); after the printf("FOO\n");
[Or copy /class/bfennema/project_other/foo2.c to foo.c]

Then check in foo.c

HINT: cvs commit -m "Added YOU" foo.c

Next, cd back to developer two's directory.
Add printf("TOO\n"); after the printf("FOO\n");
[Or copy /class/bfennema/project_other/foo3.c to foo.c]

Now type

cvs status foo.c

CVS should respond with

===================================================================
File: foo.c             Status: Needs Merge

   Working revision:    1.1.1.1 'Some Date'
   Repository revision: 1.2     /class/'username'/cvsroot/project/foo.c,v
   Sticky Tag:          (none)
   Sticky Date:         (none)
   Sticky Options:      (none)

The various status of a file are:
Up-to-date

The file is identical with the latest revision in the repository.Locally Modified

You have edited the file, and not yet committed your changes.Needing Patch

Someone else has committed a newer revision to the repository.Needs Merge

Someone else has committed a newer revision to the repository, and you have also made modifications to the file.
Therefore, this is telling use we need to merge our changes with the changes made by developer one. To do this

cvs update foo.c

CVS should respond with
RCS file: /class/'username'/cvsroot/project/foo.c,v
retrieving revision 1.1.1.1
retrieving revision 1.2
Merging differences between 1.1.1.1 and 1.2 into foo.c
rcsmerge: warning: conflicts during merge
cvs update: conflicts found in foo.c
C foo.c

Since the changes we made to each version were so close together, we must manually adjust foo.c to look the way we want it to look. Looking at foo.c we see:

void foo()
{
  printf("FOO\n");
<<<<<<< foo.c
  printf("TOO\n");
=======
  printf("YOU\n");
>>>>>>> 1.2
}

We see that the text we added as developer one is between the ======= and the >>>>>>> 1.2.
The text we just added is between the ======= and the <<<<<<< foo.c

To fix this, move the printf("TOO\n");to after the printf("YOU\n");line and delete the additional lines the CVS inserted. [Or copy /class/bfennema/project_other/foo4.c to foo.c]
Next, commit foo.c

cvs commit -m "Added TOO" foo.c

Since you issued a cvs update command and integrated the changes made by developer one, the integrated changes are committed to the source tree.

-- Additional CVS Commands

To add a new file to a module:

Get a working copy of the module.
Create the new file inside your working copy.
use cvs add filename to tell CVS to version control the file.
use cvs commit filename to check in the file to the repository.

Removing files from a module:

Make sure you haven't made any uncommitted modifications to the file.
Remove the file from the working copy of the module. rm filename.
use cvs remove filename to tell CVS you want to delete the file.
use cvs commit filename to actually perform the removal from the repository.

For more information see the cvs man pages or the cvs.ps file in cvs-1.7/doc.

---------------
copy from http://www.csc.calpoly.edu/~dbutler/tutorials/winter96/cvs/

posted @ 2006-07-20 07:06 Dedian 閱讀(511) | 評(píng)論 (0) | 編輯收藏

Java Logging mechanism

reference:

http://java.sun.com/j2se/1.4.2/docs/guide/util/logging/overview.html

posted @ 2006-06-27 02:49 Dedian 閱讀(279) | 評(píng)論 (0) | 編輯收藏

Generic in the Java Programming Language

When reading GData source code, you will find that there are lots of generic-style code in it, which is one of several extensions of JDK 1.5. If you are using java 1.5 compiler, it is surely deserved to get some ideas about generic. Be noticed that Java generic looks like C++ Temple, but is quite different.

1. what is the idea of generic?
To simply say, generic is an idea of parameterizing type, including class type and other data types.

2. examples?
-- We are familar with some container types, such as Collection. Here is an example for our former (Java 1.4 or before) typical usage:
Vector myList = new Vector();
myList.add(new Integer(100));
Integer value = (Integer)myList.get(0);

now it is better to write like this for type safety: (Eclipse IDE will display type safety warnings for above code if under java 1.5 compiler option)
??Vector<Integer> myList = new Vector<Integer>();
??myList.add(new Integer(100));
??Integer value = myList.get(0);

-- the reason why write code like this is Class Vector has been defined as a generic:
public Class Vector<E>
{
??????void add(E x);
????? ......
}

-- when we see some angle brackets(invocations) shown in?declaration, that is a generic. The invocation is a parameterized type. to use this generic, we need specify an actual type argument. (such as Integer as above)

3. trick in generic

-- we know that the idea of generic makes some data type such as container more flexible or acceptable for inputting entries. But that will be also very tricky. To take container as an example of generic, one of tricks is?can we copy values from one container to another container? if you want to copy like following style, the answer is no.
List<String> ls = new ArrayList<String>();
List<Object> lo = ls; //compile time error!

-- though we know String is a subtype of Object, and we can assign a value of String to an Object. But we can not assign a List of String to a List of Object as a whole part(like reference to a variable). The reason is we can access inner part of List(I mean element here, if List is as a simple data type such as Object, maybe we can do that), that will make List type unsafe. So, Java 1.5 complier will not let you do that.

-- Look inside two styles of code in above examples(of 2), we might say that the older style looks more flexible, because myList can accept more data types besides Integer, but the new style in 1.5 can only take Integer values. Well, if we need more flexible, we apply wildcards for generic.

4. Wildcards and bounded wildcards

-- if we see something like Collection<?> c, there is a question mark in angle brackets. That is Wildcard, which means type is temporarily unknown but it will be replaced by any type.
-- if we see something like Collection<? extends Number> c, that is bounded wildcard, which means the elements in Collection has a supertype bound. You can not put any other type whose supertype is not Number into this Collection.
-- But, no matter wildcard or bounded wildcard, we can not put a specified type value in it, that is because wildcard means type is unknown, you can not give a value to unknown data type.
-- So, what hell can wildcard be used for ? return back the flexible idea we mentioned before. We need apply wildcard to describe a flexible idea in definition or declaration, not to do real things.
for example, we can define an method like this:
void printCollection(Collection<?> c)
{
??????for(Object e : c){System.out.println(e);}
}
see? that is flexible. You can call this function for any Collection. You can use elements in Collection<?>, just don't try to put something in it.
-- So the question is, if we wanna that flexibility for our method, and we also need put something in it during the subroutine. How can we do? and then, we need use generic method

5. Generic method
-- that means method declaration can also be parameterized.
-- example:
????public <T> void addCollection(List<T> objs, T obj)
? ?{
??????? objs.add(obj);
?? ?}

6. when to use generic method and when to use wildcard ?
-- if the type parameter is used only once, or it has no relationship to other arguments of method including the return type, then wildcard?is?better to use to decribe clearer and more concise meanings.
-- otherwise, generic method should be used.
example:
class Collection
{
??????public static <T, S extends T> void copy(List<T> dest, List<S> src){...}
}
can be better rewritten as :
class Collection
{
??????public static <T> void copy(List<T> dest, List<? extends T> src){...}
}

reference: http://java.sun.com/j2se/1.5/pdf/generics-tutorial.pdf

posted @ 2006-06-23 09:39 Dedian 閱讀(1395) | 評(píng)論 (0) | 編輯收藏

something about standard of Syndication Format

http://dsonline.computer.org/portal/site/dsonline/menuitem.9ed3d9924aeb0dcd82ccc6716bbe36ec/index.jsp?&pName=dso_level1&path=dsonline/0507&file=w4sta.xml&xsl=article.xsl&;jsessionid=GZQWvln9z4JY2dXX8HyQ5f5KtRptqHRWvh17tjCXVbxHnGyzvTm2!554406865

posted @ 2006-06-22 06:06 Dedian 閱讀(212) | 評(píng)論 (0) | 編輯收藏

Enhancements in JDK 5

http://java.sun.com/j2se/1.5.0/docs/guide/language/index.html

posted @ 2006-06-21 09:51 Dedian 閱讀(205) | 評(píng)論 (0) | 編輯收藏

a bug in Java ?

when I try to debug my webcrawler?by crawling?yahoo website, I found that when trying to connect to a website which URL is such as http://www.youtube.com/w/Kak%E1?v=PIBe_V9PBIA&search=kak%C3%A1, the following exception will happen:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 12
?at java.lang.String.substring(Unknown Source)
?at sun.net.www.ParseUtil.unescape(Unknown Source)
?at sun.net.www.ParseUtil.decode(Unknown Source)
?at sun.net.www.ParseUtil.toURI(Unknown Source)
?at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
?at sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)

follow is simple testing code:
?
private static final String urlstring = "
???URL url = new URL(urlstring);
???
???URLConnection con = url.openConnection();
???
???con.connect();

since there?are no other explicit exceptions except MalformedURLException & IOException mentioned to catch for this code, I am not sure if it is a bug in Java for URL parsing...

anybody got some idea about that?

P.S. ok, somebody has pointed out that Runtime exceptions, like java.lang.StringIndexOutOfBoundsException, do not have to be declared, but they can be thrown. So i need catch StringIndexOutOfBoundsException this exception for my code. But in my understanding, the function should catch all the exceptions from lower functions, and then throw out if it can not handle them, thus we can catch those exception from deep functions. I am not sure Runtime exceptions are exceptional ...

posted @ 2006-06-15 07:48 Dedian 閱讀(505) | 評(píng)論 (0) | 編輯收藏

Something is in progress

Still working on Webcrawler part, the URL collection strategies are under thinking. A URL frontier which stores the list of? activate URLs to be parsed or downloaded will be applied to handle for synchonized I/O operations with URL collection/Inventory, stuck by some issues:

1. Duplicate URL Elimination:
??? a. Host name aliases --> DNS Resolver
??? b. Omitted port numbers
??? c. Alternative paths on the same host
??? d. replication across difference host
??? e. non-sense links or session IDs embedded in URLs ?
2. Reachable of URL
3. Distributed Storage of URL Inventory and relative synchronization problem
4. Fetch strategies for URL Frontier or Fetchor to get activate links for parsing
5. Scheduler for fetching and updating URL collection: multi-thread or single thread on each pc, when to decide re-parsing a page
7. URL-Seen test: if that page has been parsed and should it re-parse? which should be done before entering URL frontier...
8. Extensibility issues for those modules: Fetcher, Extractor/Filters, Collector...
9. Checkpointing for crawlering interupted: how to resume the crawler job, how to split crawler jobs and distribute to different machines

seems that I need couple days to refine my systen architecture design...

posted @ 2006-06-09 08:57 Dedian 閱讀(847) | 評(píng)論 (0) | 編輯收藏

I/O Design Patterns

Here is an article for effective I/O programming thought, mark it just for future re-check my I/O design in distributed searching engine system. Non-blocking synchronous mode was applied in my current system. I need check it out if anything can do to improve the performance and large scalability later.

posted @ 2006-06-09 08:56 Dedian 閱讀(204) | 評(píng)論 (0) | 編輯收藏

Good or Bad, Check your OO Design

An idea is proposed by a PHD student of University of Auckland to check your OO Design on Java. The key point is to use directed graph to analyze the dependencies between all java classes, and the more classses involved in some cycle, the worse design it is.

Several Java Open source softwares have been examed in his research report...
Though it is not the only metric to check your OO design, I'd like to say that it is an interesting thought.

posted @ 2006-06-08 03:05 Dedian 閱讀(986) | 評(píng)論 (0) | 編輯收藏

Retrieve values in HashTable or HashMap

Unlike collection types such as Vector or List, Map (HashTable or HashMap) accesses a value by a key. If we want to retrieve all the values that have been put in a Map, one of simple ways to do that is employing a Collection or plus an Iterator, here is the sample code (just retrieve vaules, skip keys), assuming there is a variable: HashMap<String, <ComplexDataType>> links

Collection c = links.value();
Vector<ComplexDataType> v = new Vector<ComplexDataType>(c);
for(int i = 0; i< v.size(); i++)
{
??? ComplexDataType tempData = (ComplexDataType)v.get(i);
??? dosomethingwith(tempData);
}

P.S. Map provides three views of map: keySet, entrySet and values collection, we can use any of them .

posted @ 2006-06-02 07:16 Dedian 閱讀(342) | 評(píng)論 (0) | 編輯收藏

Java Interview Questions

These questions are very useful for some Java newbies and guys who wanna prepare some interviews on Java programming positions, which is really cool.

reference:
http://www.allapplabs.com/interview_questions/java_interview_questions.htm
http://www.allapplabs.com/interview_questions/java_interview_questions_2.htm
http://www.allapplabs.com/interview_questions/java_interview_questions_3.htm
http://www.allapplabs.com/interview_questions/java_interview_questions_4.htm
http://www.allapplabs.com/interview_questions/java_interview_questions_5.htm
http://www.allapplabs.com/interview_questions/java_interview_questions_6.htm

posted @ 2006-06-02 06:14 Dedian 閱讀(388) | 評(píng)論 (0) | 編輯收藏

Java Reading & Writing file

1. Reading text from Standard Input

try 
{
       BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
       String str = "";
       while (str != null) 
       {
          System.out.print("> some prompt ");
          str = in.readLine();
	  dosomethingwith(str);
       }
} 
catch (IOException e) 
{
}

2. Reading text from a file

try 
{
     BufferedReader in = new BufferedReader(new FileReader("filename"));
     String str;
     while ((str = in.readLine()) != null) 
     {
	dosomethingwith(str);
     }
     in.close();
} 
catch (IOException e) 
{
}

3. Reading a file into a BityArray

    // Returns the contents of the file in a byte array.
    public static byte[] getBytesFromFile(File file) throws IOException 
    {
        InputStream is = new FileInputStream(file);

        // Get the size of the file
        long length = file.length();

        // You cannot create an array using a long type.
        // It needs to be an int type.
        // Before converting to an int type, check
        // to ensure that file is not larger than Integer.MAX_VALUE.
        if (length > Integer.MAX_VALUE) 
	{
            // File is too large
        }

        // Create the byte array to hold the data
        byte[] bytes = new byte[(int)length];

        // Read in the bytes
        int offset = 0;
        int numRead = 0;
        while (offset < bytes.length
               && (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) 
	{
            offset += numRead;
        }

        // Ensure all the bytes have been read in
        if (offset < bytes.length) 
	{
            throw new IOException("Could not completely read file "+file.getName());
        }

        // Close the input stream and return bytes
        is.close();
        return bytes;

    }

4. Writing to a file

try 
{
    BufferedWriter out = new BufferedWriter(new FileWriter("filename"));
    out.write("some string");
    out.close();
} 
catch (IOException e) 
{
}

Note: If the file does not already exist, it is automatically created.

5. Appending to a file

try 
{
     BufferedWriter out = new BufferedWriter(new FileWriter("filename", true));
     out.write("appending String");
     out.close();
} 
catch (IOException e) 
{
}

6. Using a Random Access File

try 
{
     File f = new File("filename");
     RandomAccessFile raf = new RandomAccessFile(f, "rw");

     // Read a character
     char ch = raf.readChar();

     // Seek to end of file
     raf.seek(f.length());

     // Append to the end
     raf.writeChars("aString");
     raf.close();
} 
catch (IOException e) 
{
}

reference:
http://javaalmanac.com/egs/java.io/pkg.html

posted @ 2006-05-31 08:12 Dedian 閱讀(563) | 評(píng)論 (1) | 編輯收藏

Java Glossary -- Volatile

volatile

The volatile keyword is used on variables that may be modified simultaneously by other threads. This warns the compiler to fetch them fresh each time, rather than caching them in registers. This also inhibits certain optimisations that assume no other thread will change the values unexpectedly. Since other threads cannot see local variables, there is never any need to mark local variables volatile.

quote from:

http://mindprod.com/jgloss/volatile.html

posted @ 2006-05-25 04:45 Dedian 閱讀(306) | 評(píng)論 (1) | 編輯收藏

Lucene 2.0 release mostly this Friday

Though still under voting, it is originally?mentioned by Doug Cutting, and got only positive votes. So it is very likely we can get a 2.0 release version on this Friday. Some bugs has been fixed and deprecated code has been removed in this approaching version.

posted @ 2006-05-24 09:00 Dedian 閱讀(226) | 評(píng)論 (0) | 編輯收藏

歲月遐想

二十年前

我受著老師家長(zhǎng)的各種表?yè)P(yáng)帶著各種的小紅花拿著各種的競(jìng)賽獎(jiǎng)狀

我現(xiàn)在的老板也許正在池塘里抓魚(yú)樹(shù)上捕知了向家長(zhǎng)鬧棒棒糖吃

十年前

我開(kāi)始談戀愛(ài)開(kāi)始在月光下行走在沒(méi)人行走的小道上開(kāi)始學(xué)著猶豫地寫(xiě)詩(shī)

我現(xiàn)在的老板也許正在狂啃高中課本而郁郁寡歡或許也開(kāi)始遞小紙條給鄰座的小女生

十年后的今天

戀人終成我的內(nèi)人然后我在吭哧吭哧地在我現(xiàn)在的老板提供的一片小天地下寫(xiě)著莫名其妙的代碼

鄰座的小女生終成記憶然后我現(xiàn)在的老板在我10米不遠(yuǎn)的窗明幾凈的空曠的房間里看著我以及100號(hào)在他眼里和我差不多的人賣(mài)命地為他寫(xiě)著代碼而輕松的聽(tīng)者不知是不是搖滾的音樂(lè)而搖頭晃腦。

十年后的明天

？

結(jié)局1：

內(nèi)人依然還是內(nèi)人我還在吭哧吭哧地寫(xiě)著代碼身邊卻多了一個(gè)長(zhǎng)著和我有些許相似的小孩拽著我的胳膊鬧著要用我的電腦玩游戲

無(wú)數(shù)的漂亮女生在大樓里走馬觀花然后我現(xiàn)在的老板在我100米以外不知是不是房間的里面開(kāi)著大會(huì)和著幾個(gè)肥頭大耳的股東討論著我以及1000號(hào)類似的人類的存活問(wèn)題

結(jié)局2：

內(nèi)人依然還是內(nèi)人我終于省吃儉用和內(nèi)人開(kāi)辦有史以來(lái)第一個(gè)屬于自己的公司坐在屬于自己的窗明幾凈的辦公室里看著外面100號(hào)年輕如20年前的我的小兄弟們熱火朝天的干著革命

漂亮的女生們依然走馬觀花現(xiàn)在我的老板在更高更大的高樓大廈里和著幾個(gè)肥頭大耳的股東討論著怎么把曾經(jīng)是他的手下如今卻成了一個(gè)小老板的我的公司進(jìn)行兼并的大事。

結(jié)局3：

內(nèi)人依然還是內(nèi)人我卻擁有一個(gè)屬于自己的公司辦公室聚集著一幫曾經(jīng)是我的同事以及現(xiàn)在的老板混在其中的人群在空調(diào)房里為我出謀劃策或者吭哧吭哧地寫(xiě)著和10年前不一樣的代碼

一個(gè)漂亮的女生終于成為漂亮少婦現(xiàn)在的老板卻因?yàn)榻?jīng)營(yíng)不善轉(zhuǎn)手把公司賣(mài)給曾經(jīng)在他手下吭哧吭哧寫(xiě)代碼的我然后我給了他一個(gè)不錯(cuò)的職位讓他養(yǎng)家糊口娶妻生子。

P.S. 函數(shù) Likely(結(jié)局n) (1<=n<=3)為嚴(yán)格單調(diào)遞減函數(shù)，其上限為0.0001

P.S.

以上歲月遐想純屬yy,我的老板不是中國(guó)人，沒(méi)有我yy中的他的少年以及青年。既然他不懂中文，我這里用中文進(jìn)行yy決不會(huì)有落把柄在他手中的危險(xiǎn)。寫(xiě)這段yy的話的目的是表達(dá)我對(duì)年輕的他的敬仰(希望他能看懂這句中文)，以及我還未泯滅在幸福生活中的一點(diǎn)雄心。

posted @ 2006-05-20 13:28 Dedian 閱讀(277) | 評(píng)論 (0) | 編輯收藏

Ooops! my laptop not working...

Oops! My laptop, Compaq Presario R3230, is not working now (just worked yesterday evening), blue screen, hangs at disk checking...when I reboot with safe mode, it still hangs at is multi(0)disk(0)rdisk(0)partition(1)\windows\system32\drivers\atisgkaf.sys, I guess there is something wrong with my video driver, but how can I fix that problem without wipe out my documents in harddriver?

I am trying to google by it, it seems some guys also got that problem, some steps are suggested:

1. ?Insert the QuickRestore CD into the CD drive and restart the
? ? system.
2. ?When the red Compaq logo appears, press and hold the Caps
? ? Lock key. ?Next screen will be a blinking QuickRestore screen.
3. ?When the QuickRestore text stops blinking, press and hold the
? ? Num Lock key.

but where can I get QuickRestore CD? included CD seems not in my room any more...anybody has thought about that?

posted @ 2006-05-20 04:32 Dedian 閱讀(186) | 評(píng)論 (0) | 編輯收藏

最近的一些心得 -- 關(guān)于搜索引擎

由于工作的需要，最近對(duì)搜索引擎感興趣起來(lái)，下面有些心得：

1。其實(shí)要讓自己的Blog的點(diǎn)擊率狂漲的辦法很簡(jiǎn)單，就是寫(xiě)一個(gè)最簡(jiǎn)單的webcrawler程序，不斷的訪問(wèn)自己的主頁(yè)(發(fā)送http請(qǐng)求)，很多計(jì)數(shù)器的原理就是根據(jù)這個(gè)來(lái)計(jì)算的，而不會(huì)核實(shí)IP地址，不信，只要自己F5刷新一下自己的頁(yè)面就知道了。照這樣下去，點(diǎn)擊率超過(guò)老徐是肯定沒(méi)有問(wèn)題的。不過(guò)，新浪本來(lái)就玩點(diǎn)擊率貓膩的，因?yàn)樗麄兛梢宰约盒薷挠?jì)數(shù)器，所以和他們玩這個(gè)沒(méi)有意義。

2。點(diǎn)擊率高并不表示你的頁(yè)面排名高(PageRank)。PageRank是一個(gè)技術(shù)含量比較高的詞，想當(dāng)初Google那兩個(gè)毛頭小伙子Larry Page(真的很巧和，那小子的姓居然是Page,真的想不做Page的老大都不行)和 Sergey Brin就是靠在斯坦福期間有關(guān)PageRank的研究發(fā)家的，如今年紀(jì)輕輕就可以和MS叫板。當(dāng)然，Google的PageRank的算法是商業(yè)秘密。不過(guò)網(wǎng)上牛人不乏其數(shù)，居然有人根據(jù)Google的一些搜索行為和利用概率建模等數(shù)學(xué)知識(shí)硬是弄出一套PageRank的解釋，在網(wǎng)上大為流行。那篇Paper只要Google一下PageRank Uncovered(by Chris Ridings and Mike Shishigin)就可以找到。據(jù)說(shuō)，還有人利用里面的機(jī)制大大戲弄了一把Google的搜索引擎。不過(guò)已無(wú)法考證，因?yàn)镚oogle也在不斷完善自己。

3。簡(jiǎn)單來(lái)說(shuō)，PageRank就是一個(gè)衡量自己網(wǎng)站或網(wǎng)頁(yè)的重要性的一個(gè)很關(guān)鍵的指標(biāo)。其概念的核心簡(jiǎn)單來(lái)說(shuō)就是看有多少網(wǎng)頁(yè)鏈接到你的網(wǎng)頁(yè)，特別是有多少重要的網(wǎng)頁(yè)鏈接到你的網(wǎng)頁(yè)。換句話說(shuō)，如果老徐的Blog因?yàn)槠潼c(diǎn)擊率或在全國(guó)人民的博客世界的影響力使得其PageRank達(dá)到10，即為一非常重要之網(wǎng)頁(yè)，而你又有幸得到老徐的青睞加為友情鏈接，即她之重要網(wǎng)頁(yè)有鏈接指向了你的網(wǎng)頁(yè)，則你的PageRank必有所提高。當(dāng)然，這只是一個(gè)非常簡(jiǎn)單的例子，具體的公式還沒(méi)那么簡(jiǎn)單，自己有興趣可以在網(wǎng)上查到，即便這樣，這只是一個(gè)因素而已。不過(guò)這就不難理解為什么會(huì)有那么多的人會(huì)在名人的博客上搶沙發(fā)甚至故意大放厥詞已引起各方注意了。也就不難理解廣告做到博客上去了。

4.其實(shí)，PageRank的idea來(lái)源于我們平時(shí)的生活中。比如，我想買(mǎi)一個(gè)電腦，我希望一個(gè)懂電腦的人告訴我買(mǎi)什么電腦。比如我知道小王比較懂，我就會(huì)問(wèn)小王，小王說(shuō)，恩，dedian牌電腦不錯(cuò)，就買(mǎi)dedian牌電腦吧。我說(shuō)，好吧，就買(mǎi)它了，可你是怎么知道的呢，哪里有介紹呢，有哪些優(yōu)點(diǎn)呢？小王說(shuō)，這。。。，我也不是很清楚，我也是聽(tīng)小李那丫說(shuō)的，你去問(wèn)那小子吧。這時(shí)，即便我不認(rèn)識(shí)小李，可他在我心目中的形象一下高大了許多，小王都要聽(tīng)他丫的。。。

5。所以，要讓自己的網(wǎng)頁(yè)或網(wǎng)站就有影響力，就要千方百計(jì)讓別人來(lái)連接你，來(lái)引用你。當(dāng)然還有一種辦法，就是不斷的引用別人的文章，這里的引用不是說(shuō)在你自己的網(wǎng)頁(yè)里嵌上別人的連接，而是利用別人的網(wǎng)頁(yè)嵌上自己網(wǎng)頁(yè)。怎么做，其實(shí)就是很多Blog的Trackback的功能，細(xì)心可以發(fā)現(xiàn)，只要你Trackback別人的Blog,你的Blog地址就留在別人的Blog的網(wǎng)頁(yè)里(comments一樣)。不過(guò)，現(xiàn)在大都的blog都開(kāi)始有設(shè)置不允許別人Trackback或comments.新浪好像也開(kāi)始做了手腳，名人的博客不讓引用了好像，不過(guò)新浪的博客對(duì)很多的搜索引擎都不友好，也就別動(dòng)他的主意了。倒是MSN space似乎可以，可以寫(xiě)一段代碼自動(dòng)連到各個(gè)網(wǎng)頁(yè)上fetch出每個(gè)blog的permalink然后執(zhí)行一段MSN自己提供的javascript就可以trackback了，不過(guò)這只是我最近想到的，還沒(méi)有寫(xiě)代碼實(shí)現(xiàn)。如果可以成功的話，很多其他的博客也一樣可以成功。這個(gè)想法是最近老看到一些亂七八糟的網(wǎng)站出現(xiàn)在我的trackback里想到的。

6。不過(guò)現(xiàn)在網(wǎng)上提供越來(lái)越多的服務(wù)會(huì)杜絕類似的不友好攻擊行為。比如，如果你很討厭有人在你的博客里亂引用，亂寫(xiě)評(píng)論。你可以申請(qǐng)一個(gè)類似托管的服務(wù)，就是讓另一個(gè)網(wǎng)站先收集那些留言或評(píng)論，再篩選，再放到你的博客上。總之，網(wǎng)絡(luò)的林子大了，什么鳥(niǎo)都有。

posted @ 2006-05-19 16:15 Dedian 閱讀(1530) | 評(píng)論 (3) | 編輯收藏

Notes for exploration of Search Engine (keep updating...)

+ Webcrawler
???
??? -- study open source code
??? ?? ?? purpose: analyze code structure and basic componences
??? ?? ?? focus on: Nutch (http://lucene.apache.org/nutch/)
??? ??? ??? ??? ??? & HTMLParser (http://htmlparser.sourceforge.net/)
??? ?? ?? ?? ?? ?? ? & GData(http://code.google.com/apis/gdata/overview.html)

??? -- understand PageRank idea
??? ?? relative articles:
??? ?? http://en.wikipedia.org/wiki/PageRank
??? ?? http://www.thesitewizard.com/archive/google.shtml
?????? paper : "PageRank Uncoverd" by Chris Ridings and Mike Shishigin
?????? http://www.rankforsales.com/n-aa/095-seo-may-31-03.html (about Chris Ridings & SEO)
??? ?? http://en.wikipedia.org/wiki/Web_crawler (basic idea about crawler)
??? ??
??? -- familar with RSS & Atom protocol

??? -- sample coding:
??? ?? Interface: Scheduler for fetching web links
??? ?? Interface: Web page paser/Analyzer --> to deal with XML-based websites(Weblogs or news sites, RSS & Atom) --> Paser classes based on SAX parser
??? ?? Interface: Retractor/Fetcher --> to get links from page
??? ?? Interface: Collector --> check URL whether duplicated and save in URL database with certian data structure
??? ?? Interface: InformationProcesser --> PageRank should be one important factor --> (under thinking)
??? ?? Interface: Policies(Filter) --> will be served for Collector and InformationProcessor --> (under thinking)

+ Indexer/Searcher (almost done base on Lucene)

posted @ 2006-05-19 09:40 Dedian 閱讀(297) | 評(píng)論 (1) | 編輯收藏

my favorite way to load a Java project

Motivation:

always, if you wanna check/analyze source code or do some contribution in open source communities, you would like to download the source code of some projects and load (or import) it into your own IDE. (if you don't wanna use CVS or SVN)

Following is my favorite way to do that under Eclipse:

1. create a new blank Java project:

File -> New -> Project ... -> Java Project --> Next >> -> input the project name (project layout: Create seperate source and output folders) --> click Finish

2. right click Source Folder "src" --> import ... -> select File system -> choose correct source code folder where you put the downloaded source code by click the top "Browse..." button (source code folder means the root folder? thus can keep folder structure as package structure) --> Finish

3. if you import wrong source code folder, you can delete whole project to redo. (it is no use merely deleting some failed packages)

Note:

if there is Ant build file (some stuff like build.xml) included in source code package, that will be cool, just using File -> New -> Project... -> Java Project from Existing Ant Buildfile.

posted @ 2006-05-19 02:58 Dedian 閱讀(250) | 評(píng)論 (0) | 編輯收藏

Crawling policies

The behavior of a web crawler is the outcome of a combination of policies:

A selection policy that states which pages to download.
A re-visit policy that states when to check for changes to the pages.
A politeness policy that states how to avoid overloading websites.
A parallelization policy that states how to coordinate distributed web crawlers.

cite from:

http://en.wikipedia.org/wiki/Web_crawler

posted @ 2006-05-18 06:34 Dedian 閱讀(183) | 評(píng)論 (0) | 編輯收藏

Compiler problem in Eclipse

Problem Description:

I wanna build GData source code under Eclipse which contrains creating type-specific map codes, the Eclipse IDE will complain something like that:? Syntax error, parameterized types are only available if source level is 5.0

Reason:

The new feature to create a type-specific map can only be supported at source level 5.0

Solution:

Do some IDE compiler configuration:
Window > Preferences > Java > Compiler > Compiler compliance level => 5.0

Note:
1. type-specific map:? create a map that will hold only objects of a certain type
??? example:

Map<Integer, String> map = new HashMap<Integer, String>();

    map.put(1, "first");
    map.put(2, "second");

2. if source level 5.0 is applied, Type-safe problem should be noticed for collection data type, such as Vector, List, Stack or Map etc.
that means, you can write code under level 1.4 like this:

private Vector MyList = new Vector();
...
MyList.add(str);

you'd better change to some stuff like this under level 5.0:

private Vector<String> MyList = new Vector<String>();

posted @ 2006-05-17 09:41 Dedian 閱讀(400) | 評(píng)論 (0) | 編輯收藏

Planning for next job

1. Develop a searching engine merely for Weblogs (Main jobs will be on WebCrawler, Indexer and Searcher part has been done for xml-based information retrieval)

Motivation:
?? ?a. Weblog is more and more popular recently
?? ?b. Though there has some weblog search engines such as Technorati and Blogdigger, but still seems lots of work need to do.
?? ?c. the formats of weblog feed (RSS2.0 & Atom) are xml-based and more standard, which is very close to my current job on xml-based information retrieval
?? ?d. easily extensible for crawling xml-based information websites besides weblogs
?? ?
HOWTO:
?? ????? a. Utilize GData for feeding xml-based information
or????? b. using some Open Source Crawlers + Lucene (similar idea in this article)
or ?? ? c. develop and merge my own simple Crawler package into my Shemy project which is clustering structure searching engine design based on Lucene

???????? likely: c > a > b (coz most open source crawlers are supposed to deal with much complex web pages/links, while since weblog feed is simpler, the crawler for it should be lighter)

Requirement/Functionality Analysis : (in progress)

Schedule: (in progress)

2. Exploration of performation tuning on searching issues to improve Shemy kernel

posted @ 2006-05-17 06:36 Dedian 閱讀(243) | 評(píng)論 (0) | 編輯收藏


Copyright © Dedian	Powered by: 博客園模板提供：滬江博客