-- 關注搜索引擎的開發

日歷

2006年5月

日

一

二

三

四

五

六

統計

隨筆 - 82
文章 - 2
評論 - 228
引用 - 0

隨筆分類(45)

隨筆檔案(82)

文章檔案(2)

2006年4月 (2)

Java Spaces

Alanb(Sun) (rss)
FreeRoller (rss)
JavaBlogs
JavaWorld (rss)

搜索

積分與排名

積分 - 65500
排名 - 816

閱讀排行榜

評論排行榜

2006年5月5日

微軟的新搜索引擎

微軟從未放棄搜索引擎的競爭，一直和Google暗暗較勁。盡管live search在內部員工里像是一個joke，但老大一直毫不猶豫地往里砸錢。

說實話，我盡量使用微軟的產品，操作系統放棄了linux，開發工具放棄了perl和java，當然這些是工作使然。但map我以前用 MapQuest，現在改用live map，瀏覽器也棄Firefox改用IE8，但凡能用的，我都會改用微軟的產品，不過對于搜索引擎，感覺實在太爛了，搜出來的東西總不是自己想要的，往后翻了10來頁也不見有用的。后來就偷偷把Google設為默認引擎。見到一個同事比我更過分，連outlook的搜索都改用Google Desktop來搜索。

后來，3月初的時候，內部就發布了一個新的搜索引擎，叫Kumo(酷摸？)。據說是因為live這個名字不好，不信把它反過來念念看看是什么？我覺得只是一個名字的更換沒有什么意義。后來還是忍不住上去試了試，發現確實比原來的那個好一些。沒事的時候也會用Kumo 摸一把。

今天，鮑老大又宣布發布一個新的搜索引擎，叫Bing。感覺怎樣？我怎么讀的像有病的‘病’？還不叫Search Engine,改叫Decision Engine，夠新潮的概念。我不太清楚為什么取這樣一個名字（據鮑老大說，是因為它短小好記），不過從一個日文名字變成一個中文名字，我感覺這是陸奇上臺登上Search老大交椅之后的一個成功。記得前兩天Search主頁的封面就開始用上內部某員工拍的中國陽朔的風景照片。不管猜測對不對，新的搜索引擎還是要試一試，結果有好事之徒一上來就搜了個“六四”，結果出來的全是大學四六級考試，讓人有些瀑布寒。還沒有公開release，公關就已經做得這么好了。

讓人更囧的是，為慶祝新的release，search組的人每人發了一件T-shirt。據說前面是"I Bing"，后面是“U Bing”。聽起來像“我有病，你也有病”。不過Search組的人并以為然，因為他們為“Bing”取了一個中文名字叫“必應”。比“谷歌”好一點么？

其他組的好事之徒可沒那么友好，測試了一段時間之后，把這個“bing”的搜索引擎親切地叫做Mr. Bean。

當然，面對新鮮事物，我們還應該抱著積極的態度。我想因為在測試階段，我更愿意相信這是因為沒有足夠的用戶行為數據導致的短暫的發育不良。這個“必應”在下周可能就會正式發布了。讓我們試目以待。

posted @ 2009-05-29 13:20 Dedian 閱讀(3634) | 評論 (14) | 編輯收藏

我們需要什么樣的應用程序？

我先前有說過，“很多的軟件做成web-based是web3.0的一個趨勢”。從技術角度上說，這些web-based的應用程序和以前裝在本地硬盤的軟件有些不一樣，確切地可以理解那些具有服務功能的網站或者應用程序為能夠瀏覽器所容納的對象，而瀏覽器只是一個可以支持多種對象的容器，可對象的后臺的服務應用程序正是 deploy在各種web服務器上的軟件。

而那些所謂的腳本語言只是容器與各種對象的通訊語言。

一直以來，容器和后臺服務應用程序一直在改進。但更多的是一個又一個鮮活的對象通過瀏覽器展現在我們眼前，默默地改變我們的生活。

其實，說很多的軟件做成web-based就是變成一個個可以為瀏覽器所接納的對象模型只概括了其中的一部分。它只是說到軟件的表現形式。這很容易讓大家忽略數據的存儲形式，而默認這樣的web-based的服務讓我們更多的是享受網絡上的數據或者搜索引擎上的數據。我們不用經常下載軟件占據自己的硬盤，有了網絡電視，我們也不用下載電影，甚至也無需下載音樂。我們自己的數據比如email，blog,訂閱的雜志，收藏的信息也都存放在各個網站的服務器上，而無需下載下來。

我們似乎已經習慣了在線的狀態。淡忘了脫機的那個年代。而一向標新立異的Google似乎又找到回歸的需求，那就是最近推出的的Google Gears。它提供人們一個瀏覽器的插件，通過這個插件我們下載數據到本地硬盤，并且提供一個小型數據庫引擎(SQLite)在本地硬盤幫助存儲，建立索引和搜索數據。另外提供接口實現后臺的數據同步而無需占用瀏覽器資源。

目前Google Gears的API應用在Google Reader上，即用戶可以下載訂閱的電子雜志到本地硬盤，方便整理和收藏。

一句話，軟件有放在網上的趨勢，人們也同樣關注個人數據的搜集和存放。舉個例子，我一直用Del.icio.us來收藏一些技術網站或者文章，可有一天我查閱技術文章的時候，點擊鏈接過去，卻是物是人非頁已去。這時我就想當時文章要是可以自動下載到自己硬盤并整理好那該多好。當然，手工的Copy+Paste就算了，我希望的是像Del.icio.us的一鍵操作。

posted @ 2007-05-31 14:27 Dedian 閱讀(1924) | 評論 (1) | 編輯收藏

what comparison function is in linux sorting ?

Got a question, when I apply sort command line in linux to sort some domain names by dictionary order, no matter which option i used, it will sort some domains like this:

...
abca.com
abc-d.com
abce.com
...

I am curious what comparison function it applys in its' sorting function. I supposed it should be a string comparison, like strcmp function, but it is not. coz strcmp will compare ascii code of characters in string one by one, thus above sorting should like this:

abc-d.com
abca.com
abce.com

one guess is that when sorting names the special characters like "." "-" will be skipped. but still got some problem when sorting following names:

abc---d.com
abc--d.com
abc-d.com

why can linux sorting keep this order? if it skips some special characters, above names should be compared equally and maybe sorted as a random order.

confused, anybody has thought about that?

-----
p.s.

Haven't got updated here for quite a long time, coz I am back to program with c under linux and I believe it is a place for Java programmers.

-----

update:

Linux sorting compares unicode of strings … more about unicode is here

posted @ 2007-02-02 07:10 Dedian 閱讀(1417) | 評論 (1) | 編輯收藏

創建自己的搜索引擎

隨著網絡上信息量的日益增加，人們的學習和工作越來越離不開網絡搜索引擎(有些生活中的小例子在《Google 今天8歲》文中有提到)。

但是，另外一方面，我們會對搜索出來的成千上萬的結果束手無措，使得我們基本上對第一頁的搜索結果保持興趣，從而引發各種為爭取出現在搜索引擎的第一頁的各種技術(如SEO)或手段(Spamdexing)出現，惡劣的則大打出手，甚至搜索引擎公司出現各種幕后黑手。

對于用戶來說，則需要一點智商，來迅速地達到自己的搜索目的。

對于搜索引擎的老大Google顯然注意到這一事實以及這一事實帶來的客戶需求：即搜索引擎應該滿足客戶自定義化(Customizable).

最近，Google推出的產品 custom search service 則適應了這一需要。

idea很簡單，就是用戶可以自己根據自己的興趣所在設置一些自己經常去的或者感興趣的又信息量比較大的一些網站。這樣就可以制定Google的搜索引擎就搜索這幾個網站，或者以這幾個網站的為主。

例外，這個簡單idea的產品還具備web2.0的色彩。也就是可以幾個興趣相投的人一起編輯網站列表，從而類似一個搜索圈(搜索社區)搜索出大家共同感興趣的東西。

有興趣的大家可以自己玩玩。我初步自定義了一個與Blog有關的搜索引擎。

點擊這里。或者連接：
http://www.google.com/coop/cse?cx=006688650489436466578%3Ac7-4rxi0jf4

或者點擊這個簡單的域名地址：

http://blogdigger.info

大家有興趣可以一起玩，只要你們有gmail的賬號。

加入的方法很簡單，就是點擊主頁上的鏈接：

Volunteer to contribute to this search engine.

當然，你需要一個Google 的賬號（沒有也沒有關系，只需要用你們的email注冊一個就可以了，很簡單）

這樣，你就可以成為這個搜索引擎的一員了，平時，你覺得那個網站很好，里面的信息量也比較大，你可以把這個網站添加到Blog Digger的網站列表中。也可以為你感興趣的一些搜索添加搜索條目。

如果慢慢的覺得這個自定義的Google好玩，就記住這個鏈接吧：http://blogdigger.info

posted @ 2006-10-27 06:04 Dedian 閱讀(2392) | 評論 (3) | 編輯收藏

Again, Problem or Bug for URLConnection ?

Not sure if it is a bug of (Http)URLConnection, but it hang sometimes for some URLs while calling any functions to get information from connection (includes getResponseCode, getInputStream, getContent, getContentLength, getHeaderField blabla..) after connection has been built (even I have set the read timeout and connect time out).

the functions openConnection() and connect() are ok, curious about that problem.

anybody has the same problem or similar problem with URLConnection?

posted @ 2006-10-21 07:20 Dedian 閱讀(1313) | 評論 (0) | 編輯收藏

Ajax 淺談

---祝大家中秋愉快---

Ajax (Asynchronous JavaScript and XML)是近年來流行的一門web 技術。在Blogjava上看到有人開始在介紹AJAX，但仿佛流于概念或理論的東西，對于想用Ajax的初學者似乎不是很make sense。我想，學習任何一樣新的技術，例子和步驟是極為make sense的兩樣東西。

筆者想結合過去的學習經驗簡單講講使用Ajax的基本步驟和舉幾個實用例子。由于筆者主要在于后臺端的開發，所以很多腳本并不是很擅長。Ajax也主要限于以前大學的修課和近期的一些為后臺端程序的測試的簡單實現。所以只是一個拋磚引玉的使用Ajax版本，歡迎相互學習交流。

0. 導讀

??? 1。使用Ajax的基本流程
??? 2。使用Ajax的基本步驟。(簡單例子--> Demo)
??? 3。再來一個例子(Google Suggest)。(Demo)
??? 4。家庭作業 :)

1。使用Ajax的基本流程

在筆者看來，Ajax更像是一個簡單的網絡框架，它描述著如何高效地使網絡前端的數據展現和網絡后端的數據之間的交互。基本上，就是瀏覽器提供一個XMLHttpRequest(當然在IE里是ActiveXObject)的對象向后臺端的腳本程序或者Servlet Classes發送http請求，從后臺端的回應中獲取文本數據(如xml格式和最近有人討論的Json格式)并嵌入前臺段的網頁中或腳本中。

下圖是一個簡單的流程圖：

2。使用Ajax的基本步驟。

下面，我們結合上面的流程，以及一個簡單的例子(見這篇文章)過一遍基本的步驟。(藍色代碼為標準寫法)

第一步：Form 代碼：接受前臺端的輸入，并通過Action方法(方法函數里包含創建XMLHttpRequest對象)把request post到后臺端。

<input id="username" name="username" type="text"
? onblur="checkName(this.value,'')" />
<span class="hidden" id="nameCheckFailed">
? This name is in use, please try another.
</span>

<script language="javascript">
function checkName(input, response)
{
? if (response != ''){
??? // Response mode
??? message?? = document.getElementById('nameCheckFailed');
??? if (response == '1'){
????? message.className = 'error';
??? }else{
????? message.className = 'hidden';
??? }
? }else{
??? // Input mode
??? url? = 'http://localhost/xml/checkUserName.php?q=' + input;
??? loadXMLDoc(url);
? }
}

var req;

function loadXMLDoc(url)
{
??? // branch for native XMLHttpRequest object
??? if (window.XMLHttpRequest) {
??????? req = new XMLHttpRequest();
??????? req.onreadystatechange = processReqChange;
??????? req.open("GET", url, true);
??????? req.send(null);
??? // branch for IE/Windows ActiveX version
??? } else if (window.ActiveXObject) {
??????? req = new ActiveXObject("Microsoft.XMLHTTP");
??????? if (req) {
??????????? req.onreadystatechange = processReqChange;
??????????? req.open("GET", url, true);
??????????? req.send();
??????? }
??? }
}
</script>

注：
1。這里的form只是一個input box,action的方法是onblur,就是響應失去焦點的事件，然后調用一個函數checkName, 這個函數里通過XMLHttpRequest向PHP server script 發送Post請求(看得出來，這里的php server script的文件名叫checkUserName.php，唯一參數是q)。
2。函數loadXMLDoc里有個通用的創建XMLHttpRequest對象的代碼，標準代碼整理如下：
??????? var req;
??? ??? function foo()
??? ??? {
??? ??? ??? req = false;

??? ??? ??? // branch for native XMLHttpRequest object
??? ??? ??? if(window.XMLHttpRequest)
??? ??? ??? {
??? ??? ??? ??? try
??? ??? ??? ??? {
??? ??? ??? ??? ??? req = new XMLHttpRequest();
??? ??? ??? ??? }
??? ??? ??? ??? catch(e)
??? ??? ??? ??? {
??? ??? ??? ??? ??? req = false;
??? ??? ??? ??? }
??? ??? ??? }
??? ??? ??? else if(window.ActiveXObject) // branch for IE/Windows ActiveX version
??? ??? ??? {
??? ??? ??? ??? try
??? ??? ??? ??? {
??? ??? ??? ??? ??? req = new ActiveXObject("Msxml2.XMLHTTP");
??? ??? ??? ??? }
??? ??? ??? ??? catch(e)
??? ??? ??? ??? {
??? ??? ??? ??? ??? try
??? ??? ??? ??? ??? {
??? ??? ??? ??? ??? ??? req = new ActiveXObject("Microsoft.XMLHTTP");
??? ??? ??? ??? ??? }
??? ??? ??? ??? ??? catch(e)
??? ??? ??? ??? ??? {
??? ??? ??? ??? ??? ??? req = false;
??? ??? ??? ??? ??? }
??? ??? ??? ??? }
??? ??? ??? }
??? ??? ??? if(req)
??? ??? ??? {
??? ?? ?? ?? ?? ??//do something here
???? ??? ??? }
??? ??? ???

??? ??? }

第二步：響應文本處理代碼：XMLHttpRequest對象里有個類似消息響應函數的屬性，即通過設置 req.onreadystatechange 來告訴XMLHttpRequest在哪個函數里處理服務端返回的文本信息。
如在上面的例子中：

req.onreadystatechange = processReqChange;

那么我們接著要有一個processReqChange的函數：

function processReqChange() 
{
    // only if req shows "complete"
    if (req.readyState == 4) {
        // only if "OK"
        if (req.status == 200) 
	{
            // ...processing statements go here...
	    processResponse();
        } else {
            alert("There was a problem retrieving 
               the XML data:\n" + req.statusText);
        }
    }
}

function processResponse()
{
    response  = req.responseXML.documentElement;
    method    = response.getElementsByTagName('method')[0].firstChild.data;
    result    = response.getElementsByTagName('result')[0].firstChild.data;
    eval(method + '(\'\', result)');

}

注：
1。基本上processReqChange 函數是標準代碼的寫法。
2。這里要用到前面定義的全局變量(XMLHttpRequest對象)req

第三步：后臺端代碼(這個例子是php server script)：接受前臺端的請求，處理其參數，并返回相應的結果。

文件名: checkUserName.php

<?php
header('Content-Type: text/xml');

function nameInUse($q)
{?
? if (isset($q)){
??? switch(strtolower($q))
??? {
????? case? 'drew' :
????????? return '1';
????????? break;
????? case? 'fred' :
????????? return '1';
????????? break;
????? default:
????????? return '0';
??? }
? }else{
??? return '0';
? }
?
}
?>
<?php echo '<?xml version="1.0" encoding="UTF-8"? standalone="yes"?>'; ?>
<response>
? <method>checkName</method>
? <result><?php
??? echo nameInUse($_GET['q']) ?>
? </result>
</response>
注：代碼很簡單，就不用解釋了。這里返回的是xml格式的字符串。

總體效果見這里
輸入"fred"或者"drew"的名字，失去焦點后會顯示名字已存在的信息。

?3。再來一個例子。

這里再講一個實用的例子，這是以前上課的一個課堂作業，也很有代表性。是關于Google Suggest(好像新的Google Toolbar上就用的這個功能)的應用問題。這里是寫好的DEMO。現在越來越多的網站提供類似Web Service的API, 我們利用他們提供的API URL可以返回一些我們用的著的數據，放在我們的網頁上。這里就用的上Ajax。只不過有些返回來的文本數據是xml格式的，就可以利用上面的簡單例子來處理，但很多像Google Suggest那樣是返回一段類似代碼格式的文本。我們就要利用Javascript的eval函數，把這些文本當作一段代碼在嵌入自己的網頁中。如果嵌入的代碼中含有函數，則需要自己再寫一個同名的函數作為實現。(這就是流程圖中的optional的func 3)

這里完整代碼就不貼了，貼一些關鍵代碼(原本后臺端是用Java Servlet寫的，但做demo的空間沒有Tomcat不支持Servlet,所以改用Php實現，大家可以自己用Java再寫一邊作為家庭作業 :) )：

1) form 代碼：

<form name = "QForm" method="POST" action="google_suggest.php">
??? <table bgcolor="8080C0" width="90%" >
??? <tr>
??? ??? <td? nowrap>Search Term:</td>
??? ??? <td ><input type="text" name="qtext"? onkeyup="return GetSuggestion()" size="60"></td>
??? </tr>
??? <tr>
??? ??? <th colspan="2" align="left" bgcolor="#A8A8FF"><DIV id=google_suggest_target>results go here . . . </DIV></th>
??? </tr>
??? </table>
??? </form>

注：
a. 看得出來，要把查詢的字符串post到google_suggest.php上
b. action的函數是GetSuggestion()，其返回的字符串會顯示在預留的網頁空間里。

2) 后臺端代碼(PHP)：這里主要接收前臺的請求，并不請求轉化為向Google Suggest的API URL請求，把接收到的文本信息返回給前端。代碼很簡單，如下：

文件名：google_suggest.php

<?php
function getGoogleSuggest($q)
{

??? $url = "http://www.google.com/complete/search?hl=en&js=true&qu=" . $q;
??? return file_get_contents($url);
}
?>

<?php echo getGoogleSuggest($_POST['q']) ?>

注：
a。 Google Suggest API 返回的是一個代碼格式的文本信息，如下：
sendRPCDone(frameElement, "", new Array(), new Array(), new Array(""));
所以我們再前臺接受到這個文本信息之后，應該寫一個sendRPCDone的函數來做進一步信息處理(比如說列表出查詢結果)。

3) 前臺文本處理代碼：

??? <script type="text/javascript">
??? ??? var req;
??? ??? function GetSuggestion()
??? ??? {
??? ??? ??? req = false;
??? ??? ??? var f = document.QForm;

??? ??? ??? // branch for native XMLHttpRequest object
??? ??? ??? if(window.XMLHttpRequest)
??? ??? ??? {
??? ??? ??? ??? try
??? ??? ??? ??? {
??? ??? ??? ??? ??? req = new XMLHttpRequest();
??? ??? ??? ??? }
??? ??? ??? ??? catch(e)
??? ??? ??? ??? {
??? ??? ??? ??? ??? req = false;
??? ??? ??? ??? }
??? ??? ??? }
??? ??? ??? else if(window.ActiveXObject) // branch for IE/Windows ActiveX version
??? ??? ??? {
??? ??? ??? ??? try
??? ??? ??? ??? {
??? ??? ??? ??? ??? req = new ActiveXObject("Msxml2.XMLHTTP");
??? ??? ??? ??? }
??? ??? ??? ??? catch(e)
??? ??? ??? ??? {
??? ??? ??? ??? ??? try
??? ??? ??? ??? ??? {
??? ??? ??? ??? ??? ??? req = new ActiveXObject("Microsoft.XMLHTTP");
??? ??? ??? ??? ??? }
??? ??? ??? ??? ??? catch(e)
??? ??? ??? ??? ??? {
??? ??? ??? ??? ??? ??? req = false;
??? ??? ??? ??? ??? }
??? ??? ??? ??? }
??? ??? ??? }
??? ??? ??? if(req)
??? ??? ??? {
??? ??? ??? ??? var url = "google_suggest.php";
???????
??? ??? ??? ??? req.onreadystatechange = processReqChange;
??? ??? ??? ??? req.open("POST", url, true);

??????? ??? ??? req.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
??? ??? ??? ??? req.setRequestHeader("Method", "POST " + url + " HTTP/1.1");
??? ??? ??? ??? req.send("q=" + escape(document.QForm.qtext.value));
??? ??? ??? }
??? ??? ???

??? ??? }
??? ???
??? ??? function processReqChange()
??? ??? {
??? ??? ??? if(req.readyState == 4) // only if req shows "loaded"
??? ??? ??? {
??? ??? ???????????????? if (req.status == 200) // only if "OK"
??? ??? ???????????????? {
??? ??? ???????????????? ??? ??? x = req.responseText;
??? ??? ???????????????????????? eval(x);
??? ??? ???????????????? }
??? ??? ???????????????? else
??? ??? ???????????????? {
??? ??? ?????????? ??? ??? ????? alert("There was a problem retrieving the XML data:\n" + req.statusText);
??? ??? ???????????????? }
??? ??? ??? }
??? ??? ??? else if(req.readyState == 2)
??? ??? ??? {
??? ??? ??? }
??? ??? }
??? ???
??? ??? function sendRPCDone(frameElement, qString, arr1, arr2, arr3)
??? ??? {
??? ???
??? ??? ??? var suggest_results = eval(arr1);
??? ??? ??? var counts = eval(arr2);
??? ??? ??? var htmlstr = "<TABLE cellspacing=4 border=0>";
??? ??? ??? for (var i=0; i < suggest_results.length; i++)
??? ??? ??? {
??? ??? ??? ??? htmlstr += "<tr><td><a href=\"javascript:self.location=\'http://www.google.com/search?hl=en&q=" + suggest_results[i] + "&btnG=Google+Search\'\">" + suggest_results[i] + "</a></td>";
??? ??? ??? ??? htmlstr += "<TD width=200><font color= 228b22>" + counts[i] + "</font></TD></TR>"
??? ??? ?????? ??? ?
??? ??? ??? }
??? ??? ??? htmlstr += "</TABLE>";
??? ??? ??? document.getElementById("google_suggest_target").innerHTML = htmlstr;
??? ???
??? ??? }
??? ???
??? ??? </script>

4。家庭作業 :)

一定要自己寫一些代碼，才能鞏固知識:)
題目：
我們經常用del.icio.us來收藏我們喜歡的網站或者文章，并加一些類似讀書筆記的注釋。那么我們怎么利用del.icio.us提供的API來訪問我們的讀書筆記信息，并顯示在自己的Blog里呢？
提示：
1。你要有一個del.icio.us的賬號，并且已經有所網頁收藏作為實驗數據:)
2。API URL 是 "http://del.icio.us/feeds/json/" + "你的賬號名"；自己參看一下，看返回什么樣的格式文本。另外，如果要限制返回的記錄數，可以加"?count=10"這樣的參數。

最后，祝大家中秋愉快！

---------------------------完----------------------------

posted @ 2006-10-07 07:05 Dedian 閱讀(2247) | 評論 (2) | 編輯收藏

PHP/Java Integration on Windows

reference: http://us3.php.net/java
help doc: http://php-java-bridge.sourceforge.net/


1- Make sure u have Installed Apache 2 & PHP 5 and Java J2EE 1.5

2- download pecl-5.0.5-Win32.zip and php-java-bridge_2.0.8.zip, which will include 

extra dll(s)

??  - unpack pecl pkg to your extensions folder, in PHP5 its ext.

??  - unpack java-Bridge to root php folder, in my case its simply C:\PHP

??  
Note: 
1. the java-Bridge inculdes new versions of certain files like php_java.dll

??  so, it would be wise to rename your old files that came with PECL pkg for example

??  file_old, to rollback at anytime.
2. Don't run batch file under php-java-bridge after unpacking to php root folder, just add following lines in php.ini configure file (depends on installation fold of j2ee):

extension=php_java.dll
extension_dir = "C:\php\ext" 
[java]
java.java_home=C:\Program Files\Java\jre1.5.0_06
java.java=C:\Program Files\Java\jre1.5.0_06\bin\javaw.exe
java.log_level=2
;java.log_file=ext/JavaBridge.log

posted @ 2006-10-06 09:05 Dedian 閱讀(1135) | 評論 (0) | 編輯收藏

install Apache2 & PHP5 on Windows XP

http://www.apachelounge.com/forum/viewtopic.php?t=570

http://www.webmasterstop.com/86.html

posted @ 2006-09-29 05:44 Dedian 閱讀(1026) | 評論 (0) | 編輯收藏

Google 今天8歲

估計大家已經在Google的主頁上看到了新的logo。對，今天是google8歲的生日。

記不清什么時候第一次使用了Google,如今一個搜索引擎改變了人們的網絡生活，也帶來了互聯網的革命。如今人們大談網絡社區或社會化的同時，搜索引擎又開始一個新的臺階。

8 年的時間，Google從一個單一的搜索產品已經衍生出各種改變或影響人們生活的產品，并不斷推動網絡概念和技術上的變革。比如我們經常用的產品有 Google talk, Google Adsence, Google Gmail, Google Calendar, Google Map, Google Video, Google Store, Google Earth,Google toolbar, Google Desktop. 還有很多Google正在思考的產品。

總而言之，如果網絡成為你生活中的一部分，那么Google也越來越成為你生活的一部分。Google的文化連同它的產品也越來越成為很多其他網站公司效仿的對象。

那讓我們看看我們普通網民一般用Google來搜索什么？

1。如果你有個朋友多年未見，不妨用Google搜搜他的名字。
2。如果你提筆忘成語或古詩，不妨用Google搜搜你能想起來的殘缺部分。
3。如果你想找一張圖片，不妨也搜搜看。
4。如果你想做作業，寫文章或寫畢業論文，最好不過了。可以搜到很多感興趣的，相關的素材。
5。如果你不知道翻譯你的成績單，利用Google的翻譯功能吧。
6。如果你有不認識的單詞，句子，俚語或者一些文化背景的東西，用用Google,wiki的查詢結果通常在第一頁。
7。如果你聽到一首好歌，且不知道歌名，誰唱的，還想知道歌詞，那就用你聽到的幾句歌詞搜搜吧。
8。如果你接到一個莫名其妙的電話，搜一搜，說不定知道是哪家公司打過來的。
9。覺得一個人或者一個網站或者一邊文章很cool,不妨也搜一搜，會有很多有趣的東西出現。
10。大家都在談論著一件事，或者最近很流行的一個話題或術語，搜一搜，看看他們到底在說什么。
11。有一個似乎很著名的英文縮寫，搜一搜，看看到底全稱是什么。
12。電腦遇到問題了，怎么辦？先不要著急，先搜一搜，看看有沒有人和你一樣的問題，有沒有解決方案。
13。這家伙的網頁做的很cool,怎么弄得？搜一搜，保證長見識。
14。很想問問題，搜一搜你的問題，說不定有答案。

好了，估計還有很多，大家接著補充。。。

posted @ 2006-09-28 07:55 Dedian 閱讀(1047) | 評論 (1) | 編輯收藏

關于抓蝦

當你有一個很好的idea的時候，你或許會感到有一絲興奮。然而如果你發現你的idea以你一己之力卻無法實現，并且還找不到志同道合的同志，你的興奮就會很快地變為郁悶。再過幾天，你會發現網上已經有人做了一件幾乎同樣的事并且比你事先的idea還要做的好的時候，那種郁悶又會升級為失落。

其實很多普通的又有點智慧的IT人都要不同程度地承受這樣的一種失落。

抓蝦就是這樣一個曾經讓我有幾許失落的感覺。失落得我有很長一段時間沒有注冊一個用戶。不過收拾收拾自己的心情，我還是很欣然的接受這樣一個優秀的國產web 2.0網站。

其實抓蝦的idea很簡單。它是一個把web 2.0概念和目前風行的基于RSS信息標準聚合格式很好地結合在一起的新興國產訂閱網站。盡管國外很早就有像Bloglines這樣的在線RSS信息訂閱網站。但不如抓蝦把web 2.0的概念有機地結合在一起。前者只是一個簡單的訂閱系統和簡單的共享。

關于web 2.0這個從上次網絡泡沫的廢墟上站起來的概念，目前大都的網民都有親密接觸。2005開始在國內流行至今的Blog和wiki其實就是web 2.0產物中的代表。

以前的網站更像一個信息發布的平臺。如果說網站是一個電影院的話，那我們這些網民充其量就是觀看電影的觀眾，即便我們可以注冊成為VIP而進入包廂看電影亦不過如此。你甚至可以把電影帶回家看，但你不能控制電影院播放電影的內容，也不能隨隨便便發布你自己制作的電影。

然而，web 2.0的概念就是給網民提供一個享受各種web服務的平臺。

網民不再是觀眾，而可以是演員，導演，發行商，甚至二販子。從技術角度上講，web 2.0使用戶開始可以控制數據。從用戶角度講，web 2.0使Internet成為一個虛擬社區，大家可以相互交流和共享。(從這種意義上說，早期的BBS和P2P下載軟件都是web2.0)

關于RSS聚合，我一直認為它只是一個基于xml的數據結構。在很早以前開始用.Net開發的時候，我就接受xml schema的一個思路，就是實現數據與其表現形式相分離。這也是我克服想嘲笑xml這樣一個如此簡單的網絡標準的沖動。不過那時，我就有用RSS作為 Internet上凌亂不堪的信息的一個標準結構的想法，這樣搜索引擎就會變得簡單(也曾經為此寫過一個類似資料收集器的小程序)。尤其在選了一門 Distributed Multimedia Information Management的課程后。里面大談網絡的Ontology和RDF技術。其實也就是用xml的數據結構去描述網絡實體及其內在聯系的一種技術。不過，rdf相對于簡單的rss來說，在應用上似乎超前一些。

有了web 2.0的概念，有了標準的數據結構，再加上一些具體的網站實現技術（比如目前流行的Ruby）,你就可以自己搗鼓一個web 2.0的網站出來。抓蝦很顯然在這方面做的比較成功。一方面，國內這方面成功的網站還比較少(經常去的也就是抓蝦和豆瓣)，另一方面，目前rss(如 blog)正在國內大肆流行的季節。

當然現在不少web 2.0沒戲的論調。其實這沒什么新鮮。網絡的東西就是這樣，每個人都有idea,都可以有技術做，但要存活做大，就這能是少數。web 2.0目前還是燒錢階段，因為提供的服務都是免費的(大家已經習慣了網絡的免費午餐)，只能燒錢搶用戶，最后賣流量，再搞壟斷。如果沒錢，就只能做成像奇客發現(diglog.com)這樣子（這個網站的idea和著名的digg.com類似，但顯然還在incubation階段）。這一點，和web 1.0沒有什么區別。這也是為什么大都的IT人依然郁悶，生活在各大小不等的目前還存活的公司的庇護下做著自己各自的夢想。

posted @ 2006-09-26 08:51 Dedian 閱讀(1945) | 評論 (2) | 編輯收藏

Understand Java Map Collection

http://www.oracle.com/technology/pub/articles/maps1.html

posted @ 2006-09-23 02:52 Dedian 閱讀(1073) | 評論 (1) | 編輯收藏

HttpURLConnect Problem

When I try to get some information of http connection to some websites (say http://linuxbyte.net) by function HttpURLConnection.getResponseCode(), it seems tthat JVM hangs for quite a while. Somebody says that is maybe the problem of http server who must be a Microsoft webserver. Here and here are the bug report information for Java 1.3 or before. Though it is said that the problem has been solved after java 1.4, i still get undesirable a long time waiting before a SocketException (Connection reset) is thrown out. Btw, conn.setConnectTimeout or conn.setConnectTimeout is involved for this problem. I am not sure if there is any method that can save time to skip those bad links.

posted @ 2006-09-21 06:32 Dedian 閱讀(1139) | 評論 (0) | 編輯收藏

The Ruby Programming Language

Here is a good article to introduce Ruby ..why we choose Ruby instead of Perl and Python ?

posted @ 2006-09-19 05:51 Dedian 閱讀(949) | 評論 (0) | 編輯收藏

Reader and InputStream

-- Scenario:
??? The purpose of a reader is to interpret a low-level byte stream (ByteArrayInputStream, StringInputStream, FileInputStream and so on) as a character stream and provid character input to whatever class needs it. And it is very simple to convert an inputstream to a reader:

Reader reader = new InputStreamReader( in ); //in is an instance of class InputStream or derived classes

But the issue is sometimes we need convert a reader to inputstream, think about following scenaros:
1.? the original inputstream has been filtered by certian reader, now we need save back filtered content into database by inputstream: we can not use original inputstream but filtered stream which can only get from your reader.
2.? Given a class who contains a reader to access a streaming content after complex parsering or downloading, we want to utilize the streaming content in this class while not repeating complex operations for content analysis, so we need employ some wrapper methods to get inputstream from reader.

-- Solution:
1. write own InputStream implementation, such as following:

class MyInputStream extends InputStream
{
private Reader rd;
public  MyInputStream(Reader rd)
{
super();
     this.rd = rd;
}
?
?
// implement the read() method to make this all work
publicint read()
{
int t = rd.read();
// you can do your processing on the inputReader here
// fiddle with the values and return
return t;
?
}
}

Note: Applications that need to define a subclass of InputStream must always provide a method that returns the next byte of input.
(refer to http://java.sun.com/j2se/1.4.2/docs/api/java/io/InputStream.html)

-- anything else? BTW, for parsering xml-based input stream by SAX, I am glad to see that the inputSource constructor can take either InputStream or Reader (refer to http://java.sun.com/j2se/1.4.2/docs/api/org/xml/sax/InputSource.html)

posted @ 2006-08-29 09:46 Dedian 閱讀(1338) | 評論 (0) | 編輯收藏

About Hash function

for general purpose hash function:

http://www.partow.net/programming/hashfunctions/

for cryptography & hash function

http://www.x5.net/faqs/crypto/

for a faster and better hash function (comparison of several hash function):

http://burtleburtle.net/bob/hash/doobs.html

----> for further reading...

posted @ 2006-08-19 03:01 Dedian 閱讀(983) | 評論 (0) | 編輯收藏

Getting the IP Address and Hostname

1. Getting the IP Address of a Hostname

    try 
    {
        InetAddress addr = InetAddress.getByName("yahoo.com");
        byte[] ipAddr = addr.getAddress();

        // Convert to dot representation
        String ipAddrStr = "";
        for (int i=0; i<ipAddr.length; i++) {
            if (i > 0) {
                ipAddrStr += ".";
            }
            ipAddrStr += ipAddr[i]&0xFF;
        }
    } 
    catch (UnknownHostException e) {
    }

2. Getting the Hostname of an IP Address

This example attempts to retrieve the hostname for an IP address. Note that getHostName() may not succeed, in which case it simply returns the IP address.

try {
        // Get hostname by textual representation of IP address
        InetAddress addr = InetAddress.getByName("127.0.0.1");

        // Get hostname by a byte array containing the IP address
        byte[] ipAddr = new byte[]{127, 0, 0, 1};
        addr = InetAddress.getByAddress(ipAddr);

        // Get the host name
        String hostname = addr.getHostName();

        // Get canonical host name
        String hostnameCanonical = addr.getCanonicalHostName();
    } catch (UnknownHostException e) {
    }

3. Getting the IP Address and Hostname of the Local Machine

    try {
        InetAddress addr = InetAddress.getLocalHost();

        // Get IP Address
        byte[] ipAddr = addr.getAddress();

        // Get hostname
        String hostname = addr.getHostName();
    } catch (UnknownHostException e) {
    }

posted @ 2006-08-18 06:53 Dedian 閱讀(559) | 評論 (0) | 編輯收藏

How does Alexa work?

http://forums.seochat.com/alexa-ranking-49/how-does-alexa-work-140.html

posted @ 2006-08-16 07:24 Dedian 閱讀(310) | 評論 (1) | 編輯收藏

Robert Tappan Morris

In the last digest about Greatest software ever written, I noted a worm named Morris which is ranked 12 of greatest software by the author. Actually, after finishing my clustering searching enigne development which is based on Lucene, i am studying p2p architecture for my distributed searching engine (more precisely is webcrawler part). When I am reading some p2p loopup protocol papers such as Chord, I also noticed a guy named Morris who is one of the developers. Hmmm,? this is the same Morris, from wiki, I know that guys is now an associate professor in MIT, and was indicted because of the damage by his Morris worm. Anyway, I'd like to say that it is very interesting to know some stories about those geeks.

posted @ 2006-08-15 05:53 Dedian 閱讀(449) | 評論 (0) | 編輯收藏

What's The Greatest Software Ever Written?

http://www.informationweek.com/shared/printableArticle.jhtml?articleID=191901844

12. The Morris worm
11. Google search rank
10. Apollo guidance system
9. Excel spreadsheet
8. Macintosh OS
7. Sabre system
6. Mosaic browser
5. Java language
4. IBM System 360 OS
3. gene-sequencing software at the Institute for Genomic Research
2. IBM's System R
1. Unix System III

How r u thinking?

posted @ 2006-08-15 02:22 Dedian 閱讀(346) | 評論 (0) | 編輯收藏

Google, 開源的教父？

有興趣的朋友可以參見原文

下面是本人的一些大致的翻譯：
------------------------------------------------------------

大伙都知道，Google是運行在很多的Linux(GNU)系統的服務器上的，而這只是它支持免費軟件的一個方面。其他的比如，Summer of Code, 現在已成為一個生產很多優秀代碼和項目的孵化基地，并且最近開放的Code Repository, 大有取代sourceforge.net(筆者注：廣大開源的據點)之趨勢。一方面，Google貢獻出它的Picasa(Linux(GNU)平臺)(筆者注：一個圖片管理軟件)，并被Wine(筆者注：Linux/Unix上的Windows,建于x-window之上)所使用；另一方面，Google也贊助一些開源項目，如Sri Lanka，大概有$25,000之多。
?
當然，Google也會秘密地進行一些開源的資助。比如，令我們大伙驚訝的Mozilla Foundation(筆者注：大家熟悉的另一瀏覽器Firefox)居然在去年有賺到72個million?-- 就是在Firefox上把Google的搜索引擎作為缺省的搜索引擎。

2005年的1月份，Google把Ben Goodger招為靡下。此人乃Firefox的首席工程師，并且是幾個主要開源編碼者之一。到了年末，Guido van Rossum, Python的始創人，也加入了Google。最近，Linux2.6核心的維護人，Andrew Morton也宣稱即將離開OSDL并投奔到Google.

所有的這些，都意味著開源領域的大變遷。

記得在最初的那些年代里，人們都為著自己的興趣愛好在業余時間里一邊工作一邊學習地奮力地寫著自己的代碼。突然，第一個.com的時代來臨，不少早期的開源公司開始聘請頂級程序員：如核心編碼員Alan Cox, David Miller，Stephen Tweedie等人紛紛來到Red Hat, 還有一些去了Linuxcare。

隨著第一個.com泡沫經濟的破滅，高手們被迫紛紛尋找新的工作，不少人去了新興之秀OSDL。基于這樣的一個背景，Google的興起以及大攬人才意味著早期公司廣具人才的模式的回歸。當然，這次他們的工作都間接的有關于Google的主要市場策略。

Google的策略是精明的，看看最近招的人，Goodger和Morton,一個是瀏覽器，一個是操作系統。無不顯示出其與Microsoft暗暗較勁的決心。

當然還有另一方面的原因，可能不是那么明顯，那就是最近的一些爭論，關于Google能否履行其最初對開源領域許下的諾言。矛頭指向Google是否應該公開它的源碼？因為Google用了不少開源的東西。

所以，從某種角度上講，招一些開源黑客人士入帳遠遠比把代碼隨處發布好的多。

那些關于用了開源的代碼的公司是不是也應該開放他們的代碼的爭論不僅僅涉及到Google。其他的一些主要得益者如Yahoo, 其最近正活躍于收購一些Web 2.0的公司如Flickr 和Del.icio.us，這些都很顯然有著開源的印記，當然它沒有Google那樣與開源的關系那么源遠流長，不過Yahoo也開始著手吸引開源人才。

posted @ 2006-08-11 06:39 Dedian 閱讀(913) | 評論 (0) | 編輯收藏

Web Standards or web trends?

People are still talking about web 2.0, I am not sure that is pure technical term. In my understanding, maybe most of meaning of web 2.0 is its marketing meaning. that is, web is becoming commonality and people generate the web's content. Again, i am not sure?what is the place of web service in web 2.0, in my understanding, the web is not merely client-server marketing model (I am not talking web structure here), but an?interactive community. But question is , who gonna be the operator or administrator of this community or if there?are any game?rules?needed to follow?? will that be another utopian ?

Well, on a technical layer, I'd like to shed some lights on so-called web standard trends

1. front end --
???????? CSS ----> layout
?????????XML ----> data?
?????????XHTML ----> markup
?????????Javascript & DOM ----> behavior + XMLHttpRequest?--> AJAX ?

2. back end --?
?????????some open source projects such as Ruby on Rail...

let me know how you are thinking...

posted @ 2006-08-09 09:21 Dedian 閱讀(816) | 評論 (0) | 編輯收藏

Doug Cutting 訪談錄 -- 關于搜索引擎的開發

作為Lucene和Nutch兩大Apach Open Source Project的始創人(其實還有Lucy, Lucene4C 和Hadoop等相關子項目)，Doug Cutting 一直為搜索引擎的開發人員所關注。他終于在為Yahoo以Contractor的身份工作4年后，于今年正式以Employee的身份加入Yahoo

下面是筆者在工作之余,翻譯其一篇2年前的訪談錄，原文(Doug Cutting Interview)在網上Google一下就容易找到。希望對搜索引擎開發的初學者起到一個拋磚引玉的效果。

(注：翻譯水平有限，不求雅，只求信，達。希望見諒)

1。請問你以何為生？你是如何開始從事搜索引擎開發的？

我主要在家從事兩個與搜索有關的開源項目的開發: Lucene和Nutch.?錢主要來自于一些與這些項目相關的一些合同中。目前Yahoo! Labs?有一部分贊助在Nutch上。這兩個項目還有一些其他的短期合同?。

2。你能大概給我們講解一下Nutch嗎？以及你將在哪方面運用它？

我還是先說一下Lucene吧。Lucene其實是一個提供全文文本搜索的函數庫，它不是一個應用軟件。它提供很多API函數讓你可以運用到各種實際應用程序中。現在，它已經成為Apache的一個項目并被廣泛應用著。這里列出一些已經使用Lucene的系統。

Nutch是一個建立在Lucene核心之上的Web搜索的實現，它是一個真正的應用程序。也就是說，你可以直接下載下來拿過來用。它在Lucene的基礎上加了網絡爬蟲和一些和Web相關的東東。其目的就是想從一個簡單的站內索引和搜索推廣到全球網絡的搜索上，就像Google和Yahoo一樣。當然，和那些巨人競爭，你得動一些腦筋，想一些辦法。我們已經測試過100M的網頁，并且它的設計用在超過1B的網頁上應該沒有問題。當然，讓它運行在一臺機器上，搜索一些服務器，也運行的很好。

3。在你看來，什么是搜索引擎的核心元素？也就說，一般的搜索引擎軟件可以分成哪幾個主要部分或者模塊？

讓我想想，大概是如下幾塊吧：

?-- 攫取(fetching)：就是把被指向的網頁下載下來。
?-- 數據庫：保存攫取的網頁信息，比如那些網頁已經被攫取，什么時候被攫取的以及他們又有哪些鏈接的網頁等等。
?-- 鏈接分析：對剛才數據庫的信息進行分析，給每個網頁加上一些權值(比如PageRank,WebRank什么的)，以便對每個網頁的重要性有所估計。不過，在我看來，索引那些網頁標記(Anchor)里面的內容更為重要。(這也是為什么諸如Google Bombing如此高效的原因)
?-- 索引(Indexing): 就是對攫取的網頁內容，以及鏈入鏈接，鏈接分析權值等信息進行索引以便迅速查詢。
?-- 搜索(Searching): 就是通過一個索引進行查詢然后按照網頁排名顯示。

當然，為了讓搜索引擎能夠處理數以億計的網頁，以上的模塊都應該是分布式的。也就是說，可以在多臺機器上并行運行。

4。你剛才說大家可以立馬下載Nutch運行在自己的機器上。這是不是說，即便那些對Apache服務器沒有掌控權的網站管理員在短時間內就可以使用Nutch?

很不幸，估計他們大都沒戲。因為Nutch還是需要一個Java servlet的容器(筆者注：比如Tomcat)。而這個有些ISP支持，但大都不支持。(筆者注: 只有對Apache服務器有掌控權，你才能在上面安裝一個Tomcat之類的東東)

5。我可以把Lucene和Google Web API結合起來嗎？或者和其他的一些我先前寫過的應用程序結合起來？

有那么一幫人已經為Nutch寫了一些類似Google的API, 但還沒有一個融入現在的系統。估計不久的將來就行了。

6。你認為目前實現一個搜索引擎最大的障礙在哪里？是硬件，存儲障礙還是排名算法？還有，你能不能告訴我大概需要多大的空間搜索引擎才能正常工作，就說我只想寫一個針對搜索成千上百萬的RSS feeds的一個搜索引擎吧。

Nutch大概一個網頁總共需要10kb的空間吧。Rss feeds的網頁一般都比較小(筆者注: Rss feeds都是基于xml的文本網頁，所以不會很大)，所以應該更好處理吧。當然Nutch目前還沒有針對RSS的支持。(筆者注：實際上，API里面有針對RSS的數據結構和解析)

7。從Yahoo! Labs拿到資金容易嗎？哪些人可以申請？你又要為之做出些什么作為回報？

我是被邀請的，我沒有申請。所以我不是很清楚個中的流程。

8。Google有沒有表示對Nutch感興趣？

我和那邊的一些家伙談過，包括Larry Page(筆者注: Google兩個創始人之一)。他們都很愿意提供一些幫助，但是他們也無法找到一種不會幫助到他們競爭對手的合適方式。

9。你有實現你自己的PageRank或者WebRank算法系統在你的Nutch里嗎？什么是你做網頁排名(Ranking)的考慮？

是的，Nutch里面有一個鏈接分析模塊。它是可選的，因為對于站內搜索來說，網頁排名是不需要的。

10。我想你以前有聽說過，就是對于一個開源的搜索引擎，是不是意味著同樣會給那些搞搜索引擎優化(SEO)的黑客們有機可趁？

恩，有可能。
就說利用反向工程破解的非開源搜索引擎中的最新的反垃圾信息檢測算法需要大概6個月的時間。對于一個開放源碼的搜索引擎來說，破解將會更快。但不管怎么說，那些制造垃圾信息者最終總能找到破解辦法，唯一的區別就是破解速度問題。所以最好的反垃圾信息技術，不管開源也好閉源也好，就是讓別人知道了其中的機制之后也能繼續工作那一種。

還有，如果這六月中你是把檢測出來的垃圾信息從你的索引中移除，他們無計可施，他們只能改變他們的站點。如果你的垃圾信息檢測是基于對一些網站中好的和壞的例子的統計分析，你可以徹夜留意那些新的垃圾信息模式并在他們有機會反應之前將他們移除。

開源會使得禁止垃圾信息的任務稍稍艱巨一點，但不是使之成為不可能。況且，那些閉源的搜索引擎也并沒有秘密地解決這些問題。我想閉源的好處就是不讓我們看到它其實沒有我們想象的那么好。

11。Nutch和分布式的網絡爬蟲Grub相比怎么樣？你是怎么想這個問題的？

我能說的就是，Grub是一個能夠讓網民們貢獻一點自己的硬件和帶寬給巨大的LookSmart的爬行任務的一個工程。它只有客戶端是開源，而服務端沒有。所以大家并不能配置自己的Grub服務，也不能訪問到Grub收集的數據。

更一般意義的分布式網絡爬行又如何？當一個搜索引擎變得很大的時候，其爬行上的代價相對搜索上需要付出的代價將是小巫見大巫。所以，一個分布式爬蟲并不能是顯著降低成本，相反它會使得一些已經不是很昂貴的東西變得很復雜(筆者注：指pc和硬盤之類的硬件)。所以這不是一個便宜的買賣。

廣泛的分布式搜索是一件很有趣的事，但我不能肯定它能否實現并保持速度足夠的快。一個更快的搜索引擎就是一個更好的搜索引擎。當大家可以任意快速更改查詢的時候，他們就更能在他們失去耐心之前頻繁找到他們所需的東西。但是，要建立一個不到1秒內就可以搜索數以億計的網頁的廣泛的分布式搜索引擎是很難的一件事，因為其中網絡有很高的延時。大都的半秒時間或者像Google展示它的查詢那樣就是在一個數據中心的網絡延時。如果你讓同樣一個系統運行在千家萬戶的家里的PC上，即便他們用的是DSL和Cable上網，網絡的延時將會更高從而使得一個查詢很可能要花上幾秒鐘甚至更長的時間。從而他也不可能會是一個好的搜索引擎。

12。你反復強調速度對于搜索引擎的重要性，我經常很迷惑Google怎么就能這么快地返回查詢結果。你認為他們是怎么做到的呢？還有你在Nutch上的經驗看法如何？

我相信Google的原理和Nutch大抵相同：就是把查詢請求廣播到一些節點上，每個節點返回一些頁面的頂級查詢結果。每個節點上保存著幾百萬的頁面，這樣可以避免大多查詢的磁盤訪問，并且每個節點可以每秒同時處理成十上百的查詢。如果你想獲得數以億計的頁面，你可以把查詢廣播到成千的節點上。當然這里會有不少網絡流量。

具體的在這篇文章（ www.computer.org/ micro/mi2003/ m2022.pdf）中有所描述。

13。你剛才有提到垃圾信息，在Nutch里面是不是也有類似的算法？怎么區別垃圾信息模式比如鏈接場(Linkfarms)(筆者注：就是一群的網頁彼此互相鏈接，這是當初在1999年被一幫搞SEO弄出來的針對lnktomi搜索引擎的使網頁的排名得到提高的一種Spamdexing方法)和那些正常的受歡迎的站點鏈接。

這個，我們還沒有騰出時間做這塊。不過，很顯然這是一個很重要的領域。在我們進入鏈接場之前，我們需要做一些簡單的事情：察看詞匯填充(Word stuffing)(筆者注：就是在網頁里嵌入一些特殊的詞匯，并且出現很多的次，甚至上百次，有些是人眼看不到的，比如白板寫白字等伎倆，這也是Spamdexing方法的一種)，白板寫白字(White-on-white text)，等等。

我想在一般意義上來說(垃圾信息檢測是其中的一個子問題)，搜索質量的關鍵在于擁有一個對查詢結果手工可靠評估的輔助措施。這樣，我們可以訓練一個排名算法從而產生更好的查詢結果(垃圾信息的查詢結果是一種壞的查詢結果)。商業的搜索引擎往往會雇傭一些人進行可靠評估。Nutch也會這樣做，但很顯然我們不能只接受那些友情贊助的評估，因為那些垃圾信息制造者很容易會防止那些評估。因此我們需要一種手段去建立一套自愿評估者的信任體制。我認為一個平等評論系統(peer-review system),有點像Slashdot的karma系統, 應該在這里很有幫助。

14。你認為搜索引擎在不久的將來路在何方？你認為從一個開發者的角度來看，最大的障礙將在哪里？

很抱歉，我不是一個想象力豐富的人。我的預測就是在未來的十年里web搜索引擎將和現在的搜索引擎相差無幾。現在應該屬于平穩期。在最初的幾年里，網絡搜索引擎確實曾經發展非常迅速。源于1994年的網絡爬蟲使用了標準的信息析取方法。直到1998年Google的出現，其間更多的基于Web的方法得到了發展。從那以后，新方法的引入大大放慢了腳步。那些樹枝低的果實已被收獲。創新只有在剛發展的時候比較容易，越到后來越成熟，越不容易創新。網絡搜索引擎起源于上個世紀90年代，現在儼然已成一顆搖錢樹，將來很快會走進人們的日常生活中。

至于開發上的挑戰，我認為操作上的可靠性將是一個大的挑戰。我們目前正在開發一個類似GFS(Google的文件系統)的東西。它是巨型搜索引擎不可缺少的基石：你不能讓一個小組件的錯誤導致一個大的癱瘓。你應該很容易的讓系統擴展，只需往硬件池里加更多硬件而不需繁縟的重新配置。還有，你不需要一大坨的操作人員完成，所有的一切將大都自己搞定。

----------------完----------------------

posted @ 2006-08-02 06:07 Dedian 閱讀(14474) | 評論 (199) | 編輯收藏

CVS Tutorial

--? Getting Ready to Use CVS

First set the variable CVSROOT to /class/`username`/cvsroot
[Or any other directory you wish]
[For csh/tcsh: setenv CVSROOT ~/cvsroot]
[For bash/ksh: CVSROOT=~/cvsroot;export CVSROOT]

Next run cvsinit. It will create this directory along with the subdirectory CVSROOT and put several files into CVSROOT.

-- How to put a project under CVS

A simple program consisting of multiple files is in /workspaces/project.

To put this program under cvs first

cd to /workspaces/project

Next

cvs import -m "Sample Program" project sample start

CVS should respond with
N project/Makefile
N project/main.c
N project/bar.c
N project/foo.c

No conflicts created by this import

If your were importing your own program, you could now delete the original source.
(Of course, keeping a backup is always a good idea)

-- Basic CVS Usage

Now that you have added 'project' to your CVS repository, you will want to be able to modify the code.

To do this you want to check out the source. You will want to cd to your home directory before you do this.

cd

cvs checkout project

CVS should respond with
cvs checkout: Updating project
U project/Makefile
U project/bar.c
U project/foo.c
U project/main.c

This creates the project directory in your home directory and puts the files: Makefile, bar.c, foo.c, and main.c into the directory along with a CVS directory which stores some information about the files.

You can now make changes to any of the files in the source tree.
Lets say you add a printf("DONE\n"); after the function call to bar()
[Or just cp /class/bfennema/project_other/main2.c to main.c]

Now you have to check in the new copy

cvs commit -m "Added a DONE message." main.c

CVS should respond with
Checking in main.c;
/class/'username'/cvsroot/project/main.c,v <-- main.c
new revision: 1.2; previous revision: 1.1
done

Note, the -m option lets you define the checking message on the command line. If you omit it you will be placed into an editor where you can type in the checking message.

-- Using CVS with Multiple Developers

To simulate multiple developers, first create a directory for your second developer.
Call it devel2 (Create it in your home directory).
Next check out another copy of project.

HINT: cvs checkout project

Next, in the devel2/project directory, add a printf("YOU\n"); after the printf("BAR\n");
[Or copy /class/bfennema/project_other/bar2.c to bar.c]

Next, check in bar.c as developer two.

HINT: cvs commit -m "Added a YOU" bar.c

Now, go back to the original developer directory.
[Probably /class/'username'/project]

Now look at bar.c. As you can see, the change made by developer one has no been integrated into your version. For that to happen you must

cvs update bar.c

CVS should respond with
U bar.c

Now look at bar.c. It should now be the same as developer two's.
Next, edit foo.c as the original developer and add printf("YOU\n"); after the printf("FOO\n");
[Or copy /class/bfennema/project_other/foo2.c to foo.c]

Then check in foo.c

HINT: cvs commit -m "Added YOU" foo.c

Next, cd back to developer two's directory.
Add printf("TOO\n"); after the printf("FOO\n");
[Or copy /class/bfennema/project_other/foo3.c to foo.c]

Now type

cvs status foo.c

CVS should respond with

===================================================================
File: foo.c             Status: Needs Merge

   Working revision:    1.1.1.1 'Some Date'
   Repository revision: 1.2     /class/'username'/cvsroot/project/foo.c,v
   Sticky Tag:          (none)
   Sticky Date:         (none)
   Sticky Options:      (none)

The various status of a file are:
Up-to-date

The file is identical with the latest revision in the repository.Locally Modified

You have edited the file, and not yet committed your changes.Needing Patch

Someone else has committed a newer revision to the repository.Needs Merge

Someone else has committed a newer revision to the repository, and you have also made modifications to the file.
Therefore, this is telling use we need to merge our changes with the changes made by developer one. To do this

cvs update foo.c

CVS should respond with
RCS file: /class/'username'/cvsroot/project/foo.c,v
retrieving revision 1.1.1.1
retrieving revision 1.2
Merging differences between 1.1.1.1 and 1.2 into foo.c
rcsmerge: warning: conflicts during merge
cvs update: conflicts found in foo.c
C foo.c

Since the changes we made to each version were so close together, we must manually adjust foo.c to look the way we want it to look. Looking at foo.c we see:

void foo()
{
  printf("FOO\n");
<<<<<<< foo.c
  printf("TOO\n");
=======
  printf("YOU\n");
>>>>>>> 1.2
}

We see that the text we added as developer one is between the ======= and the >>>>>>> 1.2.
The text we just added is between the ======= and the <<<<<<< foo.c

To fix this, move the printf("TOO\n");to after the printf("YOU\n");line and delete the additional lines the CVS inserted. [Or copy /class/bfennema/project_other/foo4.c to foo.c]
Next, commit foo.c

cvs commit -m "Added TOO" foo.c

Since you issued a cvs update command and integrated the changes made by developer one, the integrated changes are committed to the source tree.

-- Additional CVS Commands

To add a new file to a module:

Get a working copy of the module.
Create the new file inside your working copy.
use cvs add filename to tell CVS to version control the file.
use cvs commit filename to check in the file to the repository.

Removing files from a module:

Make sure you haven't made any uncommitted modifications to the file.
Remove the file from the working copy of the module. rm filename.
use cvs remove filename to tell CVS you want to delete the file.
use cvs commit filename to actually perform the removal from the repository.

For more information see the cvs man pages or the cvs.ps file in cvs-1.7/doc.

---------------
copy from http://www.csc.calpoly.edu/~dbutler/tutorials/winter96/cvs/

posted @ 2006-07-20 07:06 Dedian 閱讀(511) | 評論 (0) | 編輯收藏

Java Logging mechanism

reference:

http://java.sun.com/j2se/1.4.2/docs/guide/util/logging/overview.html

posted @ 2006-06-27 02:49 Dedian 閱讀(279) | 評論 (0) | 編輯收藏

Generic in the Java Programming Language

When reading GData source code, you will find that there are lots of generic-style code in it, which is one of several extensions of JDK 1.5. If you are using java 1.5 compiler, it is surely deserved to get some ideas about generic. Be noticed that Java generic looks like C++ Temple, but is quite different.

1. what is the idea of generic?
To simply say, generic is an idea of parameterizing type, including class type and other data types.

2. examples?
-- We are familar with some container types, such as Collection. Here is an example for our former (Java 1.4 or before) typical usage:
Vector myList = new Vector();
myList.add(new Integer(100));
Integer value = (Integer)myList.get(0);

now it is better to write like this for type safety: (Eclipse IDE will display type safety warnings for above code if under java 1.5 compiler option)
??Vector<Integer> myList = new Vector<Integer>();
??myList.add(new Integer(100));
??Integer value = myList.get(0);

-- the reason why write code like this is Class Vector has been defined as a generic:
public Class Vector<E>
{
??????void add(E x);
????? ......
}

-- when we see some angle brackets(invocations) shown in?declaration, that is a generic. The invocation is a parameterized type. to use this generic, we need specify an actual type argument. (such as Integer as above)

3. trick in generic

-- we know that the idea of generic makes some data type such as container more flexible or acceptable for inputting entries. But that will be also very tricky. To take container as an example of generic, one of tricks is?can we copy values from one container to another container? if you want to copy like following style, the answer is no.
List<String> ls = new ArrayList<String>();
List<Object> lo = ls; //compile time error!

-- though we know String is a subtype of Object, and we can assign a value of String to an Object. But we can not assign a List of String to a List of Object as a whole part(like reference to a variable). The reason is we can access inner part of List(I mean element here, if List is as a simple data type such as Object, maybe we can do that), that will make List type unsafe. So, Java 1.5 complier will not let you do that.

-- Look inside two styles of code in above examples(of 2), we might say that the older style looks more flexible, because myList can accept more data types besides Integer, but the new style in 1.5 can only take Integer values. Well, if we need more flexible, we apply wildcards for generic.

4. Wildcards and bounded wildcards

-- if we see something like Collection<?> c, there is a question mark in angle brackets. That is Wildcard, which means type is temporarily unknown but it will be replaced by any type.
-- if we see something like Collection<? extends Number> c, that is bounded wildcard, which means the elements in Collection has a supertype bound. You can not put any other type whose supertype is not Number into this Collection.
-- But, no matter wildcard or bounded wildcard, we can not put a specified type value in it, that is because wildcard means type is unknown, you can not give a value to unknown data type.
-- So, what hell can wildcard be used for ? return back the flexible idea we mentioned before. We need apply wildcard to describe a flexible idea in definition or declaration, not to do real things.
for example, we can define an method like this:
void printCollection(Collection<?> c)
{
??????for(Object e : c){System.out.println(e);}
}
see? that is flexible. You can call this function for any Collection. You can use elements in Collection<?>, just don't try to put something in it.
-- So the question is, if we wanna that flexibility for our method, and we also need put something in it during the subroutine. How can we do? and then, we need use generic method

5. Generic method
-- that means method declaration can also be parameterized.
-- example:
????public <T> void addCollection(List<T> objs, T obj)
? ?{
??????? objs.add(obj);
?? ?}

6. when to use generic method and when to use wildcard ?
-- if the type parameter is used only once, or it has no relationship to other arguments of method including the return type, then wildcard?is?better to use to decribe clearer and more concise meanings.
-- otherwise, generic method should be used.
example:
class Collection
{
??????public static <T, S extends T> void copy(List<T> dest, List<S> src){...}
}
can be better rewritten as :
class Collection
{
??????public static <T> void copy(List<T> dest, List<? extends T> src){...}
}

reference: http://java.sun.com/j2se/1.5/pdf/generics-tutorial.pdf

posted @ 2006-06-23 09:39 Dedian 閱讀(1395) | 評論 (0) | 編輯收藏

something about standard of Syndication Format

http://dsonline.computer.org/portal/site/dsonline/menuitem.9ed3d9924aeb0dcd82ccc6716bbe36ec/index.jsp?&pName=dso_level1&path=dsonline/0507&file=w4sta.xml&xsl=article.xsl&;jsessionid=GZQWvln9z4JY2dXX8HyQ5f5KtRptqHRWvh17tjCXVbxHnGyzvTm2!554406865

posted @ 2006-06-22 06:06 Dedian 閱讀(212) | 評論 (0) | 編輯收藏

Enhancements in JDK 5

http://java.sun.com/j2se/1.5.0/docs/guide/language/index.html

posted @ 2006-06-21 09:51 Dedian 閱讀(205) | 評論 (0) | 編輯收藏

a bug in Java ?

when I try to debug my webcrawler?by crawling?yahoo website, I found that when trying to connect to a website which URL is such as http://www.youtube.com/w/Kak%E1?v=PIBe_V9PBIA&search=kak%C3%A1, the following exception will happen:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 12
?at java.lang.String.substring(Unknown Source)
?at sun.net.www.ParseUtil.unescape(Unknown Source)
?at sun.net.www.ParseUtil.decode(Unknown Source)
?at sun.net.www.ParseUtil.toURI(Unknown Source)
?at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
?at sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)

follow is simple testing code:
?
private static final String urlstring = "
???URL url = new URL(urlstring);
???
???URLConnection con = url.openConnection();
???
???con.connect();

since there?are no other explicit exceptions except MalformedURLException & IOException mentioned to catch for this code, I am not sure if it is a bug in Java for URL parsing...

anybody got some idea about that?

P.S. ok, somebody has pointed out that Runtime exceptions, like java.lang.StringIndexOutOfBoundsException, do not have to be declared, but they can be thrown. So i need catch StringIndexOutOfBoundsException this exception for my code. But in my understanding, the function should catch all the exceptions from lower functions, and then throw out if it can not handle them, thus we can catch those exception from deep functions. I am not sure Runtime exceptions are exceptional ...

posted @ 2006-06-15 07:48 Dedian 閱讀(505) | 評論 (0) | 編輯收藏

Something is in progress

Still working on Webcrawler part, the URL collection strategies are under thinking. A URL frontier which stores the list of? activate URLs to be parsed or downloaded will be applied to handle for synchonized I/O operations with URL collection/Inventory, stuck by some issues:

1. Duplicate URL Elimination:
??? a. Host name aliases --> DNS Resolver
??? b. Omitted port numbers
??? c. Alternative paths on the same host
??? d. replication across difference host
??? e. non-sense links or session IDs embedded in URLs ?
2. Reachable of URL
3. Distributed Storage of URL Inventory and relative synchronization problem
4. Fetch strategies for URL Frontier or Fetchor to get activate links for parsing
5. Scheduler for fetching and updating URL collection: multi-thread or single thread on each pc, when to decide re-parsing a page
7. URL-Seen test: if that page has been parsed and should it re-parse? which should be done before entering URL frontier...
8. Extensibility issues for those modules: Fetcher, Extractor/Filters, Collector...
9. Checkpointing for crawlering interupted: how to resume the crawler job, how to split crawler jobs and distribute to different machines

seems that I need couple days to refine my systen architecture design...

posted @ 2006-06-09 08:57 Dedian 閱讀(847) | 評論 (0) | 編輯收藏

I/O Design Patterns

Here is an article for effective I/O programming thought, mark it just for future re-check my I/O design in distributed searching engine system. Non-blocking synchronous mode was applied in my current system. I need check it out if anything can do to improve the performance and large scalability later.

posted @ 2006-06-09 08:56 Dedian 閱讀(204) | 評論 (0) | 編輯收藏

Good or Bad, Check your OO Design

An idea is proposed by a PHD student of University of Auckland to check your OO Design on Java. The key point is to use directed graph to analyze the dependencies between all java classes, and the more classses involved in some cycle, the worse design it is.

Several Java Open source softwares have been examed in his research report...
Though it is not the only metric to check your OO design, I'd like to say that it is an interesting thought.

posted @ 2006-06-08 03:05 Dedian 閱讀(986) | 評論 (0) | 編輯收藏

Retrieve values in HashTable or HashMap

Unlike collection types such as Vector or List, Map (HashTable or HashMap) accesses a value by a key. If we want to retrieve all the values that have been put in a Map, one of simple ways to do that is employing a Collection or plus an Iterator, here is the sample code (just retrieve vaules, skip keys), assuming there is a variable: HashMap<String, <ComplexDataType>> links

Collection c = links.value();
Vector<ComplexDataType> v = new Vector<ComplexDataType>(c);
for(int i = 0; i< v.size(); i++)
{
??? ComplexDataType tempData = (ComplexDataType)v.get(i);
??? dosomethingwith(tempData);
}

P.S. Map provides three views of map: keySet, entrySet and values collection, we can use any of them .

posted @ 2006-06-02 07:16 Dedian 閱讀(342) | 評論 (0) | 編輯收藏

Java Interview Questions

These questions are very useful for some Java newbies and guys who wanna prepare some interviews on Java programming positions, which is really cool.

reference:
http://www.allapplabs.com/interview_questions/java_interview_questions.htm
http://www.allapplabs.com/interview_questions/java_interview_questions_2.htm
http://www.allapplabs.com/interview_questions/java_interview_questions_3.htm
http://www.allapplabs.com/interview_questions/java_interview_questions_4.htm
http://www.allapplabs.com/interview_questions/java_interview_questions_5.htm
http://www.allapplabs.com/interview_questions/java_interview_questions_6.htm

posted @ 2006-06-02 06:14 Dedian 閱讀(388) | 評論 (0) | 編輯收藏

Java Reading & Writing file

1. Reading text from Standard Input

try 
{
       BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
       String str = "";
       while (str != null) 
       {
          System.out.print("> some prompt ");
          str = in.readLine();
	  dosomethingwith(str);
       }
} 
catch (IOException e) 
{
}

2. Reading text from a file

try 
{
     BufferedReader in = new BufferedReader(new FileReader("filename"));
     String str;
     while ((str = in.readLine()) != null) 
     {
	dosomethingwith(str);
     }
     in.close();
} 
catch (IOException e) 
{
}

3. Reading a file into a BityArray

    // Returns the contents of the file in a byte array.
    public static byte[] getBytesFromFile(File file) throws IOException 
    {
        InputStream is = new FileInputStream(file);

        // Get the size of the file
        long length = file.length();

        // You cannot create an array using a long type.
        // It needs to be an int type.
        // Before converting to an int type, check
        // to ensure that file is not larger than Integer.MAX_VALUE.
        if (length > Integer.MAX_VALUE) 
	{
            // File is too large
        }

        // Create the byte array to hold the data
        byte[] bytes = new byte[(int)length];

        // Read in the bytes
        int offset = 0;
        int numRead = 0;
        while (offset < bytes.length
               && (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) 
	{
            offset += numRead;
        }

        // Ensure all the bytes have been read in
        if (offset < bytes.length) 
	{
            throw new IOException("Could not completely read file "+file.getName());
        }

        // Close the input stream and return bytes
        is.close();
        return bytes;

    }

4. Writing to a file

try 
{
    BufferedWriter out = new BufferedWriter(new FileWriter("filename"));
    out.write("some string");
    out.close();
} 
catch (IOException e) 
{
}

Note: If the file does not already exist, it is automatically created.

5. Appending to a file

try 
{
     BufferedWriter out = new BufferedWriter(new FileWriter("filename", true));
     out.write("appending String");
     out.close();
} 
catch (IOException e) 
{
}

6. Using a Random Access File

try 
{
     File f = new File("filename");
     RandomAccessFile raf = new RandomAccessFile(f, "rw");

     // Read a character
     char ch = raf.readChar();

     // Seek to end of file
     raf.seek(f.length());

     // Append to the end
     raf.writeChars("aString");
     raf.close();
} 
catch (IOException e) 
{
}

reference:
http://javaalmanac.com/egs/java.io/pkg.html

posted @ 2006-05-31 08:12 Dedian 閱讀(563) | 評論 (1) | 編輯收藏

Java Glossary -- Volatile

volatile

The volatile keyword is used on variables that may be modified simultaneously by other threads. This warns the compiler to fetch them fresh each time, rather than caching them in registers. This also inhibits certain optimisations that assume no other thread will change the values unexpectedly. Since other threads cannot see local variables, there is never any need to mark local variables volatile.

quote from:

http://mindprod.com/jgloss/volatile.html

posted @ 2006-05-25 04:45 Dedian 閱讀(306) | 評論 (1) | 編輯收藏

Lucene 2.0 release mostly this Friday

Though still under voting, it is originally?mentioned by Doug Cutting, and got only positive votes. So it is very likely we can get a 2.0 release version on this Friday. Some bugs has been fixed and deprecated code has been removed in this approaching version.

posted @ 2006-05-24 09:00 Dedian 閱讀(226) | 評論 (0) | 編輯收藏

歲月遐想

二十年前

我受著老師家長的各種表揚帶著各種的小紅花拿著各種的競賽獎狀

我現在的老板也許正在池塘里抓魚樹上捕知了向家長鬧棒棒糖吃

十年前

我開始談戀愛開始在月光下行走在沒人行走的小道上開始學著猶豫地寫詩

我現在的老板也許正在狂啃高中課本而郁郁寡歡或許也開始遞小紙條給鄰座的小女生

十年后的今天

戀人終成我的內人然后我在吭哧吭哧地在我現在的老板提供的一片小天地下寫著莫名其妙的代碼

鄰座的小女生終成記憶然后我現在的老板在我10米不遠的窗明幾凈的空曠的房間里看著我以及100號在他眼里和我差不多的人賣命地為他寫著代碼而輕松的聽者不知是不是搖滾的音樂而搖頭晃腦。

十年后的明天

？

結局1：

內人依然還是內人我還在吭哧吭哧地寫著代碼身邊卻多了一個長著和我有些許相似的小孩拽著我的胳膊鬧著要用我的電腦玩游戲

無數的漂亮女生在大樓里走馬觀花然后我現在的老板在我100米以外不知是不是房間的里面開著大會和著幾個肥頭大耳的股東討論著我以及1000號類似的人類的存活問題

結局2：

內人依然還是內人我終于省吃儉用和內人開辦有史以來第一個屬于自己的公司坐在屬于自己的窗明幾凈的辦公室里看著外面100號年輕如20年前的我的小兄弟們熱火朝天的干著革命

漂亮的女生們依然走馬觀花現在我的老板在更高更大的高樓大廈里和著幾個肥頭大耳的股東討論著怎么把曾經是他的手下如今卻成了一個小老板的我的公司進行兼并的大事。

結局3：

內人依然還是內人我卻擁有一個屬于自己的公司辦公室聚集著一幫曾經是我的同事以及現在的老板混在其中的人群在空調房里為我出謀劃策或者吭哧吭哧地寫著和10年前不一樣的代碼

一個漂亮的女生終于成為漂亮少婦現在的老板卻因為經營不善轉手把公司賣給曾經在他手下吭哧吭哧寫代碼的我然后我給了他一個不錯的職位讓他養家糊口娶妻生子。

P.S. 函數 Likely(結局n) (1<=n<=3)為嚴格單調遞減函數，其上限為0.0001

P.S.

以上歲月遐想純屬yy,我的老板不是中國人，沒有我yy中的他的少年以及青年。既然他不懂中文，我這里用中文進行yy決不會有落把柄在他手中的危險。寫這段yy的話的目的是表達我對年輕的他的敬仰(希望他能看懂這句中文)，以及我還未泯滅在幸福生活中的一點雄心。

posted @ 2006-05-20 13:28 Dedian 閱讀(277) | 評論 (0) | 編輯收藏

Ooops! my laptop not working...

Oops! My laptop, Compaq Presario R3230, is not working now (just worked yesterday evening), blue screen, hangs at disk checking...when I reboot with safe mode, it still hangs at is multi(0)disk(0)rdisk(0)partition(1)\windows\system32\drivers\atisgkaf.sys, I guess there is something wrong with my video driver, but how can I fix that problem without wipe out my documents in harddriver?

I am trying to google by it, it seems some guys also got that problem, some steps are suggested:

1. ?Insert the QuickRestore CD into the CD drive and restart the
? ? system.
2. ?When the red Compaq logo appears, press and hold the Caps
? ? Lock key. ?Next screen will be a blinking QuickRestore screen.
3. ?When the QuickRestore text stops blinking, press and hold the
? ? Num Lock key.

but where can I get QuickRestore CD? included CD seems not in my room any more...anybody has thought about that?

posted @ 2006-05-20 04:32 Dedian 閱讀(186) | 評論 (0) | 編輯收藏

最近的一些心得 -- 關于搜索引擎

由于工作的需要，最近對搜索引擎感興趣起來，下面有些心得：

1。其實要讓自己的Blog的點擊率狂漲的辦法很簡單，就是寫一個最簡單的webcrawler程序，不斷的訪問自己的主頁(發送http請求)，很多計數器的原理就是根據這個來計算的，而不會核實IP地址，不信，只要自己F5刷新一下自己的頁面就知道了。照這樣下去，點擊率超過老徐是肯定沒有問題的。不過，新浪本來就玩點擊率貓膩的，因為他們可以自己修改計數器，所以和他們玩這個沒有意義。

2。點擊率高并不表示你的頁面排名高(PageRank)。PageRank是一個技術含量比較高的詞，想當初Google那兩個毛頭小伙子Larry Page(真的很巧和，那小子的姓居然是Page,真的想不做Page的老大都不行)和 Sergey Brin就是靠在斯坦福期間有關PageRank的研究發家的，如今年紀輕輕就可以和MS叫板。當然，Google的PageRank的算法是商業秘密。不過網上牛人不乏其數，居然有人根據Google的一些搜索行為和利用概率建模等數學知識硬是弄出一套PageRank的解釋，在網上大為流行。那篇Paper只要Google一下PageRank Uncovered(by Chris Ridings and Mike Shishigin)就可以找到。據說，還有人利用里面的機制大大戲弄了一把Google的搜索引擎。不過已無法考證，因為Google也在不斷完善自己。

3。簡單來說，PageRank就是一個衡量自己網站或網頁的重要性的一個很關鍵的指標。其概念的核心簡單來說就是看有多少網頁鏈接到你的網頁，特別是有多少重要的網頁鏈接到你的網頁。換句話說，如果老徐的Blog因為其點擊率或在全國人民的博客世界的影響力使得其PageRank達到10，即為一非常重要之網頁，而你又有幸得到老徐的青睞加為友情鏈接，即她之重要網頁有鏈接指向了你的網頁，則你的PageRank必有所提高。當然，這只是一個非常簡單的例子，具體的公式還沒那么簡單，自己有興趣可以在網上查到，即便這樣，這只是一個因素而已。不過這就不難理解為什么會有那么多的人會在名人的博客上搶沙發甚至故意大放厥詞已引起各方注意了。也就不難理解廣告做到博客上去了。

4.其實，PageRank的idea來源于我們平時的生活中。比如，我想買一個電腦，我希望一個懂電腦的人告訴我買什么電腦。比如我知道小王比較懂，我就會問小王，小王說，恩，dedian牌電腦不錯，就買dedian牌電腦吧。我說，好吧，就買它了，可你是怎么知道的呢，哪里有介紹呢，有哪些優點呢？小王說，這。。。，我也不是很清楚，我也是聽小李那丫說的，你去問那小子吧。這時，即便我不認識小李，可他在我心目中的形象一下高大了許多，小王都要聽他丫的。。。

5。所以，要讓自己的網頁或網站就有影響力，就要千方百計讓別人來連接你，來引用你。當然還有一種辦法，就是不斷的引用別人的文章，這里的引用不是說在你自己的網頁里嵌上別人的連接，而是利用別人的網頁嵌上自己網頁。怎么做，其實就是很多Blog的Trackback的功能，細心可以發現，只要你Trackback別人的Blog,你的Blog地址就留在別人的Blog的網頁里(comments一樣)。不過，現在大都的blog都開始有設置不允許別人Trackback或comments.新浪好像也開始做了手腳，名人的博客不讓引用了好像，不過新浪的博客對很多的搜索引擎都不友好，也就別動他的主意了。倒是MSN space似乎可以，可以寫一段代碼自動連到各個網頁上fetch出每個blog的permalink然后執行一段MSN自己提供的javascript就可以trackback了，不過這只是我最近想到的，還沒有寫代碼實現。如果可以成功的話，很多其他的博客也一樣可以成功。這個想法是最近老看到一些亂七八糟的網站出現在我的trackback里想到的。

6。不過現在網上提供越來越多的服務會杜絕類似的不友好攻擊行為。比如，如果你很討厭有人在你的博客里亂引用，亂寫評論。你可以申請一個類似托管的服務，就是讓另一個網站先收集那些留言或評論，再篩選，再放到你的博客上。總之，網絡的林子大了，什么鳥都有。

posted @ 2006-05-19 16:15 Dedian 閱讀(1530) | 評論 (3) | 編輯收藏

Notes for exploration of Search Engine (keep updating...)

+ Webcrawler
???
??? -- study open source code
??? ?? ?? purpose: analyze code structure and basic componences
??? ?? ?? focus on: Nutch (http://lucene.apache.org/nutch/)
??? ??? ??? ??? ??? & HTMLParser (http://htmlparser.sourceforge.net/)
??? ?? ?? ?? ?? ?? ? & GData(http://code.google.com/apis/gdata/overview.html)

??? -- understand PageRank idea
??? ?? relative articles:
??? ?? http://en.wikipedia.org/wiki/PageRank
??? ?? http://www.thesitewizard.com/archive/google.shtml
?????? paper : "PageRank Uncoverd" by Chris Ridings and Mike Shishigin
?????? http://www.rankforsales.com/n-aa/095-seo-may-31-03.html (about Chris Ridings & SEO)
??? ?? http://en.wikipedia.org/wiki/Web_crawler (basic idea about crawler)
??? ??
??? -- familar with RSS & Atom protocol

??? -- sample coding:
??? ?? Interface: Scheduler for fetching web links
??? ?? Interface: Web page paser/Analyzer --> to deal with XML-based websites(Weblogs or news sites, RSS & Atom) --> Paser classes based on SAX parser
??? ?? Interface: Retractor/Fetcher --> to get links from page
??? ?? Interface: Collector --> check URL whether duplicated and save in URL database with certian data structure
??? ?? Interface: InformationProcesser --> PageRank should be one important factor --> (under thinking)
??? ?? Interface: Policies(Filter) --> will be served for Collector and InformationProcessor --> (under thinking)

+ Indexer/Searcher (almost done base on Lucene)

posted @ 2006-05-19 09:40 Dedian 閱讀(297) | 評論 (1) | 編輯收藏

my favorite way to load a Java project

Motivation:

always, if you wanna check/analyze source code or do some contribution in open source communities, you would like to download the source code of some projects and load (or import) it into your own IDE. (if you don't wanna use CVS or SVN)

Following is my favorite way to do that under Eclipse:

1. create a new blank Java project:

File -> New -> Project ... -> Java Project --> Next >> -> input the project name (project layout: Create seperate source and output folders) --> click Finish

2. right click Source Folder "src" --> import ... -> select File system -> choose correct source code folder where you put the downloaded source code by click the top "Browse..." button (source code folder means the root folder? thus can keep folder structure as package structure) --> Finish

3. if you import wrong source code folder, you can delete whole project to redo. (it is no use merely deleting some failed packages)

Note:

if there is Ant build file (some stuff like build.xml) included in source code package, that will be cool, just using File -> New -> Project... -> Java Project from Existing Ant Buildfile.

posted @ 2006-05-19 02:58 Dedian 閱讀(250) | 評論 (0) | 編輯收藏

Crawling policies

The behavior of a web crawler is the outcome of a combination of policies:

A selection policy that states which pages to download.
A re-visit policy that states when to check for changes to the pages.
A politeness policy that states how to avoid overloading websites.
A parallelization policy that states how to coordinate distributed web crawlers.

cite from:

http://en.wikipedia.org/wiki/Web_crawler

posted @ 2006-05-18 06:34 Dedian 閱讀(183) | 評論 (0) | 編輯收藏

Compiler problem in Eclipse

Problem Description:

I wanna build GData source code under Eclipse which contrains creating type-specific map codes, the Eclipse IDE will complain something like that:? Syntax error, parameterized types are only available if source level is 5.0

Reason:

The new feature to create a type-specific map can only be supported at source level 5.0

Solution:

Do some IDE compiler configuration:
Window > Preferences > Java > Compiler > Compiler compliance level => 5.0

Note:
1. type-specific map:? create a map that will hold only objects of a certain type
??? example:

Map<Integer, String> map = new HashMap<Integer, String>();

    map.put(1, "first");
    map.put(2, "second");

2. if source level 5.0 is applied, Type-safe problem should be noticed for collection data type, such as Vector, List, Stack or Map etc.
that means, you can write code under level 1.4 like this:

private Vector MyList = new Vector();
...
MyList.add(str);

you'd better change to some stuff like this under level 5.0:

private Vector<String> MyList = new Vector<String>();

posted @ 2006-05-17 09:41 Dedian 閱讀(400) | 評論 (0) | 編輯收藏

Planning for next job

1. Develop a searching engine merely for Weblogs (Main jobs will be on WebCrawler, Indexer and Searcher part has been done for xml-based information retrieval)

Motivation:
?? ?a. Weblog is more and more popular recently
?? ?b. Though there has some weblog search engines such as Technorati and Blogdigger, but still seems lots of work need to do.
?? ?c. the formats of weblog feed (RSS2.0 & Atom) are xml-based and more standard, which is very close to my current job on xml-based information retrieval
?? ?d. easily extensible for crawling xml-based information websites besides weblogs
?? ?
HOWTO:
?? ????? a. Utilize GData for feeding xml-based information
or????? b. using some Open Source Crawlers + Lucene (similar idea in this article)
or ?? ? c. develop and merge my own simple Crawler package into my Shemy project which is clustering structure searching engine design based on Lucene

???????? likely: c > a > b (coz most open source crawlers are supposed to deal with much complex web pages/links, while since weblog feed is simpler, the crawler for it should be lighter)

Requirement/Functionality Analysis : (in progress)

Schedule: (in progress)

2. Exploration of performation tuning on searching issues to improve Shemy kernel

posted @ 2006-05-17 06:36 Dedian 閱讀(243) | 評論 (0) | 編輯收藏

Java Glossary -- Nested Class

Definition:

A class within another class

Example:

class EnclosingClass 
{
    ...
    class ANestedClass 
    {
        ...
    }
}

Purpose:

Reflect and enforce the relationship between two classes. (esp. in the scenarios that the nested class makes sense only in the context of its enclosing class or when it relies on the enclosing class for its functionthe nested class makes sense only in the context of its enclosing class or when it relies on the enclosing class for its function)

Interesting features:

1. An instance of InnerClass can exist only within an instance of

EnclosingClass

2. InnerClass instance has direct access to the instance variables and methods of its enclosing instance.
3. two special kinds of inner classes: local classes and anonymous classes

reference:
http://java.sun.com/docs/books/tutorial/java/javaOO/nested.html

posted @ 2006-05-16 08:22 Dedian 閱讀(325) | 評論 (0) | 編輯收藏

Google Data API

GData, Google data API, provides a simple standard protocol for reading and writing data on the web, which supports two common XML-based syndication formats (Atom and RSS).

Briefly browse the documents and javadoc of GData, I found that it is very similiar to my thought on my current clustering searching engine system, designing an interface of data structure for all the request to searching engine servers, and employing an other interface to handle external/internal requests and fill up the data structure.

Although my first version of searching engine has been deployed on LiveDigital, I am still working on the new version?with clustering?design. source code has been packaged and implementation document is almost done,?I still hope I can learn something from GData, more professional and structrual design.?

posted @ 2006-05-16 06:01 Dedian 閱讀(256) | 評論 (0) | 編輯收藏

Google official blogs

Google Blog - googleblog.blogspot.com
Google Talkabout - googletalk.blogspot.com
Google Base Blog - googlebase.blogspot.com
Google Video - googlevideo.blogspot.com
Inside Google Desktop - googledesktop.blogspot.com
Google Code - code.google.com
Inside AdWords - adwords.blogspot.com
Inside AdSense - adsense.blogspot.com
Google Reader Blog - googlereader.blogspot.com
Blogger Buzz - buzz.blogger.com
AdWords API Blog - adwordsapi.blogspot.com
Google Enterprise Blog - googleenterprise.blogspot.com
Google Research - googleresearch.blogspot.com
Google Maps API Blog - googlemapsapi.blogspot.com
Google Writely - writely.blogspot.com
Inside Google Book Search - booksearch.blogspot.com

posted @ 2006-05-16 03:57 Dedian 閱讀(215) | 評論 (0) | 編輯收藏

Design Patterns - 8 - Proxy

Purpose:

To control access to an object, provide a surrogate or placeholder (proxy) for it. As a mostly used strategy, Proxy can defer the creation and initialization of the object until it is on demand.

Structure:

Similar to adapter, client includes an object of proxy to access, and proxy includes a real object that proxy represents.

Difference from Adapter:

Adapter provides a different interface to the object it adapts. In contrast, Proxy provides the same interface as its subject. As a protection of real object, Proxy can refuse to perform an operation that the subject will perform.

Example:
http://www.javaworld.com/javaworld/jw-02-2002/jw-0222-designpatterns.html

Reference:
Book: (GoF)Design Patterns
http://en.wikipedia.org/wiki/Proxy_design_pattern
http://www.inf.bme.hu/ooret/1999osz/DesignPatterns/Proxy4/
http://alumni.media.mit.edu/~tpminka/patterns/Proxy.html

posted @ 2006-05-12 04:10 Dedian 閱讀(282) | 評論 (0) | 編輯收藏

Design Patterns - 7 - Mediator

Purpose:

Encapsulate a set of objects which interact with each other. The benefit to do this is keeping objects from communicating with each other directly. All the messages between objects should be sent to mediator at first and then mediator will control and coordinate those interaction. Also, mediator can handle external interaction request and decide which object to response the external request.

Structure:

star topology: Mediator class is as a hub which connects to a set of classes (colleague classes).

Difference from Facade Pattern:

Facade differs from Mediator in that it abstracts a subsystem of objects to provide a more convenient interface. Its protocol is unidirectional. That is, Facade objects make requests of the subsystem classes but not vice versa. In contrast, Mediator enables cooperative behavior that colleague objects don't or can't provide, and the protocol is multidirectional.

reference:

Gamma, E., R. Helm, R. Johnson, J. Vlissides (1995). Design Patterns. Addison Wesley. ISBN 0.201-63361-2
http://sern.ucalgary.ca/courses/seng/443/W02/assignments/Mediator/
http://my.execpc.com/~gopalan/design/behavioral/mediator/mediator.html

posted @ 2006-05-12 03:24 Dedian 閱讀(302) | 評論 (0) | 編輯收藏

To buy a new monitor

Just wanna add an additional screen for better working, LCD monitor is my favorite thought, here is good article for purchase guide:

http://reviews.cnet.com/4520-7610_7-5084364-3.html

and also I have some interests in ViewSonic type

http://www.viewsonic.com/products/desktopdisplays/lcddisplays/valueseries/va902b/

but i am not sure if it is best deal :(. coz PRINCETON also has a not bad deal which is cheaper and with DVI connector...

still thinking and hunting, should be settled down by this week.

posted @ 2006-05-11 05:39 Dedian 閱讀(211) | 評論 (2) | 編輯收藏

Saw an interesting blog ...

http://www.metanotes.com/

posted @ 2006-05-06 10:00 Dedian 閱讀(181) | 評論 (0) | 編輯收藏

Play with Eclipse 3.2RC2

Just installed on my xp machine today, updated with Callisto Discovery Site, seems lots of plugins have been integrated. I'd like to see c++, TPTP, WST?and UML plugin tools in it. But still?don't know how to play?Visual Editor ...can't get rid of crutch thought?from Microsoft kits..

posted @ 2006-05-05 11:29 Dedian 閱讀(248) | 評論 (0) | 編輯收藏

J2EE程序員與.Net程序員死之區別

問：
都是撞車而死，J2EE程序員與.Net程序員有什么區別？

答：
J2EE程序員死前有剎車的痕跡----一種打破常規的操作。

posted @ 2006-05-05 11:19 Dedian 閱讀(400) | 評論 (1) | 編輯收藏


Copyright © Dedian	Powered by: 博客園模板提供：滬江博客

導航

常用鏈接

留言簿(8)