午夜亚洲国产成人不卡在线,色欲色欲天天天www亚洲伊,亚洲区视频在线观看

Lucene之起源現(xiàn)狀及初步應(yīng)用

作者：陳光--ZDNet reader
2004-09-09 03:11 PM

本文是Lucene研究文集的首篇，主要介紹了Lucene的起源、發(fā)展、現(xiàn)狀，以及Luence的初步應(yīng)用，可以作為了解和學(xué)習(xí)Lucene的入門資料。

1． 起源與發(fā)展

Lucene是一個(gè)高性能、純Java的全文檢索引擎，而且免費(fèi)、開(kāi)源。Lucene幾乎適合于任何需要全文檢索的應(yīng)用，尤其是跨平臺(tái)的應(yīng)用。

Lucene的作者Doug Cutting是一個(gè)資深的全文檢索專家，剛開(kāi)始，Doug Cutting將Lucene發(fā)表在自己的主頁(yè)上，2000年3月將其轉(zhuǎn)移到sourceforge，于2001年10捐獻(xiàn)給Apache，作為Jakarta的一個(gè)子工程。

2 ．使用現(xiàn)狀

經(jīng)過(guò)多年的發(fā)展，Lucene在全文檢索領(lǐng)域已經(jīng)有了很多的成功案例，并積累了良好的聲譽(yù)。

基于Lucene的全文檢索產(chǎn)品（Lucene本身只是一個(gè)組件，而非一個(gè)完整的應(yīng)用）和應(yīng)用Lucene的項(xiàng)目在世界各地已經(jīng)非常之多，比較知名的有：

l???????? Eclipse：主流Java開(kāi)發(fā)工具，其幫助文檔采用Lucene作為檢索引擎

l???????? Jive：知名論壇系統(tǒng)，其檢索功能基于Lucene

l???????? Ifinder：出自德國(guó)的網(wǎng)站檢索系統(tǒng)，基于Lucene（http://ifinder.intrafind.org/）

l???????? MIT DSpace Federation：一個(gè)文檔管理系統(tǒng)（http://www.dspace.org/）

國(guó)內(nèi)外采用Lucene作為網(wǎng)站全文檢索引擎的也很多，比較知名的有：

l???????? http://www.blogchina.com/weblucene/

l???????? http://www.ioffer.com/

l???????? http://search.soufun.com/

l???????? http://www.taminn.com/

（更多案例，請(qǐng)參見(jiàn)http://wiki.apache.org/jakarta-lucene/PoweredBy）

在所有這些案例中，開(kāi)源應(yīng)用占了很大一部分，但更多的還是商化業(yè)產(chǎn)品和網(wǎng)站。毫不夸張的說(shuō)，Lucene的出現(xiàn)，極大的推動(dòng)了全文檢索技術(shù)在各個(gè)行業(yè)或領(lǐng)域中的深層次應(yīng)用。

3 ．初步應(yīng)用

前面提到，Lucene本身只是一個(gè)組件，而非一個(gè)完整的應(yīng)用，所以若想讓Lucene跑起來(lái)，還得在Lucene基礎(chǔ)上進(jìn)行必要的二次開(kāi)發(fā)

下載與安裝

首先，你需要到Lucene的官方網(wǎng)站http://jakarta.apache.org/lucene/去下載一份拷貝，最新版是1.4。下載后將得到一個(gè)名為lucene-1.4-final.zip的壓縮文件，將其解壓，里面有一個(gè)名為lucene-1.4-final.jar的文件，這就是Lucene組件包了，若需要在項(xiàng)目使用Lucene，只需要把lucene-1.4-final.jar置于類路徑下即可，至于解壓后的其他文件都是參考用的。

接下來(lái)，我用Eclipse建立一個(gè)工程，實(shí)現(xiàn)基于Lucene的建庫(kù)、記錄加載和記錄查詢等功能。

如上圖所示，這是開(kāi)發(fā)完成后的工程，其中有三個(gè)源文件CreateDataBase.java，InsertRecords.java，QueryRecords.java，分別實(shí)現(xiàn)建庫(kù)、入庫(kù)、檢索的功能。

以下是對(duì)這三個(gè)源文件的分析。

CreateDataBase.java

packagecom.holen.part1;

importjava.io.File;

importorg.apache.lucene.analysis.standard.StandardAnalyzer;

importorg.apache.lucene.index.IndexWriter;

/**

?* @authorHolenChen

?*初始化檢索庫(kù)

?*/

public classCreateDataBase{

??? publicCreateDataBase(){??

??? }

???

??? publicintcreateDataBase(Filefile){

?????? intreturnValue=0;

?????? if(!file.isDirectory()){

?????????? file.mkdirs();

?????? }

?????? try{

?????????? IndexWriterindexWriter= newIndexWriter(file,newStandardAnalyzer(),true);

?????????? indexWriter.close();

?????????? returnValue=1;

?????? }catch(Exceptionex){

?????????? ex.printStackTrace();

?????? }

?????? returnreturnValue;

??? }

???

??? /**

??? ?*傳入檢索庫(kù)路徑,初始化庫(kù)

??? ?* @paramfile

??? ?* @return

??? ?*/

??? publicintcreateDataBase(Stringfile){

?????? returnthis.createDataBase(newFile(file));??

??? }

??? publicstaticvoidmain(String[]args){

?????? CreateDataBasetemp= newCreateDataBase();

?????? if(temp.createDataBase("e:\\lucene\\holendb")==1){

?????????? System.out.println("db init succ");

?????? }

??? }

}

說(shuō)明：這里最關(guān)鍵的語(yǔ)句是IndexWriterindexWriter= newIndexWriter(file,newStandardAnalyzer(),true)。

第一個(gè)參數(shù)是庫(kù)的路徑，也就是說(shuō)你準(zhǔn)備把全文檢索庫(kù)保存在哪個(gè)位置，比如main方法中設(shè)定的“e:\\lucene\\holendb”，Lucene支持多庫(kù)，且每個(gè)庫(kù)的位置允許不同。

第二個(gè)參數(shù)是分析器，這里采用的是Lucene自帶的標(biāo)準(zhǔn)分析器，分析器用于對(duì)整篇文章進(jìn)行分詞解析，這里的標(biāo)準(zhǔn)分析器實(shí)現(xiàn)對(duì)英文（或拉丁文，凡是由字母組成，由空格分開(kāi)的文字均可）的分詞，分析器將把整篇英文按空格切成一個(gè)個(gè)的單詞（在全文檢索里這叫切詞，切詞是全文檢索的核心技術(shù)之一，Lucene默認(rèn)只能切英文或其他拉丁文，默認(rèn)不支持中日韓等雙字節(jié)文字，關(guān)于中文切詞技術(shù)將在后續(xù)章節(jié)重點(diǎn)探討）。

第三個(gè)參數(shù)是是否初始化庫(kù)，這里我設(shè)的是true，true意味著新建庫(kù)或覆蓋已經(jīng)存在的庫(kù)，false意味著追加到已經(jīng)存在的庫(kù)。這里新建庫(kù)，所以肯定需要初始化，初始化后，庫(kù)目錄下只存在一個(gè)名為segments的文件，大小為1k。但是當(dāng)庫(kù)中存在記錄時(shí)執(zhí)行初始化，庫(kù)中內(nèi)容將全部丟失，庫(kù)回復(fù)到初始狀態(tài)，即相當(dāng)于新建了該庫(kù)，所以真正做項(xiàng)目時(shí)，該方法一定要慎用。

[被屏蔽廣告]

加載記錄源碼及說(shuō)明

InsertRecords.java

packagecom.holen.part1;

importjava.io.File;

importjava.io.FileReader;

importjava.io.Reader;

importorg.apache.lucene.analysis.standard.StandardAnalyzer;

importorg.apache.lucene.document.Document;

importorg.apache.lucene.document.Field;

importorg.apache.lucene.index.IndexWriter;

/**

?* @authorHolenChen

?*記錄加載

?*/

public classInsertRecords{

??? publicInsertRecords(){

??? }

???

??? publicintinsertRecords(Stringdbpath,Filefile){

?????? intreturnValue=0;

?????? try{

?????????? IndexWriterindexWriter

?????????? ?= newIndexWriter(dbpath,newStandardAnalyzer(),false);

?????????? this.addFiles(indexWriter,file);

?????????? returnValue=1;

?????? }catch(Exceptionex){

?????????? ex.printStackTrace();

?????? }

?????? returnreturnValue;

??? }

???

??? /**

??? ?*傳入需加載的文件名

??? ?* @paramfile

??? ?* @return

??? ?*/

??? publicintinsertRecords(Stringdbpath,Stringfile){

?????? returnthis.insertRecords(dbpath,newFile(file));

??? }

???

??? publicvoidaddFiles(IndexWriterindexWriter,Filefile){

?????? Documentdoc= newDocument();

?????? try{

?????????? doc.add(Field.Keyword("filename",file.getName()));??

?????????????????

?????????? //以下兩句只能取一句,前者是索引不存儲(chǔ),后者是索引且存儲(chǔ)

?????????? //doc.add(Field.Text("content",new FileReader(file)));?

?????? ??? doc.add(Field.Text("content",this.chgFileToString(file)));

??????????

?????????? indexWriter.addDocument(doc);

?????????? indexWriter.close();

?????? }catch(Exceptionex){

?????????? ex.printStackTrace();

?????? }

??? }

???

??? /**

??? ?*從文本文件中讀取內(nèi)容

??? ?* @paramfile

??? ?* @return

??? ?*/

??? publicStringchgFileToString(Filefile){

?????? StringreturnValue= null;

?????? StringBuffersb= newStringBuffer();

?????? char[]c= newchar[4096];

?????? try{

?????????? Readerreader= newFileReader(file);

?????????? intn=0;

?????????? while(true){????????????

????????????? n=reader.read(c);

????????????? if(n>0){

????????????????? sb.append(c,0,n);

????????????? }else{

????????????????? break;

????????????? }

?????????? }

?????????? reader.close();

?????? }catch(Exceptionex){

?????????? ex.printStackTrace();

?????? }

?????? returnValue=sb.toString();

?????? returnreturnValue;?

??? }

??? publicstaticvoidmain(String[]args){

?????? InsertRecordstemp= newInsertRecords();

?????? Stringdbpath="e:\\lucene\\holendb";

?????? //holen1.txt中包含關(guān)鍵字"holen"和"java"

?????? if(temp.insertRecords(dbpath,"e:\\lucene\\holen1.txt")==1){

?????????? System.out.println("add file1 succ");

?????? }

?????? //holen2.txt中包含關(guān)鍵字"holen"和"chen"

?????? if(temp.insertRecords(dbpath,"e:\\lucene\\holen2.txt")==1){

?????????? System.out.println("add file2 succ");

?????? }??

??? }

}

說(shuō)明：這個(gè)類里面主要有3個(gè)方法insertRecords(Stringdbpath,Filefile)，addFiles(IndexWriterindexWriter,Filefile)，chgFileToString(Filefile)。

ChgFileToString方法用于讀取文本型文件到一個(gè)String變量中。

InsertRecords方法用于加載一條記錄，這里是將單個(gè)文件入全文檢索庫(kù)，第一個(gè)參數(shù)是庫(kù)路徑，第二個(gè)參數(shù)是需要入庫(kù)的文件。

InsertRecords需要調(diào)用addFiles，addFiles是文件入庫(kù)的真正執(zhí)行者。AddFiles里有如下幾行重點(diǎn)代碼：

doc.add(Field.Keyword("filename",file.getName()));

注意，在Lucene里沒(méi)有嚴(yán)格意義上表，Lucene的表是通過(guò)Field類的方法動(dòng)態(tài)構(gòu)建的，比如Field.Keyword("filename",file.getName())就相當(dāng)于在一條記錄加了一個(gè)字段，字段名為filename，該字段的內(nèi)容為file.getName()。

常用的Field方法如下：

方法	切詞	索引	存儲(chǔ)	用途
Field.Text(String name, String value)	Y	Y	Y	標(biāo)題，文章內(nèi)容
Field.Text(String name, Reader value)	Y	Y	N	META信息
Field.Keyword(String name, String value)	N	Y	Y	作者
Field.UnIndexed(String name, String value)	N	N	Y	文件路徑
Field.UnStored(String name, String value)	Y	Y	N	與第二種類似

為了更深入的了解全文檢索庫(kù)，我們可以將全文檢索庫(kù)與通常的關(guān)系型數(shù)據(jù)庫(kù)（如Oracle，Mysql）作一下對(duì)比。

全文檢索庫(kù)對(duì)關(guān)系型數(shù)據(jù)庫(kù)對(duì)比
對(duì)比項(xiàng)	全文檢索庫(kù)（Lucene）	關(guān)系型數(shù)據(jù)庫(kù)（Oracle）
核心功能	以文本檢索為主，插入（insert）、刪除（delete）、修改（update）比較麻煩，適合于大文本塊的查詢。	插入（insert）、刪除（delete）、修改（update）十分方便，有專門的SQL命令，但對(duì)于大文本塊（如CLOB）類型的檢索效率低下。
庫(kù)	與Oracle類似，都可以建多個(gè)庫(kù)，且各個(gè)庫(kù)的存儲(chǔ)位置可以不同。	可以建多個(gè)庫(kù)，每個(gè)庫(kù)一般都有控制文件和數(shù)據(jù)文件等，比較復(fù)雜。
表	沒(méi)有嚴(yán)格的表的概念，比如Lucene的表只是由入庫(kù)時(shí)的定義字段松散組成。	有嚴(yán)格的表結(jié)構(gòu)，有主鍵，有字段類型等。
記錄	由于沒(méi)有嚴(yán)格表的概念，所以記錄體現(xiàn)為一個(gè)對(duì)象，在Lucene里記錄對(duì)應(yīng)的類是Document。	Record，與表結(jié)構(gòu)對(duì)應(yīng)。
字段	字段類型只有文本和日期兩種，字段一般不支持運(yùn)算，更無(wú)函數(shù)功能。在Lucene里字段的類是Field，如document（field1,field2…）	字段類型豐富，功能強(qiáng)大。 record（field1,field2…）
查詢結(jié)果集	在Lucene里表示查詢結(jié)果集的類是Hits，如hits（doc1,doc2,doc3…）	在JDBC為例， Resultset（record1,record2,record3...）

兩種庫(kù)對(duì)比圖如下：

檢索源碼及說(shuō)明

QueryRecords.java

packagecom.holen.part1;

importjava.util.ArrayList;

importorg.apache.lucene.analysis.standard.StandardAnalyzer;

importorg.apache.lucene.document.Document;

importorg.apache.lucene.queryParser.QueryParser;

importorg.apache.lucene.search.Hits;

importorg.apache.lucene.search.IndexSearcher;

importorg.apache.lucene.search.Query;

importorg.apache.lucene.search.Searcher;

/**

?* @authorHolenChen

?*檢索查詢

?*/

public classQueryRecords{

??? publicQueryRecords(){

??? }

???

??? /**

??? ?*檢索查詢,將結(jié)果集返回

??? ?* @paramsearchkey

??? ?* @paramdbpath

??? ?* @paramsearchfield

??? ?* @return

??? ?*/

??? publicArrayListqueryRecords(Stringsearchkey,Stringdbpath,Stringsearchfield){

?????? ArrayListlist= null;

?????? try{

?????????? Searchersearcher= newIndexSearcher(dbpath);

?????????? Queryquery

?????????? ?=QueryParser.parse(searchkey,searchfield,newStandardAnalyzer());

?????????? Hitshits=searcher.search(query);

?????????? if(hits!= null){

????????????? list= newArrayList();

????????????? inttemp_hitslength=hits.length();

????????????? Documentdoc= null;

????????????? for(inti=0;i<temp_hitslength;i++){

????????????????? doc=hits.doc(i);

????????????????? list.add(doc.get("filename"));

????????????? }

?????????? }

?????? }catch(Exceptionex){

?????????? ex.printStackTrace();

?????? }

?????? returnlist;

??? }

??? publicstaticvoidmain(String[]args){

?????? QueryRecordstemp= newQueryRecords();??????

?????? ArrayListlist= null;

?????? list=temp.queryRecords("holen","e:\\lucene\\holendb","content");

?????? for(inti=0;i<list.size();i++){

?????????? System.out.println((String)list.get(i));

?????? }?? ???

??? }

}

說(shuō)明：該類中Searcher負(fù)責(zé)查詢，并把查詢結(jié)果以Hits對(duì)象集方式返回，Hits好比JDBC中的RecordSet，Hits是Document的集合，每個(gè)Document相當(dāng)于一條記錄，Document中包含一個(gè)或多個(gè)字段，可以通過(guò)Document.get(“字段名”)方法得到每個(gè)字段的內(nèi)容。

通過(guò)這三個(gè)類，就完成了一個(gè)簡(jiǎn)單的基于Lucene的全文檢索應(yīng)用。

4 ．總結(jié)

Lucene十分精練純粹，就一個(gè)jar包，引入到你的工程中，調(diào)用其接口，就可以為你的應(yīng)用增添全文檢索功能。

通過(guò)上一節(jié)的初步應(yīng)用會(huì)發(fā)現(xiàn)，Lucene使用起來(lái)很簡(jiǎn)單，與JDBC有些類似，應(yīng)用時(shí)重點(diǎn)掌握好IndexWriter，Document，F(xiàn)ield，Searcher等幾個(gè)類即可。

Lucene的結(jié)構(gòu)很清晰，每個(gè)package司職一項(xiàng)，比如org.apache.Lucene.search負(fù)責(zé)檢索，org.apache.Lucene.index索引，org.apache.Lucene.analysis切詞等，且Lucene的主要?jiǎng)幼鞫疾捎昧顺橄箢悾瑪U(kuò)展起來(lái)十分方便。

相對(duì)于一些商業(yè)化全文檢索，Lucene的入庫(kù)速度更快。因?yàn)樗拇鎯?chǔ)采取分步合并的方法，先建立小索引，待時(shí)機(jī)成熟才把小索引合并到大索引樹上。因此，我們?cè)诓僮鲬?yīng)用數(shù)據(jù)時(shí)可以同步進(jìn)行全文檢索庫(kù)的操作而不會(huì)（或許很少）影響系統(tǒng)的效能。

Lucene性能穩(wěn)定，使用簡(jiǎn)單，而且開(kāi)源免費(fèi)，有Apache基金在后面做支撐，資金和技術(shù)力量都十分雄厚，這兩年也一直是穩(wěn)步更新，每次新版本的推出，業(yè)界均爭(zhēng)相報(bào)導(dǎo)。

參考資料

1．? Introduction to Text Indexing with Apache Jakarta Lucene（Otis Gospodnetic）

2．? Lucene Introduction in Chinese（車東）

3．? Lucene Tutorial（Steven J. Owens）

作者簡(jiǎn)介

陳光－ J2EE項(xiàng)目經(jīng)理，熟悉EJB、XML，致力于Aapche Jakarta項(xiàng)目的應(yīng)用與推廣，可通過(guò)holen@263.net與作者聯(lián)系。

posted on 2006-04-26 15:42 chenhui 閱讀(103) 評(píng)論(0) 編輯收藏

新用戶注冊(cè) 刷新評(píng)論列表


只有注冊(cè)用戶登錄后才能發(fā)表評(píng)論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問(wèn) 管理

seaairland

Lucene之起源現(xiàn)狀及初步應(yīng)用

導(dǎo)航

統(tǒng)計(jì)

常用鏈接

留言簿(1)

隨筆分類

隨筆檔案

文章分類

文章檔案

介紹 IOC

友情鏈接

最新隨筆

搜索

積分與排名

最新評(píng)論

閱讀排行榜

評(píng)論排行榜