1, Lucene的結構框架:
注意:Lucene中的一些比較復雜的詞法分析是用JavaCC生成的(JavaCC:JavaCompilerCompiler,純Java的詞法
分析生成器),所以如果從源代碼編譯或需要修改其中的QueryParser、定制自己的詞法分析器,還需要從 https://javacc.dev.java.net/下載javacc。
lucene的組成結構:對于外部應用來說索引模塊(index)和檢索模塊(search)是主要的外部應用入口。 org.apache.Lucene.search/ 搜索入口
org.apache.Lucene.index/ 索引入口
org.apache.Lucene.analysis/ 語言分析器
org.apache.Lucene.queryParser/ 查詢分析器
org.apache.Lucene.document/ 存儲結構
org.apache.Lucene.store/ 底層IO/存儲結構
org.apache.Lucene.util/ 一些公用的數據結構
2, 關于計劃于詞庫的分詞和一元分詞,二元分詞的區別. noise.chs 是詞庫中作為stopword而存在的.請大家注意.
下面做了詳細描述:
2006年01月22日 星期日 于 2:39 am · 發表在: 默認
Lucene應用越來越多,在對中文對索引過程中,中文分詞問題也就越來越重要。
在已有的分詞模式中,目前比較常用的也是比較通用的有一元分詞、二元分詞和基于詞庫的分詞三種。一元分詞在Java版本上由yysun實現,并且已經收錄
到Apache。其實現方式比較簡單,即將每一個漢字作為一個Token,例如:“這是中文字”,在經過一元分詞模式分詞后的結果為五個Token:這、
是、中、文、字。而二元分詞,則將兩個相連的漢字作為一個Token劃分,例如:“這是中文字”,運用二元分詞模式分詞后,得到的結果為:這是、是中、中
文、文字。
一元分詞和二元分詞實現原理比較簡單,基本支持所有東方語言。但二者的缺陷也比較明顯。一元分詞單純的考慮了中文的文字而沒有考慮到中文的詞性,例如在上
述的例子中,“中文”、“文字”這兩個十分明顯的中文詞語就沒有被識別出來。相反,二元分詞則分出了太多的冗余的中文詞,如上所述,“這是”、“是中”毫
無意義的文字組合竟被劃分為一個詞語,而同樣的缺陷,命中的詞語也不十分準確,如上:在“這是中文字”中,“中文字”這個詞語應該優先考慮的。而二元分詞
也未能實現。
基于詞庫的分詞實現難度比較大,其模式也有多種,如微軟在自己的軟件中的漢語分詞、海量的中文分詞研究版,還有目前在.Net下實現的使用率較高的獵兔,
和一些其他人自發實現的分詞工具等等。其都有自己的分析體系,雖然分析精度高,但實現難度大,實現周期長,而且,對一般的中小型應用系統來講,在精度的要
求不是十分苛刻的環境下,這種模式對系統對消耗是一種奢侈行為。
在綜合考慮一元分詞、二元分詞及基于詞庫的分詞模式后,我大膽提出一種基于StopWord分割的分詞模式。這種分詞模式的設計思想是,針對要分割的段
落,先由標點分割成標準的短句。然后根據設定的StopWord,將短句由StopWord最大化分割,分割為一個個詞語。如:輸入短句為“這是中文字
”,設定的StopWord列表為:“這”、“是”,則最終的結果為:“中文字”。
這個例子相對比較簡單,舉個稍微長一點的例子:輸入短句“中文軟件需要具有對中文文本的輸入、顯示、編輯、輸出等基本功能”,設定的StopWord列表為:“這”、“是”、“的”、“對”、“等”、“需要”、“具有”,則分割出對結果列表為:
====================
中文軟件
中文文本
輸入
顯示
編輯
輸出
基本功能
====================
基本實現了想要的結果,但其中也不乏不足之處,如上述的結果中“中文軟件”與“中文文本”應該分割為三個獨立詞“中文”、“軟件”和“文本”,而不是上述的結果。
并且,對StopWord列表對設置,也是相對比較復雜的環節,沒有一個確定的約束來設定StopWord。我的想法是,可以將一些無意義的主語,如“我
”、“你”、“他”、“我們”、“他們”等,動詞“是”、“對”、“有”等等其他各種詞性諸如“的”、“啊”、“一”、“不”、“在”、“人”等等
(System32目錄下noise.chs文件里的內容可以作為參考)作為StopWord。
noise.chs 是詞庫中作為stopword而存在的.請大家注意.
3, 關于分詞的.還可以關注這個帖子:
http://lucene-group.group.javaeye.com/group/blog/58701
自己寫的一個基于詞庫的lucene分詞程序--ThesaurusAnalyzer
我已經測試過.還可以.18萬分詞.
4, lucene的自帶分詞的測試如下:\
Lucene本身提供了幾個分詞接口,我后來有給寫了一個分詞接口.
功能遞增如下:
WhitespaceAnalyzer:僅僅是去除空格,對字符沒有lowcase化,不支持中文
SimpleAnalyzer:功能強于WhitespaceAnalyzer,將除去letter之外的符號全部過濾掉,并且將所有的字符lowcase化,不支持中文
StopAnalyzer:StopAnalyzer的功能超越了SimpleAnalyzer,在SimpleAnalyzer的基礎上
增加了去除StopWords的功能,不支持中文
StandardAnalyzer:英文的處理能力同于StopAnalyzer.支持中文采用的方法為單字切分.
ChineseAnalyzer:來自于Lucene的sand box.性能類似于StandardAnalyzer,缺點是不支持中英文混和分詞.
CJKAnalyzer:chedong寫的CJKAnalyzer的功能在英文處理上的功能和StandardAnalyzer相同
但是在漢語的分詞上,不能過濾掉標點符號,即使用二元切分
TjuChineseAnalyzer:我寫的,功能最為強大.TjuChineseAnlyzer的功能相當強大,在中文分詞方面由于其調用的為
ICTCLAS的java接口.所以其在中文方面性能上同與ICTCLAS.其在英文分詞上采用了Lucene的StopAnalyzer,可以去除
stopWords,而且可以不區分大小寫,過濾掉各類標點符號.
程序調試于:JBuilder 2005
package org.apache.lucene.analysis;
//Author:zhangbufeng
//TjuAILab(天津大學人工智能實驗室)
//2005.9.22.11:00
import java.io.*;
import junit.framework.*;
import org.apache.lucene.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.cn.*;
import org.apache.lucene.analysis.cjk.*;
import org.apache.lucene.analysis.tjucn.*;
import com.xjt.nlp.word.*;
public class TestAnalyzers extends TestCase {
public TestAnalyzers(String name) {
super(name);
}
public void assertAnalyzesTo(Analyzer a,
String input,
String[] output) throws Exception {
//前面的"dummy"好像沒有用到
TokenStream ts = a.tokenStream("dummy", new StringReader(input));
StringReader readerInput=new StringReader(input);
for (int i=0; i Token t = ts.next();
//System.out.println(t);
assertNotNull(t);
//使用下面這條語句即可以輸出Token的每項的text,并且用空格分開
System.out.print(t.termText);
System.out.print(" ");
assertEquals(t.termText(), output );
}
System.out.println(" ");
assertNull(ts.next());
ts.close();
}
public void outputAnalyzer(Analyzer a ,String input) throws Exception{
TokenStream ts = a.tokenStream("dummy",new StringReader(input));
StringReader readerInput = new StringReader(input);
while(true){
Token t = ts.next();
if(t!=null){
System.out.print(t.termText);
System.out.print(" ");
}
else
break;
}
System.out.println(" ");
ts.close();
}
public void testSimpleAnalyzer() throws Exception {
//學習使用SimpleAnalyzer();
//SimpleAnalyzer將除去letter之外的符號全部過濾掉,并且將所有的字符lowcase化
Analyzer a = new SimpleAnalyzer();
assertAnalyzesTo(a, "foo bar FOO BAR",
new String[] { "foo", "bar", "foo", "bar" });
assertAnalyzesTo(a, "foo bar . FOO <> BAR",
new String[] { "foo", "bar", "foo", "bar" });
assertAnalyzesTo(a, "foo.bar.FOO.BAR",
new String[] { "foo", "bar", "foo", "bar" });
assertAnalyzesTo(a, "U.S.A.",
new String[] { "u", "s", "a" });
assertAnalyzesTo(a, "C++",
new String[] { "c" });
assertAnalyzesTo(a, "B2B",
new String[] { "b", "b" });
assertAnalyzesTo(a, "2B",
new String[] { "b" });
assertAnalyzesTo(a, "\"QUOTED\" word",
new String[] { "quoted", "word" });
assertAnalyzesTo(a,"zhang ./ bu <> feng",
new String[]{"zhang","bu","feng"});
ICTCLAS splitWord = new ICTCLAS();
String result = splitWord.paragraphProcess("我愛大家 i LOVE chanchan");
assertAnalyzesTo(a,result,
new String[]{"我","愛","大家","i","love","chanchan"});
}
public void testWhiteSpaceAnalyzer() throws Exception {
//WhiterspaceAnalyzer僅僅是去除空格,對字符沒有lowcase化
Analyzer a = new WhitespaceAnalyzer();
assertAnalyzesTo(a, "foo bar FOO BAR",
new String[] { "foo", "bar", "FOO", "BAR" });
assertAnalyzesTo(a, "foo bar . FOO <> BAR",
new String[] { "foo", "bar", ".", "FOO", "<>", "BAR" });
assertAnalyzesTo(a, "foo.bar.FOO.BAR",
new String[] { "foo.bar.FOO.BAR" });
assertAnalyzesTo(a, "U.S.A.",
new String[] { "U.S.A." });
assertAnalyzesTo(a, "C++",
new String[] { "C++" });
assertAnalyzesTo(a, "B2B",
new String[] { "B2B" });
assertAnalyzesTo(a, "2B",
new String[] { "2B" });
assertAnalyzesTo(a, "\"QUOTED\" word",
new String[] { "\"QUOTED\"", "word" });
assertAnalyzesTo(a,"zhang bu feng",
new String []{"zhang","bu","feng"});
ICTCLAS splitWord = new ICTCLAS();
String result = splitWord.paragraphProcess("我愛大家 i love chanchan");
assertAnalyzesTo(a,result,
new String[]{"我","愛","大家","i","love","chanchan"});
}
public void testStopAnalyzer() throws Exception {
//StopAnalyzer的功能超越了SimpleAnalyzer,在SimpleAnalyzer的基礎上
//增加了去除StopWords的功能
Analyzer a = new StopAnalyzer();
assertAnalyzesTo(a, "foo bar FOO BAR",
new String[] { "foo", "bar", "foo", "bar" });
assertAnalyzesTo(a, "foo a bar such FOO THESE BAR",
new String[] { "foo", "bar", "foo", "bar" });
assertAnalyzesTo(a,"foo ./ a bar such ,./<> FOO THESE BAR ",
new String[]{"foo","bar","foo","bar"});
ICTCLAS splitWord = new ICTCLAS();
String result = splitWord.paragraphProcess("我愛大家 i Love chanchan such");
assertAnalyzesTo(a,result,
new String[]{"我","愛","大家","i","love","chanchan"});
}
public void testStandardAnalyzer() throws Exception{
//StandardAnalyzer的功能最為強大,對于中文采用的為單字切分
Analyzer a = new StandardAnalyzer();
assertAnalyzesTo(a,"foo bar Foo Bar",
new String[]{"foo","bar","foo","bar"});
assertAnalyzesTo(a,"foo bar ./ Foo ./ BAR",
new String[]{"foo","bar","foo","bar"});
assertAnalyzesTo(a,"foo ./ a bar such ,./<> FOO THESE BAR ",
new String[]{"foo","bar","foo","bar"});
assertAnalyzesTo(a,"張步峰是天大學生",
new String[]{"張","步","峰","是","天","大","學","生"});
//驗證去除英文的標點符號
assertAnalyzesTo(a,"張,/步/,峰,.是.,天大<>學生",
new String[]{"張","步","峰","是","天","大","學","生"});
//驗證去除中文的標點符號
assertAnalyzesTo(a,"張。、步。、峰是。天大。學生",
new String[]{"張","步","峰","是","天","大","學","生"});
}
public void testChineseAnalyzer() throws Exception{
//可見ChineseAnalyzer在功能上和standardAnalyzer的功能差不多,但是可能在速度上慢于StandardAnalyzer
Analyzer a = new ChineseAnalyzer();
//去空格
assertAnalyzesTo(a,"foo bar Foo Bar",
new String[]{"foo","bar","foo","bar"});
assertAnalyzesTo(a,"foo bar ./ Foo ./ BAR",
new String[]{"foo","bar","foo","bar"});
assertAnalyzesTo(a,"foo ./ a bar such ,./<> FOO THESE BAR ",
new String[]{"foo","bar","foo","bar"});
assertAnalyzesTo(a,"張步峰是天大學生",
new String[]{"張","步","峰","是","天","大","學","生"});
//驗證去除英文的標點符號
assertAnalyzesTo(a,"張,/步/,峰,.是.,天大<>學生",
new String[]{"張","步","峰","是","天","大","學","生"});
//驗證去除中文的標點符號
assertAnalyzesTo(a,"張。、步。、峰是。天大。學生",
new String[]{"張","步","峰","是","天","大","學","生"});
//不支持中英文寫在一起
// assertAnalyzesTo(a,"我愛你 i love chanchan",
/// new String[]{"我","愛","你","i","love","chanchan"});
}
public void testCJKAnalyzer() throws Exception {
//chedong寫的CJKAnalyzer的功能在英文處理上的功能和StandardAnalyzer相同
//但是在漢語的分詞上,不能過濾掉標點符號,即使用二元切分
Analyzer a = new CJKAnalyzer();
assertAnalyzesTo(a,"foo bar Foo Bar",
new String[]{"foo","bar","foo","bar"});
assertAnalyzesTo(a,"foo bar ./ Foo ./ BAR",
new String[]{"foo","bar","foo","bar"});
assertAnalyzesTo(a,"foo ./ a bar such ,./<> FOO THESE BAR ",
new String[]{"foo","bar","foo","bar"});
// assertAnalyzesTo(a,"張,/步/,峰,.是.,天大<>學生",
// new String[]{"張步","步峰","峰是","是天","天大","大學","學生"});
//assertAnalyzesTo(a,"張。、步。、峰是。天大。學生",
// new String[]{"張步","步峰","峰是","是天","天大","大學","學生"});
//支持中英文同時寫
assertAnalyzesTo(a,"張步峰是天大學生 i love",
new String[]{"張步","步峰","峰是","是天","天大","大學","學生","i","love"});
}
public void testTjuChineseAnalyzer() throws Exception{
/**
* TjuChineseAnlyzer的功能相當強大,在中文分詞方面由于其調用的為ICTCLAS的java接口.
* 所以其在中文方面性能上同與ICTCLAS.其在英文分詞上采用了Lucene的StopAnalyzer,可以去除
* stopWords,而且可以不區分大小寫,過濾掉各類標點符號.
*/
Analyzer a = new TjuChineseAnalyzer();
String input = "體育訊 在被尤文淘汰之后,皇馬主帥博斯克拒絕接受媒體對球隊后防線的批評,同時還為自己排出的首發陣容進行了辯護。"+
"“失利是全隊的責任,而不僅僅是后防線該受指責,”博斯克說,“我并不認為我們踢得一塌糊涂。”“我們進入了半決賽,而且在晉級的道路上一路奮 "+
"戰。即使是今天的比賽我們也有幾個翻身的機會,但我們面對的對手非常強大,他們踢得非常好。”“我們的球迷應該為過去幾個賽季里我們在冠軍杯中的表現感到驕傲。”"+
"博斯克還說。對于博斯克在首發中排出了久疏戰陣的坎比亞索,賽后有記者提出了質疑,認為完全應該將隊內的另一 "+
"名球員帕文派遣上場以加強后衛線。對于這一疑議,博斯克拒絕承擔所謂的“責任”,認為球隊的首發沒有問題。“我們按照整個賽季以來的方式做了,"+
"對于人員上的變化我沒有什么可說的。”對于球隊在本賽季的前景,博斯克表示皇馬還有西甲聯賽的冠軍作為目標。“皇家馬德里在冠軍 "+
"杯中戰斗到了最后,我們在聯賽中也將這么做。”"+
"A Java User Group is a group of people who share a common interest in
Java technology and meet on a regular basis to share"+
" technical ideas and information. The actual structure of a JUG can
vary greatly - from a small number of friends and coworkers"+
" meeting informally in the evening, to a large group of companies based in the same geographic area. "+
"Regardless of the size and focus of a particular JUG, the sense of community spirit remains the same. ";
outputAnalyzer(a,input);
//此處我已經對大文本進行過測試,不會有問題效果很好
outputAnalyzer(a,"我愛大家 ,,。 I love China 我喜歡唱歌 ");
assertAnalyzesTo(a,"我愛大家 ,,。I love China 我喜歡唱歌",
new String[]{"愛","大家","i","love","china","喜歡","唱歌"});
}
}
ExtJS教程- Hibernate教程- Struts2 教程- Lucene教程
|