锘??xml version="1.0" encoding="utf-8" standalone="yes"?>
score(q,d) = coord(q,d) · queryNorm(q) · | 鈭?/big> | ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) ) |
t in q |
tf(t in d) = |
frequency½ |
idf(t) = |
1 + log ( |
|
) |
coord(q,d)
by the Similarity in effect at search time. queryNorm(q) = queryNorm(sumOfSquaredWeights) = |
|
Weight
object璁$畻鍑烘潵鐨? For example, a boolean query
computes this value as:
sumOfSquaredWeights = q.getBoost() 2 · |
鈭?/big> | ( idf(t) · t.getBoost() ) 2 |
t in q |
setBoost()
璁劇疆鐨? 娉ㄦ剰錛岃繖鍎挎病鏈夌洿鎺ョ殑API鍘昏闂湪 a multi term query鐨勪竴涓猼erm鐨刡oost鍊鹼紝浣嗘槸multi terms浼氫互multi TermQuery
objects鍦ㄤ竴涓猶uery涓琛ㄧず,鍥犳the boost of a term in the query鍙互浣跨敤瀛恞uery鐨?a>getBoost()
鍙嶉棶鍒? doc.setBoost()
before adding the document to the index. field.setBoost()
before adding the field to a document. lengthNorm(field)
-銆傚綋鏂囨。琚姞鍏ュ埌绱㈠紩鏃惰綆楋紝錛屽拰document鐨刦ieldnorm(t,d) = doc.getBoost() · lengthNorm(field) · |
∏ | f.getBoost () |
field f in d named as t |
Similarity
for search. ------------------------------------
Store.COMPRESS Store the original field value in the index in a compressed form. This is useful for long documents and for binary valued fields.鍘嬬緝瀛樺偍錛?br />
Store.YES Store the original field value in the index. This is useful for short texts like a document's title which should be displayed with the results. The value is stored in its original form, i.e. no analyzer is used before it is stored. 绱㈠紩鏂囦歡鏈潵鍙瓨鍌ㄧ儲寮曟暟鎹? 姝よ璁″皢鍘熸枃鍐呭鐩存帴涔熷瓨鍌ㄥ湪绱㈠紩鏂囦歡涓紝濡傛枃妗g殑鏍囬銆?br />
Store.NO Do not store the field value in the index. 鍘熸枃涓嶅瓨鍌ㄥ湪绱㈠紩鏂囦歡涓紝鎼滅儲緇撴灉鍛戒腑鍚庯紝鍐嶆牴鎹叾浠栭檮鍔犲睘鎬у鏂囦歡鐨凱ath錛屾暟鎹簱鐨勪富閿瓑錛岄噸鏂拌繛鎺ユ墦寮鍘熸枃錛岄傚悎鍘熸枃鍐呭杈冨ぇ鐨勬儏鍐點?br />
鍐沖畾浜咶ield瀵硅薄鐨?this.isStored 鍜?nbsp; this.isCompressed
------------------------------------
Index.NO Do not index the field value. This field can thus not be searched, but one can still access its contents provided it is Field.Store stored. 涓嶈繘琛岀儲寮曪紝瀛樻斁涓嶈兘琚悳绱㈢殑鍐呭濡傛枃妗g殑涓浜涢檮鍔犲睘鎬у鏂囨。綾誨瀷, URL絳夈?br />
Index.TOKENIZED Index the field's value so it can be searched. An Analyzer will be used to tokenize and possibly further normalize the text before its terms will be stored in the index. This is useful for common text. 鍒嗚瘝绱㈠紩
Index.UN_TOKENIZED Index the field's value without using an Analyzer, so it can be searched. As no analyzer is used the value will be stored as a single term. This is useful for unique Ids like product numbers. 涓嶅垎璇嶈繘琛岀儲寮曪紝濡備綔鑰呭悕錛屾棩鏈熺瓑錛孯od Johnson鏈韓涓轟竴鍗曡瘝錛屼笉鍐嶉渶瑕佸垎璇嶃?/p>
Index.NO_NORMS 涓嶅垎璇嶏紝寤虹儲寮曘俷orms鏄粈涔???瀛楁鍊???銆備絾鏄疐ield鐨勫間笉鍍忛氬父閭f牱琚繚瀛橈紝鑰屾槸鍙彇涓涓猙yte錛岃繖鏍瘋妭綰﹀瓨鍌ㄧ┖闂???? Index the field's value without an Analyzer, and disable the storing of norms. No norms means that index-time boosting and field length normalization will be disabled. The benefit is less memory usage as norms take up one byte per indexed field for every document in the index.Note that once you index a given field <i>with</i> norms enabled, disabling norms will have no effect. In other words, for NO_NORMS to have the above described effect on a field, all instances of that field must be indexed with NO_NORMS from the beginning.
鍐沖畾浜咶ield瀵硅薄鐨?this.isIndexed this.isTokenized this.omitNorms
------------------------------------
Lucene 1.4.3鏂板鐨勶細
TermVector.NO Do not store term vectors. 涓嶄繚瀛榯erm vectors
TermVector.YES Store the term vectors of each document. A term vector is a list of the document's terms and their number of occurences in that document. 淇濆瓨term vectors銆?
TermVector.WITH_POSITIONS Store the term vector + token position information 淇濆瓨term vectors銆傦紙淇濆瓨鍊煎拰token浣嶇疆淇℃伅錛?br />
TermVector.WITH_OFFSETS Store the term vector + Token offset information
TermVector.WITH_POSITIONS_OFFSETS Store the term vector + Token position and offset information 淇濆瓨term vectors銆傦紙淇濆瓨鍊煎拰Token鐨刼ffset錛?br />
鍐沖畾浜咶ield瀵硅薄鐨則his.storeTermVector this.storePositionWithTermVector this.storeOffsetWithTermVector
鏈榪戯紝ruby 1.9鍙堟彁渚涗簡鏂扮殑瀹氫箟lambda
浠ヤ笅鍐呭鍧囦負杞澆,url瑙佸叿浣撻摼鎺?
鏈甯歌鐨勫洓涓狝nalyzer,璇存槑: http://windshowzbf.bokee.com/3016397.html
WhitespaceAnalyzer 浠呬粎鏄幓闄ょ┖鏍鹼紝瀵瑰瓧絎︽病鏈塴owcase鍖?涓嶆敮鎸佷腑鏂?br />
SimpleAnalyzer :鍔熻兘寮轟簬WhitespaceAnalyzer,灝嗛櫎鍘籰etter涔嬪鐨勭鍙峰叏閮ㄨ繃婊ゆ帀,騫朵笖灝嗘墍鏈夌殑瀛楃lowcase鍖?涓嶆敮鎸佷腑鏂?br />
StopAnalyzer: StopAnalyzer鐨勫姛鑳借秴瓚婁簡SimpleAnalyzer錛屽湪SimpleAnalyzer鐨勫熀紜涓?澧炲姞浜嗗幓闄topWords鐨勫姛鑳?涓嶆敮鎸佷腑鏂?綾諱腑浣跨敤涓涓猻tatic鏁扮粍淇濆瓨浜咵NGLISH_STOP_WORDS, 澶父瑙佷笉index鐨剋ords
StandardAnalyzer: 鐢↗avacc瀹氫箟鐨勪竴濂桬BNF錛屼弗紱佺殑璇硶銆傛湁浜鴻鑻辨枃鐨勫鐞嗚兘鍔涘悓浜嶴topAnalyzer.鏀寔涓枃閲囩敤鐨勬柟娉曚負鍗曞瓧鍒囧垎銆傛湭浠旂粏姣旇緝錛屼笉鏁㈢‘瀹氥?/p>
鍏朵粬鐨勬墿灞?
ChineseAnalyzer:鏉ヨ嚜浜嶭ucene鐨剆and box.鎬ц兘綾諱技浜嶴tandardAnalyzer,緙虹偣鏄笉鏀寔涓嫳鏂囨販鍜屽垎璇?
CJKAnalyzer:chedong鍐欑殑CJKAnalyzer鐨勫姛鑳藉湪鑻辨枃澶勭悊涓婄殑鍔熻兘鍜孲tandardAnalyzer鐩稿悓.浣嗘槸鍦ㄦ眽璇殑鍒嗚瘝涓婏紝涓嶈兘榪囨護鎺夋爣鐐圭鍙鳳紝鍗充嬌鐢ㄤ簩鍏冨垏鍒?br />
TjuChineseAnalyzer: http://windshowzbf.bokee.com/3016397.html鍐欑殑,鍔熻兘鏈涓哄己澶?TjuChineseAnlyzer鐨勫姛鑳界浉褰撳己澶?鍦ㄤ腑鏂囧垎璇嶆柟闈㈢敱浜庡叾璋冪敤鐨勪負ICTCLAS鐨刯ava鎺ュ彛.鎵浠ュ叾鍦ㄤ腑鏂囨柟闈㈡ц兘涓婂悓涓嶪CTCLAS.鍏跺湪鑻辨枃鍒嗚瘝涓婇噰鐢ㄤ簡Lucene鐨凷topAnalyzer,鍙互鍘婚櫎 stopWords,鑰屼笖鍙互涓嶅尯鍒嗗ぇ灝忓啓,榪囨護鎺夊悇綾繪爣鐐圭鍙?
渚嬪瓙:
http://www.langtech.org.cn/index.php/uid-5080-action-viewspace-itemid-68, 榪樻湁綆鍗曠殑浠g爜鍒嗘瀽
Analyzing "The quick brown fox jumped over the lazy dogs"
WhitespaceAnalyzer:
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
SimpleAnalyzer:
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
StopAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
Analyzing "XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]
鍙傝冭繛鎺?
http://macrochen.blogdriver.com/macrochen/1167942.html
http://macrochen.blogdriver.com/macrochen/1153507.html
http://my.dmresearch.net/bbs/viewthread.php?tid=8318
http://windshowzbf.bokee.com/3016397.html
ISO-10646鏈 |
Unicode鏈 |
UCS-2 | BMP UTF-16 |
UCS-4 | UTF-32 |
鍙︼細
Java 1.0 supports Unicode version 1.1.
Java 1.1 onwards supports Unicode version 2.0.
J2SE 1.4涓殑瀛楃澶勭悊鏄熀浜嶶nicode 3.0鏍囧噯鐨勩?br />
J2SE v 1.5 supports Unicode 4.0 character set.
鑰岋細
Unicode 3.0錛?999騫翠節鏈堬紱娑佃搵浜嗕締鑷狪SO 10646-1鐨勫崄鍏綅鍏冮氱敤瀛楀厓闆嗭紙UCS錛夊熀鏈鏂囩ó騫抽潰錛圔asic Multilingual Plane錛?
Unicode 3.1錛?001騫翠笁鏈堬紱鏂板寰濱SO 10646-2瀹氱京鐨勮紨鍔╁鉤闈紙Supplementary Planes)
U-00000000 - U-0000007F: | 0xxxxxxx |
U-00000080 - U-000007FF: | 110xxxxx 10xxxxxx |
U-00000800 - U-0000FFFF: | 1110xxxx 10xxxxxx 10xxxxxx |
U-00010000 - U-001FFFFF: | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
U-00200000 - U-03FFFFFF: | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
U-04000000 - U-7FFFFFFF: | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
浠g⒓綃勫湇 鍗佸叚閫插埗 |
妯欓噺鍊?scalar value 浜岄插埗 |
UTF-8 浜岄插埗 / 鍗佸叚閫插埗 |
璦婚噵 |
---|---|---|---|
000000 - 00007F 128鍊嬩唬紕?/small> |
00000000 00000000 0zzzzzzz | 0zzzzzzz(00-7F) | ASCII絳夊肩瘎鍦嶏紝浣嶅厓緄勭敱闆墮枊濮?/td> |
涓冨媧 | 涓冨媧 | ||
000080 - 0007FF 1920鍊嬩唬紕?/small> |
00000000 00000yyy yyzzzzzz | 110yyyyy(C2-DF) 10zzzzzz(80-BF) | 絎竴鍊?a title="瀛楄妭" >浣嶅厓緄?/a>鐢?10闁嬪錛屾帴钁楃殑浣嶅厓緄?/a>鐢?0闁嬪 |
涓夊媦錛涗簩鍊媦錛涘叚鍊媧 | 浜斿媦錛涘叚鍊媧 | ||
000800 - 00FFFF 63488鍊嬩唬紕?/small> |
00000000 xxxxyyyy yyzzzzzz | 1110xxxx(E0-EF) 10yyyyyy 10zzzzzz | 絎竴鍊?a title="瀛楄妭" >浣嶅厓緄?/a>鐢?110闁嬪錛屾帴钁楃殑浣嶅厓緄?/a>鐢?0闁嬪 |
鍥涘媥錛涘洓鍊媦錛涗簩鍊媦錛涘叚鍊媧 | 鍥涘媥錛涘叚鍊媦錛涘叚鍊媧 | ||
010000 - 10FFFF 1048576鍊嬩唬紕?/small> |
000wwwxx xxxxyyyy yyzzzzzz | 11110www(F0-F4) 10xxxxxx 10yyyyyy 10zzzzzz | 鐢?1110闁嬪錛屾帴钁楃殑浣嶅厓緄?/a>鐢?0闁嬪 |
涓夊媤錛涗簩鍊媥錛涘洓鍊媥錛涘洓鍊媦錛涗簩鍊媦錛涘叚鍊媧 | 涓夊媤錛涘叚鍊媥錛涘叚鍊媦錛涘叚鍊媧 |
鍦ㄥぇ綰?1993 騫翠箣鍚庡紑鍙戠殑澶у鏁扮幇浠g紪紼嬭璦閮芥湁涓涓壒鍒殑鏁版嵁綾誨瀷, 鍙仛 Unicode/ISO 10646-1 瀛楃. 鍦?Ada95 涓彨 Wide_Character, 鍦?Java 涓彨 char.
ISO C 涔熻緇嗚鏄庝簡澶勭悊澶氬瓧鑺傜紪鐮佸拰瀹藉瓧絎?(wide characters) 鐨勬満鍒? 1994 騫?9 鏈?Amendment 1 to ISO C 鍙戣〃鏃跺張鍔犲叆浜嗘洿澶? 榪欎簺鏈哄埗涓昏鏄負鍚勭被涓滀簹緙栫爜鑰岃璁$殑, 瀹冧滑姣斿鐞?UCS 鎵闇鐨勮鍋ュ.寰楀. UTF-8 鏄?ISO C 鏍囧噯璋冪敤澶氬瓧鑺傚瓧絎︿覆鐨勭紪鐮佺殑涓涓緥瀛? wchar_t 綾誨瀷鍙互鐢ㄦ潵瀛樻斁 Unicode 瀛楃.
浠ヤ笅涓鴻В鏋愰儴鍒?br />
btw: 鐚滄祴: javacc涓?濡傛灉浣跨敤[],鍒欏厑璁稿嚭鐜?嬈℃垨1嬈?br />