??xml version="1.0" encoding="utf-8" standalone="yes"?>亚洲成AV人片在线播放无码,亚洲尹人香蕉网在线视颅,亚洲精品自产拍在线观看http://www.tkk7.com/persister/category/38140.htmlzh-cnThu, 16 Sep 2010 17:00:17 GMTThu, 16 Sep 2010 17:00:17 GMT60Hadoop学习W记Q一Q?/title><link>http://www.tkk7.com/persister/archive/2010/03/12/315306.html</link><dc:creator>persister</dc:creator><author>persister</author><pubDate>Fri, 12 Mar 2010 12:59:00 GMT</pubDate><guid>http://www.tkk7.com/persister/archive/2010/03/12/315306.html</guid><wfw:comment>http://www.tkk7.com/persister/comments/315306.html</wfw:comment><comments>http://www.tkk7.com/persister/archive/2010/03/12/315306.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.tkk7.com/persister/comments/commentRss/315306.html</wfw:commentRss><trackback:ping>http://www.tkk7.com/persister/services/trackbacks/315306.html</trackback:ping><description><![CDATA[今天Hadoop下蝲下来学习了一下文档中的tutorialQ然后仿照如下链接实C一个word count的例子:<br /> <h1><a >?Hadoop q行分布式数据处理,W?1 部分: 入门</a></h1> <br /> 以下是一部分理论学习Q?br /> The storage is provided by HDFS, and analysis by MapReduce.<br /> <br /> MapReduce is a good fit for problems<br /> that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.<br /> An RDBMS is good for point queries or updates, where the dataset has been indexed<br /> to deliver low-latency retrieval and update times of a relatively small amount of<br /> data.<br /> MapReduce suits applications where the data is written once, and read many<br /> times, whereas a relational database is good for datasets that are continually updated.<br /> <br /> MapReduce tries to colocate the data with the compute node, so data access is fast<br /> since it is local.* This feature, known as data locality, is at the heart of MapReduce and<br /> is the reason for its good performance.<br /> <br /> Hadoop divides the input to a MapReduce job into fixed-size pieces called input<br /> splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined<br /> map function for each record in the split.<br /> <br /> On the other hand, if splits are too small, then the overhead of managing the splits and<br /> of map task creation begins to dominate the total job execution time.For most jobs, a<br /> good split size tends to be the size of a HDFS block, 64 MB by default.<br /> <br /> Reduce tasks don’t have the advantage of data locality—the input to a single reduce<br /> task is normally the output from all mappers.<br /> <br /> Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays<br /> to minimize the data transferred between map and reduce tasks. Hadoop allows the<br /> user to specify a combiner function to be run on the map output—the combiner function’s<br /> output forms the input to the reduce function.<br /> <br /> Why Is a Block in HDFS So Large?<br /> HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost<br /> of seeks. By making a block large enough, the time to transfer the data from the disk<br /> can be made to be significantly larger than the time to seek to the start of the block.<br /> Thus the time to transfer a large file made of multiple blocks operates at the disk transfer<br /> rate.<br /> A quick calculation shows that if the seek time is around 10ms, and the transfer rate is<br /> 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the<br /> block size around 100 MB. The default is actually 64 MB, although many HDFS installations<br /> use 128 MB blocks. This figure will continue to be revised upward as transfer<br /> speeds grow with new generations of disk drives.<br /> This argument shouldn’t be taken too far, however. Map tasks in MapReduce normally<br /> operate on one block at a time, so if you have too few tasks (fewer than nodes in the<br /> cluster), your jobs will run slower than they could otherwise.<br /> 意思是q样的,Block大的话,LBlock的时间大概少Q主要耗在传输的时间上Q但是如果Block的话,传输的时间和d的时间就相当了,{于说就是消耗的旉?倍传输的旉Q划不来。具体的说是Q如果数据量?00MQ那么Block的大是100MQ那么传输的旉是1s(100M/s)Q但是如果Block的大是1MQ那么传输的旉q是1sQ但是seek的时?0ms*100=1s了。这hd花去的时间就?s。是不是大好呢?也不是,太大的话Q可能导致文档没有分布式的存储,也就没有很好的利用MapReduce模型q行计算了,反而可能更慢?br /> <br /> <br /> <img src ="http://www.tkk7.com/persister/aggbug/315306.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.tkk7.com/persister/" target="_blank">persister</a> 2010-03-12 20:59 <a href="http://www.tkk7.com/persister/archive/2010/03/12/315306.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Lucene数据存储l构中的VIntQ可变长度整型)http://www.tkk7.com/persister/archive/2010/02/02/311642.htmlpersisterpersisterTue, 02 Feb 2010 03:08:00 GMThttp://www.tkk7.com/persister/archive/2010/02/02/311642.htmlhttp://www.tkk7.com/persister/comments/311642.htmlhttp://www.tkk7.com/persister/archive/2010/02/02/311642.html#Feedback0http://www.tkk7.com/persister/comments/commentRss/311642.htmlhttp://www.tkk7.com/persister/services/trackbacks/311642.html

A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on.

可变格式的整型定义:最高位表示是否q有字节要读取,低七位就是就是具体的有效位,d?/p>

l果数据中。比?0000001 最高位表示0Q那么说明这个数是一个字节表C,有效位是后面的七?000001Qgؓ1?0000010 00000001 W一个字节最高位?Q表C后面还有字节,W二位最高位0表示到此为止了,卛_是两个字节,那么具体的值注意,是从最后一个字节的七位有效数放在最前面Q依ơ放|,最后是W一个自q七位有效位,所以这个数表示 0000001 0000010Q换成整数是130

VInt Encoding Example

Value

First byte

Second byte

Third byte

0

00000000



1

00000001



2

00000010



...




127

01111111



128

10000000

00000001


129

10000001

00000001


130

10000010

00000001


...




16,383

11111111

01111111


16,384

10000000

10000000

00000001

16,385

10000001

10000000

00000001

...





Lucene源代码中q行存储和读取是q样的。OutputStream是负责写Q?/p>

 1   /** Writes an int in a variable-length format.  Writes between one and
 2    * five bytes.  Smaller values take fewer bytes.  Negative numbers are not
 3    * supported.
 4    * @see InputStream#readVInt()
 5    */
 6   public final void writeVInt(int i) throws IOException {
 7     while ((i & ~0x7F!= 0) {
 8       writeByte((byte)((i & 0x7f| 0x80));
 9       i >>>= 7;
10     }
11     writeByte((byte)i);
12   }

InputStream负责读:
 1   /** Reads an int stored in variable-length format.  Reads between one and
 2    * five bytes.  Smaller values take fewer bytes.  Negative numbers are not
 3    * supported.
 4    * @see OutputStream#writeVInt(int)
 5    */
 6   public final int readVInt() throws IOException {
 7     byte b = readByte();
 8     int i = b & 0x7F;
 9     for (int shift = 7; (b & 0x80!= 0; shift += 7) {
10       b = readByte();
11       i |= (b & 0x7F<< shift;
12     }
13     return i;
14   }

>>>表示无符号右U?br />



persister 2010-02-02 11:08 发表评论
]]>
W一ơ尝试Nutchhttp://www.tkk7.com/persister/archive/2009/07/23/288039.htmlpersisterpersisterThu, 23 Jul 2009 07:43:00 GMThttp://www.tkk7.com/persister/archive/2009/07/23/288039.htmlhttp://www.tkk7.com/persister/comments/288039.htmlhttp://www.tkk7.com/persister/archive/2009/07/23/288039.html#Feedback0http://www.tkk7.com/persister/comments/commentRss/288039.htmlhttp://www.tkk7.com/persister/services/trackbacks/288039.html环境QNutch0.9+Fedora5+tomcat6+JDK6

tomcat和jdk都安装好Q?/p>

二:nutch-0.9.tar.gz
        下载到的tar.gz包,解压?opt目录下ƈ改名Q?br />         #gunzip -xf nutch-0.9.tar.gz |tar xf
        #mv nutch-0.9.tar.gz /usr/local/nutch
      
       试环境是否讄成功Q运行:/urs/local/nutch/bin/nutch看一下有没有命o参数输出Q如果有说明没问题?/p>

       抓取q程Q?cd /opt/nutch
                         #mkdir urls
                         #vi nutch.txt 输入www.aicent.net
                         #vi conf/crawl-urlfilter.txt 加入以下信息Q利用正则表辑ּ对网站url抓取{选?br />                         /**** accept hosts in MY.DOMAIN.NAME******/
                                +^http://([a-z0-9]*\.)*aicent.net/
                       #vi nutch/nutch-site.xmlQ给自己的蜘蛛取一个名字)讄如下Q?br />    <configuration>
<property>
    <name>http.agent.name</name>
    <value>test/unique</value>
</property>
</configuration>

                开始抓取:#bin/nutch crawl urls -dir crawl -detpth 5 -thread 10 >& crawl.log

{待一会,旉依据|站的大,和设|的抓取深度?/p>


三:apache-tomcat

                在这里,当你看到每次索的面?里,需要修改一下参敎ͼ因ؓtomcat中的nutch的检索\径不寚w成的?br />                 #vi /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml
<property>
<name>searcher.dir</name>
<value>/opt/nutch/crawl</value>抓取|页所在的路径
<description>My path to nutch's searcher dir.</description>
</property>

                #/opt/tomcat/bin/startup.sh


OK,搞定。。?/p>


问题汇总:


q行Qsh ./bin/nutch crawl urls -dir crawl -depth 3 -threads 60 -topN 100 >&./logs/nutch_log.log

1.Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
|上查有说是JDK版本的问题,不能用JDK1.6Q于是安?.5。但是还是同L问题Q奇怪啊?br /> 于是l箋googleQ发现有如下的可能:

Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

说明Q一般ؓcrawl-urlfilters.txt中配|问题,比如qo条g应ؓ
+^http://www.ihooyo.com ,而配|成?http://www.ihooyo.com q样的情况就引v如上错误?/p>

但是自己的配|根本就没有问题啊?br /> 在Logs目录下面除了生成nutch_log.logq自动生成一个log文gQhadoop.log
发现有错误出玎ͼ


2009-07-22 22:20:55,501 INFO  crawl.Crawl - crawl started in: crawl
2009-07-22 22:20:55,501 INFO  crawl.Crawl - rootUrlDir = urls
2009-07-22 22:20:55,502 INFO  crawl.Crawl - threads = 60
2009-07-22 22:20:55,502 INFO  crawl.Crawl - depth = 3
2009-07-22 22:20:55,502 INFO  crawl.Crawl - topN = 100
2009-07-22 22:20:55,603 INFO  crawl.Injector - Injector: starting
2009-07-22 22:20:55,604 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
2009-07-22 22:20:55,604 INFO  crawl.Injector - Injector: urlDir: urls
2009-07-22 22:20:55,605 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2009-07-22 22:20:56,574 INFO  plugin.PluginRepository - Plugins: looking in: /opt/nutch/plugins
2009-07-22 22:20:56,720 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2009-07-22 22:20:56,720 INFO  plugin.PluginRepository - Registered Plugins:
2009-07-22 22:20:56,720 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Basic Query Filter (query-basic)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Basic Summarizer Plug-in (summary-basic)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Site Query Filter (query-site)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Text Parse Plug-in (parse-text)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Http Protocol Plug-in (protocol-http)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         JavaScript Parser (parse-js)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         URL Query Filter (query-url)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository - Registered Extension-Points:
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2009-07-22 22:20:56,786 WARN  regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2009-07-22 22:20:56,829 WARN  mapred.LocalJobRunner - job_2319eh
java.lang.RuntimeException: java.net.UnknownHostException: jackliu: jackliu
        at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:617)
        at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:591)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:364)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:390)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.startPartition(MapTask.java:294)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:355)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$100(MapTask.java:231)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:180)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
Caused by: java.net.UnknownHostException: jackliu: jackliu
        at java.net.InetAddress.getLocalHost(InetAddress.java:1353)
        at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:614)
        ... 8 more

也就是Host配置错误Q于是:
Add the following to your /etc/hosts file
127.0.0.1    jackliu

q次再次q行Q结果成功!

 

2:http://127.0.0.1:8080/nutch-0.9
 输入nutchq行查询Q结果报错:
 HTTP Status 500 -

type Exception report

message

description The server encountered an internal error () that prevented it from fulfilling this request.

exception

org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value  language + "/include/header.html" is quoted with " which must be escaped when used within the value
 org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:40)
 org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:407)
 org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:198)
 org.apache.jasper.compiler.Parser.parseQuoted(Parser.java:299)
 org.apache.jasper.compiler.Parser.parseAttributeValue(Parser.java:249)
 org.apache.jasper.compiler.Parser.parseAttribute(Parser.java:211)
 org.apache.jasper.compiler.Parser.parseAttributes(Parser.java:154)
 org.apache.jasper.compiler.Parser.parseInclude(Parser.java:867)
 org.apache.jasper.compiler.Parser.parseStandardAction(Parser.java:1134)
 org.apache.jasper.compiler.Parser.parseElements(Parser.java:1461)
 org.apache.jasper.compiler.Parser.parse(Parser.java:137)
 org.apache.jasper.compiler.ParserController.doParse(ParserController.java:255)
 org.apache.jasper.compiler.ParserController.parse(ParserController.java:103)
 org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:170)
 org.apache.jasper.compiler.Compiler.compile(Compiler.java:332)
 org.apache.jasper.compiler.Compiler.compile(Compiler.java:312)
 org.apache.jasper.compiler.Compiler.compile(Compiler.java:299)
 org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:586)
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:317)
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342)
 org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:717)

note The full stack trace of the root cause is available in the Apache Tomcat/6.0.20 logs.

分析Q查看nutch Web应用根目录下的search.jsp可知Q是引号匚w的问题?/p>

<jsp:include page="<%= language + "/include/header.html"%>"/>  //line 152 search.jsp

W一个引号和后面W一个出现的引号q行匚wQ而不是和q一行最后一个引可行匹配,所以问题就出现了?/p>

解决ҎQ?/p>

该行代码修改ؓQ?lt;jsp:include page="<%= language+urlsuffix %>"/>

q里我们定一个字W串urlsuffixQ我们把它定义在language字符串定义之后,

  String language =   // line 116 search.jsp
    ResourceBundle.getBundle("org.nutch.jsp.search", request.getLocale())
    .getLocale().getLanguage();
 String urlsuffix="/include/header.html";

修改完成后,为确保修Ҏ功,重启一下Tomcat服务器,q行搜烦Q不再报错?/p>


3.无法查询l果Q?br />   Ҏnutch_log.log的结果发现和|上描述的不同,而且crawl里面只有两个文g夹segments和crawldbQ后来重新爬了一?br />   发现q次是好的,奇怪不知道Z么上ơ爬的失败了?br />  
4.cached.jsp explain.jsp{都有上?的错误,更正q去OK了?/p>

5.今天׃一上午和半个下午的旉l于搞定了nutch的安装和配置了。明天l学习?/p>

persister 2009-07-23 15:43 发表评论
]]>
PhraseQuery、SpanQuery和PhrasePrefixQueryhttp://www.tkk7.com/persister/archive/2009/07/14/286634.htmlpersisterpersisterTue, 14 Jul 2009 01:49:00 GMThttp://www.tkk7.com/persister/archive/2009/07/14/286634.htmlhttp://www.tkk7.com/persister/comments/286634.htmlhttp://www.tkk7.com/persister/archive/2009/07/14/286634.html#Feedback0http://www.tkk7.com/persister/comments/commentRss/286634.htmlhttp://www.tkk7.com/persister/services/trackbacks/286634.htmlPhraseQuery使用位置信息来进行相x询,比如TermQuery使用“我们”?#8220;国”q行查询Q那么文档中含有q两个词的所有记录都会被查询出来。但是有一U情况,我们可能需要查?#8220;我们”?#8220;中国”之间只隔一个字和两个字或者两个字{,而不是它们之间字距相差十万八千里Q就可以使用PhraseQuery。比如下面的情况Q?br />     doc.add(Field.Text("field", "the quick brown fox jumped over the lazy dog"));
那么Q?br />     String[] phrase = new String[] {"quick", "fox"};
    assertFalse("exact phrase not found", matched(phrase, 0));
    assertTrue("close enough", matched(phrase, 1));
multi-terms:
    assertFalse("not close enough", matched(new String[] {"quick", "jumped", "lazy"}, 3));
    assertTrue("just enough", matched(new String[] {"quick", "jumped", "lazy"}, 4));
    assertFalse("almost but not quite", matched(new String[] {"lazy", "jumped", "quick"}, 7));
    assertTrue("bingo", matched(new String[] {"lazy", "jumped", "quick"}, 8));

数字表示slopQ通过如下方式讄Q表C按照顺序从W一个字D到W二个字D之间间隔的term个数?br />     query.setSlop(slop);

序很重要:
    String[] phrase = new String[] {"fox", "quick"};
assertFalse("hop flop", matched(phrase, 2));
assertTrue("hop hop slop", matched(phrase, 3));

原理如下图所C:


对于查询关键字quick和foxQ只需要foxUd一个位|即可匹配quick brown fox。而对于fox和quickq两个关键字
需要将foxUd三个位置。移动的距离大Q那么这记录的scorep,被查询出来的可能行就小了?br />
SpanQuery利用位置信息查询更有意思的查询Q?br />
SpanQuery type         Description
SpanTermQuery         Used in conjunction with the other span query types. On its own, it’s
                                        functionally equivalent to TermQuery.
SpanFirstQuery         Matches spans that occur within the first part of a field.
SpanNearQuery         Matches spans that occur near one another.
SpanNotQuery         Matches spans that don’t overlap one another.
SpanOrQuery             Aggregates matches of span queries.

SpanFirstQueryQTo query for spans that occur within the first n positions of a field, use Span-FirstQuery.



quick = new SpanTermQuery(new Term("f", "quick"));
brown = new SpanTermQuery(new Term("f", "brown"));
red = new SpanTermQuery(new Term("f", "red"));
fox = new SpanTermQuery(new Term("f", "fox"));
lazy = new SpanTermQuery(new Term("f", "lazy"));
sleepy = new SpanTermQuery(new Term("f", "sleepy"));
dog = new SpanTermQuery(new Term("f", "dog"));
cat = new SpanTermQuery(new Term("f", "cat"));

SpanFirstQuery sfq = new SpanFirstQuery(brown, 2);
assertNoMatches(sfq);
sfq = new SpanFirstQuery(brown, 3);
assertOnlyBrownFox(sfq);

SpanNearQueryQ?br />
彼此盔R的跨?

      首先Q强调一下PhraseQuery对象Q这个对象不属于跨度查询c,但能完成跨度查询功能?/p>

      匚w到的文档所包含的项通常是彼此相ȝQ考虑到原文档中在查询之间可能有一些中间项Q或Z能查询倒排的项QPhraseQuery讄了slop因子Q?font color="#ff0000">但是q个slop因子?个项允许最大间隔距,不是传统意义上的距离Q是按顺序组成给定的短语Q所需要移动位|的ơ数Q?font color="#0000ff">q表CPhraseQuery是必L照项在文档中出现的顺序计跨度的Q如quick brown fox为文档,则quick fox2个项的slop?Qquick向后Ud一?而fox quick需要quick向后Ud3ơ,所以slop?

      其次Q来看一下SpanQuery的子cSpanTermQuery?/p>

      它能跨度查询Q?font color="#0000ff">q且不一定非要按在文档中出现的序Q可以用一个独立的标记表示查询对象必须按顺序,或允许按倒过来的序完成匚w?font color="#ff0000">匚w的跨度也不是指移动位|的ơ数Q是指从W一个跨度的起始位置到最后一个跨度的l束位置?/font>

      在SpanNearQuery中将SpanTermQuery对象作ؓSpanQuery对象使用的效果,与用PharseQuery的效果非常相伹{在SpanNearQuery的构造函C的第三个参数为inOrder标志Q设|这个标志,表示按项在文档中出现的顺序倒过来的序?/p>

      ?the quick brown fox jumps over the lazy dogq个文档

      public void testSpanNearQuery() throws Exception{

           SpanQuery[] quick_brown_dog=new SpanQuery[]{quick,brown,dog};

           SpanNearQuery snq=new SpanNearQuery(quick_brown_dog,0,true);//按正帔R?跨度?,对三个项q行查询

           assertNoMatches(snq);//无法匚w

           SpanNearQuery snq=new SpanNearQuery(quick_brown_dog,4,true);//按正帔R?跨度?,对三个项q行查询

           assertNoMatches(snq);//无法匚w

           SpanNearQuery snq=new SpanNearQuery(quick_brown_dog,4,true);//按正帔R?跨度?,对三个项q行查询

           assertOnlyBrownFox(snq);//匚w成功    

           SpanNearQuery snq=new SpanNearQuery(new SpanQuery[]{lazy,fox},3,false);//按相反顺?跨度?,对三个项q行查询

           assertOnlyBrownFox(snq);//匚w成功   

           //下面使用PhraseQueryq行查询Q因为是按顺序,所以lazy和fox必须要跨度ؓ5

           PhraseQuery pq=new PhraseQuery();

           pq.add(new Term("f","lazy"));

           pq.add(new Term("f","lazy"));

           pq.setslop(4);

           assertNoMatches(pq);//跨度4无法匚w

           //PharseQuery,slop因子?

           pq.setSlop(5);

           assertOnlyBrownFox(pq);          

      }


3.PhrasePrefixQuery 主要用来q行同义词查询的Q?br />     IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true);
    Document doc1 = new Document();
    doc1.add(Field.Text("field", "the quick brown fox jumped over the lazy dog"));
    writer.addDocument(doc1);
    Document doc2 = new Document();
    doc2.add(Field.Text("field","the fast fox hopped over the hound"));
    writer.addDocument(doc2);

    PhrasePrefixQuery query = new PhrasePrefixQuery();
    query.add(new Term[] {new Term("field", "quick"), new Term("field", "fast")});
    query.add(new Term("field", "fox"));

    Hits hits = searcher.search(query);
    assertEquals("fast fox match", 1, hits.length());
    query.setSlop(1);
    hits = searcher.search(query);
    assertEquals("both match", 2, hits.length());



persister 2009-07-14 09:49 发表评论
]]>
搜烦引擎中对于输入查询关键词的一些考虑http://www.tkk7.com/persister/archive/2009/07/11/286377.htmlpersisterpersisterSat, 11 Jul 2009 09:33:00 GMThttp://www.tkk7.com/persister/archive/2009/07/11/286377.htmlhttp://www.tkk7.com/persister/comments/286377.htmlhttp://www.tkk7.com/persister/archive/2009/07/11/286377.html#Feedback0http://www.tkk7.com/persister/comments/commentRss/286377.htmlhttp://www.tkk7.com/persister/services/trackbacks/286377.html
2、近义词查询?SynonymAnalyzer和PhrasePrefixQuery都能解决q个问题?

persister 2009-07-11 17:33 发表评论
]]>
Analyzerhttp://www.tkk7.com/persister/archive/2009/07/07/285833.htmlpersisterpersisterTue, 07 Jul 2009 07:59:00 GMThttp://www.tkk7.com/persister/archive/2009/07/07/285833.htmlhttp://www.tkk7.com/persister/comments/285833.htmlhttp://www.tkk7.com/persister/archive/2009/07/07/285833.html#Feedback0http://www.tkk7.com/persister/comments/commentRss/285833.htmlhttp://www.tkk7.com/persister/services/trackbacks/285833.html  Analyzer                          Steps taken  
WhitespaceAnalyzer         Splits tokens at whitespace  
SimpleAnalyzer                Divides text at nonletter characters and lowercases  
StopAnalyzer        Divides text at nonletter characters, lowercases, and removes stop words  
StandardAnalyzer      Tokenizes based on a sophisticated grammar that recognizes
               e-mail addresses, acronyms, Chinese- Japanese-Korean characters,
              alphanumericsQ?and more; lowercases;and removes stop words  


persister 2009-07-07 15:59 发表评论
]]>
Porter stemming algorithmhttp://www.tkk7.com/persister/archive/2009/07/06/285728.htmlpersisterpersisterMon, 06 Jul 2009 14:47:00 GMThttp://www.tkk7.com/persister/archive/2009/07/06/285728.htmlhttp://www.tkk7.com/persister/comments/285728.htmlhttp://www.tkk7.com/persister/archive/2009/07/06/285728.html#Feedback0http://www.tkk7.com/persister/comments/commentRss/285728.htmlhttp://www.tkk7.com/persister/services/trackbacks/285728.html 所?a target="_blank">StemmingQ可以称?strong>词根?/strong>Q这里有?strong>overview。在pq样的拉丁语p里面,单词有多U变形。比如加?ed?ing?ly{等。在分词的时候,如果能够把这些变形单词的词根扑և了,Ҏ索结果是很有帮助的。Stemming法有很多了Q三大主算法是Porter stemming algorithm?a target="_blank">Lovins stemming algorithm?a target="_blank">Lancaster (Paice/Husk) stemming algorithmQ还有一些改q的或其它的法。这个PorterStemFilter里面调用的一个PorterStemmer是Porter Stemming algorithm的一个实现?

persister 2009-07-06 22:47 发表评论
]]>
Lucene倒排索引原理http://www.tkk7.com/persister/archive/2009/06/10/281201.htmlpersisterpersisterWed, 10 Jun 2009 10:08:00 GMThttp://www.tkk7.com/persister/archive/2009/06/10/281201.htmlhttp://www.tkk7.com/persister/comments/281201.htmlhttp://www.tkk7.com/persister/archive/2009/06/10/281201.html#Feedback0http://www.tkk7.com/persister/comments/commentRss/281201.htmlhttp://www.tkk7.com/persister/services/trackbacks/281201.html
倒排索引QInverted index

Lucene是一个高性能的java全文索工具包Q它使用的是倒排文g索引l构。该l构及相应的生成法如下Q?br />
0Q设有两文??
文章1的内容ؓQTom lives in Guangzhou,I live in Guangzhou too.
文章2的内容ؓQHe once lived in Shanghai.

1)׃lucene是基于关键词索引和查询的Q首先我们要取得q两文章的关键词,通常我们需要如下处理措?br /> a.我们现在有的是文章内容,即一个字W串Q我们先要找出字W串中的所有单词,卛_词。英文单词由于用I格分隔Q比较好处理。中文单词间是连在一L需要特D的分词处理?br /> b.文章中的”in”, “once” “too”{词没有什么实际意义,中文中的“?#8221;“?#8221;{字通常也无具体含义Q这些不代表概念的词可以qo?br /> c.用户通常希望?#8220;He”时能把含“he”Q?#8220;HE”的文章也扑և来,所以所有单词需要统一大小写?br /> d.用户通常希望?#8220;live”时能把含“lives”Q?#8220;lived”的文章也扑և来,所以需要把“lives”Q?#8220;lived”q原?#8220;live”
e.文章中的标点W号通常不表C某U概念,也可以过滤掉
在lucene中以上措施由Analyzercd?br />
l过上面处理?br />     文章1的所有关键词为:[tom] [live] [guangzhou] [i] [live] [guangzhou]
    文章2的所有关键词为:[he] [live] [shanghai]

2) 有了关键词后Q我们就可以建立倒排索引了。上面的对应关系是:“文章?#8221;?#8220;文章中所有关键词”。倒排索引把这个关pd过来,变成Q?#8220;关键?#8221;?#8220;拥有该关键词的所有文章号”。文?Q?l过倒排后变?br /> 关键?nbsp;  文章?br /> guangzhou  1
he         2
i           1
live       1,2
shanghai   2
tom         1

通常仅知道关键词在哪些文章中出现q不够,我们q需要知道关键词在文章中出现ơ数和出现的位置Q通常有两U位|:a)字符位置Q即记录该词是文章中W几个字W(优点是关键词亮显时定位快Q;b)关键词位|,卌录该词是文章中第几个关键词(优点是节U烦引空间、词l(phaseQ查询快Q,lucene中记录的是q种位置?br />
加上“出现频率”?#8220;出现位置”信息后,我们的烦引结构变为:
关键?nbsp;  文章号[出现频率]   出现位置
guangzhou 1[2]               3Q?
he       2[1]               1
i         1[1]               4
live      1[2],2[1]           2Q?Q?
shanghai  2[1]               3
tom      1[1]               1

以liveq行Z我们说明一下该l构Qlive在文?中出C2ơ,文章2中出C一ơ,它的出现位置?#8220;2,5,2”q表CZ么呢Q我们需要结合文章号和出现频率来分析Q文?中出C2ơ,那么“2,5”pClive在文?中出现的两个位置Q文?中出C一ơ,剩下?#8220;2”pClive是文?中第2个关键字?br />     
以上是lucene索引l构中最核心的部分。我们注意到关键字是按字W顺序排列的Qlucene没有使用B树结构)Q因此lucene可以用二元搜索算法快速定位关键词?br />     
实现?nbsp;lucene上面三列分别作典文ӞTerm DictionaryQ、频率文?frequencies)、位|文?positions)保存。其中词典文件不仅保存有每个关键词,q保留了指向频率文g和位|文件的指针Q通过指针可以扑ֈ该关键字的频率信息和位置信息?br />
    Lucene中用了field的概念,用于表达信息所在位|(如标题中Q文章中Qurl中)Q在建烦引中Q该field信息也记录在词典文g中,每个关键词都有一个field信息(因ؓ每个关键字一定属于一个或多个field)?br />
    Z减小索引文g的大,Lucene对烦引还使用了压~技术。首先,对词典文件中的关键词q行了压~,关键词压~ؓ<前缀长度Q后~>Q例如:当前词ؓ“阿拉伯语”Q上一个词?#8220;阿拉?#8221;Q那?#8220;阿拉伯语”压羃?lt;3Q语>。其ơ大量用到的是对数字的压~,数字只保存与上一个值的差|q样可以减小数字的长度,q而减保存该数字需要的字节敎ͼ。例如当前文章号?6389Q不压羃要用3个字节保存)Q上一文章h16382Q压~后保存7Q只用一个字节)?br />     
    下面我们可以通过对该索引的查询来解释一下ؓ什么要建立索引?br /> 假设要查询单?nbsp;“live”Qlucene先对词典二元查找、找到该词,通过指向频率文g的指针读出所有文章号Q然后返回结果。词兔R常非常,因而,整个q程的时间是毫秒U的?br /> 而用普通的序匚w法Q不建烦引,而是Ҏ有文章的内容q行字符串匹配,q个q程会相当~慢Q当文章数目很大Ӟ旉往往是无法忍受的?br />
自我评论Q?br /> q可以参考http://zh.wikipedia.org/wiki/%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95


二元搜烦法
在排好序的数l中扑ֈ特定的元素?br /> 首先, 比较数组中间的元素,如果相同Q则q回此元素的指针Q表C找C?如果不相同, 此函数就会l搜索其中大相W的一半,然后l箋下去。如果剩下的数组长度?Q则表示找不刎ͼ那么函数׃l束?br /> 此算法函数如下:
int *binarySearch(int val, int array[], int n)
{
int m = n/2;
if(n <= 0) return NULL;
if(val == array[m]) return array + m;
if(val < array[m]) return binarySearch(val, array, m);
else return binarySearch(val, array+m+1, n-m-1);
}


对于有n个元素的数组来说Q二元搜索算法进行最?+log2(n)ơ比较?如果有一百万元素Q大概比?0ơ,也就是最?0ơ递归执行binarySearch()函数?/p>


persister 2009-06-10 18:08 发表评论
]]>
Lucene学习indexhttp://www.tkk7.com/persister/archive/2009/06/09/281032.htmlpersisterpersisterTue, 09 Jun 2009 15:33:00 GMThttp://www.tkk7.com/persister/archive/2009/06/09/281032.htmlhttp://www.tkk7.com/persister/comments/281032.htmlhttp://www.tkk7.com/persister/archive/2009/06/09/281032.html#Feedback0http://www.tkk7.com/persister/comments/commentRss/281032.htmlhttp://www.tkk7.com/persister/services/trackbacks/281032.html1.Adding documents to an indexQ?br />  protected String[] keywords = {"1", "2"};
 protected String[] unindexed = {"Netherlands", "Italy"};
 protected String[] unstored = {"Amsterdam has lots of bridges", "Venice has lots of canals"};
 protected String[] text = {"Amsterdam", "Venice"};
 Directory dir = FSDirectory.getDirectory(indexDir, true);
 IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(), true);
 writer.setUseCompoundFile(true);
 for (int i = 0; i < keywords.length; i++) {
  Document doc = new Document();
  doc.add(Field.Keyword("id", keywords[i]));
  doc.add(Field.UnIndexed("country", unindexed[i]));
  doc.add(Field.UnStored("contents", unstored[i]));
  doc.add(Field.Text("city", text[i]));
  writer.addDocument(doc);
 }
 writer.optimize();
 writer.close();
2.Removing Documents from an indexQ?br />  IndexReader reader = IndexReader.open(dir);
 reader.delete(1);
上面的方式一ơ只能删除一个documentQ下面的Ҏ可以删除多个满条g的document
 IndexReader reader = IndexReader.open(dir);
 reader.delete(new Term("city", "Amsterdam"));
 reader.close();

3.Index dates
 Document doc = new Document();
 doc.add(Field.Keyword("indexDate", new Date()));

4.Tuning indexing performance
 IndexWriter          System property                            Default value          Description
 --------------------------------------------------------------------------------------------------
 mergeFactor          org.apache.lucene.mergeFactor        10       Controls segment merge  frequency and size
 maxMergeDocs     org.apache.lucene.maxMergeDocs   Integar.MAX_VALUE    Limits the number of  documents per segement
 minMergeDocs         org.apache.lucene.minMergeDocs     10     Controls the amount of   RAM used when indexing

mergeFactor控制写入盘前内存中~存的document数量Q同时控制merge index segments的频率。其默认值是10Q即存满10?br /> documents后就必须写入盘Q而且如果segment的数量达?0的数的时候会merge成一个segmentQ当然maxMergeDocs限制了每?br /> segment最大能够保存的document数量。mergeFactor大的话p能利用RAMQ提高index的效率,但是mergeFactor高也就意味着
merge的频率就低Q会可能Dsegments的数量很大(因ؓ没有mergeQ,q样search的时候就需要打开更多的segment文gQ也?br /> 降低了search的效率。minMergeDocs is another IndexWriter instance variable that affects indexing performance. Its
value controls how many Documents have to be buffered before they’re merged to a segment.也即是说minMergeDocs也具?br /> mergeFactor控制~存document数量的功能?/p>

5.RAMDirectory帮助利用RAMQ也可以采用集群或者多U程的方式充分利用硬件和软g资源Q提高index的效率?/p>

6.有时候对于每个field可能希望控制其大,比如只对?000个term做indexQ这个时候就需要用maxFieldLength来控制?/p>

7.IndexWriter’s optimize()Ҏ是segmentsq行mergeQ降低segments的数量从而减search的时候读取index的时间?/p>

8.注意多线E环境下的工作:an index-modifying IndexReader operation can’t be executed
while an index-modifying IndexWriter operation is in progress.Z防止误用QLucene在用某些API时会l?br /> index上锁?/p>

persister 2009-06-09 23:33 发表评论
]]>
Lucene的Queryhttp://www.tkk7.com/persister/archive/2009/06/08/280567.htmlpersisterpersisterMon, 08 Jun 2009 02:05:00 GMThttp://www.tkk7.com/persister/archive/2009/06/08/280567.htmlhttp://www.tkk7.com/persister/comments/280567.htmlhttp://www.tkk7.com/persister/archive/2009/06/08/280567.html#Feedback0http://www.tkk7.com/persister/comments/commentRss/280567.htmlhttp://www.tkk7.com/persister/services/trackbacks/280567.htmlLucene基本的查询语句:
 Searcher searcher = new IndexSearcher(dbpath);
 Query query = QueryParser.parse(searchkey, searchfield,
     new StandardAnalyzer());
 Hits hits = searcher.search(query);
下面是Query的各U子查询Q他们斗鱼QueryParser都有对应关系?/p>

1.TermQuery常用Q对一个TermQ最的索引块,包含一个field名字和|q行索引查询?br /> Term直接与QueryParser.parse里面的key和field直接对应?/p>

 IndexSearcher searcher = new IndexSearcher(directory);
 Term t = new Term("isbn", "1930110995");
 Query query = new TermQuery(t);
 Hits hits = searcher.search(query);

2.RangeQuery用于区间查询,RangeQuery的第三个参数表示是开区间q是闭区间?br /> QueryParser会构Zbegin到end之间的N个查询进行查询?/p>

 Term begin, end;
 Searcher searcher = new IndexSearcher(dbpath);
 begin = new Term("pubmonth","199801");
 end = new Term("pubmonth","199810");
 RangeQuery query = new RangeQuery(begin, end, true);

RangeQuery本质是比较大。所以如下查询也是可以的Q但是意义就于上面不大一样了QM是大的比较
讑֮了一个区_在区间内的都能够搜烦出来Q这里就存在一个比较大的原则Q比如字W串会首先比较第一个字W,q样与字W长度没有关pR?br /> begin = new Term("pubmonth","19");
 end = new Term("pubmonth","20");
 RangeQuery query = new RangeQuery(begin, end, true);


3.PrefixQuery.对于TermQueryQ必d全匹配(用Field.Keyword生成的字D)才能够查询出来?br /> q就制约了查询的灉|性,PrefixQuery只需要匹配value的前面Q何字D即可。如Field为nameQ记?br /> 中那么有jackliu,jackwu,jackli,那么使用jack可以查询出所有的记录。QueryParser creates a PrefixQuery
for a term when it ends with an asterisk (*) in query expressions.

 IndexSearcher searcher = new IndexSearcher(directory);
 Term term = new Term("category", "/technology/computers/programming");
 PrefixQuery query = new PrefixQuery(term);
 Hits hits = searcher.search(query);

4.BooleanQuery.上面所有的查询都是Z单个field的查询,多个field怎么查询呢,BooleanQuery
是解决多个查询的问题。通过add(Query query, boolean required, boolean prohibited)加入
多个查询.通过BooleanQuery的嵌套可以组合非常复杂的查询?br />  
 IndexSearcher searcher = new IndexSearcher(directory);
 TermQuery searchingBooks =
 new TermQuery(new Term("subject","search"));

 RangeQuery currentBooks =
 new RangeQuery(new Term("pubmonth","200401"),
  new Term("pubmonth","200412"),true);
  
 BooleanQuery currentSearchingBooks = new BooleanQuery();
 currentSearchingBooks.add(searchingBook s, true, false);
 currentSearchingBooks.add(currentBooks, true, false);
 Hits hits = searcher.search(currentSearchingBooks);

BooleanQuery的addҎ有两个boolean参数Q?br /> trueQfalseQ表明当前加入的子句是必要满的;
falseQtrueQ表明当前加入的子句是不可以被满的Q?br /> falseQfalseQ表明当前加入的子句是可选的Q?br /> trueQtrueQ错误的情况?/p>

QueryParser handily constructs BooleanQuerys when multiple terms are specified.
Grouping is done with parentheses, and the prohibited and required flags are
set when the –, +, AND, OR, and NOT operators are specified.

5.PhraseQueryq行更ؓ_的查找。它能够对烦引文本中的两个或更多的关键词的位|进?br /> 限定。如搜查包含A和Bq且A、B之间q有一个文字。Terms surrounded by double quotes in
QueryParser parsed expressions are translated into a PhraseQuery.
The slop factor defaults to zero, but you can adjust the slop factor
by adding a tilde (~) followed by an integer.
For example, the expression "quick fox"~3

6.WildcardQuery.WildcardQuery比PrefixQuery提供了更l的控制和更大的灉|性,q个最Ҏ
理解和用?/p>

7.FuzzyQuery.q个Query比较特别Q它会查询与关键字长得很像的其他记录。QueryParser
supports FuzzyQuery by suffixing a term with a tilde (~),for exmaple wuzza~.

 public void testFuzzy() throws Exception {
  indexSingleFieldDocs(new Field[] {
  Field.Text("contents", "fuzzy"),
  Field.Text("contents", "wuzzy")
  });
  IndexSearcher searcher = new IndexSearcher(directory);
  Query query = new FuzzyQuery(new Term("contents", "wuzza"));
  Hits hits = searcher.search(query);
  assertEquals("both close enough", 2, hits.length());
  assertTrue("wuzzy closer than fuzzy",
  hits.score(0) != hits.score(1));
  assertEquals("wuzza bear","wuzzy", hits.doc(0).get("contents"));
 }



persister 2009-06-08 10:05 发表评论
]]>
Lucene学习http://www.tkk7.com/persister/archive/2009/03/06/258147.htmlpersisterpersisterFri, 06 Mar 2009 03:03:00 GMThttp://www.tkk7.com/persister/archive/2009/03/06/258147.htmlhttp://www.tkk7.com/persister/comments/258147.htmlhttp://www.tkk7.com/persister/archive/2009/03/06/258147.html#Feedback0http://www.tkk7.com/persister/comments/commentRss/258147.htmlhttp://www.tkk7.com/persister/services/trackbacks/258147.html 加深了我Ҏ索的理解
在全文检索中Q可以和数据库进行一个简单的Ҏ
全文索没有表的概念,也就没有固定的fieldsQ但是有记录Q每一个记录就是一个Document对象
每一个document都可以有自己不同的fieldsQ如下:

    Document doc = new Document(); 

   doc.add(Field.Keyword("filename",file.getAbsolutePath())); 
     
   //以下两句只能取一?前者是索引不存?后者是索引且存?
   //doc.add(Field.Text("content",new FileReader(file))); 
   doc.add(Field.Text("content",this.chgFileToString(file)));
   
   indexWriter.addDocument(doc);

在查询的时候,需要三个重要的参数
首先是库路径Q即在哪个库里面q行索(相当于database的\径)Q?br />
Searcher searcher = new IndexSearcher(dbpath);

然后是你以哪个字段Q查询什么关键词Q因为根据字D就可以得到字段对应的内?br /> 在得到的内容中检索你的关键词Q这个篏死sql语句Q只不过没有表的概念
Query query
    = QueryParser.parse(searchkey,searchfield,new StandardAnalyzer()); 

然后开始查询,查询的结果就是document的集合:
   Hits hits = searcher.search(query); 

对得到的集合q行处理Q?br />
   if(hits != null)
  {
       list = new ArrayList();
       int temp_hitslength = hits.length();
       Document doc = null;
       for(int i = 0;i < temp_hitslength; i++){
           doc = hits.doc(i);
           //list.add(doc.get("filename"));
           list.add(doc.get("content"));
       }
   } 

  附常用FieldQ?span style="font-size: 10pt; color: black; font-family: 宋体;">

常用?/span>FieldҎ如下Q?/span>


Ҏ

切词

索引

存储

用?/span>

Field.Text(String name, String value)

Yes

Yes

Yes

切分词烦引ƈ存储Q比如:标题Q内容字D?/span>

Field.Text(String name, Reader value)

Yes

Yes

No

切分词烦引不存储Q比如:META信息Q?/span>

不用于返回显C,但需要进行检索内?/span>

Field.Keyword(String name, String value)

No

Yes

Yes

不切分烦引ƈ存储Q比如:日期字段

Field.UnIndexed(String name, String value)

No

No

Yes

不烦引,只存储,比如Q文件\?/span>

Field.UnStored(String name, String value)

Yes

Yes

No

只全文烦引,不存?/span>


切分? 是指对文本q行切词Q用于进行烦引,上面可以看到切分的都会进行烦引;索引即用于通过搜烦词进行查询;存储表示是否存储内容本n。上面的 Field.KeywordҎ׃切分但是可以索引Q所以可以用q个字段q行查询Q而Field.UnIndexed׃能进行查询了。但是由? Field.Keyword不切分,所以当使用new Term(searchkey,searchfield)q行查询Ӟl出的searchkey必须与vaue参数值完全一致才会查询出来,? Field.Text和Field.UnStored则就不一?/span>?br />
Lucene中国是一个非常好的网站,对Lucene内部l构q行了详l的分析Q可以参考?br />



persister 2009-03-06 11:03 发表评论
]]>
վ֩ģ壺 Av뾫Ʒһ| պϵ| ߾ƷƵ| Ʒվ| ޹ϵһ| ղƷϵ| ³ʦӰԺѹۿ | ޹Һվw| 99þۺϹƷ| ӰԺҹײ| þù޹ۿ| ڵɫƵƵ| ߹ۿƵ| ששר2023| ޺ݺۺϾþþþ | ձһ| ձ߹ۿ| һa뼺Ӳִѿ51Ʒ | avһ| þþþavӰ| 51ƵѹۿƵ| һƬѿ| ޹Ʒlv| ɫö| þĻƷһ| 18Ƶѹۿ| AëƬA| һػ¼Ѳŷ| ˸һ91| ޾ƷӰԺþþþþ| ˾Ʒҹ侫պ | þþƷav鶹С˵| ëƬavպav| ׊ĴƵ| ƷרëƬ| ˳ɼƵ߹ۿ| Ůžžվֻ| ӰԺ| ޾ƷƵ| ɫͼ߲| ޾Ʒþþþþ|