??xml version="1.0" encoding="utf-8" standalone="yes"?>
A variable-length format for positive integers is
defined where the high-order bit of each byte indicates whether more
bytes remain to be read. The low-order seven bits are appended as
increasingly more significant bits in the resulting integer value.
Thus values from zero to 127 may be stored in a single byte, values
from 128 to 16,383 may be stored in two bytes, and so on. 可变格式的整型定义:最高位表示是否q有字节要读取,低七位就是就是具体的有效位,d?/p>
l果数据中。比?0000001 最高位表示0Q那么说明这个数是一个字节表C,有效位是后面的七?000001Qgؓ1?0000010 00000001 W一个字节最高位?Q表C后面还有字节,W二位最高位0表示到此为止了,卛_是两个字节,那么具体的值注意,是从最后一个字节的七位有效数放在最前面Q依ơ放|,最后是W一个自q七位有效位,所以这个数表示 0000001 0000010Q换成整数是130 VInt Encoding Example Value
First byte
Second byte
Third byte
0
00000000
1
00000001
2
00000010
...
127
01111111
128
10000000
00000001
129
10000001
00000001
130
10000010
00000001
...
16,383
11111111
01111111
16,384
10000000
10000000
00000001
16,385
10000001
10000000
00000001
...
Lucene源代码中q行存储和读取是q样的。OutputStream是负责写Q?/p>
?Hadoop q行分布式数据处理,W?1 部分: 入门
以下是一部分理论学习Q?br />
The storage is provided by HDFS, and analysis by MapReduce.
MapReduce is a good fit for problems
that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
An RDBMS is good for point queries or updates, where the dataset has been indexed
to deliver low-latency retrieval and update times of a relatively small amount of
data.
MapReduce suits applications where the data is written once, and read many
times, whereas a relational database is good for datasets that are continually updated.
MapReduce tries to colocate the data with the compute node, so data access is fast
since it is local.* This feature, known as data locality, is at the heart of MapReduce and
is the reason for its good performance.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined
map function for each record in the split.
On the other hand, if splits are too small, then the overhead of managing the splits and
of map task creation begins to dominate the total job execution time.For most jobs, a
good split size tends to be the size of a HDFS block, 64 MB by default.
Reduce tasks don’t have the advantage of data locality—the input to a single reduce
task is normally the output from all mappers.
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays
to minimize the data transferred between map and reduce tasks. Hadoop allows the
user to specify a combiner function to be run on the map output—the combiner function’s
output forms the input to the reduce function.
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost
of seeks. By making a block large enough, the time to transfer the data from the disk
can be made to be significantly larger than the time to seek to the start of the block.
Thus the time to transfer a large file made of multiple blocks operates at the disk transfer
rate.
A quick calculation shows that if the seek time is around 10ms, and the transfer rate is
100 MB/s, then to make the seek time 1% of the transfer time, we need to make the
block size around 100 MB. The default is actually 64 MB, although many HDFS installations
use 128 MB blocks. This figure will continue to be revised upward as transfer
speeds grow with new generations of disk drives.
This argument shouldn’t be taken too far, however. Map tasks in MapReduce normally
operate on one block at a time, so if you have too few tasks (fewer than nodes in the
cluster), your jobs will run slower than they could otherwise.
意思是q样的,Block大的话,LBlock的时间大概少Q主要耗在传输的时间上Q但是如果Block的话,传输的时间和d的时间就相当了,{于说就是消耗的旉?倍传输的旉Q划不来。具体的说是Q如果数据量?00MQ那么Block的大是100MQ那么传输的旉是1s(100M/s)Q但是如果Block的大是1MQ那么传输的旉q是1sQ但是seek的时?0ms*100=1s了。这hd花去的时间就?s。是不是大好呢?也不是,太大的话Q可能导致文档没有分布式的存储,也就没有很好的利用MapReduce模型q行计算了,反而可能更慢?br />
]]>
2 * five bytes. Smaller values take fewer bytes. Negative numbers are not
3 * supported.
4 * @see InputStream#readVInt()
5 */
6 public final void writeVInt(int i) throws IOException {
7 while ((i & ~0x7F) != 0) {
8 writeByte((byte)((i & 0x7f) | 0x80));
9 i >>>= 7;
10 }
11 writeByte((byte)i);
12 }
InputStream负责读:
2 * five bytes. Smaller values take fewer bytes. Negative numbers are not
3 * supported.
4 * @see OutputStream#writeVInt(int)
5 */
6 public final int readVInt() throws IOException {
7 byte b = readByte();
8 int i = b & 0x7F;
9 for (int shift = 7; (b & 0x80) != 0; shift += 7) {
10 b = readByte();
11 i |= (b & 0x7F) << shift;
12 }
13 return i;
14 }
>>>表示无符号右U?br />
]]>
tomcat和jdk都安装好Q?/p>
二:nutch-0.9.tar.gz
下载到的tar.gz包,解压?opt目录下ƈ改名Q?br />
#gunzip -xf nutch-0.9.tar.gz |tar xf
#mv nutch-0.9.tar.gz /usr/local/nutch
试环境是否讄成功Q运行:/urs/local/nutch/bin/nutch看一下有没有命o参数输出Q如果有说明没问题?/p>
抓取q程Q?cd /opt/nutch
#mkdir urls
#vi nutch.txt 输入www.aicent.net
#vi conf/crawl-urlfilter.txt 加入以下信息Q利用正则表辑ּ对网站url抓取{选?br />
/**** accept hosts in MY.DOMAIN.NAME******/
+^http://([a-z0-9]*\.)*aicent.net/
#vi nutch/nutch-site.xmlQ给自己的蜘蛛取一个名字)讄如下Q?br />
<configuration>
<property>
<name>http.agent.name</name>
<value>test/unique</value>
</property>
</configuration>
开始抓取:#bin/nutch crawl urls -dir crawl -detpth 5 -thread 10 >& crawl.log
{待一会,旉依据|站的大,和设|的抓取深度?/p>
三:apache-tomcat
在这里,当你看到每次索的面?里,需要修改一下参敎ͼ因ؓtomcat中的nutch的检索\径不寚w成的?br />
#vi /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml
<property>
<name>searcher.dir</name>
<value>/opt/nutch/crawl</value>抓取|页所在的路径
<description>My path to nutch's searcher dir.</description>
</property>
#/opt/tomcat/bin/startup.sh
OK,搞定。。?/p>
问题汇总:
q行Qsh ./bin/nutch crawl urls -dir crawl -depth 3 -threads 60 -topN 100 >&./logs/nutch_log.log
1.Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
|上查有说是JDK版本的问题,不能用JDK1.6Q于是安?.5。但是还是同L问题Q奇怪啊?br />
于是l箋googleQ发现有如下的可能:
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
说明Q一般ؓcrawl-urlfilters.txt中配|问题,比如qo条g应ؓ
+^http://www.ihooyo.com ,而配|成?http://www.ihooyo.com q样的情况就引v如上错误?/p>
但是自己的配|根本就没有问题啊?br />
在Logs目录下面除了生成nutch_log.logq自动生成一个log文gQhadoop.log
发现有错误出玎ͼ
2009-07-22 22:20:55,501 INFO crawl.Crawl - crawl started in: crawl
2009-07-22 22:20:55,501 INFO crawl.Crawl - rootUrlDir = urls
2009-07-22 22:20:55,502 INFO crawl.Crawl - threads = 60
2009-07-22 22:20:55,502 INFO crawl.Crawl - depth = 3
2009-07-22 22:20:55,502 INFO crawl.Crawl - topN = 100
2009-07-22 22:20:55,603 INFO crawl.Injector - Injector: starting
2009-07-22 22:20:55,604 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb
2009-07-22 22:20:55,604 INFO crawl.Injector - Injector: urlDir: urls
2009-07-22 22:20:55,605 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries.
2009-07-22 22:20:56,574 INFO plugin.PluginRepository - Plugins: looking in: /opt/nutch/plugins
2009-07-22 22:20:56,720 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2009-07-22 22:20:56,720 INFO plugin.PluginRepository - Registered Plugins:
2009-07-22 22:20:56,720 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Basic Query Filter (query-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Site Query Filter (query-site)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - JavaScript Parser (parse-js)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - URL Query Filter (query-url)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Registered Extension-Points:
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2009-07-22 22:20:56,786 WARN regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2009-07-22 22:20:56,829 WARN mapred.LocalJobRunner - job_2319eh
java.lang.RuntimeException: java.net.UnknownHostException: jackliu: jackliu
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:617)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:591)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:364)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:390)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.startPartition(MapTask.java:294)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:355)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$100(MapTask.java:231)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:180)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
Caused by: java.net.UnknownHostException: jackliu: jackliu
at java.net.InetAddress.getLocalHost(InetAddress.java:1353)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:614)
... 8 more
也就是Host配置错误Q于是:
Add the following to your /etc/hosts file
127.0.0.1 jackliu
q次再次q行Q结果成功!
2:http://127.0.0.1:8080/nutch-0.9
输入nutchq行查询Q结果报错:
HTTP Status 500 -
type Exception report
message
description The server encountered an internal error () that prevented it from fulfilling this request.
exception
org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value language + "/include/header.html" is quoted with " which must be escaped when used within the value
org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:40)
org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:407)
org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:198)
org.apache.jasper.compiler.Parser.parseQuoted(Parser.java:299)
org.apache.jasper.compiler.Parser.parseAttributeValue(Parser.java:249)
org.apache.jasper.compiler.Parser.parseAttribute(Parser.java:211)
org.apache.jasper.compiler.Parser.parseAttributes(Parser.java:154)
org.apache.jasper.compiler.Parser.parseInclude(Parser.java:867)
org.apache.jasper.compiler.Parser.parseStandardAction(Parser.java:1134)
org.apache.jasper.compiler.Parser.parseElements(Parser.java:1461)
org.apache.jasper.compiler.Parser.parse(Parser.java:137)
org.apache.jasper.compiler.ParserController.doParse(ParserController.java:255)
org.apache.jasper.compiler.ParserController.parse(ParserController.java:103)
org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:170)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:332)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:312)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:299)
org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:586)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:317)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)
javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
note The full stack trace of the root cause is available in the Apache Tomcat/6.0.20 logs.
分析Q查看nutch Web应用根目录下的search.jsp可知Q是引号匚w的问题?/p>
<jsp:include page="<%= language + "/include/header.html"%>"/> //line 152 search.jsp
W一个引号和后面W一个出现的引号q行匚wQ而不是和q一行最后一个引可行匹配,所以问题就出现了?/p>
解决ҎQ?/p>
该行代码修改ؓQ?lt;jsp:include page="<%= language+urlsuffix %>"/>
q里我们定一个字W串urlsuffixQ我们把它定义在language字符串定义之后,
String language = // line 116 search.jsp
ResourceBundle.getBundle("org.nutch.jsp.search", request.getLocale())
.getLocale().getLanguage();
String urlsuffix="/include/header.html";
修改完成后,为确保修Ҏ功,重启一下Tomcat服务器,q行搜烦Q不再报错?/p>
3.无法查询l果Q?br />
Ҏnutch_log.log的结果发现和|上描述的不同,而且crawl里面只有两个文g夹segments和crawldbQ后来重新爬了一?br />
发现q次是好的,奇怪不知道Z么上ơ爬的失败了?br />
4.cached.jsp explain.jsp{都有上?的错误,更正q去OK了?/p>
5.今天׃一上午和半个下午的旉l于搞定了nutch的安装和配置了。明天l学习?/p>
序很重要:
String[] phrase = new String[] {"fox", "quick"};
assertFalse("hop flop", matched(phrase, 2));
assertTrue("hop hop slop", matched(phrase, 3));
原理如下图所C:
对于查询关键字quick和foxQ只需要foxUd一个位|即可匹配quick brown fox。而对于fox和quickq两个关键字
需要将foxUd三个位置。移动的距离大Q那么这记录的scorep,被查询出来的可能行就小了?br />
SpanQuery利用位置信息查询更有意思的查询Q?br />
SpanQuery type Description
SpanTermQuery Used in conjunction with the other span query types. On its own, it’s
functionally equivalent to TermQuery.
SpanFirstQuery Matches spans that occur within the first part of a field.
SpanNearQuery Matches spans that occur near one another.
SpanNotQuery Matches spans that don’t overlap one another.
SpanOrQuery Aggregates matches of span queries.
SpanFirstQueryQTo query for spans that occur within the first n positions of a field, use Span-FirstQuery.
quick = new SpanTermQuery(new Term("f", "quick"));
brown = new SpanTermQuery(new Term("f", "brown"));
red = new SpanTermQuery(new Term("f", "red"));
fox = new SpanTermQuery(new Term("f", "fox"));
lazy = new SpanTermQuery(new Term("f", "lazy"));
sleepy = new SpanTermQuery(new Term("f", "sleepy"));
dog = new SpanTermQuery(new Term("f", "dog"));
cat = new SpanTermQuery(new Term("f", "cat"));
SpanFirstQuery sfq = new SpanFirstQuery(brown, 2);
assertNoMatches(sfq);
sfq = new SpanFirstQuery(brown, 3);
assertOnlyBrownFox(sfq);
SpanNearQueryQ?br />
彼此盔R的跨?
3.PhrasePrefixQuery 主要用来q行同义词查询的Q?br />
IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true);
Document doc1 = new Document();
doc1.add(Field.Text("field", "the quick brown fox jumped over the lazy dog"));
writer.addDocument(doc1);
Document doc2 = new Document();
doc2.add(Field.Text("field","the fast fox hopped over the hound"));
writer.addDocument(doc2);
PhrasePrefixQuery query = new PhrasePrefixQuery();
query.add(new Term[] {new Term("field", "quick"), new Term("field", "fast")});
query.add(new Term("field", "fox"));
Hits hits = searcher.search(query);
assertEquals("fast fox match", 1, hits.length());
query.setSlop(1);
hits = searcher.search(query);
assertEquals("both match", 2, hits.length());
]]>
int *binarySearch(int val, int array[], int n)
{
int m = n/2;
if(n <= 0) return NULL;
if(val == array[m]) return array + m;
if(val < array[m]) return binarySearch(val, array, m);
else return binarySearch(val, array+m+1, n-m-1);
}
对于有n个元素的数组来说Q二元搜索算法进行最?+log2(n)ơ比较?如果有一百万元素Q大概比?0ơ,也就是最?0ơ递归执行binarySearch()函数?/p>
3.Index dates
Document doc = new Document();
doc.add(Field.Keyword("indexDate", new Date()));
4.Tuning indexing performance
IndexWriter System property Default value Description
--------------------------------------------------------------------------------------------------
mergeFactor org.apache.lucene.mergeFactor 10 Controls segment merge frequency and size
maxMergeDocs org.apache.lucene.maxMergeDocs Integar.MAX_VALUE Limits the number of documents per segement
minMergeDocs org.apache.lucene.minMergeDocs 10 Controls the amount of RAM used when indexing
mergeFactor控制写入盘前内存中~存的document数量Q同时控制merge index segments的频率。其默认值是10Q即存满10?br />
documents后就必须写入盘Q而且如果segment的数量达?0的数的时候会merge成一个segmentQ当然maxMergeDocs限制了每?br />
segment最大能够保存的document数量。mergeFactor大的话p能利用RAMQ提高index的效率,但是mergeFactor高也就意味着
merge的频率就低Q会可能Dsegments的数量很大(因ؓ没有mergeQ,q样search的时候就需要打开更多的segment文gQ也?br />
降低了search的效率。minMergeDocs is another IndexWriter instance variable that affects indexing performance. Its
value controls how many Documents have to be buffered before they’re merged to a segment.也即是说minMergeDocs也具?br />
mergeFactor控制~存document数量的功能?/p>
5.RAMDirectory帮助利用RAMQ也可以采用集群或者多U程的方式充分利用硬件和软g资源Q提高index的效率?/p>
6.有时候对于每个field可能希望控制其大,比如只对?000个term做indexQ这个时候就需要用maxFieldLength来控制?/p>
7.IndexWriter’s optimize()Ҏ是segmentsq行mergeQ降低segments的数量从而减search的时候读取index的时间?/p>
8.注意多线E环境下的工作:an index-modifying IndexReader operation can’t be executed
while an index-modifying IndexWriter operation is in progress.Z防止误用QLucene在用某些API时会l?br />
index上锁?/p>
1.TermQuery常用Q对一个TermQ最的索引块,包含一个field名字和|q行索引查询?br /> Term直接与QueryParser.parse里面的key和field直接对应?/p>
IndexSearcher searcher = new IndexSearcher(directory);
Term t = new Term("isbn", "1930110995");
Query query = new TermQuery(t);
Hits hits = searcher.search(query);
2.RangeQuery用于区间查询,RangeQuery的第三个参数表示是开区间q是闭区间?br /> QueryParser会构Zbegin到end之间的N个查询进行查询?/p>
Term begin, end;
Searcher searcher = new IndexSearcher(dbpath);
begin = new Term("pubmonth","199801");
end = new Term("pubmonth","199810");
RangeQuery query = new RangeQuery(begin, end, true);
RangeQuery本质是比较大。所以如下查询也是可以的Q但是意义就于上面不大一样了QM是大的比较
讑֮了一个区_在区间内的都能够搜烦出来Q这里就存在一个比较大的原则Q比如字W串会首先比较第一个字W,q样与字W长度没有关pR?br />
begin = new Term("pubmonth","19");
end = new Term("pubmonth","20");
RangeQuery query = new RangeQuery(begin, end, true);
3.PrefixQuery.对于TermQueryQ必d全匹配(用Field.Keyword生成的字D)才能够查询出来?br />
q就制约了查询的灉|性,PrefixQuery只需要匹配value的前面Q何字D即可。如Field为nameQ记?br />
中那么有jackliu,jackwu,jackli,那么使用jack可以查询出所有的记录。QueryParser creates a PrefixQuery
for a term when it ends with an asterisk (*) in query expressions.
IndexSearcher searcher = new IndexSearcher(directory);
Term term = new Term("category", "/technology/computers/programming");
PrefixQuery query = new PrefixQuery(term);
Hits hits = searcher.search(query);
4.BooleanQuery.上面所有的查询都是Z单个field的查询,多个field怎么查询呢,BooleanQuery
是解决多个查询的问题。通过add(Query query, boolean required, boolean prohibited)加入
多个查询.通过BooleanQuery的嵌套可以组合非常复杂的查询?br />
IndexSearcher searcher = new IndexSearcher(directory);
TermQuery searchingBooks =
new TermQuery(new Term("subject","search"));
RangeQuery currentBooks =
new RangeQuery(new Term("pubmonth","200401"),
new Term("pubmonth","200412"),true);
BooleanQuery currentSearchingBooks = new BooleanQuery();
currentSearchingBooks.add(searchingBook s, true, false);
currentSearchingBooks.add(currentBooks, true, false);
Hits hits = searcher.search(currentSearchingBooks);
BooleanQuery的addҎ有两个boolean参数Q?br />
trueQfalseQ表明当前加入的子句是必要满的;
falseQtrueQ表明当前加入的子句是不可以被满的Q?br />
falseQfalseQ表明当前加入的子句是可选的Q?br />
trueQtrueQ错误的情况?/p>
QueryParser handily constructs BooleanQuerys when multiple terms are specified.
Grouping is done with parentheses, and the prohibited and required flags are
set when the –, +, AND, OR, and NOT operators are specified.
5.PhraseQueryq行更ؓ_的查找。它能够对烦引文本中的两个或更多的关键词的位|进?br />
限定。如搜查包含A和Bq且A、B之间q有一个文字。Terms surrounded by double quotes in
QueryParser parsed expressions are translated into a PhraseQuery.
The slop factor defaults to zero, but you can adjust the slop factor
by adding a tilde (~) followed by an integer.
For example, the expression "quick fox"~3
6.WildcardQuery.WildcardQuery比PrefixQuery提供了更l的控制和更大的灉|性,q个最Ҏ
理解和用?/p>
7.FuzzyQuery.q个Query比较特别Q它会查询与关键字长得很像的其他记录。QueryParser
supports FuzzyQuery by suffixing a term with a tilde (~),for exmaple wuzza~.
public void testFuzzy() throws Exception {
indexSingleFieldDocs(new Field[] {
Field.Text("contents", "fuzzy"),
Field.Text("contents", "wuzzy")
});
IndexSearcher searcher = new IndexSearcher(directory);
Query query = new FuzzyQuery(new Term("contents", "wuzza"));
Hits hits = searcher.search(query);
assertEquals("both close enough", 2, hits.length());
assertTrue("wuzzy closer than fuzzy",
hits.score(0) != hits.score(1));
assertEquals("wuzza bear","wuzzy", hits.doc(0).get("contents"));
}
Ҏ |
切词 |
索引 |
存储 |
用?/span> |
Field.Text(String name, String value) |
Yes |
Yes |
Yes |
切分词烦引ƈ存储Q比如:标题Q内容字D?/span> |
Field.Text(String name, Reader value) |
Yes |
Yes |
No |
切分词烦引不存储Q比如:META信息Q?/span> 不用于返回显C,但需要进行检索内?/span> |
Field.Keyword(String name, String value) |
No |
Yes |
Yes |
不切分烦引ƈ存储Q比如:日期字段 |
Field.UnIndexed(String name, String value) |
No |
No |
Yes |
不烦引,只存储,比如Q文件\?/span> |
Field.UnStored(String name, String value) |
Yes |
Yes |
No |
只全文烦引,不存?/span> |