午夜亚洲WWW湿好爽,亚洲av永久无码精品三区在线4,亚洲综合精品网站在线观看

使用Lucene進(jìn)行全文檢索(一)---處理索引

scud(飛云小俠) http://www.jscud.com 轉(zhuǎn)載請注明來源/作者

關(guān)鍵字:lucene,html parser,全文檢索,IndexReader,Document,Field,IndexWriter,Term,HTMLPAGE

Lucene是一個全文檢索的引擎,目前有Java和.Net 等幾個版本.Java版本的網(wǎng)址是http://lucene.apache.org.相關(guān)的一個項(xiàng)目是車東的WebLucene: http://sourceforge.net/projects/weblucene.

首先,基于一個簡單的新聞系統(tǒng),要想做全文檢索.新聞系統(tǒng)的管理等在這里不在具體提出,下面列出新聞對象的類:

注:程序用會到一些工具類,不在此列出,用戶可以自己實(shí)現(xiàn).

  package com.jscud.website.newsinfo.bean;


  import java.sql.Timestamp;

  import com.jscud.util.DateTime;
  import com.jscud.util.StringFunc;
  import com.jscud.website.newsinfo.NewsConst;


  /**
   * 一個新聞.
   *
   * @author scud(飛云小俠) http://www.jscud.com
   *
   */
  public class NewsItem
  {

      private int nid; //新聞編號

      private int cid; //類別編號

      private String title;//標(biāo)題

      private int showtype; //內(nèi)容類型:目前支持url和html

      private String content;//內(nèi)容

      private String url;//對應(yīng)網(wǎng)址,如果內(nèi)容類型是url的話

      private Timestamp addtime; //增加時間

      private int click; //點(diǎn)擊數(shù)

      //對應(yīng)的get,set函數(shù),較多不在列出,可以使用工具生成
      //......


      /**
       * 按照類型格式化
       */
      public String getShowContent()
      {
          String sRes = content;
          if(showtype == NewsConst.ShowType_HTML)
          {
          }
          return sRes;
      }

      public String getTarget()
      {
          if(showtype == NewsConst.ShowType_URL)
          {
              return "_blank";
          }
          else
              return "";
      }

      /**
       * 靜態(tài)Html文件的路徑及其名字
       */
      public String getHtmlFileName()
      {
          int nYear = DateTime.getYear_Date(getAddtime());
          int nMonth = DateTime.getMonth_Date(getAddtime());

          String sGeneFileName =
             "/news/" + getCid() + "/" + nYear + "/" + nMonth +"/" + getNid() + ".htm";

          return sGeneFileName;
      }

      /**
       * 靜態(tài)Html文件的路徑
       */
      public String getHtmlFilePath()
      {
          int nYear = DateTime.getYear_Date(getAddtime());
          int nMonth = DateTime.getMonth_Date(getAddtime());

          String sGeneFilePath =
             getCid() + "_" + nYear + "_" + nMonth;

          return sGeneFilePath;
      }
  }

可以看到,我們需要對標(biāo)題和內(nèi)容進(jìn)行檢索,為了這個目的,我們首先需要來研究一下lucene.

在Lucene中,如果要進(jìn)行全文檢索,必須要先建立索引然后才能進(jìn)行檢索,當(dāng)然實(shí)際工作中還會有刪除索引和更新索引的工作.

在此之前,介紹一個最基本的類(摘抄自http://www.tkk7.com/cap/archive/2005/07/17/7849.html):

Analyzer 文件的分析器（聽起來別扭，還是叫Analyzer好了)的抽象，這個類用來處理分詞(對中文尤其重要，轉(zhuǎn)換大小寫(Computer->computer,實(shí)現(xiàn)查詢大小寫無關(guān))，轉(zhuǎn)換詞根(computers->computer),消除stop words等,還負(fù)責(zé)把其他格式文檔轉(zhuǎn)換為純文本等.

在lucene中,一般會使用StandardAnalyzer來分析內(nèi)容,它支持中文等多字節(jié)語言,當(dāng)然可以自己實(shí)現(xiàn)特殊的解析器.StandardAnalyzer目前對中文的處理是按照單字來處理的,這是最簡單的辦法,但是也有缺點(diǎn),會組合出一些沒有意義的結(jié)果來.

首先我們來了解建立索引,建立索引包含2種情況,一種是給一條新聞建立索引,另外的情況是在開始或者一定的時間給批量的新聞建立索引,所以為了通用,我們寫一個通用的建立索引的函數(shù):

(一般一類的索引都放在一個目錄下,這個配置可以在函數(shù)中定義,也可以寫在配置文件中,通過參數(shù)傳遞給函數(shù).)

    /**
     * 生成索引.
     *
     * @param doc 目標(biāo)文檔
     * @param indexDir 索引目錄
     */
    public static void makeIndex(Document doc, String indexDir)
    {
        List aList = new ArrayList();
        aList.add(doc);
        makeIndex(aList, indexDir);
    }

    /**
     * 生成索引.
     *
     * @param doc 生成的document.
     * @param indexDir 索引目錄
     */
    public static void makeIndex(List docs, String indexDir)
    {
        if (null == docs)
        {
            return;
        }
        boolean indexExist = indexExist(indexDir);
        IndexWriter writer = null;
        try
        {
            StandardAnalyzer analyzer = new StandardAnalyzer();

            //如果索引存在,就追加.如果不存在,就建立新的索引.lucene要是自動判決就好了.
            if(indexExist)
            {
                writer = new IndexWriter(indexDir, analyzer, false);
            }
            else
            {
                writer = new IndexWriter(indexDir, analyzer, true);
            }

            //添加一條文檔
            for (int i = 0; i < docs.size(); i++)
            {
                Document doc = (Document) docs.get(i);
                if (null != doc)
                {
                    writer.addDocument(doc);
                }
            }

            //索引完成后的處理
            writer.optimize();
        }
        catch (IOException e)
        {
            LogMan.warn("Error in Make Index", e);
        }
        finally
        {
            try
            {
                if (null != writer)
                {
                    writer.close();
                }
            }
            catch (IOException e)
            {
                LogMan.warn("Close writer Error");
            }
        }
    }

可以看到,建立索引用到類是IndexWrite,它可以新建索引或者追加索引,但是需要自己判斷.判斷是通過IndexReader這個類來實(shí)現(xiàn)的,函數(shù)如下:

/**
     * 檢查索引是否存在.
     * @param indexDir
     * @return
     */
    public static boolean indexExist(String indexDir)
    {
        return IndexReader.indexExists(indexDir);
    }

如果每次都是新建索引的話,會把原來的記錄刪除,我在使用的時候一開始就沒有注意到,后來觀察了一下索引文件,才發(fā)現(xiàn)這個問題.

還可以看到,建立索引是給用戶的Document對象建立索引,Document表示索引中的一條文檔記錄.那么我們?nèi)绾谓⒁粋€文檔那?以新聞系統(tǒng)為例,代碼如下:

     /**
      * 生成新聞的Document.
      *
      * @param aNews 一條新聞.
      *
      * @return lucene的文檔對象
      */
     public static Document makeNewsSearchDocument(NewsItem aNews)
     {
         Document doc = new Document();

         doc.add(Field.Keyword("nid", String.valueOf(aNews.getNid())));

         doc.add(Field.Text("title", aNews.getTitle()));

         //對Html進(jìn)行解析,如果不是html,則不需要解析.或者根據(jù)格式調(diào)用自己的解析方法
         String content = parseHtmlContent(aNews.getContent());

         doc.add(Field.UnStored("content", content));

         doc.add(Field.Keyword("addtime", aNews.getAddtime()));

         //可以加入其他的內(nèi)容:例如新聞的評論等
         doc.add(Field.UnStored("other", ""));

         //訪問url
         String newsUrl = "/srun/news/viewhtml/" + aNews.getHtmlFilePath() + "/" + aNews.getNid()
                         + ".htm";

         doc.add(Field.UnIndexed("visiturl", newsUrl));

         return doc;
     }

通過上面的代碼,我們把一條新聞轉(zhuǎn)換為lucene的Document對象,從而進(jìn)行索引工作.在上面的代碼中,我們又引入了lucene中的Field(字段)類.Document文檔就像數(shù)據(jù)庫中的一條記錄,它有很多字段,每個字段是一個Field對象.

從別的文章摘抄一段關(guān)于Field的說明(摘抄自http://www.tkk7.com/cap/archive/2005/07/17/7849.html):
[quote]
    類型                               Analyzed Indexed Stored 說明
    Field.Keyword(String,String/Date) N Y Y                    這個Field用來儲存會直接用來檢索的比如(編號,姓名,日期等)
    Field.UnIndexed(String,String)     N N Y                    不會用來檢索的信息,但是檢索后需要顯示的,比如,硬件序列號,文檔的url地址
    Field.UnStored(String,String)      Y Y N                    大段文本內(nèi)容,會用來檢索,但是檢索后不需要從index中取內(nèi)容,可以根據(jù)url去load真實(shí)的內(nèi)容
    Field.Text(String,String)          Y Y Y                    檢索,獲取都需要的內(nèi)容,直接放index中,不過這樣會增大index
    Field.Text(String,Reader)          Y Y N                    如果是一個Reader, lucene猜測內(nèi)容比較多,會采用Unstored的策略.
[/quote]

我們可以看到新聞的編號是直接用來檢索的,所以是Keyword類型的字段,新聞的標(biāo)題是需要檢索和顯示用的,所以是Text類型,而新聞的內(nèi)容因?yàn)槭荋tml格式的,所以在經(jīng)過解析器的處理用,使用的UnStored的格式,而新聞的時間是直接用來檢索的,所以是KeyWord類型.為了在新聞索引后用戶可以訪問到完整的新聞頁面,還設(shè)置了一個UnIndexed類型的訪問地址字段.

(對Html進(jìn)行解析的處理稍后在進(jìn)行講解)

為一條新聞建立索引需要兩個步驟:獲取Document,傳給makeIndex函數(shù),代碼如下:

    public static void makeNewsInfoIndex(NewsItem aNews)
    {
        if (null == aNews)
        {
            return;
        }
        makeIndex(makeNewsSearchDocument(aNews),indexDir);
    }

建立索引的工作就進(jìn)行完了,只要在增加新聞后調(diào)用 makeNewsInfoIndex(newsitem); 就可以建立索引了.

如果需要刪除新聞,那么也要刪除對應(yīng)的索引,刪除索引是通過IndexReader類來完成的:

    /**
     * 刪除索引.
     * @param aTerm 索引刪除條件
     * @param indexDir 索引目錄
     */
    public static void deleteIndex(Term aTerm, String indexDir)
    {
        List aList = new ArrayList();
        aList.add(aTerm);
        deleteIndex(aList, indexDir);
    }
    /**
     * 刪除索引.
     *
     * @param aTerm 索引刪除條件.
     * @param indexDir 索引目錄
     *
     */
    public static void deleteIndex(List terms, String indexDir)
    {
        if (null == terms)
        {
            return;
        }

        if(!indexExist(indexDir)) { return; }

        IndexReader reader = null;
        try
        {
            reader = IndexReader.open(indexDir);
            for (int i = 0; i < terms.size(); i++)
            {
                Term aTerm = (Term) terms.get(i);
                if (null != aTerm)
                {
                    reader.delete(aTerm);
                }
            }
        }
        catch (IOException e)
        {
            LogMan.warn("Error in Delete Index", e);
        }
        finally
        {
            try
            {
                if (null != reader)
                {
                    reader.close();
                }
            }
            catch (IOException e)
            {
                LogMan.warn("Close reader Error");
            }
        }
    }

刪除索引需要一個條件,類似數(shù)據(jù)庫中的字段條件,例如刪除一條新聞的代碼如下:

     public static void deleteNewsInfoIndex(int nid)
     {
         Term aTerm = new Term("nid", String.valueOf(nid));
         deleteIndex(aTerm,indexDir);
     }

通過新聞的ID,就可以刪除一條新聞.

如果需要更新新聞,如何更新索引哪? 更新索引需要先刪除索引然后新建索引2個步驟,其實(shí)就是把上面的代碼組合起來,例如更新一條新聞:

     public static void updateNewsInfoIndex(NewsItem aNews)
     {
         if (null == aNews)
         {
             return;
         }
         deleteNewsInfoIndex(aNews.getNid());
         makeNewsInfoIndex(aNews);
     }

至此,索引的建立更新和刪除就告一段落了.其中批量更新新聞的代碼如下:
(批量更新應(yīng)該在訪問人數(shù)少或者后臺程序在夜間執(zhí)行)

    public static void makeAllNewsInfoIndex(List newsList)
    {
        List terms = new ArrayList();
        List docs = new ArrayList();
        for (int i = 0; i < newsList.size(); i++)
        {
            NewsItem aitem = (NewsItem) newsList.get(i);
            if (null != aitem)
            {
                terms.add(new Term("nid", String.valueOf(aitem.getNid())));
                docs.add(makeNewsSearchDocument(aitem));
            }
        }

        deleteIndex(terms,indexDir);
        makeIndex(docs,indexDir);
    }

下一節(jié)講解如何對要建立索引的內(nèi)容進(jìn)行解析,例如解析Html等內(nèi)容.

posted on 2005-08-12 17:31 Scud(飛云小俠) 閱讀(1077) 評論(1) 編輯收藏所屬分類: Java


只有注冊用戶登錄后才能發(fā)表評論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: MAVEN:如何為開發(fā)和生產(chǎn)環(huán)境建立不同的配置文件 --我的簡潔方案對搜索引擎同義詞支持的實(shí)驗(yàn), 分析模擬不重復(fù)的排列組合示例最近在編寫DBHelper的文檔讀"Under the Hood of J2EE Clustering" J2EE集群幾個提高代碼質(zhì)量,檢查代碼規(guī)范的工具分析XML中的CDATA類型在RSS中的使用使用FreeMarker/Jsp(webwork)生成靜態(tài)/動態(tài)RSS文件 Rss 中日期格式的研究使用Lucene進(jìn)行全文檢索(三)---進(jìn)行搜索

# re: 使用Lucene進(jìn)行全文檢索(一)---處理索引 2006-08-29 17:12 jasonlee

使用Lucene進(jìn)行全文檢索(一)---處理索引

評論

導(dǎo)航

統(tǒng)計

公告

常用鏈接

留言簿(15)

隨筆分類(113)

隨筆檔案(103)

相冊

友情鏈接

技術(shù)網(wǎng)站

搜索

積分與排名

最新評論

閱讀排行榜

評論排行榜