<rt id="bn8ez"></rt>
<label id="bn8ez"></label>

  • <span id="bn8ez"></span>

    <label id="bn8ez"><meter id="bn8ez"></meter></label>

    gembin

    OSGi, Eclipse Equinox, ECF, Virgo, Gemini, Apache Felix, Karaf, Aires, Camel, Eclipse RCP

    HBase, Hadoop, ZooKeeper, Cassandra

    Flex4, AS3, Swiz framework, GraniteDS, BlazeDS etc.

    There is nothing that software can't fix. Unfortunately, there is also nothing that software can't completely fuck up. That gap is called talent.

    About Me

     

    XQuery Search and Update

    Abstract

    Abstract

    The XQuery 1.0 specification describes a powerful language for accessing and manipulating XML. In its initial form XQuery lacks two interesting features: fast token-based text search and node-level update. Specifications for these features are under development at the W3C with release dates after XQuery 1.0. This paper looks at the mechanisms by which XQuery can support search and update based on our experience at Mark Logic implementing these features.

    Are Search and Update Needed?

    To begin, let us ask the question, are search and update capabilities needed in an XQuery environment? The answer depends on the use case. When using XQuery to federate queries against data-oriented RDBMS stores, text-based search isn't required and updates can be handled through direct SQL calls. The XQuery in this situation acts as a read-only view, a convenience rather than a primary access and manipulation language.

    In other use cases the need can be acute. Take for example O'Reilly Media, a Mark Logic customer with a custom publishing site called SafariU that's targeted at helping university professors create custom course readers. Professors access book and article content (held as Docbook within Mark Logic) by performing advanced searches to locate items of interest -- chapters, sections, or code examples. Professors gather together these items into a new book, reviewing HTML and later PDF copies, and eventually receiving a stack of their custom books at the university bookstore.

    SafariU requires fast text search so professors can find content to select and update to track the new hierarchical book structure and content usage for royalty reports. In fact, before the first professor visits SafariU, O'Reilly uses XQuery update calls to "decorate" the book and article content: adding <indexterm> elements on articles, adding estimated page counts to content pieces, etc.

    Mark Logic has opted to provide search and update using function calls -- rather than language extensions -- so as to conform to the XQuery standard as much as possible. We have been able to make search and update calls efficient, executing quickly even against gigabyte documents in terabyte collections. In the rest of this paper we'll explain the Mark Logic approaches to search and update and compare and contrast that approach with the preliminary W3C proposals.

    Section Title

    If XQuery search were to provide only what a typical search engine provides, it would be fairly easy. You specify the search word or phrase, possibly add a basic document selection constraint, and retrieve a relevance ordered list of documents containing the search term(s). Using the Mark Logic search function calls and the input() function to return a sequence of all document nodes, the expression would look like this:

     

     
    cts:search(input(), "word or phrase")
     

     

    Yet XQuery enables so much more. First, XQuery has direct access to and can utilize the XML document structure. This lets you determine what subset of content should be searched. As shown in Example 1, you can look for the most relevant <sect1> blocks discussing "cougars" where the <sect1> blocks must be within chapters about "veterinary medicine" in books published after 2003.

     

    Example 1: A Structure-aware Search
     
    cts:search(
    input()/book[meta/pubdate >= 2003]
           
    //chapter[cts:contains(title, "veterinary medicine")]//sect1,
    "cougars
    )
     

    Second, XQuery can access the items returned from each search and programmatically manipulate each item for rendering or to feed into another subsequent search. In the cougar example, the search expression might return a sequence of <sect1> elements, then the query might continue on to extract all the figure blocks within those sections for display, along with any paragraph linking to the figure. Example 2 shows what that query would look like, and how a theoretical print-references() might be implemented:

     

    Example 2: Manipulating Search Results
     
    for $s in
    cts:search(...)[1 to 10]
    for $f in $s//figure
    return (
    print-figure($f),
    print-references($f)
    )
     
    define function print-figure($f as element(figure)) { ... }
     
    define function print-references($f as element(figure))
    {
    let $id := $f/@id
    for $link in $f/ancestor::book//link[@linkend=$id]
    return
    <p>
     
    { string-join($link/ancestor::para[1]//text(), " ") }
    </p>
    }
     

     

    The power to control what's searched and what's returned proves especially valuable when documents have useful semantic structure. Figure 1 shows a screen shot of a query for "tennis elbow" executed against a collection of medical textbooks:

    The results were rendered with custom XQuery logic. Start at the most relevant <para> elements, walk up the document hierarchy to the nearest containing section, then render that block as HTML. Clickable breadcrumbs were added based on the full ancestral history of the matching <para>. The search returns paragraphs, but for the end result we've opted to include full sections.

    By comparison, a search engine presented with a large number of Docbook files would return simply URL pointers to which files were most relevant. Any post processing would be entirely manual and external to the search engine.

    Executing a Search

    Mark Logic provides two functions to execute searches: cts:contains() and cts:search(). The "cts" namespace stands for "core text search" and is a standard extension in Mark Logic.

    The cts:contains() call is the simpler of the pair. It returns a boolean true() if the node passed as the first argument contains the word or phrase specified by the second argument. Example 3 returns all the programlistings within any example element whose title contains "black and white" within any book element whose title contains "java servlet". The query executes quickly based on indexes of both content and structure.

    Example 3: Searching content and structure
    input()/book[cts:contains(title, "java servlet")]
         
    //example[cts:contains(title, "black and white")]
        
    //programlisting
     

    Those familiar with XQuery may recall a standard fn:contains() function. The standard function operates on a character level rather than token (word), so fn:contains(title, "foo") somewhat unexpectedly matches the word "food". The cts:contains(title, "foo") call would not, because it matches tokens.

    The cts:contains() function works well for black and white searches. By this I don't mean searches for "black and white", but rather those cases where your item is either in or it's out. There's no relevance rank ordering. For that you use cts:search().

    The cts:search() call returns a result sequence in decreasing relevance order. In common usage it accepts two arguments, an expression to be searched and a query specifying what search to perform. Below is the Example 1 search query shown earlier:

     
    cts:search(
    input()/book[meta/pubdate >= 2003]
          
    //chapter[cts:contains(title, "veterinary medicine")]//sect1,
    "cougars"
    )[1 to 10]
     

     

    The first argument limits the search domain to <sect1> elements within specific chapters within specific books. The second argument indicates the search criteria. It's the search criteria that influences the relevance ranking. The following two queries are very different:

     
    //title[cts:contains(., "XQuery search")]
     
     
    cts:search(//title, "XQuery search")
     

    The first returns titles that contain "XQuery search", in document order. The second returns titles that contain "XQuery search", in decreasing relevance order.

    The second argument to cts:search() can accept not just simple words and phrases but cts:query instances. Through a cts:query instance you can apply an arbitrarily complex and nested search criteria. Each criteria can provide search options (case insensitivity, stemming enablement, wildcard enablement, language) and a term weight (an indicator of this term's relative importance in the search, between 0.0 and 1.0).

    You build a cts:query instance through of a series of "cts" function calls. Example 4 shows several examples:

     

    Example 4: Query Examples
     
    (: Simple strings are treated as
    cts:word-query() :)
    cts:word-query("XQuery search")
     
    (: How to use a
    cts:word-query() :)
    cts:search(input(),
    cts:word-query("rhinos"))
     
    (: Indicates to use stemming so "searching" would be a match also :)
    cts:word-query("XQuery search", "stemmed")
     
    (: Either wine color is acceptable, red 2x preferred :)
    cts:or-query(
    cts:word-query("white wine", "stemmed", 0.5),
    cts:word-query("red wine", "stemmed", 1.0)
    )
     
    (: Both explorers should appear :)
    cts:and-query(
    cts:word-query("Lewis", "case-sensitive"),
    cts:word-query("Clark", "case-sensitive")
    )
     
    (: Same as previous. Capitalized words trigger case sensitivity. :)
    cts:and-query(
    cts:word-query("Lewis"),
    cts:word-query("Clark")
    )
     
    (: Look for "spring" with relevance weighting by location :)
    cts:or-query(
    cts:element-word-query(xs:QName("title"), "spring", (), 1.0),
    cts:element-word-query(xs:QName("subtitle"), "spring", (), 0.3),
    cts:element-word-query(xs:QName("para"), "spring", 0.1)
    )
     
    (: Use
    cts:near-query() to specify a term distance :)
    cts:near-query(
    (cts:word-query("white wine", "stemmed", 0.5),
    cts:word-query("red wine", "stemmed", 1.0)),
    2)
     

    Scoring a Search

    The cts:search() call returns results in (decreasing) relevance order. You can quantify each item's relevance using the cts:score() function, as shown in Example 5:

    Example 5: Scoring Results
     
    for $x in
    cts:search(//para, "spring break")
    return <p score="{cts:score($x)}"> { $x/* } </p>
     

    A score is a non-negative integer, unbounded and increasing logarithmetically. The result of calling xdmp:score() for a given nodes depends on the structure of the most recent function call which returned the node as a value. This implementation of score violates XQuery side-effect free-ness. On the other hand, it appears to be the only feasible approach to scoring which is compatible with the current XQuery grammar. Recently the Free Text Task Force of the W3C XQuery Working Group has published a draft recommendation for XML free text search which includes a grammatical recommendation for managing relevance scoring. We will describe this is the next section and compare it to the Mark Logic infrastructure.

    Relevance scores are computed using a "tf-idf" formula (tf-idf stands for term frequency, inverse document frequency) with byte-length normalization. Element nodes with a higher occurrence of a given word or tag will score higher than nodes of the same size with a lower occurrence of the given word or tag. Between two element nodes with the same number of occurrences of a given word or tag, the smaller element will normalize to larger score, and therefore a higher relevance. Between two words or tags occurring with the same frequency in a given element, the word or tag which is less common in the collection will make a bigger contribution to the relevance score.

    Relation of Mark Logic Search to the W3C FTTF Activity

    The Free Text Task Force of the W3C XQuery Working Group has published a draft recommendation for XQuery text search. The recommendation describes a surface syntax for XML search which integrates with the existing XQuery grammar. The basic component of this syntax is the "ftcontains" predicate relation. This predicate maps very closely to the Mark Logic cts:contains() predicate. For example, the FTTF proposed search expression:

     
    /books/book[title
    ftcontains ("dog" with stemming) && "cat"]/author
     

    maps directly to the Mark Logic syntax

     
    /books/book[cts:contains(title,
     
    cts:and-query((cts:word-query("dog","stemmed"),"cat")))]/author
     

    The Mark Logic search function API acts effectively as an implementation API for the W3C draft recommendation. It would be premature to start implementing the FTTF grammar given the lack of maturity and large number of unresolved issues in that recommendation.

    The FTTF grammar also uses the "ftcontains" predicate relation as an integral part of the FLWOR expression. For example, the set of author children of book elements, where the book element has a title child containing both "dog" and "cat":

     
    for $b in //book
    where $b/title
    ftcontains "dog" && "cat"
    return $b/author
     

    maps to the Mark Logic function API:

     
    for $b in
    cts:search(//book,
     
    cts:element-query(QName("title"),
    cts:and-query(("dog","cat"))))
    return $b/author
     

    A key difference, however, is that the Mark Logic cts:search() function returns results in (decreasing) relevance order by default, whereas the FLWOR expression returns results in document order unless an explicit order by clause is included in the expression. The fully equivalent FTTF expression would be:

     
    for $b in //book
    score $s as $b/title
    ftcontains "dog" && "cat"
    where $s > 0
    order by $s descending
    return $b/author
     

    which is surprisingly more verbose than the lower-level Mark Logic implementation. De-coupling the search and the relevance expression is potentially very useful, but in practice the two expressions are almost always equal. The EBNF grammar allows that the right-hand-side of the "ftcontains" relation may be any expression, but the specification restricts the expression to be a Boolean combination of ft expressions.

    Considering FT Expressions

    FT search expressions appear on the right hand side of the ftcontains predicate. They are composed from the following primitives, which are almost all supported directly by the Mark Logic search function API:

    • Primitives
      • Boolean And, Or, Not -- supported;MildNot -- not supported

      • Proximity – supported

      • Case and diacritic sensitivity -- supported

      • Stemming and thesaurus options -- supported

      • Stop word options -- not supported

      • Language options -- supported

      • Wildcard matching -- supported

      • Match incidence counting -- not supported

      • Ordered matching -- supported

      • "at start" and "at end" anchors -- supported with regexp matching

      • "words", "sentences", and "paragraph" scope -- supported with appropriate markup

     

    From the FTTF Mildnot specification:

    If I want to find articles that mention Mexico, I might search for '"Mexico" mild not "New Mexico"'. '"Mexico" mild not "New Mexico"' matches any Expr that contains Mexico on its own. An Expr that contains "New Mexico" is not "excluded" from the result - it may mention "Mexico" as well.

    (The specification probably means to say "node" where it says "Expr" as one does not match expressions, one evaluates expressions to obtain a set of nodes which may match some search condition.)

     

    This seems like a nice idea, and it will be considered for subsequent releases of the Mark Logic server.

    Conversely, the Mark Logic search system includes important concepts which are not present in the FTTF draft recommendation. The Mark Logic server may be configured to "phrase through", "phrase around", or "phrase into" a given set of markup. Phrasing through has the effect of making markup "transparent". For example, inline emphasis tags should not interrupt the phrase structure of a text search.

     
    You <b>phrase through</b> a bold tag.
     

    Phrasing around has the effect of "ignoring" markup which interrupt the phrase structure. For example, inline footnotes should not contribute to the phrase structure of the footnoted text.

     
    A GPS <footnote>Global Positioning System</footnote> device should match a search for "GPS device".
     

    Lastly, there are cases when the best behavior is to allow the phrase structure to descend recursively into the string value of the subtree below a given element node.

    Mark Logic Update

    Mark Logic provides a handful of functions to enable document-level and node-level update. A single query can execute any number of these functions.

    Three basic functions perform document-level updates. These functions all use the Mark Logic standard "xdmp" namespace prefix, as shown in Example 6.

    Example 6: Document-level Updates
     
    (: Loads a document from the
    filesystem to the specified URI.
    The URI is what would be used as a parameter to
    fn:doc().
     
    Replaces any existing document at that URI. :)
    xdmp:load($filepath, $uri)
     
    (: Loads a dynamically constructed document or element node :)
    xdmp:document-insert($uri, $root as node())
     
    (: Deletes the specified document :)
    xdmp:document-delete($uri)
     

    Three other functions provide node-level insert abilities. Through the three basic calls shown in Example 7 a node can be added at any location:

    Example 7: Node-level Inserts
     
    (: Inserts the new node as the last child of the parent :)
    xdmp:node-insert-child($parent as node(), $new as node())
     
    (: Inserts the node as a following sibling to the reference node :)
    xdmp:node-insert-after($sibling as node(), $new as node())
     
    (: Inserts the node as a preceding sibling to the reference node :)
    xdmp:node-insert-before($sibling as node(), $new as node())
     

    A final pair of functions in Example 8 provide the rest of the node-level updates:

    Example 8: Node-level Replacements
     
    (: Replace the old with the new :)
    xdmp:node-replace($old as node(), $new as node())
     
    (: Delete the node (and its descendents) from the document :)
    xdmp:node-delete($old as node())

    There are additional methods to get and set document properties (used for WebDAV), to add and remove documents from collections, and to alter document permissions. Example 9 utilizing several of these calls appears below. It's an implementation of a named queue:

    Example 9: Named Queue
     
    define function
    enqueue($name as xs:string, $i as item()) as empty() {
    (: wrap $i in an element() so it can be added to a doc,
    : whether it's already an element or not: works for attributes, etc.
    :)
    let $n := <entry>{$i}</entry>
    let $queue := doc($name)/queue
    return
    if (exists($queue))
      
    then
    xdmp:node-insert-child($queue, $n)
      
    else
    xdmp:document-insert($name, <queue>{$n}</queue>)
    }
     
    define function
    dequeue($name as xs:string) as item()? {
    (: get the first child element under the queue root, and unwrap it :)
    let $node := doc($name)/queue/entry[1]
    return
     
    if (exists($node))
        
    then (xdmp:node-delete($node), $node/@*, $node/node())
        
    (: if there's nothing on the queue, return the empty sequence :)
        
    else ()
    }
     

    Everything is straightforward in this example except perhaps the return value from dequeue(). The "then" clause returns a sequence of three items: the result of the xdmp:node-delete() call (always empty), the $node/@* (any possible attribute on the <entry> in case the item stored was an attribute), and $node/node() (any nodes under the <entry> in case the item stored was an element, PI, text node, etc).

    To conform to the spirit of the XQuery specification in disallowing side effects, updates are not made visible to the query environment executing the update. This requires some getting used to, but doesn't hinder productivity.

    Updates occur after the query completes without error. Should a query produce an uncaught error, the set of updates from that query do not occur. Mark Logic does not yet provide explicit commit/rollback support. However, Mark Logic provides a try/catch expression so that errors can be caught and processed while allowing execution to continue. Its syntax:

     
    try {
    $try-expression
    }
    catch ($exception as element()) {
    $catch-expression
    }
     

     

    Example 10 loads files from a list of filenames, reporting errors but not letting any errors stop the other files from being loaded:

    Example 10: Handling Load Errors
     
    for $file in $files
    return
    try {
      
    <li>
     
    { xdmp:load($file, $file) }
     
    { $file } loaded
      
    </li>
    }
    catch ($exception) {
      
    <span>
        
    Problem while loading { $file }:<br/>
        
    <i xmlns:e="http://marklogic.com/xdmp/error">
          
    {$exception/e:message/text()}
        
    </i>
      
    </span>
    }
     

    ACID Semantics

    Mark Logic updates satisfy a subset of ACID semantics. Below are some definitions conveniently extracted from the Wikipedia that define ACID semantics.

    • ACID Semantics
    • Atomicity refers to the ability of the DBMS to guarantee that either all of the tasks of a transaction are performed or none of them are.
    • Consistency refers to the database being in a legal state when the transaction begins and when it ends. This means that a transaction can't break the rules, or integrity constraints, of the database.
    • Isolation refers to the ability of the application to make operations in a transaction appear isolated from all other operations. This means that no operation outside the transaction can ever see the data in an intermediate state.
    • Durability refers to the guarantee that once the user has been notified of success, the transaction will persist, and not be undone. This means it will survive system failure, and that the database system has checked the integrity constraints and won't need to abort the transaction. Typically, all transactions are written into a log that can be played back to recreate the system to its state right before the failure. A transaction can only be deemed committed after it is safely in the log.

    The Mark Logic Server evaluates an XQuery module by spawning an evaluation thread which operates on a snapshot of the database. All the updates appearing in the module are collected and a change vector computed and journaled. The XQuery module has an implied commit which guarantees that either all the changes or none of the changes become persistent. Update evaluation guarantees atomicity, isolation, and durability. Consistency is satisfied, but only in the weak sense, since integrity constraints are not explicitly part of the Mark Logic architecture. Integrity is expressed as de-normalized documents: all the elements collected in one document are maintained as a consistent collection. Isolation is maintained in the strongest sense: each query evaluation thread operates on a static snapshot, and all the updates refer to the original values, even if an update sub-expression modifies one of the values appearing elsewhere in the XQuery module.

    W3C XQuery Update Activity

    The W3C XQuery Working Group has published a draft requirements document for XML XQuery update. This draft is available at http://www.w3.org/TR/2005/WD-xquery-update-requirements-20050211/. The pertinent parts for this paper are the basic functional requirements and the ACID semantics requirements.

    W3C XQuery Update Functional Requirements

    From the specification:

    Locus of modifications -- supported

    The XQuery Update Facility MUST be able to change the properties of existing nodes while preserving their identity. The XQuery Update Facility MUST be able to create a new copy of a node with a specific set of changes.

    Delete -- supported

    The XQuery Update Facility MUST be able to delete nodes.

    Insert -- supported

    The XQuery Update Facility MUST be able to insert new nodes in specified positions.

    Replace -- supported

    The XQuery Update Facility MUST be able to replace a node.

    Changing values -- supported

    The XQuery Update Facility MUST be able to change the value returned by the typed-value accessor for a node.

    Modifying properties -- not supported

    The XQuery Update Facility SHOULD be able to modify some of the properties of a node such as the name, type, content, nilability, base-URI, etc.

    Moving nodes -- not supported

    The XQuery Update Facility MAY be able to move a node from one location to another.

    Conditional updates -- supported

    The XQuery Update Facility MUST be able to do conditional updates.

    Iterative updates -- supported

    The XQuery Update Facility MUST be able to iterate over nodes to do updates.

    Validation -- not supported

    The XQuery Update Facility MAY support an explicit XML Schema validation operation that preserves node identity.

    Compositionality -- supported

    The XQuery Update Facility MUST be able to compose update operators with other update operators. The XQuery Update Facility MAY be compositional with respect to XQuery expressions; that is, it may be possible to use an update wherever an XQuery expression is used.

    Parameterization -- supported

    The XQuery Update Facility SHOULD provide a means to parameterize update operations.

    W3C XQuery Update Semantics Requirements

    From the specification:

    Atomicity -- supported

    The XQuery Update Facility MUST provide a set of atomic operations, and MUST define a means to group atomic operations into an atomic execution unit.

    Consistency -- supported

    At the end of an outermost update operation (that is, an update operation invoked from the external environment), the data model MUST be consistent with respect to the constraints specified in the Data Model. In particular, all type annotations MUST be consistent with the content of the items they govern. The XQuery Update Facility MAY define additional levels of granularity at which Data Model constraints are enforced.

    Isolation -- supported

    The XQuery Update Facility SHOULD define the means by which operations can be isolated from concurrent operations.

    Durability -- supported

    The XQuery Update Facility MUST define a means to control the durability of atomic operations and atomic execution units.

    Conclusion

    Interested parties can download evaluation and community licensed copies of Mark Logic Server from http://xqzone.marklogic.com. Full function documentation for Mark Logic search and update functions can be found at http://xqzone.marklogic.com/pubs/.

     

    Biography

    Jason Hunter

    Mark Logic

    Jason Hunter works as a Lead Applications Engineer at Mark Logic. He's author of Java Servlet Programming (O'Reilly). He's a member of the expert groups responsible for Servlet, JSP, JAXP, and XQJ API development, and has assisted the W3C XQuery Working Group. He is also an Apache Member and as Apache's representative to the Java Community Process Executive Committee he established a landmark agreement allowing open source Java. He co-created the open source JDOM library to enable optimized Java and XML integration.

    posted on 2008-07-29 17:15 gembin 閱讀(775) 評論(0)  編輯  收藏 所屬分類: DatabaseXML 、XQuery

    導航

    統計

    常用鏈接

    留言簿(6)

    隨筆分類(440)

    隨筆檔案(378)

    文章檔案(6)

    新聞檔案(1)

    相冊

    收藏夾(9)

    Adobe

    Android

    AS3

    Blog-Links

    Build

    Design Pattern

    Eclipse

    Favorite Links

    Flickr

    Game Dev

    HBase

    Identity Management

    IT resources

    JEE

    Language

    OpenID

    OSGi

    SOA

    Version Control

    最新隨筆

    搜索

    積分與排名

    最新評論

    閱讀排行榜

    評論排行榜

    free counters
    主站蜘蛛池模板: 亚洲Av无码一区二区二三区| 亚洲精品无码久久久久牙蜜区| 亚洲午夜无码久久久久小说| 国产黄色免费观看| 国产电影午夜成年免费视频| 亚洲国产一区视频| 亚洲一区免费视频| 伊人久久大香线蕉免费视频| 成年轻人网站色免费看| 亚洲阿v天堂在线| 久久久久久久久无码精品亚洲日韩 | 亚洲AV无一区二区三区久久| 亚洲人成人网站18禁| 免费人成黄页在线观看日本| 在线观看免费国产视频| 亚洲欧洲第一a在线观看| 美女隐私免费视频看| 999国内精品永久免费视频| 亚洲日本一区二区三区在线| 亚洲精品宾馆在线精品酒店| 99精品一区二区免费视频| 亚洲国产精品自产在线播放| 亚洲三级高清免费| 日韩精品无码一区二区三区免费| 亚洲?V无码成人精品区日韩| 亚洲色图激情文学| 91在线手机精品免费观看| 三上悠亚亚洲一区高清| 亚洲AV无码一区二区三区网址| 亚洲美女视频免费| 亚洲综合精品香蕉久久网| 18禁亚洲深夜福利人口| 2021久久精品免费观看| 亚洲AV日韩AV天堂久久| 一级黄色免费网站| 四虎影库久免费视频| 亚洲最大的成人网| 青草草色A免费观看在线| 久久亚洲国产成人精品性色| 久久久久久久国产免费看| 免费人成视频在线观看视频|