A Storage policy is a set of rules and their values. A rule can be set for each XML element and it defines how this XML element is treated while storing or retrieving data.
Each XML element can have its own rule, but is not obligated to. If an XML element does not have its own rule, a rule is inherited from its ascendant. If no rules are set for an XML element and all its ascendants, the XML element is saved while storing or retrieving, but no specific action is performed.
The following table lists all possible rules and their possible values.
|Rule||Value||Action applied to document part|
|document||no (default)||This is not the root part of the document.|
|yes||This is the root part of the document. The contents of this part are treated as a single Clusterpoint Server document.|
|id||no (default)||This part is not the identifier of the document.|
This part is the identifier (ID) of the document (i.e. the primary key).
An ID can be a simple integer, an alphanumeric character string, a full file path on a file server and the file name, the URL of a Web page, or anything other that uniquely identifies a document.
There must be exactly one ID for a document and duplicate IDs may not exist.
|group||no (default)||This part does not denote the group a document belongs to.|
|yes||This part denotes the group a document belongs to.|
|index||no (default)||The contents of this part will be stored in the document repository and available for retrieval, however, it will be not indexed.|
|text||Textual information contained within this part is indexed and made available for FTS.|
|xml||Textual information contained within this part is indexed preserving XML markup. In this case FTS can only be performed according to the XML markup.|
|xml-text||Permits grouping the descendants of a document part under a single search path when performing search within markup. This can replace some OR operations.|
|Allows to use this part for faceted navigation or classification. Faceted classification allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, predetermined, order. Documents can later be filtered/accessed using XPath-like expressions relative to this part. For more information see Faceted search. The "sparse_facet" value utilizes a different storage scheme, which is much more efficient for facet paths which are only present in a small subset of documents.|
|xml-text&xml&text&facet||Switches all modes of index policy. It's possible to make selection of modes by joining them with symbol '&'.|
|alias||any valid XML element name||If an alias is defined for a document part, the index will also record the contents of this part as located in an XML element named as the alias. Multiple documents can have the same alias, therefore creating an OR operation. Aliases can be used when performing search within markup.|
|tag1&tag2||To set multiple alias tags use symbol '&' as seperator.|
|index-dates||no (overrides default configuration)||Dates are not indexed independently of text in this part.|
|yes (overrides default configuration)||Dates are indexed independently of text in this part. If there is more than one date in this element, only the last date is indexed. The available formats are:
YYYY/MM/DD [HH:MM:SS [am/pm]]
MM/DD/YY[YY] [HH:MM:SS [am/pm]]
DD Month YYYY [HH:MM:SS [am/pm]]
Month DD YYYY [HH:MM:SS [am/pm]]
|index-numbers||no (overrides default configuration)||Numbers are not indexed independently of text in this part.|
|yes (overrides default configuration, number is treated as float)||Numbers are indexed independently of text in this part. This part must contain numeric value.|
|int (overrides default configuration, number is treated as integer)||Numbers are indexed independently of text in this part. This part must contain numeric value.|
|abs / absolute||The contents of the tag are indexed using absolute path from document root, including document root tag. In this case you will need to specify XPath starting with document root for queries. This is used together with any indexing policy.|
|rel / relative (default)||The contents of the tag are indexed using relative path since last indexing policy, including tag indexing policy was set for. In this case you will need to specify XPath relative to last indexing policy for queries. This is used together with any indexing policy.|
|exact-match||binary||The contents of the tag are indexed byte-to-byte for exact matching purposes|
|text||The contents of the tag are indexed as a set of words exact matching purposes. Punctuation and other marks are ignored. Case-insensitive.|
|all||All of the above|
|none (default)||The tag will not be indexed for exact match|
|stem-lang||lv||Specifies that the tag's contents should be stemmed in Latvian for the purposes of stemming search.|
|en||Specifies that the tag's contents should be stemmed in English.|
|fr, es, pt, it, ro, de, nl, sv, no, da, fi, hu, tr||Will stem in French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian, Danish, Finnish, Hungarian and Turkish|
|empty (default)||The contents of the tag will not be available for stemmed search.|
|weight||<min–max>||This rule only works together with the index rule set to text, xml, or all. Weight is an integer number in a range from 1 to 100. All words in this part are explicitly set to be relevant to a certain extent to the corresponding search term when performing FTS. If only a single number is set here, min and max are equal to it. For information on how weight affects search results, see Result ordering and grouping.|
|list||no (default)||This part will be not listed in the search results.|
|yes||This part will be listed in the search results.|
|highlight||This part will be listed in the search results and the search terms within this part will be highlighted.|
|snippet||A snippet (short extract) from this part will be shown in the search results. The search terms will be highlighted.|
|any valid XML element name||Defines new tag name to use during document listing (search, similar, etc.). Retrieve will return original tag names. Please see detailed description for more information. Can be overrided later using list query parameter.|
|coll-lang||en, lv, ru, or any other POSIX locale without the charset part||Enables collated string indices for alphabetic ordering.|
Setting Storage Policy
One comprehensive Storage policy can be set for the whole Clusterpoint Storage by using the Clusterpoint Manager, or specific rules and their values can be set for individual XML documents while importing.
For information on using the Clusterpoint Manager to set the Storage policy, see the Administrator's Guide.
To set the rule directly in the XML document, add the cps:rule=value attribute to the tag of XML element, for example,
These two mechanisms can be combinied: first the Storage policy with general rules and values is set with the Clusterpoint Manager, and then, exception rules and values are added using the cps:rule tag to some XML documents while importing.
Default Storage Policy
The default Storage policy has been designed with the following assumptions:
- Data in existing filings, databases, or other storages can be perceived as documents that each have a unique ID, title, and content.
- Additional elements are supported by the default Storage policy, but are not required.
- When performing a search request, a reply of documents that match the search criteria is a list of IDs, titles and snippets (parts of documents in which the match was found).
The following table lists and describes the elements of a document that corresponds to the default Storage policy. When using the Clusterpoint API libraries, the contents of the document are enclosed between the root tags <docuemnt></document>.
|<document>||The root element of a document. Defines a single document.|
|<id>||Unique document identifier in which FTS is not performed. It is listed in the search results.|
|<title>||Document title in which FTS is performed. Has higher weight than text. It is listed in the search results, with the search term highlighted.|
|<rate>||An integer value in which FTS is not performed, assigned to a document with respect to other documents.|
|<group>||Document group. This element can be used as a classifier for any kind of documents. When performing a search, it is possible to limit the number of documents returned from a group in the search result.|
|<text>||Textual information in which FTS is performed. Clusterpoint Server also supports XML marked up information and preserves the markup, when searching in it. A snippet, which is a fragment of the document with an occurrence of the search term, is returned to the search results.|
|<hidden>||Textual information in which FTS is performed, but which is not listed in the search results.|
|<info>||Additional information added to a document, but in which FTS is not performed. It can even be binary data, for example, picture files, MS Word or PDF document files, and so on. Note that these files must be appropriately formatted. For information on appropriate formatting, see formatting XML special characters.|
Therefore, an example document may appear as:
<document> <id> 1245 </id> <title> Board Meeting Minutes </title> <rate> 6545646479 </rate> <group> Minutes </group> <text> [..The text of the minutes..] </text> <hidden> [ID in a related database system] %*2225-8 </hidden> <info> [A word document containing the minutes, can be accessed by retrieving the document or listing this part] </info> </document>
The default Storage policy defined in XML is the following.
<policy> <rule> <xpath>//document</xpath> <property>document=yes</property> </rule> <rule> <xpath>//document/id</xpath> <property>id=yes</property> <property>list=yes</property> </rule> <rule> <xpath>//document/info</xpath> <property>index=xml</property> <property>list=yes</property> </rule> <rule> <xpath>//document/rate</xpath> <property>index-numbers=int</property> <property>list=yes</property> </rule> <rule> <xpath>//document/text</xpath> <property>weight=1-75</property> <property>index=text</property> <property>list=snippet</property> </rule> <rule> <xpath>//document/group</xpath> <property>index=text</property> <property>list=yes</property> <property>group=yes</property> </rule> <rule> <xpath>//document/title</xpath> <property>weight=65-90</property> <property>index=text</property> <property>list=highlight</property> </rule> <rule> <xpath>//document/hidden</xpath> <property>index=xml</property> </rule> </policy>
Setting up preordering rules
Clusterpoint Server provides search result preordering functionality. Normally, every time you execute a search query, CPS computes the whole result set (document IDs matching the specified query) and reorders them according to your specified criteria, then shows you the top 10 (or any other number, depending on the docs parameter specified during the search query).
However, if your application often uses the same ordering rules, as most applications usually do, you can tell CPS to preorder part of the result set in your given order. This means that in most cases CPS will have results ready in your specified order before you execute the query, and thus there will often be no need to fetch the whole result set - the top items will suffice.
This improves search query performance significantly for queries that match a big part of the storage, however it also slightly reduces the indexing performance, as well as disk space required for storing indices. Each specified preordering rule might increase disk space usage by roughly 10% and decrease indexing speed by about the same margin.
All you have to do in order to configure, which sorting parameters you would like preordering to work on, is copy the ordering rules that you use in your search request into the policy configuration file. E.g., if your request is
<cps:request> ... <cps:command>search</cps:command> ... <cps:content> <query>...</query> <ordering> <numeric><amount>descending</amount></numeric> </ordering> </cps:content> </cps:request>
then you have to copy the <ordering> tag from your request into policy.xml, e.g. like this:
<policy> <rule> ... </rule> <rule> ... </rule> <rule> ... </rule> <ordering> <numeric><amount>descending</amount></numeric> </ordering> </policy>
You can add multiple preordering rules in the same manner - by specifying each one in its own <ordering> tag in the policy config. Starting with version 2.2 you can add attribute default="yes" to one <ordering> tag. This will mean, that this ordering will be used in cases when query does not contain any ordering instructions at all. Without this attribute default order for such requests will be rate in descending order.
When searching, if you specify ordering rules identical to the ones you specified in the policy file, Clusterpoint Server will use the preordered index automatically.
Sometimes you might want to retrieve a different minimal number of results than the docs parameter specifies. Reasons for this might include the need to facet the results - count how many items match each category - or to know the exact number of matching documents for pagination or other purposes.
In order to instruct CPS to have a different number as the minimum results to retrieve before sorting, you can specify the optimize_to parameter in the <cps:content> tag, e.g. like so:
<cps:request> ... <cps:command>search</cps:command> ... <cps:content> <query>...</query> <ordering> ... </ordering> <optimize_to>10000</optimize_to> </cps:content> </cps:request>
This will fetch at least 10000 results before resorting them, thus providing a somewhat more accurate picture of the data in the storage in facet counters and/or estimated hit counters than if, say, 10 results were fetched.