Policy config

From Clusterpoint Wiki
Jump to: navigation, search

Configuration

A Database policy is a set of rules and their values. A rule can be set for each XML element and it defines how this XML element is treated while storing or retrieving data.

Each XML element can have its own rule, but is not obligated to. If an XML element does not have its own rule, a rule is inherited from its ascendant. If no rules are set for an XML element and all its ascendants, the XML element is saved while storing or retrieving, but no specific action is performed.

The following table lists all possible rules and their possible values.

Rule Value Action applied to document part
document no (default) This is not the root part of the document.
yes This is the root part of the document. The contents of this part are treated as a single Clusterpoint Server document.
id no (default) This part is not the identifier of the document.
yes

This part is the identifier (ID) of the document (i.e. the primary key).

An ID can be a simple integer, an alphanumeric character string, a full file path on a file server and the file name, the URL of a Web page, or anything other that uniquely identifies a document.

There must be exactly one ID for a document and duplicate IDs may not exist.

group no (default) This part does not denote the group a document belongs to.
yes This part denotes the group a document belongs to.
index no (default) The contents of this part will be stored in the document repository and available for retrieval, however, it will be not indexed.
text Textual information contained within this part is indexed and made available for FTS (Full text search).
xml Textual information contained within this part is indexed preserving XML markup. In this case FTS can only be performed according to the XML markup.
xml-text Permits grouping the descendants of a document part under a single search path when performing search within markup. This can replace some OR operations.
facet
OR
sparse_facet
Allows to use this part for faceted navigation or classification. Faceted classification allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, predetermined, order. Documents can later be filtered/accessed using XPath-like expressions relative to this part. For more information see Faceted search. The "sparse_facet" value utilizes a different database scheme, which is much more efficient for facet paths which are only present in a small subset of documents.
xml-text&xml&text&facet Switches all modes of index policy. It's possible to make selection of modes by joining them with symbol '&'.
alias any valid XML element name If an alias is defined for a document part, the index will also record the contents of this part as located in an XML element named as the alias. Multiple documents can have the same alias, therefore creating an OR operation. Aliases can be used when performing search within markup.
tag1&tag2 To set multiple alias tags use symbol '&amp' as seperator.
index-dates no (overrides default configuration) Dates are not indexed independently of text in this part.
yes (overrides default configuration) Dates are indexed independently of text in this part. If there is more than one date in this element, only the last date is indexed. Supported date formats are listed here.
index-numbers no (overrides default configuration) Numbers are not indexed independently of text in this part.
yes (overrides default configuration, number is treated as float) Numbers are indexed independently of text in this part. This part must contain numeric value.
int (overrides default configuration, number is treated as integer) Numbers are indexed independently of text in this part. This part must contain numeric value.
index-empty yes Tag is indexed, even if it is empty.
no Tag is not indexed, if it is empty.
path

(since v2.2)
abs / absolute The contents of the tag are indexed using absolute path from document root, including document root tag. In this case you will need to specify XPath starting with document root for queries. This is used together with any indexing policy.
rel / relative (default) The contents of the tag are indexed using relative path since last indexing policy, including tag indexing policy was set for. In this case you will need to specify XPath relative to last indexing policy for queries. This is used together with any indexing policy.
exact-match binary The contents of the tag are indexed byte-to-byte for exact matching purposes
text The contents of the tag are indexed as a set of words exact matching purposes. Punctuation and other marks are ignored. Case-insensitive.
all All of the above
none (default) The tag will not be indexed for exact match
stem-lang lv Specifies that the tag's contents should be stemmed in Latvian for the purposes of stemming search.
en Specifies that the tag's contents should be stemmed in English.
ru Russian
fr, es, pt, it, ro, de, nl, sv, no, da, fi, hu, tr Will stem in French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian, Danish, Finnish, Hungarian and Turkish
empty (default) The contents of the tag will not be available for stemmed search.
weight <min–max> This rule only works together with the index rule set to text, xml, or all. Weight is an integer number in a range from 1 to 100. All words in this part are explicitly set to be relevant to a certain extent to the corresponding search term when performing FTS. If only a single number is set here, min and max are equal to it. For information on how weight affects search results, see Result ordering and grouping.
list no (default) This part will be not listed in the search results.
yes This part will be listed in the search results.
highlight This part will be listed in the search results and the search terms within this part will be highlighted.
snippet A snippet (short extract) from this part will be shown in the search results. The search terms will be highlighted.
listas

(since v2.3)
any valid XML element name Defines new tag name to use during document listing (search, similar, etc.). Retrieve will return original tag names. Please see detailed description for more information. Can be overrided later using list query parameter.
coll-lang en, lv, ru, or any other POSIX locale without the charset part Enables collated string indices for alphabetic ordering.

Setting Database Policy

One comprehensive Database policy can be set for the whole Clusterpoint Database by using the Clusterpoint Manager, or specific rules and their values can be set in query.

For information on using the Clusterpoint Manager to set the Database policy, see the Administrator's Guide.

Default Cloud Database policy

When creating new Database in Cloud UI (button “+ Create database”, or importing file “Configure import” -> “create a new database”), Database Policy is automatically set as default (see below). You can change/configure Default policy, using “Configure” button -> “Policy”.

Default Database policy:

<policy>
	<rule>
		<xpath>//document</xpath>
		<property>document=yes</property>
		<property>index=xml&amp;text&amp;facet</property>
		<property>index-numbers=yes</property>
		<property>exact-match=binary</property>
		<property>index-empty=yes</property>
		<property>list=yes</property>
	</rule>
	<rule>
		<xpath>//document/id</xpath>
		<property>id=yes</property>
		<property>index=-facet</property>
	</rule>
</policy>

Where:

Xpath Index rule&value Description
//document document=yes This is the root part of the document. The contents of this part are treated as a single document in database.
index=xml&text&facet Contains mentioned index modes.
index-numbers=yes Numbers are indexed independently of text in this part.
exact-match=binary The contents of the tag are indexed byte-to-byte for exact matching purposes.
index-empty=yes Tag is indexed, even if it is empty.
list=yes This part will be listed in the search results.
//document/id id=yes This part is the identifier (ID) of the document. There must be exactly one ID for a document and duplicate IDs may not exist.
index=-facet Tag does not inherit index=facet policy from it's parent tag (all other indexes are inherrited: index=xml&text).

Note, that childtag inherrits policy rules from parenttag (according to mentioned Default Policy, all childtags inherrit indexes from parent tag <document>).

Default Database policy is made to optimize search options (FTS, faceted search, numeric search etc.) in all document tags, that's why <document> tag has several policy rules. Whereas Database size increases with indexed words and according to policy values, then it is necessary to adjust suitable policy rules and values for appropriate tags.

Database Policy example

As example, Database policy has been designed with the following assumptions:

  • Data in existing databases can be perceived as documents that each have a unique ID.
  • Additional elements are supported by the default Database policy, but are not required.
  • When performing a search request, a reply of documents that match the search criteria is a list of IDs and snippets (parts of documents in which the match was found).

The following table lists and describes the elements of a document that corresponds to the Database policy (below). When using the Clusterpoint API libraries, the contents of the document are enclosed between the root tags <document></document>.

Element Description
<document> The root element of a document. Defines a single document.
<id> Unique document identifier in which FTS (full text search) is not performed. It is listed in the search results.
<title> Document title in which FTS is performed. Has higher weight than text. It is listed in the search results, with the search term highlighted.
<rate> An integer value in which FTS is not performed, assigned to a document with respect to other documents.
<group> Document group. This element can be used as a classifier for any kind of documents. When performing a search, it is possible to limit the number of documents returned from a group in the search result.
<text> Textual information in which FTS is performed. Clusterpoint Server also supports XML marked up information and preserves the markup, when searching in it. A snippet, which is a fragment of the document with an occurrence of the search term, is returned to the search results.
<hidden> Textual information in which FTS is performed, but which is not listed in the search results.
<info> Additional information added to a document, but in which FTS is not performed. It can even be binary data, for example, picture files, MS Word or PDF document files, and so on. Note that these files must be appropriately formatted. For information on appropriate formatting, see formatting XML special characters.

Therefore, an example document may appear as:

<document>
  <id>1245</id>
  <title>Board Meeting Minutes</title>
  <rate>6545646479</rate>
  <group>Minutes</group>
  <text>[..The text of the minutes..]</text>
  <hidden>[ID in a related database system] %*2225-8 </hidden>
  <info>[A word document containing the minutes, can be accessed by retrieving the document or listing this part]</info>
</document>

Database policy for this example defined in XML is the following:

<policy>
  <rule>
    <xpath>//document</xpath>
    <property>document=yes</property>
  </rule>
  <rule>
    <xpath>//document/id</xpath>
    <property>id=yes</property>
    <property>list=yes</property>
  </rule>
  <rule>
    <xpath>//document/title</xpath>
    <property>weight=65-90</property>
    <property>index=text</property>
    <property>list=highlight</property>
  </rule>
  <rule>
    <xpath>//document/rate</xpath>
    <property>index-numbers=int</property>
    <property>list=yes</property>
  </rule>
  <rule>
    <xpath>//document/group</xpath>
    <property>index=text</property>
    <property>list=yes</property>
    <property>group=yes</property>
  </rule>
  <rule>
    <xpath>//document/text</xpath>
    <property>weight=1-75</property>
    <property>index=text</property>
    <property>list=snippet</property>
  </rule>
  <rule>
    <xpath>//document/hidden</xpath>
    <property>index=xml</property>
  </rule>
  <rule>
    <xpath>//document/info</xpath>
    <property>index=xml</property>
    <property>list=yes</property>
  </rule>
</policy>

Setting up preordering rules

Clusterpoint Server provides search result preordering functionality. Normally, every time you execute a search query, CPS computes the whole result set (document IDs matching the specified query) and reorders them according to your specified criteria, then shows you the top 10 (or any other number, depending on the docs parameter specified during the search query).

However, if your application often uses the same ordering rules, as most applications usually do, you can tell CPS to preorder part of the result set in your given order. This means that in most cases CPS will have results ready in your specified order before you execute the query, and thus there will often be no need to fetch the whole result set - the top items will suffice.

This improves search query performance significantly for queries that match a big part of the database, however it also slightly reduces the indexing performance, as well as disk space required for storing indices. Each specified preordering rule might increase disk space usage by roughly 10% and decrease indexing speed by about the same margin.

All you have to do in order to configure, which sorting parameters you would like preordering to work on, is copy the ordering rules that you use in your search request into the policy configuration file. E.g., if your request is

<cps:request>
...
<cps:command>search</cps:command>
...
<cps:content>
  <query>...</query>
  <ordering>
    <numeric><amount>descending</amount></numeric>
  </ordering>
</cps:content>
</cps:request>

then you have to copy the <ordering> tag from your request into policy.xml, e.g. like this:

<policy>
  <rule>
    ...
  </rule>
  <rule>
     ...
  </rule>
  <rule>
     ...
  </rule>
  <ordering>
    <numeric><amount>descending</amount></numeric>
  </ordering>
</policy>

You can add multiple preordering rules in the same manner - by specifying each one in its own <ordering> tag in the policy config. Starting with version 2.2 you can add attribute default="yes" to one <ordering> tag. This will mean, that this ordering will be used in cases when query does not contain any ordering instructions at all. Without this attribute default order for such requests will be rate in descending order.

When searching, if you specify ordering rules identical to the ones you specified in the policy file, Clusterpoint Server will use the preordered index automatically.

Sometimes you might want to retrieve a different minimal number of results than the docs parameter specifies. Reasons for this might include the need to facet the results - count how many items match each category - or to know the exact number of matching documents for pagination or other purposes.

In order to instruct CPS to have a different number as the minimum results to retrieve before sorting, you can specify the optimize_to parameter in the <cps:content> tag, e.g. like so:

<cps:request>
...
<cps:command>search</cps:command>
...
<cps:content>
  <query>...</query>
  <ordering>
    ...
  </ordering>
  <optimize_to>10000</optimize_to>
</cps:content>
</cps:request>

This will fetch at least 10000 results before resorting them, thus providing a somewhat more accurate picture of the data in the database in facet counters and/or estimated hit counters than if, say, 10 results were fetched.