Search query syntax

From ClusterpointWiki2

Jump to: navigation, search

Contents

Search query syntax

Clusterpoint Server provides several mechanisms for specifying your search query, each with a definite syntax. This section describes the syntax of specific parameters that are supported for the Search command.

Single search term

To search for documents that contain a single search term, the search term must be entered as-is.

Example:

<query>George</query>

The query returns documents that contain the word "George".

AND

To search for documents that contain all of several terms, which are not necessarily next to each other, the search terms must be separated by a space " ".

Example:

<query>George Brown</query>

The query returns documents that contain both the word "George" and the word "Brown".

Phrase search

To search for documents that contain an exact phrase, the search phrase must be enclosed in quotation marks """.

Example:

<query>"George Brown"</query>

The query returns documents that contain the exact phrase "George Brown".

Exact match

Searches for documents that have a full tag match - i.e. contain a tag with the exact content specified and nothing else. Available only for tags that have the exact-match policy set.

The search phrase must be enclosed in quotation marks """ and prefixed with "=".

Example:

<query><name>="George Brown"</name></query>
<exact-match>binary</exact-match>

The query returns documents that contain a <name>George Brown</name> tag.

The optional exact-match parameter specifies which match mode should be used for this search type. Possible values are described here

OR

To search for documents that contain any of the search terms, the search terms must be enclosed in braces ("{ }") and separated by a space " ".

Example:

<query>{George Brown}</query>

The query returns documents that contain either the word "George" or the word "Brown".

NOT

To exclude documents that contain a search term, the search term must be preceded with a tilde ("~"). Note: This option cannot be used alone.

Example:

<query>Brown ~George</query>

The query returns documents that do contain the word "Brown" but do not contain the word "George".

Boolean expressions

AND, OR, and NOT logical connectives can be combined in more complex search expressions using brackets "()", which allows building any Boolean expression.

Example:

<query>{(George Brown) (Mary Green)}</query>

The query returns documents that either contain the word "George" and the word "Brown", or the word "Mary" and the word "Green".


Figure 1: Parsing boolean expressions

<query>{(A B ~C) "D E"}<query>

The query is parsed in the expression tree as in the figure to the left.

Wildcard patterns

To search for documents that contain a class of words, wildcards can be used to substite:

Note: When wildcard patterns are used to define a class of words to be searched, only a limited number of statistically frequent words are searched for. This limitation is introduced to preserve the high performance of the Clusterpoint Server. The extent of this limitation can be modified by changing the Storage configuration.

Example:

<query>ma?</query>

The query returns documents that contain the words "map", "may", "mat", "max", and so on.

<query>Geo*</query>

The query returns documents that contain the words "George", "Geotermal", "Geology", and so on.

<query>ma[py]</query>

The query returns documents that contain the words "map" or "may".

<query>c?[au]*</query>

The query returns documents that contain the words "counter", "club", "chapter", "country", "change", "chat", "council", "class", "cpu", "challenge", "church", "couple", "championship", and so on.

Ignored words

By default, Clusterpoint Server ignores common words and characters such as "and", "where", and "how", as well as certain single characters (letters), because they tend to slow down the search without improving search results. Common words and characters like this are called ignored words.

The Clusterpoint Server detects words that appear in the Storage most often and adds them to the ignored words list. The extent of this limitation can be modified either by changing the Storage configuration.

If a common word or a character is essential to getting the required results, it can be included by preceding it with a plus sign "+".

Example:

<query>George +and Mary</query>

The query returns documents that contain all three words: "George", "and", and "Mary".

Boosted words

Note that this feature is only available in Clusterpoint Server v2.2 and newer

It is possible to alter the weight values for parts of the query, in order to affect the relative importance of different query terms while sorting documents by relevance. For instance, when searching for George Lucas movie it might be beneficial to consider terms George Lucas to be more important for relevance than the term movie. This can be achieved with the weight boosting feature.

The weight boosting operator (^) alters weights for each term that are placed between the opening and closing ^ symbols. In order to have more fine-grained control over particular weight modifications, 3 different modes of operation are provided: incrementation, replacement, or multiplication.

Incrementation

The simplest way to increment the terms' weight by 10 is to surround them with the ^ operator:

<query>^George Lucas^ movie</query>

You have to be careful with this syntax and user-submitted queries, because if there are numbers and/or special symbols after the first ^ symbol, they might be interpreted as additional weight boosting operator parameters, as specified below.

Another way to specify the same operation is this:

<query>^+10 George Lucas ^ movie</query>

This will add 10 to the weight value of every occurence of George or Lucas. For instance, if the original indexed weight value was 50, it will be incremented to 60.

Note that the increment can also be negative, so this is also valid:

<query>George Lucas ^-10 movie ^</query>

Replacement

<query>^100 George Lucas ^ movie</query>

This will replace the weight value of every occurence of George or Lucas with 100. For instance, if the original indexed weight value was 50, it will be replaced with 100.

Multiplication

<query>^*1.5 George Lucas ^ movie</query>

This will multiply the weight value of every occurence of George or Lucas by 1.5. For instance, if the original indexed weight value was 50, it will be replaced with 75.

Stemming

It is possible to include a word and its declinations, for example, "go" and "going", in one search request. This feature is especially useful for so-called synthetic languages, in which syntactic relations within sentences are expressed by the change in the form of a word that indicates distinctions of tense, person, gender, number, mood, voice, and case, for example, German, Russian and Latin.

In order for the feature to work correctly, You have to specify the stem-lang parameter when using stem search and the tag that You are searching on has to have the stem-lang policy specified. Possible values for the parameter are specified here.

To search for documents that contain a word or its declinations, a word or a phrase must be enclosed in dollar signs ("$").

Example:

<query>$George$</query>
<stem-lang>en</stem-lang>

The query returns documents that contain the words "George" and "Georges".

Case sensitivity for proper names

It is possible to perform case sensitive search for proper names, which means that case sensitivity is applied for the first letter of a search term. The case sensitivity feature is switched on or off by setting the <case_sensitive> parameter to "yes" in the search command’s XML request.

Example:

<query>Bank</query>
<case_sensitive>yes</case_sensitive>

The query returns documents in which the word "Bank" is with the first capital letter. Note that in this case, documents in which the word "BANK" is with all capitals are also returned.

Search within markup

To search for documents that contain the search term in a specific XML element, the search term must be enclosed in the appropriate tags.

Note: Search within markup can be performed only if the policy rule index with values xml or all is used.

Example:

<query>
 <name>John</name>
 <surname>Smith</surname>
</query>

The query returns documents that contain the word "John" in the <name> tag and the word "Smith" in the <surname> tag.

<query>
 {
  <person>George</person>
  <address>"Great Britain"</address>
 }
</query>

The query returns documents that either contain the word "George" in the <person> tag, or the phrase "Great Britain" in the <address> tag.

Alias

If an alias is defined for a document part in the Storage policy, the index will also record the contents of this part as located in an XML element named as the alias. Multiple documents can have the same alias, therefore creating an OR operation.

Example policy defined within the document:

<document>
 <tag1 cpse:alias="x">Apple</tag1>
 <tag2 cpse:alias="z">Banana</tag2>
 <tag3 cpse:alias="x">Kiwi</tag3>
 <tag4 cpse:alias="z">Orange</tag4>
</document>

Example query:

<query>
 <x>Kiwi</x>
</query>

The query will look for "Kiwi" in both <tag1> and <tag3>.

Grouped elements

Instead of searching by elements using their full path (index=xml) or searching among all elements that have been indexed without markup (index=text), it is also possible to group all the descendants of an XML element under a single path, using the index=xml-text rule. For example:

<document>
 <group1 cpse:index="xml-text">
  <tag1 cpse:index="text">Kiwi</tag1>
  <tag2 cpse:index="text">Banana</tag2>
 </group1>
 <group2 cpse:index="xml-text">
  <tag3 cpse:index="text">Apple</tag3>
  <tag4 cpse:index="text">Orange</tag4>
 </group>
</document>

To find the word "Kiwi", the requisite query would be:

<query><group1>Kiwi</group></query>

This approach permits replacing OR operations:

<query>{<tag1>osos</tag1><tag2>asdasd</tag2>}</query>

with an alternative that has better performance.

Tag colocation (TODO)

Proximity search

It is possible to define the maximum number of words that may appear between certain search terms.

To use this feature, the search terms must be specified as: @ N term1 term2 @, where N is the maximum count of words between the search terms, and term1 and term2 are search terms. Any number of search terms can be included in a proximity search.

Example:

<query>@ 4 phone fax @</query>

The query returns documents that contain the words "phone" and "fax" not further than 4 words from each other.

Numeric search

It is possible to perform a numeric search in document parts that are indexed with the index-numbers policy rule set to "yes". Numeric search allows searching for documents that contain numeric values within a certain range. For example, each document contains information about an object including geographic coordinate information. In that case, a numeric search can be performed to retrieve all objects within a definite range of a geographic coordinate. Thus, Clusterpoint Server can be used in online maps, where people can find information on different objects in a definite area.

A numeric search can be performed only together with a textual search.

Numeric values in documents are always indexed and stored as single-precision floats. These floating point numbers hold values upto six significant digits. Remember that a direct comparison of floats (such as with the = operator) will often be useless due to the float type imprecision.

To use the numeric search functionality, the search term must be as follows:

Additionaly, the following operators are available, yet they only work as expected if You have the precise_numeric_search option enabled in storage configuration:

These are particularly useful for date searches.

Note that the symbols "<" and ">" should be denoted as "&gt;" and "&lt;" according to XML encoding standards.

Example:

Document content:

<document>
 <id>76541</id>
 <title>George’s profile</title>
 <text>
  <name>George Brown</name>
  <age>26</age>
 </text>
</document>

Search query that matches the document:

<query>
 <name>George</name>
 <age>20 .. 30</age>
</query>

Only one numeric interval can be queried per a single document part. However, it is possible to perform a numeric search in more than one document part.

Example: Document content:

<document>
 <id>76541</id>
 <title>George’s profile</title>
 <text>
  <name>George Brown</name>
  <age>26</age>
  <children>4</children>
 </text>
</document>

Search query that matches the document:

<query>
 <name>George</name>
 <age>20 .. 30</age>
 <children>&gt; 3</children>
</query>

Numeric search in more than one tag is especially useful and necessary for geographic coordinate searching, where it is necessary to search for an object by its longitude and latitude.

Note that currently numeric searches are applied to the query as a whole and regardless of how they are specified, always are in an AND relationship to the documents - You can't search for several numeric fields in an OR relationship (e.g. a < 5 or b > 10) or have the numeric search applied to only part of the query.

Date search

Dates are only indexed for fields with the index-dates policy set to "yes". The following formats are supported:

Starting from version 2.3, this syntax is also supported:

Date searches are performed in exactly the same manner as the numeric searches.

Personal tools
Namespaces
Variants
Actions
Download
Developer area
Admin area
Toolbox
Navigation