Relevance ranking

From ClusterpointWiki2

Jump to: navigation, search

Contents

Relevance

Relevance is a measure of the accuracy of the search result, which is calculated according to:

  1. the weight interval of the document part in which the search term appears
  2. the number of times the search term appears in the document
  3. the distance between the search terms in the document, if multiple words are being searched for

The weight interval can be customized using document policies to best reflect your document structure. A document part with a higher weight interval than other document parts mean that this part is considered more important than other parts. For example, the document's title is more important than the document's text.

Relevance calculation algorithm

The Clusterpoint Server relevance calculation algorithm consists of two parts:

  1. calculating the weights of individual words (performed when storing documents to the Storage)
  2. calculating the relevance of the document (performed when searching for documents in the Storage)
Calculating the weight for each word in a document

In each document part, the weight of each word is calculated according to the weight interval of the document part the word occurs in.

The weight for a word in a document part is the minimum value of the following:

The weight interval minimum and maximum can be the same value. In that case, for all words in such document part, no matter how often they appear, the weight in the document part is the same: the weight of the document part.

The maximum value of the weights of a word in all document parts is then assigned as the weight of the word in the document.

Calculating the relevance of a document

When searching documents in the Storage, the relevance of the document according to the search request is calculated as follows:

Figure 2: Calculating weight for each document

Example:

A document consists of three document parts: heading, description, and note. Each document part contains words w1, w2, and w3 and has its own weight interval, as described in Figure 2.

First, the weights of words are calculated in each part of the document:

Then, the weights of words in the entire document are calculated:

Finally, the relevance of the document is calculated:

wtotal=w1d + w2d + w3d = 80+23+21 = 124

Relevance = wtotal * d

Customizing the Weight Interval

The following weights are defined in the Clusterpoint default Storage policy:

Document part Minimum Maximum
Title 65 90
Text 1 75
All other indexed parts 0 100

To ensure more precise relevance calculation, the weight values of the default document policy can be changed, or appropriate weights can be assigned to different document parts when creating a custom document policy. As as with any document policy, it is both possible to assign weights for all documents in a Storage, and also to customize the weights of a single document when importing it to the Storage.

Personal tools
Namespaces
Variants
Actions
Download
Developer area
Admin area
Product manuals
Toolbox
Navigation