This Clusterpoint Wiki site is deprecated!
For reference - please refer to our new Wiki site at:
www.clusterpoint.com/docs

Relevance ranking

From Clusterpoint Wiki
Jump to: navigation, search

Relevance

Relevance is a measure of the accuracy of the search result, which is calculated according to:

  1. the weight interval of the document part in which the search term appears
  2. the number of times the search term appears in the document
  3. the distance between the search terms in the document, if multiple words are being searched for

The weight interval can be customized using document policies to best reflect your document structure. A document part with a higher weight interval than other document parts mean that this part is considered more important than other parts. For example, the document's title is more important than the document's text.

Relevance calculation algorithm

The Clusterpoint Server relevance calculation algorithm consists of two parts:

  1. calculating the weights of individual words (performed when storing documents to the Database)
  2. calculating the relevance of the document (performed when searching for documents in the Database)
Calculating the weight for each word in a document

In each document part, the weight of each word is calculated according to the weight interval of the document part the word occurs in.

The weight for a word in a document part is the minimum value of the following:

  • minimum value of the weight interval of the document part plus the number of times the word occurs in the document part
  • maximum value of the weight interval of the document part

The weight interval minimum and maximum can be the same value. In that case, for all words in such document part, no matter how often they appear, the weight in the document part is the same: the weight of the document part.

The maximum value of the weights of a word in all document parts is then assigned as the weight of the word in the document.

Calculating the relevance of a document

When searching documents in the Database, the relevance of the document according to the search request is calculated as follows:

  • The weights of all search terms in a document are summed.
  • Relevance is calculated by multiplying the total weight with a value that represents the distance between the search terms in the document: the greater the distance, the smaller this value.
Figure 2: Calculating weight for each document

Example:

A document consists of three document parts: heading, description, and note. Each document part contains words w1, w2, and w3 and has its own weight interval, as described in Figure 2.

First, the weights of words are calculated in each part of the document:

  • w1(heading)=min(80+2,80)=80,
    w1(description)=min(20+1,50)=21,
    w1(note)=min(10+4,12)=12
  • w2(heading)=0,
    w2(description)=min(20+3,50)=23,
    w2(note)=min(10+1,12)=11
  • w3(heading)=0,
    w3(description)=min(20+1,50)=21,
    w3(note)=min(10+2,12)=12

Then, the weights of words in the entire document are calculated:

  • w1d=max(w1(heading),w1(description),w1(note))=max(80,21,12)=80
  • w2d=max(w2(heading),w2(description),w2(note))=max(0,23,11)=23
  • w3d=max(w3(heading),w3(description),w3(note))=max(0,21,12)=21

Finally, the relevance of the document is calculated:

wtotal=w1d + w2d + w3d = 80+23+21 = 124

Relevance = wtotal * d

Customizing the Weight Interval

The following weights are defined in the Clusterpoint default Database policy:

Document part Minimum Maximum
Title 65 90
Text 1 75
All other indexed parts 0 100

To ensure more precise relevance calculation, the weight values of the default document policy can be changed, or appropriate weights can be assigned to different document parts when creating a custom document policy. As as with any document policy, it is both possible to assign weights for all documents in a Database, and also to customize the weights of a single document when importing it to the Database.