Search glossary

Stop words

Stop words are those words that are usually ignored for full-text indexing. Typical stop words in English are the, at, is, on, which. Since a stop word has only grammatical and syntactic function, it can be neglected in the description of the content. Moreover, such words occur very often and, when activated, affect performance.

In each language there are common words that occur frequently which typically are not relevant in a query — e.g. "that", "the", "this", "to". To save space and improve performance, most full-text engines may be configured to ignore stop words. The censhare full-text engine supports stop words. To our experience this doesn't really makes a difference and even causes unexpected query results so the recommendation is to turn this setting off.
Further reading:
http://de.wikipedia.org/wiki/Stoppwort

N-Gram

In computational linguistics an N-gram describes the breakdown of a text into fragments. These elements can be characters, words, phonemes or something equivalent. The length of the N-Gram is defined by Greek numerals. For example, a pentagram consists of five characters. Text analysis using n-grams is a proven method in Computational linguistics, as it is applicable to all languages. It is used to determine the answer to the question, how likely a particular character will be following certain words or letter sequences.
Further reading:
http://de.wikipedia.org/wiki/N-Gram

Stemming

Reduction on the word stem for indexing. Example: bowls, bowler, bowling, bowled do obviously belong together and can be stored under their root (stem) "bowl".

Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form. E.g. the base form of "going" and "went" would be "go". The idea behind using stemming within full-text search is to find documents even so words are inflected — either in the document or the query. Stemming can be implemented in using a language specific algorithm and/or using a dictionary. Algorithms basically garble words which may lead to strange results, the dictionary based approach doesn't cover words that are not in the dictionary. The censhare full-text engine supports algorithmic stemming. To our experience this leads to unexpected results so the recommendation is to turn this off, especially because the censhare query engine supports prefix, infix and postfix matching and fuzzy matching.
Further reading:
http://de.wikipedia.org/wiki/Stemming

Fuzzy Search

Fuzzy search is about finding words even so they are misspelled. E.g. searching for "recomondation" would find documents that contain the word "recommendation". Fuzzy search is a kind of double-edged sword: Without it simple typos don't return expected results, but a fuzzy search being too tolerant returns too many unexpected results. The censhare full-text engine supports fuzzy search and allows to configure the tolerance precisely through the so called editing-distance (see http://en.wikipedia.org/wiki/Levenshtein_distance ) which is the number of character insertions, deletions and changes. Between "recomondation" and "recommendation" the editing distance is two because to get from "recomondation" to "recommendation" one "m" has to be inserted and one "o" has to be changed to "e". The recommendation is to enable fuzzy search but have a low tolerance of 1 for the editing distance.

Okapi BM25
The BM25 is the underlying ranking algorithm, which has been implemented for determining the rank after a censhare Quick search.
Further reading:
http://en.wikipedia.org/wiki/Okapi_BM25

Bitap-Algorithmus
The matching algorithm for finding text segments in a string.
Further reading:
http://de.wikipedia.org/wiki/String-Matching-Algorithmus

Page tree

Search glossary

Stop words

N-Gram

Stemming

Fuzzy Search