FAQ – censhare Full-Text-Search censhare know-how

Search can be complex, the results confusing. Here are some frequently asked questions and the answers.

Customer question

Can I search for an exact text string?

censhare's answer

The search in a typical text processing application performs a character by character search in one document. This is pretty fast for one document. However it is not possible to perform a character by character search in say a 1 million documents. Even if it would take only 0,01s per document this would take almost 3 hours for 1 million documents. Therefore full-text search engines (see http://en.wikipedia.org/wiki/Full_text_search) work differently: Each document is split into words and an index is build up. For every word that index has an entry that contains a list of all documents (assets in our case) that contain a certain word. For queries the entered text is again split into words and an intersection (see http://en.wikipedia.org/wiki/Intersection_(set_theory)) of the corresponding document lists is performed. The result set contains all documents that contain these words but doesn't tell if the words are adjacent. To perform a (more) exact search using a phrase query that considers word positions, the index would have to contain the positional information as well. This blows up the index size by a factor of 2 to 4 (Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze) and slows down the performance to update and query the index considerably. Therefore this positional information deliberately is not stored in the censhare database and therefore a phrase search is not possible. Please note that search engine providers like Google that offer a phrase search have server farms with literally thousands of machines working in parallel.

Customer question

How does the result ranking work and how does that compare to something like Google?

censhare's answer

censhare uses the asset meta data and/or content for deriving the ranking. For content (e.g., text) the so called term frequency–inverse document frequency (Tf-idf) is used (see http://en.wikipedia.org/wiki/Tf–idf). It considers how "important" a word is in a document and a collection of documents. A term that occurs in plenty of documents has a low selectivity (e.g., the word "the") and a match will add less weight to the total ranking of a document than a word that rarely occurs (e.g., "Palpigradi"). A document that contains one word multiple times will get a higher rank than a document where this word occurs only once. If two documents have the same amount of occurrences of a specific word than the shorter document will get a higher rank. There are multiple mathematical ranking functions to calculate the ranking. censhare uses the so called Okapi BM25 (which is based on Tf-idf) because it gives a good match with the users expectations (see http://en.wikipedia.org/wiki/Okapi_BM25). However it is still a heuristic that may lead to sometimes unexpected ranking results. Google uses the patented so called PageRank algorithm named after one it's inventors and Google founding members, Larry Page (see http://de.wikipedia.org/wiki/PageRank). The ranking is derived from the amount of links between web pages. A page that is referenced by many other pages gets a higher rank, especially if the referencing pages themselves have a high rank. Because many webmasters try to manipulate the ranking (e.g., creating link farms) Google has modified and enhanced this basic algorithm. So the ranking is not derived from the content. And as well it is a heuristic that may lead to sometimes strange ranking results but with Google people usually don't have expected results. Think about that one: Searching for "Peach" gives more or less 12 arbitrary results on the first page with the message "About 168,000,000 results".

Customer question

What are stop words?

censhare's answer

In each language there are common words that occur frequently which typically are not relevant in a query — e.g. "that", "the", "this", "to". To save space and improve performance most full-text engines may be configured to ignore stopwords. The censhare full-text engine supports stop words. To our experience this doesn't really makes a difference and even causes unexpected query results so the recommendation is to turn this off.

Customer question

What is stemming?

censhare's answer

Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form. E.g. the base form of "going" and "went" would be "go". The idea behind using stemming within full-text search is to find documents even so words are inflected — either in the document or the query. Stemming can be implemented in using a language specific algorithm and/or using a dictionary. Algorithms basically garble words which may lead to strange results, the dictionary based approach doesn't cover words that are not in the dictionary. The censhare full-text engine supports algorithmic stemming. To our experience this leads to unexpected results so the recommendation is to turn this off, especially because the censhare query engine supports prefix, infix and postfix matching and fuzzy matching.

Customer question

What is prefix, infix and postfix matching?

censhare's answer

Prefix matching is about finding words that start with the given character string. "recommend" is a prefix of "recommendation". So with prefix matching a query for "recommend" would return documents containing the word "recommendation". A typical database index gives a good performance for exact matches and prefix matches. Infix matching and postfix matching is about finding words that contain or end with the given character string. "commend" is part of "recommendation" and would match as infix, "base" is a postfix of the search term "database" and would match as postfix. A typical database index cannot handle infix and postfix matches efficiently. The censhare full-text engine is optimised to handle these cases as well using n-grams (http://en.wikipedia.org/wiki/N-gram).

Customer question

What is fuzzy search?

censhare's answer

Fuzzy search is about finding words even so they are misspelled. E.g. searching for "recomondation" would find documents that contain the word "recommendation". Fuzzy search is a kind of double-edged sword: Without it simple typos don't return expected results, but a fuzzy search being too tolerant returns too many unexpected results. The censhare full-text engine supports fuzzy search and allows to configure the tolerance precisely through the so called editing-distance (see http://en.wikipedia.org/wiki/Levenshtein_distance) which is the number of character insertions, deletions and changes. Between "recomondation" and "recommendation" the editing distance is two because to get from "recomondation" to "recommendation" one "m" has to be inserted and one "o" has to be changed to "e". The recommendation is to enable fuzzy search but have a low tolerance of 1 for the editing distance.

Customer question

Why do I have to sometimes enter a search string with wildcard ("*") characters and sometimes not?

censhare's answer

It depends on the kind of index that is defined for that fields (attributes) or feature. Standard fields/features have a non full-text styled index. Here an exact match is the default and only with wildcards a prefix, infix or postfix matching search is performed. Note that infix and postfix searches are expensive (slow) because the whole index has to be scanned. It's like looking up a name in the white pages: Finding all people starting with "Peter" ("Peter", "Peterman", "Peterson", etc.) is easy. But finding all people with names that contain or end with "win" is a huge effort. If required a full-text index can be configured for a field/feature that is optimised to handle prefix, infix and postfix matching. Then wildcards are not necessary (and ignored) because prefix, infix and postfix matching is always performed.