* Create your FREE website now *

More actions

Search Glossary

Tags:  

Glossary of terms and acronyms related to enterprise search. 

Absolute boosting
Absolute boosting enables a document to be consistently displayed at a given position in the result set when a user searches with a specific query.  It also prevents individual documents from being displayed when a user searches with a specific query.
 
Accent normalization
Character normalization can preserve both original and normalized forms for accented words (e.g. hôtel).

Access control list (ACL)
A data set which grants permissions, or access rights, to each user or group for a specific system object, such as a directory or file.
 
Using the ACL information from the content repositories the same permissions apply to search results. This means that a user is only able to see the query results that he/she is entitled to view, based on his/her permissions towards the source content repository.

Adjacent searching
Commonly refered to as proximity search.
 
An extension to Boolean searching, this technique checks the position of terms and only matches those within the specified distance. It’s a good way to cut down the irrelevant matches and get better results.
 
Alert
A message that the enterprise search engine broadcasts (for example, to a front-end application, or a messaging system such as e-mail, SMS or IM) when a document satisfies a stored query. Alerts are either near real-time or configured as asynchronous events run on a scheduled basis.

Alert engine
A matching engine which performs matching of incoming documents against stored queries (triggers). A match generates an alert.

Alert query
An alert query is the set of filtering conditions an end-user or external application sends to the Alert Engine.
 
Each alert query is composed of several matching conditions and Boolean operators in a similar way as a search query.

Anchor text
An Anchor text is the textual components of web hyperlinks (text links or ‘alt’ text associated with image hyperlinks). Anchor texts may provide additional descriptive information about the referred page, and is therefore often indexed as metadata to the referred document.
 
Detecting links from other pages to a given page, and using the anchor texts associated with these links to compute an authority rank component. The referring anchor texts may also be included as searchable content for the referred documents.

Anti-phrasing
Identifying word sequences in queries that do not contribute essentially to the query’s meaning, such as “Where can I find” or “Where is”.

Application programming interface (API)
A programmatic interface that enables software developers to access features and functions of a hardware or software platform.  An API is the specific method prescribed by a computer operating system or by an application by which a programmer writing an application program can make requests of the operating system or another application.

Approximate matching
Matching a query term and a term within a document based on approximations. Such approximations can be based on spell check (see Spell Checking) or linguistic normalization (lemmatization, accent normalization).

Asian language tokenization
Tokenization (word segmentation) for Asian languages requires special treatment. These languages do not allow text to be split into word entities by referring to whitespace or other separators. Asian language text needs to be split into tokens that can be treated as words during document processing and matching.

Authority
One dimension of search relevancy. This indicates that the document is considered to be an authority for this query.  That is, the document is being referred to by others, for example, through web anchor texts. Many items can be part of the analysis of documents to determine this parameter – Web link cardinality, article references, page impressions, and product revenue, to name a few.

Average query response time
The average time it takes for the search engine to respond to a given query. There are typically two times that can be measured: 1) the average response time of the search engine itself, and 2) that of the complete system for an end-to-end query (i.e. including the application and web server times).

Benchmarking
A process that allows organizations to evaluate various aspects of their processes in relation to best practice, usually within their own industry sector.  Benchmarking also allows organizations to develop plans on how to adopt such best practices, usually with the aim of increasing performance. Benchmarking may be a one-time event, but is often treated as a continuous process.

Bigram
A bigram is another term for a 2-word phrase. It can also be seen as an N-gram where N=2

Boolean Language
Structured interrogation language allowing the implementation of Boolean algebra based on the industry standard AND, OR and NOT terms

Boolean Search or Boolean Query
A form of logical comparison. Boolean operators let you define whether multiple search terms are matched within a text block. A Boolean expression is constructed by joining terms together with the three special operators: AND, OR, and NOT. You can also combine sub-expressions within a query using these boolean expressions. Proximity Search using the NEAR/ONEAR operators are somewhat related. NEAR is similar to AND, but implies an additional constraint that the terms should appear within a given distance in the document.

Boosting
Boosting may be used to alter the relevancy value of a document compared to other documents in a search index, typically because it is perceived to be a more valuable resource. It is the addition or subtraction of a value to a document’s rank (relevancy). By default, documents with the highest rank values are returned to the user before documents of lower rank values.

Boosting may be applied in two ways:
- Query independent (document boosting). This is used to boost high quality pages for all queries that match the document 
- Query dependant (query boosting). In this case specific documents may be boosted for given queries  

Boundary match 
The ability to limit a query term/phrase to the start and/or end of an indexed field/parameter. Combining start and end condition provides an exact field/parameter match.

Call-backs
Programmatic alerts produced by an API. For a search platform, this is usually related to the content processing and indexing status of a document in order for the client application to keep track of the processing/indexing progress.
 
Case sensitive/insensitive searching
Search engines will most often normalize words to lower case. Some search applications, though, may want to use case sensitive search against specific content, such as metadata.

Categorization
Categorization is the process of organizing pieces of information into topical categories. Usually, these are hierarchical trees, with the most general topics at the top and the most specific at the bottom.

A department store might have: Products, Shoes, Women, Cross-Trainers, while a gardening site might have a category: Plants, Flowers, California Natives, Poppies. In either case, a searcher can understand more about the content of the page when they know the category. Some categorization products will attempt to classify data automatically, while others assist human catalogers

A search engine may apply categorization of the documents in the index based on similarities (typically based on a training set), matching rules or programmatic rules. See also Result Clustering.

Classification or Categorization
The process of organizing pieces of information into topical categories. Usually, these are hierarchical trees, with the most general topics at the top and the most specific at the bottom (Products, Shoes, Women, Cross-Trainers.) In either case, a searcher can understand more about the content of the page when they know the category.

Clusterisation
Action consisting of gathering, “on-the-fly”, answers to a request from within heterogeneous subsets relating to the same topic.

Collection
Content that is to be processed, made searchable, and retrieved as a logical unit. Content types can be grouped by source and by the processing rules that are to be applied to this type of content.

Collection-level security
The application tier will assign different authorization levels to various collections within the search index. End users then have access to the set of collections that map to their authorization levels.

Completeness
In relation to relevancy, a gauge of how well the document matches superior document contexts such as the title or the URL. It describes what matches the query: document title, author, mention in the body text, metadata linked to the document, both root, and expanded form of words.

Concept extraction
The ability to mine concepts from data using linguistic analysis.

Connector / Agent
Program connecting to a source of information either to question it in real time (connector) or to extract specific elements from the documents which are indexed by an agent.

Content
Content is the external data input to the enterprise search platform. Content is converted into internal document representation after being fed into the system.

Content Connector
A content connector extracts content from an external content repository (file systems, content management systems, databases, collaboration applications) and inputs this to the search system for indexing.

Connectors may be based on push or pull technology, depending on the capabilities of the content repository.

Content aggregation
The bringing together of content from multiple source repositories for retrieval at a later time.  In some cases, this term is also used for the amalgamation of search results into a comprehensive whole.

Content management system (CMS)
A software system for organizing and facilitating collaborative creation and publishing of documents and other content. 

Content routing
In a large search system the index is split into multiple columns/partitions. It is often desired to choose different algorithms for routing of content to columns/partitions. An efficient method is to apply a statistical distribution of documents based on a hashing algorithm. In other cases it may be desired to base toe routing on collections or other attributes of the content.

Context relevancy
One dimension of search relevancy. The importance of a term/phrase/entity match depends on the matching context. Contexts may be fields or semantic structures of the document, such as paragraph, sentence or title. 

Contextual Insight
Next generation search intelligence that dynamically identifies relationships so that the users can quickly find facts and answers to questions like “When did the Berlin wall fall?” The users get both the contextual results with extreme precision and the contextual navigation for further investigation of related information.

Contextual entity extraction
Extracted entities (see Entity extraction) can be automatically annotated to semantic structures in the text, such as paragraphs or sentences. Such annotation enables normalized matching of entities as well as contextual navigation into detected entities from search results.

Crawler
Active element of the indexer who’s action indexes http sources and filing systems. The crawler can follow html links.

Crawling
The act of accessing Web servers and/or file systems in order to extract information to feed into the enterprise search platform. By following links, a Crawler is able to traverser Web content hierarchies based on a single start URL.

DMS
Document management system

Date range
A search engine may provide an option to search for documents modified on a specific date, before a date, after a date, or between two dates. See also Freshness boosting.

Deep navigators
A type of dynamic drill-down navigator which applies on-the-fly aggregation of result values across the entire result set for a query. See also Navigator

Dictionary/Thesaurus
Within the context of search, a dictionary supports linguistic processing of content and queries against a list of words/terms/phrases in order to improve recall and/or precision for a query. A compiled dictionary structure is normally used for performance reasons.

Directed search
A narrow search within a specified area of the indexed content.  Users may choose to search within “news” if they want the latest updates on today’s game, for example, instead of having to search within “news”, “culture”, and “sports.”

Document
A piece of content that is normalized with respect to the enterprise search platform’s document structure, as opposed to the content itself.

Document element
A document element is a part of a document within the document processing framework. A document is divided into elements in order to enable individual processing and indexing of structural parts of the documents. This may be heading, body and meta data from HTML documents, or fields within a database schema. Document Elements are mapped to searchable fields within the Index.

Document processing pipeline
The document processing pipeline is a sequential set of pre-index document processing stages.

Document summary
A document summary represents the subset of the matching documents that is returned with a query result.

Document summary field
The content of an individual field within a document summary. The set of summary fields returned is typically a sub-set of the indexed fields of a document. See also Dynamic teaser.

Document vector
A document vector is simply a set of (keyword, weight) pairs, where keyword is a word or a phrase associated with the document, and weight is a numerical measure of how important keyword is for the document.

Vectors are a kind of document signature (word-weight pairs) representing a document's content in a way that allows comparison between documents. It is the numerical representation of the unstructured textual content of a document.  Vectors can be used to enable clustering and refinement operations.

Document-level security
Within a search engine document-level security implies that the search index provides the same document access control granularity as the source content repositories. This may be facilitated by mapping the ACL information from the content repositories to the index.

Document-processing stage
The document-processing stage may modify, remove, or add information to a document, such as adding new meta information for linguistic processing, or extracting information about the language the document is written in. Also known as a document processor.

Duplicate detection
Search engines may apply different levels of duplicate detection. Exact duplicates means the same document, but located in different repositories. Next level of duplicate detection is typically to look for documents with equal visible content (excluding metadata). Certain applications may want to apply even more aggressive duplicate detection, e.g. based on a set of fields that are equal.See also Field collapsing.

Dynamic concept extraction
The ability to mine concepts from data present in the result set of a query through statistical and linguistic analysis. See also Entity extraction

Dynamic drill-down
A navigation tool for structured data; it provides multidimensional drill-down in structured data based on facets of content. This enables on-the-fly aggregation of result values for multiple fields across the documents in a result set. For numeric data this also includes dynamic binning of result values based on statistical value distribution across the result set.

Dynamic rank
The process by which rank components are computed during matching related to the level of match between document and query.

Dynamic teaser
A short summary of a document, generated based on the actual query, showing the regions of the document matching the query – with the query terms highlighted.

ETL-type tools
Extract, transform, and load (ETL) is a data-integration function that involves extracting data from outside sources, transforming it to fit business needs, and ultimately loading it into a data warehouse.  In a search application ETL tools may be used for merging of database records and content normalization.

Entity Extraction
The ability of an enterprise search platform to parse and recognize informational entities, such as geographic names, personal names, and company names. Entities can typically be annotated to the indexed documents, and enhance the search and navigation experience. See also Contextual entity extraction.

Exact match
Match query terms to document words exactly. This will not allow fuzzy matching based on spell check and/or linguistic normalization.

False positives
When a search returns results that do not contain what was searched for. 

Federated search
In a federated search, users receive results from multiple search and retrieval systems – for example, from other search engines, commercial information services, or internal databases.
 
Federation is the blending of results from multiple, often non-compatible search systems.

Field
The schema of a search index split documents into fields. Fields specify those elements of a document that are to be searchable or presented in the result.

Field collapsing
Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated “more documents from this site” link. See also Duplicate detection.

File Traverser
A tool for accessing the files (e.g. Microsoft Word, HTML, and XML files) that are located on a standard file system in order to bring them into the index of the enterprise search platform.
A File traverser works along file system directory structures, whereas the Web Crawler crawls Web servers along URI structures.

Footprint
The software portion of computing resources – typically RAM, CPU time, and disk space – required to drive that resource.

Filter
Program which reads the contents of a document to extract the “to be taken into account” information (included text, metadata) or to carry out a specific process on certain documents. New Filters can be added to the existing Filters.

Indexer
Program creating an Index. Sources of information that do not have a native mechanism of interrogation must be indexed before searching can be performed. 

Freshness
The "age" of the document compared to the time of query. 

Freshness boosting
Enhances relevancy by boosting documents based on their relative age – that is, compared to the time of query.

Full-text sorting
Sorting of search results based on the full textual content of a field (or a configurable number of characters).

Fuzzy matching
Exact matching is very strict: either a word matches or it doesn’t. An attempt to improve search recall by matching more than the exact word: fuzzy matching techniques try to reduce words to their core and then match all forms of the word. See also Approximate matching.

Generation of Hypothesis
Action consisting of enriching a user request by referencing closely matched vocabulary and correlated information contained within the Knowledgebase.

Geo enabled search 
The ability to sort, filter and/or rank documents based on their geographical distance from a given source position, typically the position of the end-user.

Geo/Location
In relation to relevancy, the importance of location in relation to the query term.

Golden set
The number of documents and queries that are to be used for testing; a minimum of 2,000 documents and at least 50 queries. Typically these are manually selected.

Impression logging
The ability to log all query results that are displayed to the end-user. This means that each query result generates an impression log entry for each document returned on the result page.

Incremental indexing
Incremental indexing enables an efficient combination of freshness and scalability in a search node. The index can be partitioned into different segments (partitions) with different refresh rates.

Data is initially indexed in the smallest index where it resides for a given period of time before it propagates a larger index. This procedure repeats itself; once data has entered index P, it stays there for a configurable amount of time before it moves to index P+1. This concept takes place within one search node, and is independent of the partitioning performed using multiple columns of search nodes.

Index
An index is a reverse representation of searchable content using an index of terms occurring in the documents, with a reference to all documents matching a given term.

Index profile
The configuration file that defines the schema for a searchable index. This includes fields and properties of the index, similar to an XML schema, but also specifying field types and search engine-specific field features.

Index-based security
The resolution of a repository’s document ACL permissions at query time by the index itself through the use of stored meta-data. Using this method, results lists will only include hits for which the searcher has viewing permissions. Compared to post-processing, the index-based security method gives higher query performance and enables the search engine to return correct counts for navigators and related concepts.

Indexing latency
The time from when a document is added to the search system to when the document is included in the searchable index.

Ingestion rate
The number of documents per unit time that an enterprise search platform can process.

Intranet
An Intranet is a network which provides similar services within an organization to those provided by the Internet. It is normally separated from the Internet via a firewall.

Knowledgebase
Internal database containing vocabulary and synonyms and able to import Thesaurus data. Used to enrich search requests automatically and to better describe the search subject. The Knowledgebase is fed both by the user and automatically by dedicated learning algorithms.

Lemmatization
Utilizing lemmatization enables the search system to recognize and match different grammatical forms of a word.  For example, searching for “mouse” will also produce hits on “mice”.

Lemmatization by expansion
A type of lemmatization which expands words into the full set of inflected forms.

Lemmatization by reduction
A type of lemmatization, also referred to as “base form reduction”, that normalize indexed terms and query terms to their grammatical base form.  For example, “ate” becomes “eat.” 

Linguistics
The study of the nature, structure, and variation of language. In advanced enterprise search platforms, linguistics analysis enables transformation of content and queries for the purposes of improving relevancy, recall, and precision.

Link cardinality
The number of links in a set that refer to a given document.  It is best used to determine the relevancy of a Web page by factoring in how many other pages refer to the page under consideration.

Metadata
Metadata is often described as “data about data”.  It typically augments the full text of a document to help with recall, precision, creating filters, and working with navigators.

Mining
Finding useful facts in databases of text; evaluating large amounts of stored data and looking for useful patterns.

More like this
A way to refine search by identifying the right set of documents and then locating similar documents. This allows the searcher to control the direction of the search and focus on the most fruitful lines of inquiry.

Morphologic analysis
Used in query analysis, this analysis includes all forms of a given word via linguistic normalization (lemmatization).

Morphology
Study of the structure and form of words in language or a language, including inflection, derivation, and the formation of compounds

Multi-level sorting
Sorting by multiple fields. Both text and integer fields may be sorted upon (ascending or descending). Field sorting may be combined with rank sorting.

Name value-pairs
In a search context, name value-pairs are raw data that is normalized into a structured “tree” of information. They are then sent downstream to waiting document processors.  For example, name value-pairs can be data about cars that is structured into categories containing information about “make”, “color”, “year”, and “mileage.”

Natural language processing (NLP)
Instead of using Boolean logic, the user can simply type in a question as a query. The simplest processing just removes stop words and uses statistical approaches. The process of using linguistic analysis to infer meaning from human-written text that could not be extracted using the individual word meanings.

Navigation
Information discovery through drill-down into query results. Navigation is possible both on document level attributes/entities and contextual entities within the matching context of the search results. Dynamic Drill-Down may be used to drill down into any dimension of the documents that can be represented as numeric information or well-defined terms/strings. The combination of entity extraction and drill-down provides a powerful way of drilling down into the results.
See also Navigator.Navigators
 
A navigator is a construct that enables filtering and grouping of search results. On an international site, you may have a navigator that enables you to only display results with content in a given language – for instance, “Display English results only.”

 
Network Attached Storage
 
A specialized file server that connects to the network. A Network Attached Storage (NAS) device contains a slimmed-down (microkernel) operating system and file system and processes only I/O requests by supporting popular file sharing protocols such as NFS (Unix) and SMB/CIFS (DOS/Windows). Using traditional LAN protocols such as Ethernet and TCP/IP, the NAS enables additional storage to be quickly added by plugging it into a network hub or switch. As network transmission rates have increased from Ethernet to Fast Ethernet to Gigabit Ethernet, NAS devices have come up to speed parity with direct attached storage devices.

 
Node
 
In general, a node is a basic unit used to build data structures, such as linked lists and tree data structures. In an enterprise search system, a node is usually referred to as “a server”.

 
Noun Phrase Extraction
 
Noun Phrase Extraction implies that phrases like ”competitive advantage”, ”key driver” and ”sellers market” can be extracted and annotated to a document prior to indexing.

 
Offensive content filter
 
An offensive content filter detects offensive content (sexual, drugs, violence) in a document and can tag the document to be offensive and based on that either exclude the document from the index (as spam), or enable removal of offensive content on a per query basis.

 
Ontology
Ontology defines concepts, providing a way to move towards consistency in vocabulary.  It provides a working model of the entities and interactions of a particular topic, such as dentistry or anthropology. It also has a specific knowledge related to a given domain name – for example, in finance or pharmaceuticals.

 
Orthographic analysis
 
Orthographic analysis is used in checking for typos and, official variants (for example, German spelling).

 
Parametric search
 
Parametric searching allows people to find items of interest based on an individual item’s parameters, or particular characteristics. Such parameters or facets may be represented as Fields within a search index.

 
Parametric search
 
Parametric searching allows people to find items of interest based on an individual item’s parameters, or particular characteristics. Such parameters or facets may be represented as Fields within a search index.

 
Parsing
The process of analyzing input to determine its grammatical structure with respect to formal grammar. A parser is a computer program that carries out this task. Parsing transforms input text into a data structure, usually a tree, which is suitable for later processing and which captures the implied hierarchy of the input. Generally, parsers operate in two stages, first identifying the meaningful tokens in the input and then building a parse tree from those tokens.

 
Phonetic search
Phonetic search is the analysis of words that are pronounced similarly in order to detect all possible variants.

 
Phrase detection
 
The recognition and grouping of an idiom such as “home run” or “Christmas tree”. Detection of an implicit phrase in a query may improve precision of a query.

 
Phrase searching
 
A search engine may provide an option to search a set of words as a phrase, either by typing in quotation marks (“”), by using a command or clicking a button. When they receive this kind of search, the engine will generally locate all words that match the search terms, then discard those which are not next to each other in the correct order. To perform this task, the index must store the position of the word in the document, so the search engine can tell where the words are located. See also Proximity search.

 
Phrasing
 
The recognition and grouping of an idiom such as “home run” or "Christmas tree."

Post Ranking
Action of cross-referencing the results coming from various sources and of their sorting by decreasing relevance.

Preferential Pages
Documents presented first against certain pre-defined search requests independent of their relevance. 

Precision and Recall
Precision is the ability to retrieve the most precise results.  Higher precision means better relevance and more precise results, but may imply fewer results returned.
For a query, recall means the ability to retrieve as many documents as possible that match or are related to a query.  Recall may be improved by linguistic processing such as lemmatization, spell-checking, and synonym expansion.

In information retrieval, there’s a classic tension between recall and precision. Specifying more recall (trying to find all the relevant items), you often get a lot of junk. If you limit your search trying to find only precisely relevant items, you can miss important items because they don’t use quite the same vocabulary.

 
Processing pipeline
 
Sequential stages of processing within the search engine before the creation of final index of the content.

Proper name recognition
Proper name recognition is a way of identifying word sequences in text that are defined as proper names or phrases in the appropriate dictionary. See also Spell Check

Proximity boosting
Documents that contain the query terms closer together are ranked higher than documents that contain these terms distributed throughout the document. This may also be referred to as Implicit Proximity.

Proximity search
An extension to Boolean searching, this technique checks the position of terms and only matches those within the specified distance. It’s a good way to cut down the irrelevant matches and get better results.

Search using the NEAR/ONEAR operators implies an explicit proximity constraint to the operands to the NEAR/ONEAR operator. NEAR is similar to AND, but requires that the terms should appear within a given word distance in the document. ONEAR also requires that the order of the terms shall be equal.

Quality
In relation to relevancy, the quality of the document, and how important it is as viewed by the content owner or search application.

Queries per second (QPS)
The number of queries that the enterprise search platform will process in one second.  This is normally a function of hardware (capability) and licensing (what is allowed due to contract terms).

Query
The combination of the word or words used for searching, and any options allowed by the search engine.

Query Syntax
The semantic rules that must be observed when submitting queries to a search engine – for example, the use of parenthesis and Boolean operators. Sometimes, a query transformation stage may be used to allow end-users to use a different syntax from the one expected by the search engine.

Query and result processing
The application of algorithms to the original query or to the raw results returned by the search engine. This is useful for modifying queries to reflect an inferred behaviour – for example, using synonym expansion or business rules to modify the results (resorting, teaser modification etc), and to customize the search experience. The overall goal is to analyze and identify the essence of the searcher’s intent from the query, and to return the most relevant set of results.

Query term weight
The ability to support different relevance weight for different terms in a query.

Query transformation
The analysis and subsequent rewriting of a query, using linguistic transformations such as lemmatization and spell-checking. Custom query transformation stages may also be used if necessary. Equivalent to Query Processing.

Range restrictions
The ability to limit a search to a specified range of a numerical metadata field. For example, a search for a digital camera between $250 and $400.

Rank profile
A rank profile concept enables full control of the relative weight of each component for a given query (for example, how important an article’s title is relative to the main text or how important is proximity versus freshness). This enables individual relevance tuning of different query applications.

Ranking
Ranking is a way of arranging result documents according to their relevancy related to a query.

Ranking models
Models used to determine how closely content matches a particular query, and whether it should be included in the search results.

Real-time Indexing
The ability to index content with short latency, typically within seconds from when the enterprise search platform receives a document for indexing.

Recall
For a query, recall means the ability to retrieve as many documents as possible that match or are related to a query.  Recall may be improved by linguistic processing such as lemmatization, spell-checking, and synonym expansion.

Relative boosting
This enables a document to always be displayed among the first 20 documents in the result list, provided a user searched with a specific query.  For all other queries, the ranking position of the document will not be affected.

Relevance or Relevance ranking
Relevancy is the measure of how well the indexed page answers the question. Only the searcher can actually define how relevant a document is, in relation to their query: there is no way to automate it. When there are many query matches, the search engines must rank the results by relevance score, sorting the results listing so that the pages most likely to be useful will appear first. Varying algorithms are used to define relevancy.

Real Language
Use of natural language to form search queries without the use of specific operators or particular syntax. Search engine then seeks the documents containing “as much as possible” information relating to the terms of the request.

Request
Commonly referred to as “Search”, the Request represents the terms sought in the searched for documents. Requests can be in either Boolean or real language. 

Result set
The result set is a set of Document Summaries returned for a query.

Result-side (shallow) navigators
A type of dynamic drill-down navigator. Drill-down navigators are created across an extended but non-exhaustive result set (typically, the 200 highest ranked results).

Results clustering
Grouping similar results together to make it easy to see which results relate to each other. This can be supervised (based on a taxonomy) or un-supervised (based on on-the-fly similarity analysis). 

Results transformation 
The algorithmic processing of search results, which includes result-set reordering (e.g. duplicate removal), adding navigation information, and result content conversion or reformatting. Equivalent to Results Processing (above).

Results-based binning
Results-based binning performs ad-hoc clustering of results into dynamic bins based on value distribution for this parameter in the results. See also Dynamic drill-down.

Rows and columns
A search installation may be configured in a row/column configuration for performance and fault-tolerance reasons.
Multiple Columns is used in order to partition the indexed content for large data volumes. Each column contains a unique subset of the indexed content.
Multiple Rows are used for query performance scaling and fault-tolerance. Each row within a column is identical with respect to the indexed content.

Scalability
Scalability indicates the capability of a system to increase total throughput under an increased load when resources (typically hardware) are added. 

Scope Search
Scope Search enables search in hierarchical content structures without a need to know the schema in advance.

Scope field 
A scope field contains hierarchically structured content. It enables schema flexibility and the ability to conserve hierarchical relationships rather than flattening the data as is often required by meta-data engines.

Search Profile
A concept used in order to identify the set of search attributes common for a given search application. This includes global filter constraints (e.g. collection), query processing parameters (such as linguistics) and result handling parameters (such as Navigation settings).

Search cluster
A search cluster is a group of search nodes (row/column matrix) that shares the same index schema (index profile)

Search terms 
The search terms are the words entered by the searcher, which are part of the query, along with other instructions. The search engine will look for these words in the index, and return the matching results, usually sorted by relevance. Some search engines will allow Boolean operators, adjacency, match phrases, partial words and provide other options.

Semantic analysis
This means applying a combination of general and specific thesauri and ontologies, and automatic phrasing, –for example, to understand the intention of the query

Semantic indexing 
Semantic indexing of content by detection and annotation of sentences, paragraphs and other semantic structures in unstructured content. This enables you to limit your search to paragraphs or other semantic elements in the text.

Sentiment analysis 
The evaluation of the sentiment – typically positive or negative – of the text based on the usage of language. Determining the sentiment (general tone) of a document based on the application of computational linguistics algorithms.

Similarity searching
The ability to search for similar documents. Similar in a search context may be similar to a document in a result set or similar to an example document.
Similarity searching may be based on:
Find similar: Find documents similar to the selected document, or based on input of a full document or a chunk of a document submitted via the search interface. 
Refine similar: Within the scope of the original query, find documents similar to the selected document 
Exclude similar: Within the scope of the original query, find documents different from the selected document
See also Document vector

Site Descriptor
Set of directives associated with a source allowing the indexer to optimise the process of accessing only certain parts of the source and to index only selected elements. 
 
Spell Check Optimization
The process of optimizing a spell check dictionary towards a live search index. In this way the dictionary is better aligned with the actual domain of the given search application, taking into consideration term frequencies and domain-specific terminology.

Spell checking
Individual query terms and phrases are spell-checked against a dictionary. The spell check algorithm is normally based on the edit distance between a query term and the dictionary term. The edit distance is given by the number of basic character operations (add, delete, swap) required to transform the miss-spelled query term/phrase to the closest term in the dictionary.
A special variant of spell check is the phonetic spell check, where the edit distance is computed based on a phonetic representation of the words.
See also Approximate matching.
 
Statistics
In relation to relevancy, statistically how well the content of the overall document matches the query. One measure is the number of times the query terms appear in the document, and how rare that term is within the complete corpus. Another is the proximity of the words in the document – how close they are to one another. 

Stemming 
Stemming is using linguistic analysis to reduce a word to its root form (stam), and then matching all forms of a word in a search query to all forms of the same word in documents. Stemming, in contrast to lemmatization, is normally only based on removing trailing parts of a word, leaving the stam. Lemmatization is normally based on dictionary lookup.

Stop words 
Words which are very frequent and have little meaning. They are removed and not indexed.  In advanced enterprise search platforms, customers can control the list of stop words by managing the stop word dictionary.

Storage Area Network (SAN) 
A Storage Area Network (SAN) is a network of storage disks. In large enterprises, a SAN connects multiple servers to a centralized pool of disk storage. Compared to managing hundreds of servers, each with their own disks, SANs improve system administration. By treating all the company's storage as a single resource, disk maintenance and routine backups are easier to schedule and control. In some SANs, the disks themselves can copy data to other disks for backup without any processing overhead at the host computers. See also NAS.

Structural analysis
Structural analysis allows documents to be classified based on structure and linguistic analysis (for example, the home page of an Internet service provider (ISP)), as well as the detection and extraction of more complex elements such as the opening hours of the ISP’s customer service operations. 

Subject-Matter Expert (SME) 
A subject-matter expert (SME) is a person knowledgeable about a given topic or subject area.

Substring search
Searching for parts of a string as with a wildcard search ("*term*").
A word or token (for Asian language documents) is split up in smaller entities, so called sub-strings, consisting of a defined number of signs. This is often used for Asian languages, which does not have a similar word structure as western languages.

Supervised clustering 
Supervised clustering provides a grouped view based on pre-defined categories, and maps results to pre-determined categories (that is, category information provided for the documents prior to indexing). 

Synchroniser
Service suggested by the indexer to update the contents of an index at the time of insertion, removal or modification of a document within a corpus.

Synonym expansion
When a query or document is expanded with a defined list of synonyms for the words it originally contains.

Syntactic analysis
Used to analyze query through entity/phrase extraction, anti-phrasing, and to remove word-sense ambiguity (the color orange versus the fruit, for example).

Syntactical patterns
Used for detecting information entities such as people, places, product codes, and prices.

TF-IDF
TF and IDF are used together as a measure of the statistical strength of a given word relative to a query.  TF (term frequency) is the measure of how often a word appears in a document. IDF (inverse document frequency) is the measure of the rarity of a word within the search index. 

Taxonomy
Taxonomy is a defined hierarchy of categories – a treelike structure of customer- or market-specific terminology that defines how categories relate to one another.  It provides a conceptual framework for discussion, analysis, or information retrieval. For example, a car manufacturer may have a taxonomy based on the type of car (convertible, SUV, wagon, etc.).  Taxonomies help partition the search environment and experience, based on a pre-defined knowledge of categories. This helps limit the number of “noisy” results returned to the user.

Term, expression
The simplest search engines index documents based on the words they contain but can neither calculate their importance nor the meaning they can have especially when cross-referenced. 

Thesaurus
A thesaurus stores synonyms and related words. This allows a search engine to map city planning to land use, for example, and show the relevant pages even if the vocabulary of the text did not match.

Tokenization
Tokenization involves detection of white space characters and other symbols that separate words from each other and that are not relevant to the matching process. It is part of the linguistic analysis, where text is split into word entities.  More complex tokenization is used for CJK languages, where semantic analysis is required to identify word boundaries.
 
FAST ESP provides a highly configurable tokenization, which enables you to configure whether or not special characters like –‘.\/ shall be discarded, treated as white space or indexed as normal characters.

URI
Internet space is inhabited by many points of content. A URI (Uniform Resource Identifier) is the way you identify any of those points of content, whether it be a page of text, a video or sound clip, a still or animated image, or a program. The most common form of URI is the Web page address, which is a particular form or subset of URI called a Uniform Resource Locator (URL). A URI typically describes:

The mechanism used to access the resource
The specific computer that the resource is housed in
The specific name of the resource (a file name) on the computer

URL
A URL (Uniform Resource Locator) is the address of a file (resource) accessible on the Internet. The type of resource depends on the Internet application protocol. Using the World Wide Web's protocol, the Hypertext Transfer Protocol (HTTP), the resource can be an HTML page, an image file, a program such as a common gateway interface application or Java applet, or any other file supported by HTTP.

The URL contains the name of the protocol required to access the resource, a domain name that identifies a specific computer on the Internet, and a hierarchical description of a file location on the computer. A URL is a type of URI (Uniform Resource Identifier).
 
Unsupervised clustering
Unsupervised clustering provides grouping of related documents on the basis of their content without referring to a taxonomy; it creates a taxonomy “on-the-fly,” parceling documents into dynamic partitions.

User interface (UI)
The end-user application linking a person to a computer program.  Mos