What does a search engine do?
Search engines have three major elements. First is the spider, also called the
crawler. The spider visits a web page, reads it, and then follows links to other
pages within the site. This is what it means when someone refers to a site being
"spidered" or "crawled." The spider returns to the site on a regular basis, such as
every month or two, to look for changes.
Everything the spider finds goes into the second part of a search engine, the index.
The index, sometimes called the catalog, is like a giant book containing a copy of
every web page that the spider finds. If a web page changes, then this book is
updated new information.
Sometimes it can take a while for new pages or changes that the spider finds to be
added to the index. Thus, a web page may have been "spidered" but not yet
"indexed." Until it is indexed -- added to the index -- it is not available to those
searching with the search engine.
Search engine software is the third part of a search engine. This is the program
that sifts through the millions of pages recorded in the index to find matches to a
search and rank them in order of what it believes is most relevant.
What are people looking for on the web?
http://www.excite.com/search/voyeur/
http://www.askjeeves.com/docs/peek/
Search Engine Glossary
Boolean search: A search allowing the inclusion or exclusion of documents
containing certain words through the use of operators such as AND, NOT and
OR.
Concept search: A search for documents related conceptually to a word, rather
than specifically containing the word itself.
Full-text index: An index containing every word of every document cataloged,
including stop words (defined below).
Fuzzy search: A search that will find matches even when words are only partially
spelled or misspelled.
Index: The searchable catalog of documents created by search engine software.
Also called "catalog." Index is often used as a synonym for search engine.
Keyword search: A search for documents containing one or more words that are
specified by a user.
Phrase search: A search for documents containing a exact sentence or phrase
specified by a user.
Precision: The degree in which a search engine lists documents matching a
query. The more matching documents that are listed, the higher the precision. For
example, if a search engine lists 80 documents found to match a query but only 20
of them contain the search words, then the precision would be 25%.
Proximity search: A search where users to specify that documents returned
should have the words near each other.
Query-By-Example: A search where a user instructs an engine to find more
documents that are similar to a particular document. Also called "find similar."
Recall: Related to precision, this is the degree in which a search engine returns all
the matching documents in a collection. There may be 100 matching documents,
but a search engine may only find 80 of them. It would then list these 80 and have
a recall of 80%.
Relevancy: How well a document provides the information a user is looking for,
as measured by the user.
Search Engine: The software that searches an index and returns matches.
Search engine is often used synonymously with spider and index, although these
are separate components that work with the engine.
Spider: The software that scans documents and adds them to an index by
following links. Spider is often used as a synonym for search engine.
Stemming: The ability for a search to include the "stem" of words. For example,
stemming allows a user to enter "swimming" and get back results also for the stem
word "swim."
Stop words: Conjunctions, prepositions and articles and other words such as
AND, TO and A that appear often in documents yet alone may contain little
meaning.
Thesaurus: A list of synonyms a search engine can use to find matches for
particular words if the words themselves don't appear in documents.