Friday, January 16, 2009

Lesson (1): Classification of Search Engines

The term "search engine" (SE) is often misused to describe both directories and pure search engines. In fact, they are not the same; the difference lies in how result listings are generated.


There are four major search engine types you should know about. They are:

  • crawler-based (traditional, common) search engines;
  • directories (mostly human-edited catalogs);
  • hybrid engines (META engines and those using other engines' results);
  • pay-per-performance and paid inclusion engines.

Crawler-based SEs, also referred to as spiders or Web crawlers, use special software to automatically and regularly visit websites to create and supplement their giant Web page repositories.


This software is referred to as a "bot", "robot", "spider", or "crawler". All these terms denote the same concept. These programs run on the search engines. They browse pages that already exist in their repositories, and find your site by following links from those pages. Alternatively, after you have submitted pages to a search engine, these pages are queued for scanning by a spider; it finds your page by looking through the lists of pages pending review in this queue.


After a spider has found a page to scan, it retrieves this page via HTTP (like any ordinary Web surfer who types an URL into a browser's address field and presses "enter"). Just like any human visitor, the crawling software leaves a record on your server about its visit. Therefore, it?s possible to know from your server log when a search engine has dropped in on your online estate.


Your Web server returns the HTML source code of your page to the spider. The spider then reads it (this process is referred to as "crawling" or "spidering") ? and this is where the difference begins between a human visitor and crawling software.


While a human visitor can appreciate the quality graphics and impressive Flash animation you've loaded onto your page, a spider won't. A human visitor does not normally read the META tags, a spider can. Only seasoned users might be curious enough to read the code of the page when seeking additional information about the Web page. A human visitor will first notice the largest and most attractive text on the page. A spider, on the other hand, will give more value to text that's closest to the beginning and end of the page, and the text wrapped in links.


Perhaps you've spent a fortune creating a killer website designed to immediately captivate your visitors and gain their admiration. You've even embedded lots of quality Flash animation and JavaScript tricks. Yet, a search engine spider is a robot which only sees that there are some images on the page and some code embedded into the "script" tag that it is instructed to skip. These design elements are additional obstacles on its way to your content. What's the result? The spider ranks your page low, no one finds it on the search engine, and no one is able to appreciate the design.


SEO (search engine optimization) is the solution for making your page more search-engine friendly. The optimization is mostly oriented towards crawler-based engines, which are the most-popular on the Internet. We're not telling you to avoid design innovations; instead, we will teach you how to properly combine them with your optimization needs.


Let's return to the way a spider works. After it reads your pages, it will compress them in a way that is convenient to store in a giant repository of Web pages called a search engine index. The data are stored in the search engine index the way that makes it possible to quickly determine whether this page is relevant to a particular query and to pull it out for inclusion in the result page shown in response to the query. The process of placing your page in the index is referred to as "indexing". After your page has been indexed, it will appear on search engine results pages for the words and phrases most common on the indexed Web page. Its position in the list, however, may vary.


Later, when someone searches the engine for particular terms, your page will be pulled out of the index and included in the search results. The search engine now applies a sophisticated technique to determine how relevant your page is to these terms. It considers many on-page and off-page factors and the page is given a certain position, or rank, within other results found for the surfer's query. This process is called "ranking".

Google (www.google.com) is a perfect example of a crawler-based SE.


Human-edited directories are different. The pages that are stored in their repository are added solely through manual submission. The directories, for the most part, require manual submission and use certain mechanisms (particularly, CAPTCHA images) to prevent pages from being submitted automatically. After completing the submission procedure, your URL will be queued for review by an editor, who is, luckily, a human.


When directory editors visit and read your site, the only decision they make is to accept or reject the page. Most directories do not have their own ranking mechanism ? they use various obvious factors to sort URLs, such as alphabetic sequence or Google PageRankTM (explained later in this course). It is very important to submit a relevant and precise description to the directory editor, as well as take other parts of this manual submission seriously.


Spider-based engines often use directories as a source of new pages to crawl. As a result, it's self-evident in SEO that you should treat directory submission and directory listings as seriously and responsibly as possible.


While a crawler-based engine would visit your site regularly after it has first indexed it, and detect any change you make to your pages, it's not the same with directories. In a directory, result listings are influenced by humans. Either you enter a short description of your website, or the editors will. When searching, only these descriptions are scanned for matches, so website changes do not affect the result listing at all.


As directories are usually created by experienced editors, they generally produce better (at least better filtered) results. The best-known and most important directories are Yahoo (www.yahoo.com) and DMOZ (www.dmoz.org).


Hybrid engines. Some engines also have an integrated directory linking to them. They contain websites which have already been discussed or evaluated. When sending a search query to a hybrid engine, the sites already evaluated are usually not scanned for matches; the user has to explicitly select them. Whether a site is added to an engine's directory generally depends on a mixture of luck and content quality. Sometimes you may "apply" for a discussion of your website, but there?s no guarantee that it will be done.


Yahoo (www.yahoo.com) and Google (www.google.com), although mentioned here as examples of a directory and crawler respectively, are in fact hybrid engines, as are nowadays most major search machines. As a rule, a hybrid search engine will favor one type of listing over another. For example, Yahoo is more likely to present human-powered listings, while Google prefers its crawled listings.


Meta Search Engines. Another approach to searching the vast Internet is the use of a multi-engine search, or meta-search engine that combines results from a number of search engines at the same time and lays them out in a formatted result page. A common or natural language request is translated to multiple search engines, each directed to find the information the searcher requested. The search engine's responses thus obtained are gathered into a single result list. This search type allows the user to cover a great deal of material in a very efficient way, retaining some tolerance for imprecise search questions or keywords.


Examples of multi-engines are MetaCrawler (http://www.metacrawler.com) and DogPile (http://www.dogpile.com). MetaCrawler refers your search to seven of the most popular search engines (including AltaVista and Lycos), then compiles and ranks the results for you.


Pay-for-performance and paid inclusion engines. As is clear from the title, with these engines you have no way other than to pay a recurring or one-time fee to keep your site either listed, re-spidered, or top-ranked for keywords of your choice. There are very few search engines that solely focus on paid listings. However, most major search engines offer a paid listing option as a part of their indexing and ranking system.


Unlike paid inclusion where you just pay to be included in search results, in an advertising program listings are guaranteed to appear in response to particular search terms, and the higher your bid, the higher your position will be for these terms. Paid placement listings can be purchased from a portal or a search network. Search networks are often set up in an auction environment where keywords and phrases are associated with a cost-per-click (CPC) fee. Such a scheme is referred to as Pay-Per-Click (PPC). Yahoo and Google are the largest paid listing providers, and Windows Live Search (formerly MSN) also sells paid placement listings.


So here's what you should remember from this lesson:

  1. Search engines (SEs) are classified into crawlers, directories, META engines and paid-inclusion engines.
  2. Crawler-based SEs use software called robots, spiders, or crawlers to add new pages to its database which is called an index. Directories use humans to manually fill their databases.
  3. After your site has been included in an index of a crawler-based search engine, you will appear in its results, and your position for a certain search query depends on how relevant the spider finds your page for this query.
  4. Your directory listings are quite influential to your positions in crawling search engines.

No comments: