Detecting search engine spiders, tracking and analyzing their behavior

Steering SE Crawlers · Index · Expand · Web Feed

Previous PageBasic Search Engine Crawler Support

The Gatekeeper: robots.txtNext Page

Search engine crawlers can be identified by their user agent. For example Google's web robots (Googlebot and Googlebot-Images) provide the string 'Googlebot' in the HTTP user agent name. You can also detect crawlers by IP address, but as long as your intention is not cheating you don't need to maintain bulletproof IP lists. Here is a PHP example:

function isSpider ( $userAgent ) {
    if ( stristr($userAgent, "Googlebot")    || /* Google */
         stristr($userAgent, "Slurp")    || /* Inktomi/Y! */
         stristr($userAgent, "MSNBOT")    || /* MSN */
         stristr($userAgent, "teoma")    || /* Teoma */
         stristr($userAgent, "ia_archiver")    || /* Alexa */
         stristr($userAgent, "Scooter")    || /* Altavista */
         stristr($userAgent, "Mercator")    || /* Altavista */
         stristr($userAgent, "FAST")    || /* AllTheWeb */
         stristr($userAgent, "MantraAgent")    || /* LookSmart */
         stristr($userAgent, "Lycos")    || /* Lycos */
         stristr($userAgent, "ZyBorg")    /* WISEnut */
    ) return TRUE;
    return FALSE;

if (isSpider(getenv("HTTP_USER_AGENT"))) {
    $useSessionID = FALSE;
    $logAccess = TRUE;

This example shows just a snapshot. Search for other user agents used by search engine crawlers and compile your own list.

Before your scripts close the database connection, call a function which logs the crawler's visit in a database table. After outputting the final close tag of the page, do a flush() before you insert the tuple into the log-table. This ensures a complete content delivery, just in case of delays during the logging process. In your log table index all attributes appearing in WHERE clauses and GROUP-BY statements. On very large sites refine this basic procedure.

Write a few reports querying your log table, for example a tracker following each bot to learn where it starts and which links it follows. Also, you need statistics showing the crawling frequency by URL (server + requested file + query string) to find out which of your pages the spiders like most, and which of your spider food they refuse to eat. Study these reports frequently and improve your linking when you find rarely or even never spidered areas of your site. Donate these pages a few links from often crawled pages, put up themed site maps linked from the root index page and so on.

The Gatekeeper: robots.txtNext Page

Previous PageBasic Search Engine Crawler Support

Steering and Supporting Search Engine Crawling · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · Expand · Web Feed

Author: Sebastian
Last Update: Monday, June 20, 2005   Web Feed

· Home

· Internet

· Steering SE Crawlers

· Googlebot-Spoofer

· Google Sitemaps Info

· Web Links

· Link to us

· Contact

· What's new

· Site map

· Get Help

Most popular:

· Site Feeds

· Database Design Guide

· Google Sitemaps

· smartDataPump

· Spider Support

· How To Link Properly

Free Tools:

· Sitemap Validator

· Simple Sitemaps

· Spider Spoofer

· Ad & Click Tracking

Search Google
Web Site

Add to My Yahoo!
Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed

Digg this · Add to · Add to Furl · We Can Help You!

Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives

Top of page

No Ads

Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy