Detecting search engine spiders, tracking and analyzing their behavior

Steering SE Crawlers · Index · Expand · Web Feed

Previous PageBasic Search Engine Crawler Support

The Gatekeeper: robots.txtNext Page


Search engine crawlers can be identified by their user agent. For example Google's web robots (Googlebot and Googlebot-Images) provide the string 'Googlebot' in the HTTP user agent name. You can also detect crawlers by IP address, but as long as your intention is not cheating you don't need to maintain bulletproof IP lists. Here is a PHP example:


function isSpider ( $userAgent ) {
    if ( stristr($userAgent, "Googlebot")    || /* Google */
         stristr($userAgent, "Slurp")    || /* Inktomi/Y! */
         stristr($userAgent, "MSNBOT")    || /* MSN */
         stristr($userAgent, "teoma")    || /* Teoma */
         stristr($userAgent, "ia_archiver")    || /* Alexa */
         stristr($userAgent, "Scooter")    || /* Altavista */
         stristr($userAgent, "Mercator")    || /* Altavista */
         stristr($userAgent, "FAST")    || /* AllTheWeb */
         stristr($userAgent, "MantraAgent")    || /* LookSmart */
         stristr($userAgent, "Lycos")    || /* Lycos */
         stristr($userAgent, "ZyBorg")    /* WISEnut */
    ) return TRUE;
    return FALSE;
}

if (isSpider(getenv("HTTP_USER_AGENT"))) {
    $useSessionID = FALSE;
    $logAccess = TRUE;
}


This example shows just a snapshot. Search for other user agents used by search engine crawlers and compile your own list.

Before your scripts close the database connection, call a function which logs the crawler's visit in a database table. After outputting the final close tag of the page, do a flush() before you insert the tuple into the log-table. This ensures a complete content delivery, just in case of delays during the logging process. In your log table index all attributes appearing in WHERE clauses and GROUP-BY statements. On very large sites refine this basic procedure.

Write a few reports querying your log table, for example a tracker following each bot to learn where it starts and which links it follows. Also, you need statistics showing the crawling frequency by URL (server + requested file + query string) to find out which of your pages the spiders like most, and which of your spider food they refuse to eat. Study these reports frequently and improve your linking when you find rarely or even never spidered areas of your site. Donate these pages a few links from often crawled pages, put up themed site maps linked from the root index page and so on.



The Gatekeeper: robots.txtNext Page

Previous PageBasic Search Engine Crawler Support


Steering and Supporting Search Engine Crawling · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · Expand · Web Feed



Author: Sebastian
Last Update: Monday, June 20, 2005   Web Feed

· Home

· Internet

· Steering SE Crawlers

· Googlebot-Spoofer

· Google Sitemaps Info

· Web Links

· Link to us

· Contact

· What's new

· Site map

· Get Help


Most popular:

· Site Feeds

· Database Design Guide

· Google Sitemaps

· smartDataPump

· Spider Support

· How To Link Properly


Free Tools:

· Sitemap Validator

· Simple Sitemaps

· Spider Spoofer

· Ad & Click Tracking



Search Google
Web Site

Add to My Yahoo!
Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed



To eliminate unwanted email from ALL sources use SpamArrest!





neatCMS

neat CMS:
Smart Web Publishing



Text Link Ads

Banners don't work anymore. Buy and sell targeted traffic via text links:
Monetize Your Website
Buy Relevant Traffic
text-link-ads.com


[Editor's notes on
buying and selling links
]






Digg this · Add to del.icio.us · Add to Furl · We Can Help You!




Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives


Top of page

No Ads


Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy