|
Detecting search engine spiders, tracking and analyzing their behavior Steering SE Crawlers ·
Index · Expand · Web Feed
Basic Search Engine Crawler Support
The Gatekeeper: robots.txt
Search engine crawlers can be identified by their user agent. For example Google's web robots (Googlebot and Googlebot-Images) provide the string 'Googlebot' in the HTTP user agent name. You can also detect crawlers by IP address, but as long as your intention is not cheating you don't need to maintain bulletproof IP lists. Here is a PHP example:
function isSpider ( $userAgent ) {
if ( stristr($userAgent, "Googlebot") || /* Google */
stristr($userAgent, "Slurp") || /* Inktomi/Y! */
stristr($userAgent, "MSNBOT") || /* MSN */
stristr($userAgent, "teoma") || /* Teoma */
stristr($userAgent, "ia_archiver") || /* Alexa */
stristr($userAgent, "Scooter") || /* Altavista */
stristr($userAgent, "Mercator") || /* Altavista */
stristr($userAgent, "FAST") || /* AllTheWeb */
stristr($userAgent, "MantraAgent") || /* LookSmart */
stristr($userAgent, "Lycos") || /* Lycos */
stristr($userAgent, "ZyBorg") /* WISEnut */
) return TRUE;
return FALSE;
}
if (isSpider(getenv("HTTP_USER_AGENT"))) {
$useSessionID = FALSE;
$logAccess = TRUE;
}
This example shows just a snapshot. Search for other user agents used by search engine crawlers and compile your own list.
Before your scripts close the database connection, call a function which logs the crawler's visit in a database table. After outputting the final close tag of the page, do a flush() before you insert the tuple into the log-table. This ensures a complete content delivery, just in case of delays during the logging process. In your log table index all attributes appearing in WHERE clauses and GROUP-BY statements. On very large sites refine this basic procedure.
Write a few reports querying your log table, for example a tracker following each bot to learn where it starts and which links it follows. Also, you need statistics showing the crawling frequency by URL (server + requested file + query string) to find out which of your pages the spiders like most, and which of your spider food they refuse to eat. Study these reports frequently and improve your linking when you find rarely or even never spidered areas of your site. Donate these pages a few links from often crawled pages, put up themed site maps linked from the root index page and so on.
The Gatekeeper: robots.txt
Basic Search Engine Crawler Support
Steering and Supporting Search Engine Crawling ·
Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · Expand · Web Feed
Author: Sebastian
Last Update: Monday, June 20, 2005 Web Feed
|
|