Preventing search engine crawlers from fetching particular files and directories

Steering SE Crawlers · Index · Expand · Web Feed

Previous PageIdentifying and Tracking SE Crawling

URL Specific Control: the Robots META TagNext Page


The Robots Exclusion Protocol from 1994 defines "a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot". That's a quasi standard, but crawlers sent out by major search engines do comply.

robots.txt is a plain text file located in the root directory of your server. Web robots read it before they fetch a document. If the document the bot is going to fetch is excluded for the particular robot by statements in the robots.txt file, the bot will not request it. The syntax is described here, use this tool to validate your robots.txt.

In order to track spider visits, your robots.txt should be a script logging each request of robots.txt in a database table. Here is an example for Apache/PHP:


Configure your webserver to parse .txt files for PHP, e.g. by adding this statement to your root's .htaccess file:

AddType application/x-httpd-php .htm .txt

Now you can use PHP in all .php, .htm, and .txt files. Ensure your users cannot submit .txt files for security reasons. http://www.yourdomain.com/robots.txt behaves like any other PHP script.


Your file system's directory structure has nothing to do with your linking structure, that is your site's hierarchy. However, you can store scripts delivering content which is not subject of public access in directories protected by robots.txt. To prevent this content from all unwanted views, add user/password protection.


User-agent: MyIntranetSpider
Disallow: /development/
Disallow: /extranet/
User-agent: *
Disallow: /intranet-login.htm
Disallow: /extranet-login.htm
Disallow: /developer-login.htm
Disallow: /development/
Disallow: /intranet/
Disallow: /extranet/
Disallow: /*.gif$
Disallow: /*.jpg$


This example allows 'MyIntranetSpider' to crawl the intranet directory while keeping all other web robots out. Note that file and directory names as well as query string arguments are case sensitive, and that excluding by file extension may not work with every web robot out there.

Google's crawler Googlebot and other Web robots support exclusion by patterns too, e.g.


User-agent: Googlebot
Disallow: /*affid=
Disallow: /*sessionID=
Disallow: /*visitorID=
Disallow: /*.aspx$
User-agent: Googlebot-Image
Disallow: /*.gif$

"*" matches any sequence of characters, "$" indicates the end of a name.


The first example would disallow all dynamic URLs were the variable 'affid' (affiliate ID) is part of the query string. The second and third example disallow URLs containing a session ID or a visitor ID. The fourth example excludes .aspx page scripts without a query string from crawling. The fifth example tells Google's image crawler to fetch all image formats except .gif files. Because not all Web robots understand this syntax, it makes sound sense to put in a robots META tag with a 'NOINDEX' value, just to be sure that search engines do not index unwanted pages.

Use Google's cool robots.txt validator to check your syntax and to simulate a crawler's behavior ruled by Disallow-statements.

If you add a User-agent: Googlebot section, you must duplicate all exclusions from the general User-agent: * section, because if Googlebot finds itself mentioned, it will ignore all other directions. Other crawlers may handle this the same way, thus create complete sections per spider if you really need to distinguish crawling exclusions between search engines.



URL Specific Control: the Robots META TagNext Page

Previous PageIdentifying and Tracking SE Crawling


Steering and Supporting Search Engine Crawling · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · Expand · Web Feed



Author: Sebastian
Last Update: Monday, June 20, 2005   Web Feed

· Home

· Internet

· Steering SE Crawlers

· Googlebot-Spoofer

· Google Sitemaps Info

· Web Links

· Link to us

· Contact

· What's new

· Site map

· Get Help


Most popular:

· Site Feeds

· Database Design Guide

· Google Sitemaps

· smartDataPump

· Spider Support

· How To Link Properly


Free Tools:

· Sitemap Validator

· Simple Sitemaps

· Spider Spoofer

· Ad & Click Tracking



Search Google
Web Site

Add to My Yahoo!
Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed



To eliminate unwanted email from ALL sources use SpamArrest!





neatCMS

neat CMS:
Smart Web Publishing



Text Link Ads

Banners don't work anymore. Buy and sell targeted traffic via text links:
Monetize Your Website
Buy Relevant Traffic
text-link-ads.com


[Editor's notes on
buying and selling links
]






Digg this · Add to del.icio.us · Add to Furl · We Can Help You!




Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives


Top of page

No Ads


Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy