|
Preventing search engine crawlers from fetching particular files and directories Steering SE Crawlers ·
Index · Expand · Web Feed
Identifying and Tracking SE Crawling
URL Specific Control: the Robots META Tag
The Robots Exclusion Protocol from 1994 defines "a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot". That's a quasi standard, but crawlers sent out by major search engines do comply.
robots.txt is a plain text file located in the root directory of your server. Web robots read it before they fetch a document. If the document the bot is going to fetch is excluded for the particular robot by statements in the robots.txt file, the bot will not request it. The syntax is described here, use this tool to validate your robots.txt.
In order to track spider visits, your robots.txt should be a script logging each request of robots.txt in a database table. Here is an example for Apache/PHP:
Configure your webserver to parse .txt files for PHP, e.g. by adding this statement to your root's .htaccess file: AddType application/x-httpd-php .htm .txt Now you can use PHP in all .php, .htm, and .txt files. Ensure your users cannot submit .txt files for security reasons. http://www.yourdomain.com/robots.txt behaves like any other PHP script.
Your file system's directory structure has nothing to do with your linking structure, that is your site's hierarchy. However, you can store scripts delivering content which is not subject of public access in directories protected by robots.txt. To prevent this content from all unwanted views, add user/password protection.
User-agent: MyIntranetSpider
Disallow: /development/
Disallow: /extranet/
User-agent: *
Disallow: /intranet-login.htm
Disallow: /extranet-login.htm
Disallow: /developer-login.htm
Disallow: /development/
Disallow: /intranet/
Disallow: /extranet/
Disallow: /*.gif$
Disallow: /*.jpg$
This example allows 'MyIntranetSpider' to crawl the intranet directory while keeping all other web robots out. Note that file and directory names as well as query string arguments are case sensitive, and that excluding by file extension may not work with every web robot out there.
Google's crawler Googlebot and other Web robots support exclusion by patterns too, e.g.
User-agent: Googlebot
Disallow: /*affid=
Disallow: /*sessionID=
Disallow: /*visitorID=
Disallow: /*.aspx$
User-agent: Googlebot-Image
Disallow: /*.gif$
"*" matches any sequence of characters, "$" indicates the end of a name.
The first example would disallow all dynamic URLs were the variable 'affid' (affiliate ID) is part of the query string. The second and third example disallow URLs containing a session ID or a visitor ID. The fourth example excludes .aspx page scripts without a query string from crawling. The fifth example tells Google's image crawler to fetch all image formats except .gif files. Because not all Web robots understand this syntax, it makes sound sense to put in a robots META tag with a 'NOINDEX' value, just to be sure that search engines do not index unwanted pages.
Use Google's cool robots.txt validator to check your syntax and to simulate a crawler's behavior ruled by Disallow-statements.
If you add a User-agent: Googlebot section, you must duplicate all exclusions from the general User-agent: * section, because if Googlebot finds itself mentioned, it will ignore all other directions. Other crawlers may handle this the same way, thus create complete sections per spider if you really need to distinguish crawling exclusions between search engines.
URL Specific Control: the Robots META Tag
Identifying and Tracking SE Crawling
Steering and Supporting Search Engine Crawling ·
Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · Expand · Web Feed
Author: Sebastian
Last Update: Monday, June 20, 2005 Web Feed
|
|