|
Telling search engine spiders how to index and cache a particular page Steering SE Crawlers ·
Index · Expand · Web Feed
The Gatekeeper: robots.txt
Link Specific Regulation: REL=NOFOLLOW
The Robots META Tag, introduced 1996, "allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links". It's put in the <HEAD> section of the HTML document:
<META NAME="ROBOTS" CONTENT="INDEX, FOLLOW">
The content of the robots META tag contains directives separated by commas:
INDEX|NOINDEX - Tells the SE spider whether the page may be indexed or not
FOLLOW|NOFOLLOW - Tells the SE crawler whether it may follow links provided on the page or not
ALL|NONE - ALL = INDEX, FOLLOW (default), NONE = NOINDEX, NOFOLLOW
NOODP - tells search engines not to use page titles and descriptions from the ODP on their SERPs.
NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.
NOARCHIVE - Google specific1, used to prevent archiving
NOSNIPPET - Google specific, prevents Google from displaying text snippets for your page on its SERPs
If you provide more than one view on the same content, use INDEX|NOINDEX to avoid indexing of duplicate content. On the page desired for indexing by search engines put "INDEX, FOLLOW", on all alternate views put "NOINDEX,FOLLOW". Do not trick SE crawlers into indexing printer friendly layouts and alike, chances are you get banned sooner or later.
Nowadays search engines are smart enough to extract text content from page templates. Comparing similar text found on different pages, they try to guess which source page is worth indexing. Unfortunately, these guesses are sometimes weird and they deliver unimportant URLs on the SERPs. By the way, filtering duplicate content is not a penalty - it's a method of optimizing the search results in the best interest of search engine users (hardcore spammers and scraper site operators may not agree).
Note that the robots META tag is for use in HTML documents only. If you offer your content additionally in PDF or DOC format, and you don't want to find the PDF/DOC-files in search results, store them in a directory protected by robots.txt or disallow these extensions in general.
Link Specific Regulation: REL=NOFOLLOW
The Gatekeeper: robots.txt
Steering and Supporting Search Engine Crawling ·
Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · Expand · Web Feed
1
|
Do not use search engine specific values in the standard robots META tag. Add a separate META tag per search engine, for example:
<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">
|
Author: Sebastian
Last Update: Monday, June 20, 2005 Web Feed
|
|