Telling search engine spiders how to index and cache a particular page

Steering SE Crawlers · Index · Expand · Web Feed

Previous PageThe Gatekeeper: robots.txt

Link Specific Regulation: REL=NOFOLLOWNext Page


The Robots META Tag, introduced 1996, "allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links". It's put in the <HEAD> section of the HTML document:


<META NAME="ROBOTS" CONTENT="INDEX, FOLLOW">


The content of the robots META tag contains directives separated by commas:
INDEX|NOINDEX - Tells the SE spider whether the page may be indexed or not
FOLLOW|NOFOLLOW - Tells the SE crawler whether it may follow links provided on the page or not
ALL|NONE - ALL = INDEX, FOLLOW (default), NONE = NOINDEX, NOFOLLOW
NOODP - tells search engines not to use page titles and descriptions from the ODP on their SERPs.
NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.
NOARCHIVE - Google specific1, used to prevent archiving
NOSNIPPET - Google specific, prevents Google from displaying text snippets for your page on its SERPs


If you provide more than one view on the same content, use INDEX|NOINDEX to avoid indexing of duplicate content. On the page desired for indexing by search engines put "INDEX, FOLLOW", on all alternate views put "NOINDEX,FOLLOW". Do not trick SE crawlers into indexing printer friendly layouts and alike, chances are you get banned sooner or later.

Nowadays search engines are smart enough to extract text content from page templates. Comparing similar text found on different pages, they try to guess which source page is worth indexing. Unfortunately, these guesses are sometimes weird and they deliver unimportant URLs on the SERPs. By the way, filtering duplicate content is not a penalty - it's a method of optimizing the search results in the best interest of search engine users (hardcore spammers and scraper site operators may not agree).    

Note that the robots META tag is for use in HTML documents only. If you offer your content additionally in PDF or DOC format, and you don't want to find the PDF/DOC-files in search results, store them in a directory protected by robots.txt or disallow these extensions in general.



Link Specific Regulation: REL=NOFOLLOWNext Page

Previous PageThe Gatekeeper: robots.txt


Steering and Supporting Search Engine Crawling · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · Expand · Web Feed




1

Do not use search engine specific values in the standard robots META tag. Add a separate META tag per search engine, for example:
<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">




Author: Sebastian
Last Update: Monday, June 20, 2005   Web Feed

· Home

· Internet

· Steering SE Crawlers

· Googlebot-Spoofer

· Google Sitemaps Info

· Web Links

· Link to us

· Contact

· What's new

· Site map

· Get Help


Most popular:

· Site Feeds

· Database Design Guide

· Google Sitemaps

· smartDataPump

· Spider Support

· How To Link Properly


Free Tools:

· Sitemap Validator

· Simple Sitemaps

· Spider Spoofer

· Ad & Click Tracking



Search Google
Web Site

Add to My Yahoo!
Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed



To eliminate unwanted email from ALL sources use SpamArrest!





neatCMS

neat CMS:
Smart Web Publishing



Text Link Ads

Banners don't work anymore. Buy and sell targeted traffic via text links:
Monetize Your Website
Buy Relevant Traffic
text-link-ads.com


[Editor's notes on
buying and selling links
]






Digg this · Add to del.icio.us · Add to Furl · We Can Help You!




Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives


Top of page

No Ads


Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy