Since Google launched the sitemap protocol in June 2005, webmasters and search engine optimizers have to rethink their dealing with search engine crawlers. Assuming Google will remove the sitemap protocol's 'beta' label faster than from it's search engine, this article tries to give web site owners an idea where to place Google sitemaps in their toolset. To define the playground, this article starts as a tutorial on supporting and steering search engine crawlers.


Index

Basic Search Engine Crawler Support

Supporting crawlers in indexing a web site [Shrink]

Identifying and Tracking SE Crawling

Detecting search engine spiders, tracking and analyzing their behavior [Shrink]

The Gatekeeper: robots.txt

Preventing search engine crawlers from fetching particular files and directories [Shrink]

URL Specific Control: the Robots META Tag

Telling search engine spiders how to index and cache a particular page [Shrink]

Link Specific Regulation: REL=NOFOLLOW

Hindring search engines to interpret a link as a vote for the link target [Shrink]

Tagging irrelevant page areas: class=robots-nocontent

How to make cluttered page areas like blocks with ads unsearchable. The class name robots-nocontent can be applied to everything not related to the page's main content. [Shrink]

User and Crawler Friendly Navigation

Leading search engine bots to the content they shall index [Shrink]

Search Engine Friendly Query Strings

If you can't avoid query strings in URLs, keep them short [Shrink]

What Google's Sitemap Protocol May Change

Educating Googlebot and (hopefully, in the future) other crawlers too [Shrink]

Recap: Methods to Support Search Engines in Crawling and Ranking

Webmaster's toolset to support and control search engine spiders [Shrink]



Basic Search Engine Crawler Support

[Shrink]

If you want SE spiders to fetch your content, the most important hint to a crawler is a link, known by the search engine, pointing to the page. Other hints are URL submissions, unlinked URLs found on the web, and perhaps even now still directory indexing. SEs consider pages with no incoming links pretty useless and usually don't bother indexing them (by the way, a page without outgoing links may be considered useless too). That means, forget submitting your stuff to the major search engines and concentrate your efforts on linkage.

To attract SE spiders, acquire valuable inbound links from related web sites. To keep SE crawlers interested in your site, provide a natural link schema, avoiding too many hops to the last page in the hierarchy. Search engine web robots are designed to find valuable content for search engine users. Ranking algorithms analyze a site's internal linking and honor an easy and user friendly navigation. There is nothing to say against a few shortcuts implemented for robots, but you really should try to design a navigation scheme that leads both users as well as crawlers on the shortest way to the content deeply buried in the site's hierarchy.

Think of the search engine crawler as a user. Build your site comfortable for your visitors, then implement special crawler support where it is needed. Steering and supporting search engine crawling basically is done by steering and supporting visitors on their way to your content they are interested in.

Stay away from cloaking if you're keen on free and highly targeted search engine traffic. Do not deliver 'search engine optimized versions' of your pages to crawlers. Feed spiders with the page as seen by users. There are very few tolerated exceptions from this rule, for example geo targeting and hiding user tracking from robots.



Identifying and Tracking SE Crawling

[Shrink]

Search engine crawlers can be identified by their user agent. For example Google's web robots (Googlebot and Googlebot-Images) provide the string 'Googlebot' in the HTTP user agent name. You can also detect crawlers by IP address, but as long as your intention is not cheating you don't need to maintain bulletproof IP lists. Here is a PHP example:


function isSpider ( $userAgent ) {
    if ( stristr($userAgent, "Googlebot")    || /* Google */
         stristr($userAgent, "Slurp")    || /* Inktomi/Y! */
         stristr($userAgent, "MSNBOT")    || /* MSN */
         stristr($userAgent, "teoma")    || /* Teoma */
         stristr($userAgent, "ia_archiver")    || /* Alexa */
         stristr($userAgent, "Scooter")    || /* Altavista */
         stristr($userAgent, "Mercator")    || /* Altavista */
         stristr($userAgent, "FAST")    || /* AllTheWeb */
         stristr($userAgent, "MantraAgent")    || /* LookSmart */
         stristr($userAgent, "Lycos")    || /* Lycos */
         stristr($userAgent, "ZyBorg")    /* WISEnut */
    ) return TRUE;
    return FALSE;
}

if (isSpider(getenv("HTTP_USER_AGENT"))) {
    $useSessionID = FALSE;
    $logAccess = TRUE;
}


This example shows just a snapshot. Search for other user agents used by search engine crawlers and compile your own list.

Before your scripts close the database connection, call a function which logs the crawler's visit in a database table. After outputting the final close tag of the page, do a flush() before you insert the tuple into the log-table. This ensures a complete content delivery, just in case of delays during the logging process. In your log table index all attributes appearing in WHERE clauses and GROUP-BY statements. On very large sites refine this basic procedure.

Write a few reports querying your log table, for example a tracker following each bot to learn where it starts and which links it follows. Also, you need statistics showing the crawling frequency by URL (server + requested file + query string) to find out which of your pages the spiders like most, and which of your spider food they refuse to eat. Study these reports frequently and improve your linking when you find rarely or even never spidered areas of your site. Donate these pages a few links from often crawled pages, put up themed site maps linked from the root index page and so on.



The Gatekeeper: robots.txt

[Shrink]

The Robots Exclusion Protocol from 1994 defines "a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot". That's a quasi standard, but crawlers sent out by major search engines do comply.

robots.txt is a plain text file located in the root directory of your server. Web robots read it before they fetch a document. If the document the bot is going to fetch is excluded for the particular robot by statements in the robots.txt file, the bot will not request it. The syntax is described here, use this tool to validate your robots.txt.

In order to track spider visits, your robots.txt should be a script logging each request of robots.txt in a database table. Here is an example for Apache/PHP:


Configure your webserver to parse .txt files for PHP, e.g. by adding this statement to your root's .htaccess file:

AddType application/x-httpd-php .htm .txt

Now you can use PHP in all .php, .htm, and .txt files. Ensure your users cannot submit .txt files for security reasons. http://www.yourdomain.com/robots.txt behaves like any other PHP script.


Your file system's directory structure has nothing to do with your linking structure, that is your site's hierarchy. However, you can store scripts delivering content which is not subject of public access in directories protected by robots.txt. To prevent this content from all unwanted views, add user/password protection.


User-agent: MyIntranetSpider
Disallow: /development/
Disallow: /extranet/
User-agent: *
Disallow: /intranet-login.htm
Disallow: /extranet-login.htm
Disallow: /developer-login.htm
Disallow: /development/
Disallow: /intranet/
Disallow: /extranet/
Disallow: /*.gif$
Disallow: /*.jpg$


This example allows 'MyIntranetSpider' to crawl the intranet directory while keeping all other web robots out. Note that file and directory names as well as query string arguments are case sensitive, and that excluding by file extension may not work with every web robot out there.

Google's crawler Googlebot and other Web robots support exclusion by patterns too, e.g.


User-agent: Googlebot
Disallow: /*affid=
Disallow: /*sessionID=
Disallow: /*visitorID=
Disallow: /*.aspx$
User-agent: Googlebot-Image
Disallow: /*.gif$

"*" matches any sequence of characters, "$" indicates the end of a name.


The first example would disallow all dynamic URLs were the variable 'affid' (affiliate ID) is part of the query string. The second and third example disallow URLs containing a session ID or a visitor ID. The fourth example excludes .aspx page scripts without a query string from crawling. The fifth example tells Google's image crawler to fetch all image formats except .gif files. Because not all Web robots understand this syntax, it makes sound sense to put in a robots META tag with a 'NOINDEX' value, just to be sure that search engines do not index unwanted pages.

Use Google's cool robots.txt validator to check your syntax and to simulate a crawler's behavior ruled by Disallow-statements.

If you add a User-agent: Googlebot section, you must duplicate all exclusions from the general User-agent: * section, because if Googlebot finds itself mentioned, it will ignore all other directions. Other crawlers may handle this the same way, thus create complete sections per spider if you really need to distinguish crawling exclusions between search engines.



URL Specific Control: the Robots META Tag

[Shrink]

The Robots META Tag, introduced 1996, "allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links". It's put in the <HEAD> section of the HTML document:


<META NAME="ROBOTS" CONTENT="INDEX, FOLLOW">


The content of the robots META tag contains directives separated by commas:
INDEX|NOINDEX - Tells the SE spider whether the page may be indexed or not
FOLLOW|NOFOLLOW - Tells the SE crawler whether it may follow links provided on the page or not
ALL|NONE - ALL = INDEX, FOLLOW (default), NONE = NOINDEX, NOFOLLOW
NOODP - tells search engines not to use page titles and descriptions from the ODP on their SERPs.
NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.
NOARCHIVE - Google specific1, used to prevent archiving
NOSNIPPET - Google specific, prevents Google from displaying text snippets for your page on its SERPs


If you provide more than one view on the same content, use INDEX|NOINDEX to avoid indexing of duplicate content. On the page desired for indexing by search engines put "INDEX, FOLLOW", on all alternate views put "NOINDEX,FOLLOW". Do not trick SE crawlers into indexing printer friendly layouts and alike, chances are you get banned sooner or later.

Nowadays search engines are smart enough to extract text content from page templates. Comparing similar text found on different pages, they try to guess which source page is worth indexing. Unfortunately, these guesses are sometimes weird and they deliver unimportant URLs on the SERPs. By the way, filtering duplicate content is not a penalty - it's a method of optimizing the search results in the best interest of search engine users (hardcore spammers and scraper site operators may not agree).    

Note that the robots META tag is for use in HTML documents only. If you offer your content additionally in PDF or DOC format, and you don't want to find the PDF/DOC-files in search results, store them in a directory protected by robots.txt or disallow these extensions in general.




1

Do not use search engine specific values in the standard robots META tag. Add a separate META tag per search engine, for example:
<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">




Link Specific Regulation: REL=NOFOLLOW

[Shrink]

Google has introduced the NOFOLLOW value in the <A> tag's REL attribute in January, 2005, as an instrument to prevent comment spam. Yahoo, MSN and guestbook/forum/blog software makers quickly joined the initiative. In the meantime this syntax found its way into Google's guidelines and Google reps encourage webmasters to use it anytime they can't vouch for a link.

If you can't vouch for a link, for example because it's added by a user or scraped from a foreign source, add REL=NOFOLLOW to the link. If you want to make other use of the REL attribute, separate the values by spaces. Example links:


<A HREF='http://foreignDomain.com/anyPage.html' REL='NOFOLLOW'>anchor text</A>
<A HREF='http://foreignDomain.com/anyPage.html' REL='external nofollow'>anchor text</A>


This is not a negative vote for the page you link to. Use REL=NOFOLLOW carefully and don't abuse it. Links marked with REL=NOFOLLOW do not pass PageRankô to the linked page, that means they do not contribute to the link popularity of the target. There are good reasons not to use REL=NOFOLLOW to hoard PageRankô. First, PageRankô hoarding is easy to discover and you will earn a ranking penalty for over-optimizing. Second, other webmasters are smart too and will cancel link trades if you cheat.

Search engine crawlers harvest even unlinked URLs. In rare cases when it's necessary to hide a URL from spiders completely, don't output it server sided. Use java script to write the URL string in the web browser then.



Tagging irrelevant page areas: class=robots-nocontent

[Shrink]

Telling a search engine that particular page areas aren't related to a page's core contents was a problem, until Yahoo! introduced the "robots-nocontent" class name in May 2007. Perhaps other search engines will follow and support this mechanism too. Google has something similar called section targeting for the AdSense crawler, but puts crawler directives in HTML comments instead of the class attribute.

Yahoo's implementation of a great idea is based on the draft of a flawed microformat, and it is somewhat hapless. Using a class name to apply crawler directives comes with a lot of work for Webmasters of (not only) static sites. Introducing CSS-like syntax in robots.txt to apply crawler directives to existing class names and DOM-IDs as well would have been a way better approach. However, here is how it works:

<div class="css-class robots-noindex"> [any X/HTML] </div>

X/HTML classes are designed to get populated with multiple values in a space delimited list. That means that a class name cannot contain a space, and by the way it shouldn't contain other characters than (a-z, 0-9, - and _). Since today the class was used for formatting purposes only, so multiple class names per X/HTML element are somewhat uncommon even to CSS-savvy Web designers.

The class attribute can be used with every X/HTML element in BODY. The predefined robots-nocontent class name takes effect on child nodes when it is assigned to a parent node. Say you've a P element which contains several A, B, EM and STRONG elements. When the P element is tagged with the robots-nocontent class name, the A, B, EM and STRONG elements within the paragraph inherit this attribute value. Hence an elegant implementation would assign class="robots-nocontent" to DIV or SPAN elements (or table rows [TR] and cells [TD]) spanning a block of code and contents which is not relevant to the page's message or core content. For example on this page we could tell crawlers that the search box, ads on the sidebars and the footer are not relevant to this article.

According to Yahoo's specs the robots-nocontent class name marks the tagged page area as "unsearchable". That means words and phrases within a robots-nocontent block will not appear in text snippets on the SERPs, and they will not trigger search query relevancy. So if absolutely-unique-string-on-the-whole-internet appears in a paragraph tagged with robots-nocontent, a search query will not return the page when a searcher types in exactly this phrase into the search box.

Links within a block tagged with the robots-nocontent class name will be followed and they should pass reputation, so robots-nocontent does not come with an implicit rel=nofollow! This allows tagging of navigational page elements like unrelated site-wide links at the very bottom with robots-nocontent. Crawlers will follow the links and index the link destination, but the anchor text is not counted as main content.

Look at your pages and decide which page areas are useful for human visitors, but useless for indexing purposes. Advertisements certainly belong to this category, also repeated navigational text links and crawlable popup menus where every page links to every page should be tagged as unsearchable. TOS excerpts or terms of shipping on e-commerce sites aren't search query relevant, and the same goes for quotes from content licenses or copyright notices.

Bear in mind that not all search engines support the robots-nocontent class name, and that some search engines may implement it differently from the inventor's specifications. We've seen that happen with rel=nofollow and other standards as well.



User and Crawler Friendly Navigation

[Shrink]

A good website provides easy navigation, so that the site's visitors can easily find what they are looking for within the website. Do not create a twelve levels deep hierarchy, regardless how large your site is. You lose audience (visitors as well as spiders) with every hop to a large degree. Define logical nodes in few hierarchy levels, and offer cross-links to related topics where it makes sense.

If you can, go for menu bars on the top left side, which should include global navigation elements like site maps, search forms, contact forms, and the root and category index pages as well. Add more navigation bars as the user goes deeper into the site. Provide links to the most important sections at the top and bottom of each page, along with a path to the root where each level is a stand-alone link. Search boxes should use the GET method, predefined searches for popular topics spread as links are useful for users and crawlers as well. Whatever you do, make it simple, consistent, and logical.

Easy navigation implies fast page loads, so dump your buttons and go for text links instead. Considering keyword relevancy, a text link is a stronger vote for the linked page. With an image link, you have the ALT and TITLE attribute stuffed with text explaining what the linked page is all about. With a text link you can use the keywords as anchor text, what counts more because it's visible to surfers before they do a mouseover. The ideal internal link looks like


<A HREF='http://www.myDomain.com/keyword-phrase.html' TITLE='Learn more about keyword-phrase'>keyword-phrase</A>


Create themed sitemaps, each with no more than 100 links, which are linked from the root index page and related category pages. Do not rely on navigation by combo boxes, DHTML menus, or other fancy stuff which is pretty useless for most users and totally useless for search engine crawlers. Do not use frames, especially not for navigation.

As a tweak addressing crawlers and savvy users, you can put one or two lines with keywords in small but visible fonts, linked to related site maps, category index pages, and pages deeper in the hierarchy, at the bottom of your pages.



Search Engine Friendly Query Strings

[Shrink]

A search engine friendly URL doesn't contain a question mark followed by a list of variables and their values. A search engine friendly URL is short and contains the keywords describing the page's content best, separated by hyphens. This does not only help with rankings, it helps visitors and especially bookmarkers too.

However, it's not always possible to avoid query strings. All the major search engines have learned to crawl dynamic pages, but there are limits:

∑†Search engine spiders dislike long and ugly URLs. They get indexed from very popular sites, but dealing with small web sites spiders usually don't bother fetching the page.
∑†Links from dynamic pages seem to count less than links from static pages when it comes to ranking based on link popularity. Also, some crawlers don't follow links from dynamic pages more than one level deep.
∑†To reduce server loads, search engine spiders crawl dynamic content slower than static pages. On large sites, it's pretty common that a huge amount of dynamic pages buried in the 3rd linking level and below never get indexed.
∑†Most search engine crawlers ignore URLs with session IDs and similar stuff in the query string, to prevent the spiders from fetching the same content over and over in infinite loops. Search engine robots do not provide referrers and they do not accept cookies, thus every request gets a new session ID assigned. Each variant of a query string creates a new unique URL.
∑†Keywords in variables and their values are pretty useless for ranking purposes, if they count at all. If you find a page identified by the search term in its query string on the SERPs, in most cases the search term is present as visible or even invisible text too, or it was used as anchor text of inbound links.
∑†There are still search engine crawlers out there which refuse to eat dynamic spider food.


Some rules of thumb on search engine friendly query strings:

Keep them short. Less variables gain more visibility.
Keep your variable names short, but do not use 'ID' or composites of entities and 'ID'.
Hide user tracking from search engine crawlers in all URLs appearing in (internal) links. That's tolerated cloaking, because it helps search engines. Ensure to output useful default values when a page gets requested without a session ID and the client does not accept cookies.
Keep the values short. If you can, go for integers. Don't use UUIDs/GUIDs and similar randomly generated stuff in query strings if you want the page indexed by search engines. Exception: in forms enabling users to update your database use GUIDs/UUIDs only, because integers encourage users to play with them in the address bar, which leads to unwanted updates and other nasty effects.


Consider providing static looking URLs, for example on Apache use mod_rewrite to translate static URLs to script URLs + query string. Ensure your server does not send a redirect response (302/301) then. Or, on insert of tuples in a 'pages' database table, you can store persistent files for each dynamic URL, calling a script on request. For example a static URL like http://www.yourDomain.com/nutrition/vitamins-minerals-milk-4711.htm can include a script parsing the file name to extract the parameter(s) necessary to call the output script. In this example the keywords were extracted from the page's title and the pageID '4711' makes the URL unique within the domain's namespace.



What Google's Sitemap Protocol May Change

[Shrink]

If you're not familiar with Google's Sitemap Service, please read the tutorial How to make use of Google SiteMaps first. Google's sitemap protocol, offered under the terms of the Attribution-ShareAlike Creative Commons License, is open for other search engines. A big player like Google has the power to talk its competitors into using this protocol. Thus, ignoring the 'beta' label, which is there to stay for a while, most probably we're discussing a future standard here.

For large dynamic web sites, Google SiteMaps is the instrument of choice to improve the completeness of search engine crawls. However, each and every web site out there should make use of Google SiteMaps. SEO firms should develop dynamic Google SiteMaps, which are fully automated and reflect a site's current state on every request by a search engine crawler, for all their clients. Content management systems should come with this feature build-in.

In all the euphoria caused thru Google's launch of sitemaps, do not forget to read the fine prints. Webmasters providing a Google SiteMap containing all crawlable URLs shall not become lazy. Google SiteMaps do not replace established methods of web site crawling, they do not affect rankings, and they do not guarantee spidering and inclusion in a search engine's index. Google SiteMaps do give webmasters an opportunity to inform search engine crawlers about fresh content, relative priorities and change frequencies per URL, but they are hints, not commands.

Google SiteMaps should be used as an additional tool for steering and supporting search engine crawling. Provided all page attributes in the sitemap XML file(s) were populated honestly, search engine crawlers should learn to fetch even deeply buried content in-time, while seldom touched URLs get scheduled in less frequent crawls. Webmasters should not expect too much in the beginning. Taking the learning curve effect into account, most probably search engine crawlers will have to perform sitemap based crawls in many iterations, before they come close to 'perfection'. Also, webmasters should not forget, that every now and then search engines might have a very special understanding of importance, which is pretty different from a site owner's point of view.



Recap: Methods to Support Search Engines in Crawling and Ranking

[Shrink]

Let's recap the basic methods of steering and supporting search engine crawling and ranking:


  • Provide unique content. A lot of unique content. Add fresh content frequently.
  • Acquire valuable inbound links from related pages on foreign servers, regardless of their search engine ranking. Actively acquire deep inbound links to content pages, but accept home page links. Do not run massive link campaigns if your site is rather new. Let the amount of relevant inbound links grow smoothly and steadily to avoid red-flagging.
  • Put in carefully selected outbound links to on-topic authority pages on each content page. Ask for reciprocal links, but do not dump your links if the other site does not link back.
  • Implement a surfer friendly, themed navigation. Go for text links to support deep crawling. Provide each page at least one internal link from a static page, for example from a site map page.
  • Encourage other sites to make use of your RSS feeds and alike. To protect the uniqueness of your site's content, do not put text snippets from your site into feeds or submitted articles. Write short summaries instead and use a different wording.
  • Use search engine friendly, short but keyword rich URLs. Hide user tracking from search engine crawlers.
  • Log each crawler visit and keep these data forever. Develop smart reports querying your logs and study them frequently. Use these logs to improve your internal linking.
  • Make use of the robots exclusion protocol to keep spiders away from internal areas. Do not try to hide your CSS files from robots.
  • Make use of the robots META tag to ensure that only one version of each page on your server gets indexed. When it comes to pages carrying partial content of other pages, make your decision based on common sense, not on any SEO bible.
  • Use rel="nofollow" in your links, when you cannot vote for the linked page (user submitted content in guestbooks, blogs ...). Do not hoard PageRankô.
  • Make use of Google SiteMaps as a 'robots inclusion protocol'.
  • Do not cheat the search engines.


  • Author: Sebastian
    Last Update: Monday, June 20, 2005   Web Feed

    · Home

    · Internet

    · Steering SE Crawlers

    · Googlebot-Spoofer

    · Google Sitemaps Info

    · Web Links

    · Link to us

    · Contact

    · What's new

    · Site map

    · Get Help


    Most popular:

    · Site Feeds

    · Database Design Guide

    · Google Sitemaps

    · smartDataPump

    · Spider Support

    · How To Link Properly


    Free Tools:

    · Sitemap Validator

    · Simple Sitemaps

    · Spider Spoofer

    · Ad & Click Tracking



    Search Google
    Web Site

    Add to My Yahoo!
    Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed



    To eliminate unwanted email from ALL sources use SpamArrest!





    neatCMS

    neat CMS:
    Smart Web Publishing



    Text Link Ads

    Banners don't work anymore. Buy and sell targeted traffic via text links:
    Monetize Your Website
    Buy Relevant Traffic
    text-link-ads.com


    [Editor's notes on
    buying and selling links
    ]






    Digg this · Add to del.icio.us · Add to Furl · We Can Help You!




    Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives


    Top of page

    No Ads


    Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy