Everything you need to know about Google Sitemaps

In June 2005, Google® has launched a service called Google® SiteMaps to optimize Googlebot's web site crawling. It allows webmasters to submit new/modified URLs to Google's spider Googlebot. Google SiteMap submissions have no impact on rankings on Google's SERPs nor will they influence PageRank™ calculations, but most probably they will help webmasters to get their new stuff crawled by Googlebot faster than before. Although we can't predict how or how much this new service, which is still in BETA state, will help site owners, we tell you how to make use of Google SiteMaps. [Update July 2005: It works like a charm. It even works better than expected. It improves search engine visibility to a great degree, as long as webmasters care what they submit to Google.]


Index

Google SiteMaps - Overview

What is a Google SiteMap and how can this service work for you? [Shrink]

Understanding the Google SiteMap Protocol

How the communication between Google and the web site owner works [Shrink]

Populating the sitemap.xml File

How to populate the sitemap.xml file honestly, and nevertheless getting the most value out of its submission to Google [Shrink]

How to Create a Dynamic Google SiteMap XML File

Make your sitemap.xls file dynamic to ensure it delivers your recent changes whenever it is fetched by Google's bot [Shrink]

Ask Googlebot to Crawl New and Modified Pages on Your Web Site

How to submit your sitemap to Google and how to inform Googlebot about new and changed content [Shrink]

Google Sitemaps Crawler Stats

Google's sitemap program provides detailed crawler reports which make it easy to fix issues like broken links, conflicts with robots.txt exclusions etc. etc. in no time. There is even a great robots.txt validator. [Shrink]

Is Google Sitemaps an Index Wiper?

Ensure your web site is in a pretty good shape before you submit a sitemap [Shrink]

Google Sitemaps Myths and Fictions

About site owners influencing Google's rankings via sitemaps and other fictions [Shrink]

Google SiteMap Discussions

Read here what webmasters have to say about their experiences with Google Sitemap, starting with the first announcement [Shrink]

Links to Google SiteMap Tools and Generators

Web sites without an underlying database and many smaller sites can't work with the dynamic approach to fully automate the Google SiteMaps channel outlined here, so here you go ... [Shrink]

Professional Services / Implementation of Google SiteMaps

If you prefer to buy your Google Sitemaps implementation, instead of bothering with many technical details... [Shrink]



Google SiteMaps - Overview

[Shrink]

Usually Google's crawler Googlebot will find each and every page on your web site, as long as a link known by Google points to it. Once Googlebot has spidered a page, it returns every once in a while to check for your updates. Shortly after Googlebot's visit, Google updates its index and changes the ranking of your pages. If you've changed the content (visible text, image descriptions...) of a page, Google will deliver your pages on its SERPs for new keywords or it will rank your pages differently. If you've added new pages and if you provide links to these pages, Google includes them into its index. That's pretty much simplistic, but to make use of Google's SiteMap service you don't need to understand the details. Hire a SEO if you want your web site ranked better.

This established procedure has its disadvantages, both for Google and webmasters as well. It burns resources and a high percentage of the results are pretty much useless. Googlebot has to crawl zillions of pages daily, just to find out that they weren't changed. Since Googlebot is that busy fetching pages archived in the stone age of the internet, it may find new content way too late. What was missing is a communication channel between Google and site owners, dedicated to adjust the crawling process. Dealing with billions of pages on the net, Google had no chance to communicate with webmasters on an individual or per site base. Google SiteMaps now opens this cannel for everybody, offering a method to exchange information on new and modified content in a timely manner. Both sides can benefit: Google saves a whole of a lot of machine time and bandwidth costs, site owners get their new content earlier on Google's SERPs and reduce their server load by Googlebot no longer spidering archived content too frequently.

Due to the nature of the beast, this channel needs full automation on both sides. Google itself as operator of the service took care of it, but a few millions of webmasters around the globe are called to implement a suitable solution to fully automate the Google SiteMap generation and maintenance, without reinventing the wheel. Google offers a Sitemap Generator requiring Python 2.2 or higher installed on the web server. A couple of webmasters have posted their scripts on the boards, a few blogs are offering solutions written in PHP, ASP and other programming languages, and a handful of tools are available from messages in Google's SiteMap Group. Sooner or later all content management systems will come with this functionality build-in.

Fortunately, there is a lot of useful stuff out there. Unfortunately, in a pretty fast growing number of posts and code snippets compilable from message boards, blogs, usenet groups and search engine related web sites, the webmaster searching for an adaptable solution fitting a particular web site's needs, seeks the needle in a haystack. Providing a webmaster has enough of a code monkey to customize a script, we don't offer just another superfluous piece of code, but a tutorial how to make use of Google SiteMaps.



Understanding the Google SiteMap Protocol

[Shrink]

First of all, Google's SiteMap service does not replace the established crawling procedure. It's offered as an addition to the old fashioned spidering by following links. That means, webmasters don't need to send each and every URL thru this new channel. Googlebot will still find (all) pages, whether they are listed in the web site's site map or not. Also, Google SiteMaps do not make standard site maps obsolete. Googlebot will continue to follow links from these site navigation elements.


This said, here is how Google SiteMap works:

1. The webmaster compiles a list of useful URLs and adds a few optional attributes (date of last modification, priority and change frequency) to each URL entry. This list must be served as XML file according to the sitemap protocol defined by Google. Usually a file named 'sitemap.xml' gets placed on the web server's root directory. Google accepts plain text files too, but processes sitemaps provided in XML format with a higher priority.

2. The webmaster submits the URL of the sitemap to Google. Google checks it for valid syntax and provides online stats showing the submission state. For accepted sitemaps, Google schedules a crawl using the information provided by the webmaster. Shortly after each download of a sitemap, Googlebot visits the web site and fetches new and modified content. From this point on, the established procedure applies.

3. On every change of content, the webmaster updates the sitemap and resubmits it to Google. -> 2.


It's that easy. Even the XML format does not require additional software nor understanding of XML at all. Once the initial Google SiteMap implementation works, resubmits can be done fully automated.



Populating the sitemap.xml File

[Shrink]

As said before, it's not necessary to put each URL available from a web site into the sitemap, although Google encourages webmasters to submit even images and movies, what makes not so much sense without META data describing the content1. Google SiteMaps was launched to give webmasters an opportunity to tell Google which pages they consider valuable for search engine users. For example, if your contact page behaves dynamically depending on the referring page, you don't need to submit every permutation to Google. Also, don't bother submitting URLs excluded in your robots.txt. Actually a no-brainer, don't submit doorway pages, duplicated content and alike, chances are good that Google will ignore your sitemaps after a while if you cheat.

Concentrate your efforts on pages which are hard to spider, for example dynamic URLs having many arguments in the query string, pages linked from dynamic pages, and pages deeply buried in your linking hierarchy. If you're using session IDs, provide Google with clean URLs (all randomly generated noise truncated). In the sitemap you can use long dynamic URLs up to 2048 characters.

Mass submissions of URLs are not a new thing, but the possibility to suggest how a search engine crawler should handle them is new and pioneering. Google's sitemap protocol defines three optional attributes of URLs: priority, change frequency and last modification. If you can't provide a particular attribute for a page (yet), skip it. The <url> tag is perfectly valid containing the page location alone. Put in additional information as you can, but don't try to populate these tags with more or less useless values just because they are defined.

The most important tag is <lastmod>, telling Google when a page was indeed modified or created. This enables Googlebot to pick fresh content aimed, probably a long time before it finds the very first link pointing to it by accident. Changes of this attribute in the underlying database should trigger a sitemap resubmission by the way. It seems to be important to avoid abuse of <lastmod>, in the best interest of the webmaster. Minor changes of templates affecting a bunch of pages are no reason to submit all pages based on the altered template as modified. Modifications are different wording, additional text information and brand new content.

The <priority> tag is meant as a hint to balance crawling capacities. Say a sitemap contains 10,000 modified URLs, but Googlebot's time slot scheduled for the web site in question would allow the fetching of only 1,000 pages. Now Googlebot should extract 1,000 URLs ordered by priority and probably last modification from the sitemap, fetch these pages and return later on to eat the 9,000 remaining pages.

Google says 'Search engines use this information when selecting between URLs on the same site, so you can use this tag to increase the likelihood that your more important pages are present in a search index.'. This statement made many site owners hope, they may get influence on rankings on Google's SERPs. That's wishful thinking. It simply means, that possibly Googlebot will crawl high-priority URLs before low-priority pages.

Assign reasonable priorities from 0.0 to 1.0 to your pages. For example, a brand new article should get a higher priority assigned than the more or less static home page. Given priorities are interpreted relative to other pages on the same web site. The best advice is: honestly assign high priorities to often changed pages which are of a great interest for your users, and low priorities to static stuff.

The <changefreq> tag seems to be meant as an educated guess, just a hint to the crawler. The list of valid values is short: "always", "hourly", "daily", "weekly", "monthly", "yearly" and "never". Irregularly changes are not covered, thus assign your best guess or even skip it, then rely on <lastmod>. "Never" stands for archived content. Use "always" for frequently updated news feeds and other stuff triggering content changes on (nearly) every page view.




1

META data describing non-textual content means title/alt text in image elements, anchor text in links, and surrounding text as well as META description tags. HTML pages get crawled more frequently than images or videos. Image/video-URIs harvested during regular crawls get queued into the specific crawling schedules. Since there is a relation between descriptive META data and non-textual content, it makes sound sense to submit all kind of content via sitemaps. It sure helps Google to make its image/video-search more current.




How to Create a Dynamic Google SiteMap XML File

[Shrink]

Scheduling batch jobs to generate RSS feeds and similar stuff like the sitemap.xml file is a way to complex procedure to handle such a simple task, and this approach is fault-prone. Better implement your sitemap generator as dynamic XML file, that is a script reflecting the current state of your web site on each request1. After submitting a sitemap to Google, you don't know when Googlebot finds the time to crawl your web site. Most probably you'll release a lot of content changes between the resubmit and Googlebot's visit. Also, perhaps crawlers of other search engines may be interested in your XML sitemap in the future. There are other advantages too, so you really should ensure that your sitemap reflects the current state of your web site everytime a web robot fetches it.

You can use every file name with your sitemap. Google accepts what you submit, 'sitemap.xml' is just a default. So you can go for 'sitemap.php', 'sitemap.asp', 'mysitemap.xhtml' or whatever scripting language you prefer, as long as the content is valid XML. However, there are good reasons to stick with the default 'sitemap.xml'. Here is an example for Apache/PHP:


Configure your webserver to parse .xml files for PHP, e.g. by adding this statement to your root's .htaccess file:

AddType application/x-httpd-php .htm .xml .rss

Now you can use PHP in all .php, .htm, .xml and .rss files. http://www.yourdomain.com/sitemap.xml behaves like any other PHP script. Note: static XML files will produce a PHP error caused by the XML version header.



You don't need XML software to produce the pretty simple XML of Google's sitemap protocol. The PHP example below should be easy to understand, even if you prefer another programming language. Error handling as well as elegant programming was omitted to make the hierarchical XML structure transparent and understandable.


$isoLastModifiedSite = "";
$newLine = "\n";
$indent = " ";
if (!$rootUrl) $rootUrl = "http://www.yourdomain.com";

$xmlHeader = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>$newLine";


$urlsetOpen = "<urlset xmlns=\"http://www.google.com/schemas/sitemap/0.84\"
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"
xsi:schemaLocation=\"http://www.google.com/schemas/sitemap/0.84
http://www.google.com/schemas/sitemap/0.84/sitemap.xsd\">$newLine";
$urlsetValue = "";
$urlsetClose = "</urlset>$newLine";

function makeUrlString ($urlString) {
    return htmlentities($urlString, ENT_QUOTES, 'UTF-8');
}

function makeIso8601TimeStamp ($dateTime) {
    if (!$dateTime) {
        $dateTime = date('Y-m-d H:i:s');
    }
    if (is_numeric(substr($dateTime, 11, 1))) {
        $isoTS = substr($dateTime, 0, 10) ."T"
                 .substr($dateTime, 11, 8) ."+00:00";
    }
    else {
        $isoTS = substr($dateTime, 0, 10);
    }
    return $isoTS;
}

function makeUrlTag ($url, $modifiedDateTime, $changeFrequency, $priority) {
    GLOBAL $newLine;
    GLOBAL $indent;
    GLOBAL $isoLastModifiedSite;
    $urlOpen = "$indent<url>$newLine";
    $urlValue = "";
    $urlClose = "$indent</url>$newLine";
    $locOpen = "$indent$indent<loc>";
    $locValue = "";
    $locClose = "</loc>$newLine";
    $lastmodOpen = "$indent$indent<lastmod>";
    $lastmodValue = "";
    $lastmodClose = "</lastmod>$newLine";
    $changefreqOpen = "$indent$indent<changefreq>";
    $changefreqValue = "";
    $changefreqClose = "</changefreq>$newLine";
    $priorityOpen = "$indent$indent<priority>";
    $priorityValue = "";
    $priorityClose = "</priority>$newLine";

    $urlTag = $urlOpen;
    $urlValue     = $locOpen .makeUrlString("$url") .$locClose;
    if ($modifiedDateTime) {
     $urlValue .= $lastmodOpen .makeIso8601TimeStamp($modifiedDateTime) .$lastmodClose;
     if (!$isoLastModifiedSite) { // last modification of web site
         $isoLastModifiedSite = makeIso8601TimeStamp($modifiedDateTime);
     }
    }
    if ($changeFrequency) {
     $urlValue .= $changefreqOpen .$changeFrequency .$changefreqClose;
    }
    if ($priority) {
     $urlValue .= $priorityOpen .$priority .$priorityClose;
    }
    $urlTag .= $urlValue;
    $urlTag .= $urlClose;
    return $urlTag;
}


Now fetch the URLs from your database. It's a good idea to have a boolean attribute to exclude particular pages from the sitemap. Also, you should have an indexed date-time attribute storing the last modification. Your content management system should enable the attributes ChangeFrequency, Priority, PageInSitemap and perhaps even LastModified on the user interface. Example query: "SELECT pageUrl, pageLastModified, pagePriority, pageChangeFrequency from pages WHERE pages.pageSiteMap = 1 AND pages.pageActive = 1 AND pages.pageOffsite <> 1 ORDER BY pages.pageLastModified DESC". Loop:


$urlsetValue .= makeUrlTag ($pageUrl, $pageLastModified, $pageChangeFrequency, $pagePriority);


After the loop you can add a few templated pages/scripts, not stored as content pages, which change on each page modification or not:


if (!$isoLastModifiedSite) { // last modification of web site
    $isoLastModifiedSite = makeIso8601TimeStamp(date('Y-m-d H:i:s'));
}
$urlsetValue .= makeUrlTag ("$rootUrl/what-is-new.htm", $isoLastModifiedSite, "daily", "1.0");


Now write the complete XML. Dealing with a larger amount of pages, you should print the <url> tag on each iteration followed by a flush(). If you publish tens of thousands of pages, you should provide multiple sitemaps and a sitemap index. Each sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 10MB.


header('Content-type: application/xml; charset="utf-8"',true);
print "$xmlHeader
$urlsetOpen
$urlsetValue
$urlsetClose
";


Google will process all <url> entries where the URL begins with the URL of the sitemap file. If your website is distributed over many domains, provide sitemaps per domain. Subdomains and the 'www prefix' are treated as seperate domains. URLs like 'http://www.domain.us/page' are not valid in a sitemap located on 'http://domain.us/'. The script's output should be something like


<?xml version="1.0" encoding="UTF-8" ?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84 http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">
    <url>
        <loc>http://www.smart-it-consulting.com/</loc>
        <lastmod>2005-06-04T00:00:00+00:00</lastmod>
        <changefreq>monthly</changefreq>
        <priority>0.6</priority>
    </url>
    <url>
        <loc>http://www.smart-it-consulting.com/database/progress-database-design-guide/</loc>
        <lastmod>2005-06-04T00:00:00+00:00</lastmod>
        <changefreq>monthly</changefreq>
        <priority>1.0</priority>
    </url>
    <url>
        <loc>http://www.smart-it-consulting.com/catindex.htm?node=2</loc>
        <lastmod>2005-05-31T00:00:00+00:00</lastmod>
        <priority>0.5</priority>
    </url>
    <url>
        <loc>http://www.smart-it-consulting.com/what-is-new.htm</loc>
        <lastmod>2005-06-04T08:31:12+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>1.0</priority>
    </url>
</urlset>


Feel free to use and customize the code above. If you do so, put this comment into each source code file containing our stuff:


COPYRIGHT (C) 2005 BY SMART-IT-CONSULTING.COM
* Do not remove this header
* This program is provided AS IS
* Use this program at your own risk
* Don't publish this code, link to http://www.smart-it-consulting.com/ instead




1

On large sites it may be a good idea to run the script querying the database on another machine to avoid web server slow downs. Also, using the sitemap index file creatively can help: reserve one or more dynamic sitemap files for fresh content and provide static sitemaps, updated weekly or so, containing all URLs. The sitemap tag of the sitemap index offers a lastmod tag to tell Google which sitemaps were modified since the last download. Use this tag to avoid downloads of unchanged static sitemaps.




Ask Googlebot to Crawl New and Modified Pages on Your Web Site

[Shrink]

Create a Google Account, then go to the Google SiteMap Submit Page. Enter your sitemap URL and wait for the first download displayed on the stats page. If the status is not 'Ok', correct the errors and resubmit your sitemap until it's approved. Bookmark the stats page and check back every once in a while (and after script changes!) to track Googlebot's usage of your sitemap.

You don't need to resubmit your sitemap manually. Being a smart webmaster, you'll automate the resubmits. The easiest way to automate sitemap resubmits to Google is to trigger a HTTP request on change of released content pages. After updating your database, call a function to ping Google. Since your dynamic sitemap file is always up to date, you don't need to do more. A PHP example:


function pingGoogleSitemap ( $rootUrl ) {

    $fileName = "http://www.google.com/webmasters/sitemaps/ping?sitemap=" .urlencode("$rootUrl/sitemap.xml");

    $url = parse_url($fileName);
    if (!isset($url["port"])) $url["port"] = 80;
    if (!isset($url["path"])) $url["path"] = "/";

    $fp = @fsockopen($url["host"],
                     $url["port"],
                     &$errno, &$errstr, 30);

    if ($fp) {
        $head = "";
        $httpRequest = "HEAD ". $url["path"] ."?"
                     .$url["query"] ." HTTP/1.1\r\n"
                     ."Host: ". $url["host"] ."\r\n"
                     ."Connection: close\r\n\r\n";

        fputs($fp, $httpRequest);
        while(!feof($fp)) $head .= fgets($fp, 1024);
        fclose($fp);

        return $head;

    }

return "ERROR";

}


This function returns something like "HTTP/1.1 200 OK Content-Type: text/html; charset=UTF-8 Content-Language: en Cache-control: private Content-Length: 0 Date: Sat, 04 Jun 2005 21:41:00 GMT Server: GFE/1.3" or "ERROR" on failure. If the string doesn't contain the return code "200 OK" something is fishy. Resubmits via ping don't appear in your account's sitemap stats.



If your content changes frequently, you should set up a cron job pinging Google twice a working day or so, instead of bothering Google with a ping on each record change.

To enable auto-discovering of your sitemap (Google, Yahoo, MSN, Ask) add this line to your robots.txt file:
Sitemap: http://www.example.com/sitemap.xml
(replace domain and file name if necessary)



Google Sitemaps Crawler Stats

[Shrink]

Before Yahoo's Site Explorer went live, Google provided advanced statistics in the sitemap program. The 'lack of stats' has produced hundreds of confused posts in the Google Groups. As Google Sitemaps was announced at June/02/2005, Shiva Shivakumar stated "We are starting with some basic reporting, showing the last time you've submitted a Sitemap and when we last fetched it. We hope to enhance reporting over time, as we understand what the webmasters will benefit from". Google's Sitemaps team closely monitored the issues and questions brought up by the webmasters, and since August/30/2005 there are enhanced stats (last roll out: February/06/2006). Here is how it works.

Google provides crawling reports for sites, where the sitemap was submitted via a Google account. To view the reports, a site's ownership must be verified first. On the Sitemap Stats Page each listed sitemap has a verify link (which changes to a stats link after verification). On the verification form Google provides a case sensitive unique file name, which is assigned to the current Google account and the sitemap's location.

Uploading and submitting an empty verification file with this name per domain tells Google that indeed the site owner requests access to the site's crawler stats. If a sitemap is located in the root directory, the verification file must be served from the root too (detailed instructions here). Do not delete the verification file after the verification procedure, because Google checks for its existance periodically. If you delete this file, you've to verify your site again.

That means, that for example free hosted bloggers submitting their RSS feed as sitemap don't get access to their stats, because they can't upload files to their subdomain's root index. In the first version, Google didn't provide stats for mobile sitemaps1, that is Google Sitemaps restricted to content made up for WAP devices (cell phones, PDAs and other handheld devices) provide basic reports only. Verification and enhanced stats for mobile sitemaps were added November/16/2005.

Correct setups:


http://www.domain.com/GOOGLE11e5844324b7354e.html [verification file]
http://www.domain.com/sitemap.xml
http://www.domain.com/directory/sitemap.xml

http://www.domain.com/directory/GOOGLE11e5844324b7354e.html
http://www.domain.com/sitemap.xml
http://www.domain.com/directory/sitemap.xml
http://www.domain.com/directory/directory/sitemap.xml


In the first example, crawling stats get enabled for all URIs in and below the domain's root. In the second example, crawling stats get enabled for all URIs in and below /directory/, but not in upper levels. By the way, the crawler reports provide information on URIs spidered from sitemaps and URIs found during regular crawls by following links, regardless whether the URI is listed in a sitemap or not.

Sounds pretty easy, but there is a pitfall. For security reasons, Google will not verify a location, if the Web server's response on invalid page requests is not equal 4042 (the error message says "We've detected that your 404 (file not found) error page returns a status of 200 (OK) in the header.", but it occurs even on redirects, e.g. 302). For example, if a site makes use of customized error documents, the HTTP response code is not 404 as requested. The server does a redirect to the custom error page and sends a 302 header, which cannot be overwritten by the error page itself. Here is a .htaccess example:


# 302, doesn't work for verification purposes:
ErrorDocument 404 http://www.domain.com/err.htm?errno=404

# 404, verification goes thru, because the custom error document
# is disabled:
#ErrorDocument 404 http://www.domain.com/err.htm?errno=404

# 404, verification process goes thru, because the server
# doesn't redirect:
ErrorDocument 404 /err.htm?errno=404

Hint: the verification process is a one time thingy per domain.

Once the verification process is finished, the crawler stats (or better problem reports) are accessible from the sitemaps stats page. In the first version, released August/30/2005, those reports showed all kind of errors per URI, but they said nothing about successful fetches. However, because every error status was linked to an explanation, this tool made it pretty easy to fix the issues3. On August/30/2005 I wrote my wishlist "I could think of enhancements though":

The error 'HTTP Error' doesn't tell the error code, it's linked to the 'page not found' FAQ entry. However, 'HTTP Error' occurs on all sorts of problems, for example crawling of URIs in password protected areas, harvested from the toolbar or linked from outside. Providing the HTTP response code and the date and time of crawling as well would simplify debugging.

In case of invalid URIs found on foreign pages, it would be extremely helpful to know which page contains the broken link. Firing up an email to the other site's webmaster would make everyone happy, inclusive Google.

Well, probably I'm greedy. Google's crawler report is a great tool, kudos to the sitemaps team! In combination with my spider tracker, sitemap generator and some other tools I've everything I need to monitor and support the crawl process.


Google listens, and here is what the sitemaps team has launched on November/16/2005: Enhanced Web site statistics for everybody. Everybody? Yep, those statistics are accessible for every site owner, regardless whether the site makes use of Google Sitemaps or not. Everybody can verify a site to get Google's view on it. Here is what you get when you click on the stats link:

Query stats, that is a list of the top 54 queries to Google that return pages from your site, and top 54 search query clicks, that is the most popular search terms (keyword phrases) that directed traffic from Google's SERPs to your site, based on user clicks counted by Google. Google runs click tracking periodically, so those stats aren't based on the real numbers of visitors per search term, but based on statistical trends they are useful.

On March/01/2004 Google has added the top position (highest ranking on the SERPs averaged over the last three weeks) to each keyword phrase listed in the stats. It's nice to see that often results from the second or even third SERP get more traffic than expensive money terms ranked in the top five positions. We'll see whether Google will expand the view from the current maximum of 20 lines.

Crawl stats, that is graphical enhanced stats on crawler problems and PageRank distribution. The HTTP error list provides HTTP error codes now, but the URL of the source page carrying the broken link is still missing. Page requests resulting in a custom error page are listed as "general HTTP error", regardless whether the error page responds with a 200 or 302 return code. Also it seems Google still limits the number of shown errors. Besides URLs where Googlebot ran into HTTP errors, you get information on unreachable URLs (connectivity issues), URLs restricted by robots.txt, URLs not followed, and URLs timed out. The list of URLs not followed is interesting, it shows pages Googlebot began to crawl, but was unable to follow due to redirect chains5. As for the PageRank distribution within a site, those stats seem to be based on the real PageRank used in rankings, not the outdated snapshot used to display green pixels on the toolbar. That and the page with the highest PageRank are neat goodies for the PageRank addicts.

Page analysis are stats on content types like text/html, application/octet-stream, text/plain, application/pdf and so on, as well as encodings like UTF-8, ISO-8859-1 (Latin-1), US-ASCII, or CP1252 (Windows Latin-1). Since February 2006 Google provides a site wide word analysis of textual content and external anchor text as well.

Index stats provide help on advanced operators like site:, allinurl:, link:, info: and related:, along with linked sample queries. Note that Google does not show all links pointing to your site. This page should be very useful for site owners not familiar with Google's advanced search syntax.

Every once in a while I got "Data is not available at this time. Please check back later for statistics about your site." responses, but after a while the data I had seen previously reappeared. Don't worry if this happens to you too.

Overall, I'm impressed. I still have wishes, but honestly Google's crawler stats deliver way more useful stuff than I've expected in the short period of time since the launch, and way more than any other search engine (well, there is still an issue w.r.t. to inbound links, but I doub't Google will ever fix it, and there is an alternative). I promise that I don't refer to Google's crawler stats as "extracts of crawler problem reports" any more.


Update February/06/2006: The Sitemaps team has launched a few very cool goodies.

  • The robots.txt validation tool shows when Googlebot fetched the robots.txt (happens usually once per day, the robots.txt is then cached) and whether it blocks access to the home page or not.

    Its contents are displayed in a text area, where a Webmaster can edit it and Google simulates accesses to particular URLs. That's really cool. Google is the only search engine supporting wildcard syntax in robots.txt, and blocking particular pages or even URLs with a particular variable or value in the query string can be a tough job. Now its easy, just fire up Google's robots.txt validator, enter the URLs to exclude in a text box, then change the robots.txt until the disallow statements do exactly what they are supposed to do. It works like a charm, it's even possible to optimize a robots.txt file for different user agents and the plain old standard not supporting Google's extended syntax as well.

    If a statement is errornous, Google's robots.txt syntax checker puts a warning. That's pretty useful, but it may be misleading in case the robots.txt interpreter runs into a section for another Web robot supporting syntax Google didn't implement, for example crawl-delays and such. So look at the current section before you edit malformed syntax, it may be correct.
  • Crawl stats now include the page that had the highest PageRank, by month, for the last three months. That's a nice feature because the toolbar PR values are updated only every 3-4 months. Also, this feature ends the debate whether dynamic URLs have their PageRank assigned to the script's base URI or the fully qualified URI. In my stats the pages with the highest PageRank all have a query string with several variable/value pairs.
  • Page analysis now includes a list of the most common words in a site's textual content, and found in the anchor text of external links to a site. Note that those statistics are on a per site basis, that means they cannot be used to optimize particular pages (well, except with one-pagers). Interestingly Google counts words even in templated page areas, for example in side wide links, and there seems to be a correlation to the selection of related pages in the "similar pages" link on the SERPs. Parts of URL segments from URL drops in external anchor text are treated as words delimited by all non-alpha characters, even "http", "www", "com", "htm" and terms in file names appear in the stats. These word statistics will start a few very interesting SEO debates.

Update March/01/2006: The Sitemaps team has launched more new features: stats on mobile search, the average top position per search term covered above under "query stats", and all stats are downloadable in CSV format.




1

A mobile sitemap is a standard Google compliant sitemap, populated with URIs of WAP pages of one particular markup language (XHTML, WML...), submitted via another form. Currently accepted markup languages are XHTML mobile profile (WAP 2.0), WML (WAP 1.2) and cHTML (iMode).

2

To check a site's 404 handling, Google requests randomly generated files like /GOOGLE404probee4736a7e0e55f592.html or /noexist_d6f4a5fc020d3ee8.html

3

If you don't fix all reported issues with URIs on your site, you'll miss out on some traffic at least. So track down the errors. If you find broken links on your pages, correct them. If you submit invalid URIs via sitemap, change it. A few errors will stay in the section listing unreachable URIs found during the regular crawl process. Try to find the source of each invalid inbound link in your referrer stats, 404 logs and such, and write the other Webmaster a polite letter asking to edit the broken link. If you can't track down the source, guess the inbound link's target as best as you can. Then put up a simple script under the invalid URI doing a permanent (301!) redirect, pointing to the page on your site which is/could be the link's destination. This way you don't waste any traffic nor the ranking boost earned from those inbounds.

4

Since December/13/2005 Google raised the 5 search queries limit. That is for popular respectively established sites you'll see more search queries in your stats. New sites OTOH most probably get less entries, or a message "no data available".

5

When Ms. Googlebot requests a resource and gets a redirect header (HTTP error codes 301 or 302), she doesn't follow the redirect immediately. That is, instead of delivering the fetched content to the indexing process, she reports the location provided in the redirect header back to the crawling engine. Depending on PageRank and other factors, a request of the new URI may occur within the current crawling process, or later. Caused by this potential delay, sometimes the destination of redirecting resources is crawled weeks after the scheduled fetch, and the search index gets no update. Although this behavior is supposed to change in fall/winter 2005, it is a good idea to avoid redirects, and especially redirect chains where the destination initiates another redirect.




Is Google Sitemaps an Index Wiper?

[Shrink]

A few weeks after Google's launch of SiteMaps, more and more webmasters complain about their sites disappearing from Google's index shortly after a sitemap submission. Did Google trick innocent newbies and not so savvy webmasters into a very smart (but, being a beta version, still errornous) spammer and scraper trap? Tired on countless approaches to abuse its search services, the Google empire strikes back! Seriously, Google launched SiteMaps to explore the 'hidden web', and to learn more about web site structures - including widely used 'helpers' like feeder pages and similar stuff.

Danny Sullivan asked Shiva Shivakumar, engineering director and the technical lead for Google SiteMaps, How will you prevent people from using this to spam the index in bulk? He said We are always developing new techniques to manage index spam. All those techniques will continue to apply with the Google Sitemaps. Analyzing a few of the disappeard web sites and their sitemaps, it seems that the causes for disappearing from Google's index can obviously be found in Shiva Shivakumar's answer.

A few examples cannot prove, that there is no bug causing removals of clean web sites in Google's new service. But before webmasters complain, they should be sure that their sites do comply to Google's guidelines. Shit happens. Even experienced webmasters can fail.

One circumstance commonly applies to web sites wiped out from Google's index after sitemap based deep crawls. The sitemaps were generated by foreign tools, which spider for links and/or collect URLs from the web server's file system. With large sitemaps, human reviews are limited, especially if the file names don't follow human readable naming conventions, and/or query strings are insignificant.

Google applies spam filters on unintentional spider food supplied in sitemaps too. Some scenarios of unintentional cheating:

  • Huge assorted links pages from spider traps, which were very popular in 1999/2000, were not deleted on the web server. The webmaster has only removed the links from the home page.

  • A developer playing with a vendor's data feed many months ago has generated zillions of interlinked product pages in a forgotten directory, all linking to the domain's home page, which Google sees as doorway pages.
  • A formerly spamming site was completely revamped and reindexed on a reinclusion request. The webmaster switched the HTML file name extension from .html to .htm in his devemopment tool, kept the directory structure, and forgot to delete the old stuff on the web server. Unfortunately the sitemap generator submitted the spammy .html files, packed with hidden links and invisible text.
  • A bunch of rarely crawled printer friendly pages without a robots NOINDEX meta tag gets submitted via sitemap. The primary versions of these pages were well ranked, caused by lots of deep inbound links from other sites. For some odd reason the duplicate content filter likes the more or less unlinked printer friendly pages better. Those cannot be found by site:+unique-word-appearing-in-every-bottom-line searches, because the printer friendly pages lack a bottom line containing the search term.
  • ...
  • If one of the mistakes above or a similar scenario applies, clean-up your web server, remove the offending pages in Google's index and send a reinclusion request to Google.



    Google Sitemaps Myths and Fictions

    [Shrink]

    From reading the boards and feedback to this tutorial, we'd like to add a few items to Google's facts and fiction listing:

    Fiction: Assigning a high priority in a Google SiteMap increases the URL's PageRank™.
    Fact:    PageRank™ calculations have nothing to do with sitemap priority. It simply means, that possibly Googlebot will crawl high-priority URLs before low-priority pages.

    Fiction: According to Google's TOS, commercial sites cannot participate in the sitemap program.
    Fact:    Every web site can submit sitemaps to Google. You don't need a Google account to participate. Even webmasters of commercial sites may use their Google accounts to track sitemap downloads.

    Fiction: Participating web sites must have Phyton 2.2 installed on their webservers.
    Fact:    Only Google's free sitemap generator requires Phyton. You can use everyting you have to create and submit the sitemaps. Even notepad and a web browser will do the job, the sitemap protocol is that simple.

    Fiction: Google penalizes web sites for frequent submissions.
    Fact:  There are no such penalties. Google encourages sitemap submissions on content changes. However, if your content changes every minute, you should go for a reasonable submission frequency.



    Google SiteMap Discussions

    [Shrink]

    Here are some links to Google SiteMap discussions. We'll not update this listing, so please use it as a starting point.

  • Google Blog: Google Launches Sitemaps - the official announcement by Shiva Shivakumar, Google's Engineering Director
  • Search Engine Watch Blog: New 'Google Sitemaps' Web Page Feed Program - Danny Sullivan interviewing Shiva Shivakumar, Google's Engineering Director
  • Theodore R. Smith: Google Sitemap Horror - first experience with Google's SiteMap Generator - read with a grain of salt
  • WebmasterWorld: Google Sitemaps - webmasters speculating about Google's new service
  • WebProNews: Google Sitemaps: RSS For The Entire Website? - a first overview
  • GoogleGroups: Google Sitemaps - this user group is read by Google staff
  • Search Engine Watch Forums: Google Sitemaps - Danny Sullivan, GoogleGuy, Google's SitemapsAdvisor and webmasters discussing the new service
  • Developer blog: SE Side - Tobias and John, who both offer sitemap related tools, blogging about their studies and other interesting Google Sitemaps insights from Switzerland.



  • Links to Google SiteMap Tools and Generators

    [Shrink]

    Static websites do need another solution to generate a valid Google SiteMaps XML file. Unfortunately, many webmasters cannot use the free sitemap generator provided by Google for various reasons. Not even a week after Google's announcement, all search engine marketing related forums, blogs and usenet groups provide links to more or less useful Google Sitemap tools. There is a lot of crap floating around, thus we tried collect a few 'nuggets'. We didn't evaluate the tools listed below, and we cannot vouch for them, but they seem to be pretty decent. We don't link to tools without positive webmaster feedback on the boards.


    Do not download files if your client machine lacks a suitable protection. Especially do not download executables, never.

    • Google's Third Party Tool Listings is a page linking to Google SiteMaps related tools and articles gathered by Google staff.
    • Google SiteMaps Pal is an online service generating the sitemap.xml file containing a maximum of 100 URLs, spidered from a submitted URL.
    • Google SiteMap XML Validator is an online service validating the XML structure of your Google SiteMap. It can submit your sitemap.xml file to Google, if you don't want to use your Google account.
    • Node Map is a web site packed with information on Google SiteMaps, including tools for generation and validation of sitemap XML files. For example they provide a free Google Sitemap XML Updater to check all URLs to make sure they are under the base URL, to check the status (HTTP response code) of each URL, and to compare the last-modified date reported by the web server with the last-modified date in the sitemap file. The validation results can be downloaded as new XML Sitemap.
    • phpSitemapNG is a PHP script compiling the XML file from the web server's file system, and/or spidering a web site to include dynamic links. Alternatively the site offers an online generator, where you can enter the index URL to get an instant XML sitemap.
    • Google Sitemap Generator for Wordpress is a well documented WordPress Plugin.
    • Gsitemap is a Windows/.Net site map generator and submitter.
    • GSiteCrawler is a Windows tool creating Google SiteMaps by project, thus its suitable for multi-domain Web sites.
    • Sitemap Editor is an on-line tool to edit an XML sitemap.
    • Google Sitemaps StyleSheets display XML sitemaps in a user friendly way, sample sitemap, download code here or an enhanced version here.
    • Simple Sitemaps is a PHP script generating a dynamic Google XML Sitemap plus a pseudo-static HTML Site Map and a RSS 2.0 site feed from a simple text file. Simple Sitemaps is suitable for smaller Web sites with no more than 100 pages.

    If you operate a forum, blog or similar kind of web site based on foreign software, chances are good the software vendor supplies a sitemap generator. Visit the vendor's web site before you implement a hack.



    Professional Services / Implementation of Google SiteMaps

    [Shrink]

    Smart IT Consulting offers professional implementation services for Google Sitemaps, as well as reviews, advice and alike. To get in touch with us, please click here.




    We use Google Sitemaps to inform Google's crawler about all your pages and to help people discover more of your web pages.
    We use Google Sitemaps to inform Google's crawler about all your pages and to help people discover more of your web pages.






    Besides Google Sitemaps, consider to make use of several more methods to support search engines in crawling your web site. To learn more, please read our tutorial on Steering and Supporting Search Engine Crawling.




    Author: Sebastian
    Last Update: Saturday, June 04, 2005   Web Feed

    · Home

    · Internet

    · Google Sitemaps Guide

    · Google Sitemaps FAQ

    · Google Sitemaps KB

    · Sitemap News

    · Simple Sitemaps

    · XML Validator

    · Google Sitemaps Info

    · Web Links

    · Link to us

    · Contact

    · What's new

    · Site map

    · Get Help


    Most popular:

    · Site Feeds

    · Database Design Guide

    · Google Sitemaps

    · smartDataPump

    · Spider Support

    · How To Link Properly


    Free Tools:

    · Sitemap Validator

    · Simple Sitemaps

    · Spider Spoofer

    · Ad & Click Tracking



    Search Google
    Web Site

    Add to My Yahoo!
    Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed



    To eliminate unwanted email from ALL sources use SpamArrest!





    neatCMS

    neat CMS:
    Smart Web Publishing



    Text Link Ads

    Banners don't work anymore. Buy and sell targeted traffic via text links:
    Monetize Your Website
    Buy Relevant Traffic
    text-link-ads.com


    [Editor's notes on
    buying and selling links
    ]






    Digg this · Add to del.icio.us · Add to Furl · We Can Help You!




    Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives


    Top of page

    No Ads


    Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy