How to Make Use of Google SiteMaps · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · 11 · Expand · Web Feed


Before Yahoo's Site Explorer went live, Google provided advanced statistics in the sitemap program. The 'lack of stats' has produced hundreds of confused posts in the Google Groups. As Google Sitemaps was announced at June/02/2005, Shiva Shivakumar stated "We are starting with some basic reporting, showing the last time you've submitted a Sitemap and when we last fetched it. We hope to enhance reporting over time, as we understand what the webmasters will benefit from". Google's Sitemaps team closely monitored the issues and questions brought up by the webmasters, and since August/30/2005 there are enhanced stats (last roll out: February/06/2006). Here is how it works.

Google provides crawling reports for sites, where the sitemap was submitted via a Google account. To view the reports, a site's ownership must be verified first. On the Sitemap Stats Page each listed sitemap has a verify link (which changes to a stats link after verification). On the verification form Google provides a case sensitive unique file name, which is assigned to the current Google account and the sitemap's location.

Uploading and submitting an empty verification file with this name per domain tells Google that indeed the site owner requests access to the site's crawler stats. If a sitemap is located in the root directory, the verification file must be served from the root too (detailed instructions here). Do not delete the verification file after the verification procedure, because Google checks for its existance periodically. If you delete this file, you've to verify your site again.

That means, that for example free hosted bloggers submitting their RSS feed as sitemap don't get access to their stats, because they can't upload files to their subdomain's root index. In the first version, Google didn't provide stats for mobile sitemaps1, that is Google Sitemaps restricted to content made up for WAP devices (cell phones, PDAs and other handheld devices) provide basic reports only. Verification and enhanced stats for mobile sitemaps were added November/16/2005.

Correct setups:


http://www.domain.com/GOOGLE11e5844324b7354e.html [verification file]
http://www.domain.com/sitemap.xml
http://www.domain.com/directory/sitemap.xml

http://www.domain.com/directory/GOOGLE11e5844324b7354e.html
http://www.domain.com/sitemap.xml
http://www.domain.com/directory/sitemap.xml
http://www.domain.com/directory/directory/sitemap.xml


In the first example, crawling stats get enabled for all URIs in and below the domain's root. In the second example, crawling stats get enabled for all URIs in and below /directory/, but not in upper levels. By the way, the crawler reports provide information on URIs spidered from sitemaps and URIs found during regular crawls by following links, regardless whether the URI is listed in a sitemap or not.

Sounds pretty easy, but there is a pitfall. For security reasons, Google will not verify a location, if the Web server's response on invalid page requests is not equal 4042 (the error message says "We've detected that your 404 (file not found) error page returns a status of 200 (OK) in the header.", but it occurs even on redirects, e.g. 302). For example, if a site makes use of customized error documents, the HTTP response code is not 404 as requested. The server does a redirect to the custom error page and sends a 302 header, which cannot be overwritten by the error page itself. Here is a .htaccess example:


# 302, doesn't work for verification purposes:
ErrorDocument 404 http://www.domain.com/err.htm?errno=404

# 404, verification goes thru, because the custom error document
# is disabled:
#ErrorDocument 404 http://www.domain.com/err.htm?errno=404

# 404, verification process goes thru, because the server
# doesn't redirect:
ErrorDocument 404 /err.htm?errno=404

Hint: the verification process is a one time thingy per domain.

Once the verification process is finished, the crawler stats (or better problem reports) are accessible from the sitemaps stats page. In the first version, released August/30/2005, those reports showed all kind of errors per URI, but they said nothing about successful fetches. However, because every error status was linked to an explanation, this tool made it pretty easy to fix the issues3. On August/30/2005 I wrote my wishlist "I could think of enhancements though":

The error 'HTTP Error' doesn't tell the error code, it's linked to the 'page not found' FAQ entry. However, 'HTTP Error' occurs on all sorts of problems, for example crawling of URIs in password protected areas, harvested from the toolbar or linked from outside. Providing the HTTP response code and the date and time of crawling as well would simplify debugging.

In case of invalid URIs found on foreign pages, it would be extremely helpful to know which page contains the broken link. Firing up an email to the other site's webmaster would make everyone happy, inclusive Google.

Well, probably I'm greedy. Google's crawler report is a great tool, kudos to the sitemaps team! In combination with my spider tracker, sitemap generator and some other tools I've everything I need to monitor and support the crawl process.


Google listens, and here is what the sitemaps team has launched on November/16/2005: Enhanced Web site statistics for everybody. Everybody? Yep, those statistics are accessible for every site owner, regardless whether the site makes use of Google Sitemaps or not. Everybody can verify a site to get Google's view on it. Here is what you get when you click on the stats link:

Query stats, that is a list of the top 54 queries to Google that return pages from your site, and top 54 search query clicks, that is the most popular search terms (keyword phrases) that directed traffic from Google's SERPs to your site, based on user clicks counted by Google. Google runs click tracking periodically, so those stats aren't based on the real numbers of visitors per search term, but based on statistical trends they are useful.

On March/01/2004 Google has added the top position (highest ranking on the SERPs averaged over the last three weeks) to each keyword phrase listed in the stats. It's nice to see that often results from the second or even third SERP get more traffic than expensive money terms ranked in the top five positions. We'll see whether Google will expand the view from the current maximum of 20 lines.

Crawl stats, that is graphical enhanced stats on crawler problems and PageRank distribution. The HTTP error list provides HTTP error codes now, but the URL of the source page carrying the broken link is still missing. Page requests resulting in a custom error page are listed as "general HTTP error", regardless whether the error page responds with a 200 or 302 return code. Also it seems Google still limits the number of shown errors. Besides URLs where Googlebot ran into HTTP errors, you get information on unreachable URLs (connectivity issues), URLs restricted by robots.txt, URLs not followed, and URLs timed out. The list of URLs not followed is interesting, it shows pages Googlebot began to crawl, but was unable to follow due to redirect chains5. As for the PageRank distribution within a site, those stats seem to be based on the real PageRank used in rankings, not the outdated snapshot used to display green pixels on the toolbar. That and the page with the highest PageRank are neat goodies for the PageRank addicts.

Page analysis are stats on content types like text/html, application/octet-stream, text/plain, application/pdf and so on, as well as encodings like UTF-8, ISO-8859-1 (Latin-1), US-ASCII, or CP1252 (Windows Latin-1). Since February 2006 Google provides a site wide word analysis of textual content and external anchor text as well.

Index stats provide help on advanced operators like site:, allinurl:, link:, info: and related:, along with linked sample queries. Note that Google does not show all links pointing to your site. This page should be very useful for site owners not familiar with Google's advanced search syntax.

Every once in a while I got "Data is not available at this time. Please check back later for statistics about your site." responses, but after a while the data I had seen previously reappeared. Don't worry if this happens to you too.

Overall, I'm impressed. I still have wishes, but honestly Google's crawler stats deliver way more useful stuff than I've expected in the short period of time since the launch, and way more than any other search engine (well, there is still an issue w.r.t. to inbound links, but I doub't Google will ever fix it, and there is an alternative). I promise that I don't refer to Google's crawler stats as "extracts of crawler problem reports" any more.


Update February/06/2006: The Sitemaps team has launched a few very cool goodies.

  • The robots.txt validation tool shows when Googlebot fetched the robots.txt (happens usually once per day, the robots.txt is then cached) and whether it blocks access to the home page or not.

    Its contents are displayed in a text area, where a Webmaster can edit it and Google simulates accesses to particular URLs. That's really cool. Google is the only search engine supporting wildcard syntax in robots.txt, and blocking particular pages or even URLs with a particular variable or value in the query string can be a tough job. Now its easy, just fire up Google's robots.txt validator, enter the URLs to exclude in a text box, then change the robots.txt until the disallow statements do exactly what they are supposed to do. It works like a charm, it's even possible to optimize a robots.txt file for different user agents and the plain old standard not supporting Google's extended syntax as well.

    If a statement is errornous, Google's robots.txt syntax checker puts a warning. That's pretty useful, but it may be misleading in case the robots.txt interpreter runs into a section for another Web robot supporting syntax Google didn't implement, for example crawl-delays and such. So look at the current section before you edit malformed syntax, it may be correct.
  • Crawl stats now include the page that had the highest PageRank, by month, for the last three months. That's a nice feature because the toolbar PR values are updated only every 3-4 months. Also, this feature ends the debate whether dynamic URLs have their PageRank assigned to the script's base URI or the fully qualified URI. In my stats the pages with the highest PageRank all have a query string with several variable/value pairs.
  • Page analysis now includes a list of the most common words in a site's textual content, and found in the anchor text of external links to a site. Note that those statistics are on a per site basis, that means they cannot be used to optimize particular pages (well, except with one-pagers). Interestingly Google counts words even in templated page areas, for example in side wide links, and there seems to be a correlation to the selection of related pages in the "similar pages" link on the SERPs. Parts of URL segments from URL drops in external anchor text are treated as words delimited by all non-alpha characters, even "http", "www", "com", "htm" and terms in file names appear in the stats. These word statistics will start a few very interesting SEO debates.

Update March/01/2006: The Sitemaps team has launched more new features: stats on mobile search, the average top position per search term covered above under "query stats", and all stats are downloadable in CSV format.



Is Google Sitemaps an Index Wiper?Next Page

Previous PageAsk Googlebot to Crawl New and Modified Pages on Your Web Site


How to Make Use of Google SiteMaps · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · 11 · Expand · Web Feed




1

A mobile sitemap is a standard Google compliant sitemap, populated with URIs of WAP pages of one particular markup language (XHTML, WML...), submitted via another form. Currently accepted markup languages are XHTML mobile profile (WAP 2.0), WML (WAP 1.2) and cHTML (iMode).

2

To check a site's 404 handling, Google requests randomly generated files like /GOOGLE404probee4736a7e0e55f592.html or /noexist_d6f4a5fc020d3ee8.html

3

If you don't fix all reported issues with URIs on your site, you'll miss out on some traffic at least. So track down the errors. If you find broken links on your pages, correct them. If you submit invalid URIs via sitemap, change it. A few errors will stay in the section listing unreachable URIs found during the regular crawl process. Try to find the source of each invalid inbound link in your referrer stats, 404 logs and such, and write the other Webmaster a polite letter asking to edit the broken link. If you can't track down the source, guess the inbound link's target as best as you can. Then put up a simple script under the invalid URI doing a permanent (301!) redirect, pointing to the page on your site which is/could be the link's destination. This way you don't waste any traffic nor the ranking boost earned from those inbounds.

4

Since December/13/2005 Google raised the 5 search queries limit. That is for popular respectively established sites you'll see more search queries in your stats. New sites OTOH most probably get less entries, or a message "no data available".

5

When Ms. Googlebot requests a resource and gets a redirect header (HTTP error codes 301 or 302), she doesn't follow the redirect immediately. That is, instead of delivering the fetched content to the indexing process, she reports the location provided in the redirect header back to the crawling engine. Depending on PageRank and other factors, a request of the new URI may occur within the current crawling process, or later. Caused by this potential delay, sometimes the destination of redirecting resources is crawled weeks after the scheduled fetch, and the search index gets no update. Although this behavior is supposed to change in fall/winter 2005, it is a good idea to avoid redirects, and especially redirect chains where the destination initiates another redirect.




Author: Sebastian
Last Update: Saturday, June 04, 2005   Web Feed

· Home

· Internet

· Google Sitemaps Guide

· Google Sitemaps FAQ

· Google Sitemaps KB

· Sitemap News

· Simple Sitemaps

· XML Validator

· Google Sitemaps Info

· Web Links

· Link to us

· Contact

· What's new

· Site map

· Get Help


Most popular:

· Site Feeds

· Database Design Guide

· Google Sitemaps

· smartDataPump

· Spider Support

· How To Link Properly


Free Tools:

· Sitemap Validator

· Simple Sitemaps

· Spider Spoofer

· Ad & Click Tracking



Search Google
Web Site

Add to My Yahoo!
Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed








Digg this · Add to del.icio.us · Add to Furl · We Can Help You!



Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives


Top of page

No Ads


Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy