Google Sitemaps was launched in June 2005 to enhance Google's Web crawling process in cooperation with Webmasters and site owners. Collaborative crawling takes Webmasters into the boat to some degree, and both sides did learn a lot from each other in the last months. Google's Sitemaps Team does listen to Joe Webmaster's needs, questions, and suggestions. They have implemented a lot of very useful features based on suggestions in the Google Sitemaps Group, an open forum where members of the Sitemaps team communicate with their users, handing out technical advice even on weekends. The nickname Google Employee used by the Sitemaps team makes it regularly on the list of This month's top posters.
Vanessa: You're right. Our team is located in offices around the globe, which means someone on the team is working on Sitemaps nearly around the clock. A few team members were able to take some time to answer your questions, including Shiva Shivakumar, engineering director who started the Google Sitemaps project (and whose interview with Danny Sullivan you may have seen when we initially launched), Grace and Patrik from our Zurich office, Michael and Andrey from our Kirkland office, and Shal from our Mountain View office. We also got Matt Cutts to chime in.
My amateurish try to summarize the Google Sitemaps program is "Aimed crawling makes Google's search results fresh, and Webmasters happy". How would you outline your service, its goals, intentions, and benefits?
Our goal is two-way communication between Google and webmasters. Google Sitemaps is a free tool designed so webmasters can let us know about all the pages on their sites and so we can provide them with detailed reports on how we see their sites (such as their top Google search queries and URLs we had trouble crawling).
You've announced collaborative crawling as "an experiment called Google Sitemaps that will either fail miserably, or succeed beyond our wildest dreams, in making the web better for webmasters and users alike". Eight months later I think it's safe to say that that your experiment has been grown up to a great success. How much has your project contributed to the growth, freshness and improved quality of Google's search index?
We've had a huge response from webmasters, who have submitted a great deal of high-quality pages. Many pages would have never found through our usual crawl process, such as URLs with content found behind forms or locked up in databases or behind content management systems.
A major source of misunderstandings and irritations is the common lack of knowledge on how large IT systems -- especially search engines -- work. This leads to unrealistic expectations and rants like "Google is broke because it had fetched my sitemap, the download status shows an OK, but none of my new pages appear in search results for their deserved keywords".
Your description is pretty close. Sitemaps are downloaded periodically and then scanned to extract links and metadata. The valid URLs are passed along to the rest of our crawling pipeline -- the pipeline takes input from 'discovery crawl' and from Sitemaps. The pipeline then sends out the Googlebots to fetch the URLs, downloads the pages and submits them to be considered for our different indices.
Obviously you can't reveal details about the scores applied to Web sites (besides PageRank) which control priorities and frequency of your (sitemap based as well as regular) crawling and indexing, and -- due to your efforts to ensure a high quality of search results and other factors as well -- you cannot guarantee crawling and indexing of all URLs submitted via Sitemaps. However, for a quite popular and reputable site which meets your quality guidelines, what would you expect as the best/average throughput, or time to index, for both new URLs and updated content as well?
You're right. Groups users report being indexed in a matter of days or weeks. Crawling is a funny business when you are crawling several billions of pages -- we need to crawl lots of new pages and refresh a large subset of previously crawled pages periodically as well, with finite capacity. So we're always working on decreasing the time it takes to index new information and to process updated information, while focusing on end user search results quality. Currently Sitemaps feeds the URL and metadata into the existing crawling pipeline like our discovered URLs.
My experiments have shown that often a regular crawl spots and fetches fresh content before you can process the updated Sitemap, and that some URLs harvested from Sitemaps are even crawled when the page in question is unlinked and its URL cannot be found elsewhere. Also, many archived pages lastly crawled in the stone age get revisited all of a sudden. These findings lead to two questions.
Sitemaps offer search engines with precise information than can be found through discovery crawling. All sites can potentially benefit from Sitemaps in this way, particularly as the metadata is used in more and more ways.
Some Webmasters fear that a Google Sitemap submission might harm their positioning, and in the discussion group we can often read postings asserting that Google has removed complete Web sites from the search index shortly after a Sitemap submission. My standard answer is "don't blame the Sitemap when a site gets tanked", and looking at most of the posted URLs the reasons causing invisibility on the SERPs become obvious at first glance: the usual quality issues. However, in very few of the reported cases it seems possible that a Sitemaps submission could result in a complete wipe-out or move to the supplemental index, followed by heavy crawling and fresh indexing in pretty good shape after a while.
Sitemap URLs are currently handled in the same way as discovery URLs in terms of penalties. If a site is penalized for violating the webmaster guidelines, that penalty would apply whether Googlebot followed links from a Sitemap, as part of the regular discovery crawl, or from the Add URL page.
Many users expect Google Sitemaps to work like Infoseek submissions in the last century, that is instant indexing and ranking of each and every URL submission. Although unverified instant listings are technically possible, they would dilute the quality of any search index, because savvy spammers could flood the index with loads of crap in no time.
Our FAQ pages are a good starting place for all webmasters. For those who are using hosting or CMS systems they don't have a lot of experience with, Sitemaps can help alert them to issues they may not know about, such as problems Googlebot has had crawling their pages.
Many Web sites use eCommerce systems and CMS software producing cluttered non-standard HTML code, badly structured SE-unfriendly navigation, and huge amounts of duplicated textual content available from various URLs. In conjunction with architectural improvements like SE-friendly cloaking to enhance such a site's crawl-ability, a Google Sitemap will help to increase the number of crawled and indexed pages. In some cases this may be a double-edged sword, because on formerly rarely indexed sites a full crawl may reveal unintended content duplication, which leads to suppressed search results caused by your newer filters.
Everything that applies to regular discovery crawling of sites applies to pages listed in Sitemaps. Our webmaster guidelines provide many tips about these issues. Take a look at your site in a text-only browser. What content is visible?
Google is the first major search engine informing Webmasters -- not only Sitemaps users -- about the crawlability of their Web sites. Besides problem reports your statistics even show top search queries, the most clicked search terms, PageRank distribution, overviews on content types, encoding and more. You roll out new reports quite frequently, what is the major source of your inspiration?
We look at what users are asking for in the discussion groups and we work very closely with the crawling and indexing teams within Google.
On the error pages you show a list of invalid URLs, for example 404 responses and other errors. If the source is "Web", that is the URL was found on the site or anywhere on the Web during the regular crawling process and not harvested from a Sitemap, it's hard to localize the page(s) carrying the dead link. A Web search does not always lead to the source, because you don't index every page you've crawled. Thus in many cases it's impossible to ask the other Webmaster for a correction.
We have a long list of things we'd love to add, but we can only work so fast. :) Meanwhile, we do read every post in the group to look for common requests and suggestions, so please keep them coming!
HTTP/1.1 introduced the 410-Gone response code, which is supported by the newer Mozilla-compatible Googlebot, but not by the older crawler which still does HTTP/1.0 requests. The 404-Not found response indicates that the requested resource may reappear, so the right thing to do is responding with a 410 error if a resource has been removed permanently.
Would you consider it safe with Google to make use of the 410 error code, or do your prefer a 404 response and want Webmasters to manually remove outdated pages with the URL console, which scares the hell out of most Webmasters who fear to get their complete site suspended for 180 days?
Webmasters can use either a 404 or 410. When Googlebot receives either response when trying to crawl a page, that page doesn't get included in the refresh of the index. So, over time, as the Googlebot recrawls your site, pages that no longer exist should fall out of our index naturally.
Although the data appearing in the current reports aren't that sensitive (for now), you've a simple and effective procedure in place to make sure that only the site owner can view the statistics. To get access to the stats one must upload a file with a unique name to the Web server's root level, and you check for its existence. To ensure this lookup can't be fooled by a redirect, you also request a file which should not exist. The verification is considered successful when the verification URL responds with a 200-Ok code, and the server responds to the probe request with a 404-Not found error. To enable verification of Yahoo stores and sites on other hosts with case restrictions you've recently changed the verification file names to all lower case.
Can you think of an alternative verification procedure to satisfy the smallest as well as the largest Web sites out there?
We are actively working improving the verification process. We know that there are some sites that have had issues with our older verification process, and we updated it to help (allowing lowercase verification filenames), but we know there are still users out there who have problems. We'd love to hear suggestions from users as to what would or wouldn't work for them!
In some cases verification requests seem to stick to the queue, and every once in a while a verfied site falls back in the pending status. You've posted that's an inconvenience you're working on, can you estimate when delayed verifications will be the news of yesterday?
We've been making improvements in this area, but we know that some webmasters are still having trouble and it's very frustrating. We are working on a complete resolution as quickly as we can. We have sped up the time from "pending" to "verified" and you shouldn't see issues with verified sites falling back into pending.
Once you've checked these things and made any needed changes, click Verify again.
The Google Sitemaps Protocol is sort of a "robots inclusion protocol" respecting the gatekeeper robots.txt, standardized in the "robots exclusion protocol". Some Sitemaps users are pretty much confused by those mutually exclusive standards with regard to Web crawler support. Some of their suggestions like restricting crawling to URIs in the Sitemap make no sense, but adding a way to remove URIs from a search engine's index for example is a sound idea.
Because the protocol is an open standard, we don't want to make too many changes. We don't want to disrupt those who have adopted it. However, over time, we may augment the protocol in cases where we find that to be useful. The protocol is designed to be extensible, so if people find extended uses for it, they should go for it.
You've launched the Google Sitemaps Protocol under the terms of the Attribution-ShareAlike Creative Commons License, and your Sitemaps generator as open source software as well. The majority of the 3rd party Sitemaps tools are free too.
Eight months after the launch of Google Sitemaps, MSN search silently accepts submissions of XML sitemaps, but does not officially support the Sitemaps Protocol. Shortly after your announcement Yahoo started to accept mass submissions in plain text format, and has added support of RSS and ATOM feeds a few months later. In the meantime each and every content management system produces XML Sitemaps, there are tons of great sitemap generators out there, and many sites have integrated Google Sitemaps individually.
We hope that all search engines adopt the standard so that webmasters have one easy way to tell search engines comprehensive information about their sites. We are actively helping more content management systems and websites to support Sitemaps. We believe the open protocol will help the web become cleaner with regard to crawlers, and are looking forward to widespread adoption.
Is there anything else you'd like to tell your users, and the Webmasters not yet using your service as well?
We greatly appreciate the enthusiastic participation of webmasters around the world. The input continues to help us learn what webmasters would most like and how we can deliver that to them.
Thank you very much!
Wednesday, February 01, 2006 by Sebastian