Google Sitemaps experts provide their corporate knowledge
The Google Sitemaps Knowledge Base Project was launched by the Google Sitemaps Group in December 2005. Its goal is to provide a comprehensive help system for Google Sitemaps users. Thanks to the contributors!
Webmasters, don't miss out on the Google Sitemaps Team interview!
About the Google Sitemaps Knowledge Base Project
Browse the index or use the site search to find the topic related to your Google Sitemaps question. If the short answer isn't enough, follow the links to more detailed information. [Shrink]
Google Sitemaps Team Interview
Google's Sitemaps Team, interviewed in January 2006, provides great insights and information about the Sitemaps program, crawling and indexing in general, handling of vanished pages (404 vs. 410) and the URL removal tool, and valuable assistance on many frequently asked questions. Matt Cutts chimed in and stated '...It's definitely a good idea to join Sitemaps so that you can be on the ground floor and watch as Sitemaps improves'. This interview is a must-read for Webmasters. [Shrink]
Canonical server name issues
Declined submissions and error messages like URL not under Sitemap path are often caused by an incorrect usage of multiple server names. Why are example.com and www.example.com different, when both serve the same content? How to avoid confusion? [Shrink]
I have a site hosted under two domains, example.com and example.net, can I have a Google Sitemap on both servers, or should I consolidate the site's various addresses? I want to consolidate several brands with Web sites hosted on separate domains and sub-domains on my main site, now I need a checklist. [Shrink]
I've moved my site to a new domain. Can I submit a Sitemap to tell Google to index the new site rather than the old site? What else should I do to ensure a smoothly move? [Shrink]
Google Sitemaps Stylesheets make XML Sitemaps human readable
By default, XML-based sitemaps are not human readable - the browser just renders a big bunch of XML code and pure data. To make your sitemap look like a normal HTML webpage, you just have to add one line to your Google Sitemaps file, that's all. [Shrink]
Google's time to index
My Google Sitemap submission was accepted but I can't find (all) my Web pages with a site:example.com search. Why doesn't Google index my Web site (completely)? [Shrink]
Google doesn't index all pages from a Sitemap
Googlebot downloads my XML sitemap frequently, but didn't crawl and index all pages of my Web site yet. What can I do to improve my search engine visibility? [Shrink]
Google Python Sitemap Generator - Introduction
The free Python Google sitemap generator can be used to create Google Sitemaps in XML format by walking the file system on the web server and scanning access logs. It requires Python version 2.2 (or compatible newer versions) installed on your server. [Shrink]
Google Sitemaps for (tens of) thousands sub-domains
How blog networks, (free) hosting services, and communities where each user publishes on a different sub-domain, can provide a centralized Google Sitemap service. [Shrink]
Which major search engines support the Google Sitemaps Protocol?
Although Google dominates search, I'd like to get the most out of my efforts to create and maintain a Google Sitemap. Are there other search engines which accept mass URL submissions via XML sitemaps? Yes. [Shrink]
Which URLs should I put in my Google Sitemap?
Should I include all URLs in my sitemap, even feeds, images, videos and other Web objects without META data? Will a Google Sitemap help to get framed pages indexed? Should I submit thin pages via Google Sitemaps or would this hurt? [Shrink]
About the Google Sitemaps Knowledge Base Project
Wednesday, December 14, 2005 by Sebastian
Neither the generic official Google Sitemaps FAQ, nor the various approaches to provide in-depth knowledge on the most common Google Sitemaps related questions from a Webmaster's perspective, solved the problem of repetitive topics in the Google Sitemaps Group. Questions like "How long does it take to get indexed after a Google Sitemap submission?" were posted several times a day in countless variations. Regular posters answering the same questions over and over got frustrated, and ignored repetitive topics, or went to harsh-mode every now and then.
The Google Sitemaps Knowledge Base Project tries to provide easy to find solutions to the most common problems and misunderstandings, as well as guides on particular tasks. Each topic can and should be discussed in the Google Sitemaps Group and is signed by a contributor:
- Cristina Wood from Asymptoticdesign, mastering Google's free sitemap generator,
- John, aka Johannes Müller, inventor of the free Google Sitemap generator GSitemapCrawler,
- Shawn K. Hall, CEO of the Web services empire 12 Point Design,
- Tobias Kluge, inventor of the free Google Sitemaps generator phpSitemapNG,
- Vanessa Fox from Google's Sitemaps Team (thanks for the permission to repost articles and helpful advice from other sources),
- Sebastian from Smart IT Consulting (editor).
Google Sitemaps Team Interview
Wednesday, February 01, 2006 by Sebastian
Google Sitemaps was launched in June 2005 to enhance Google's Web crawling process in cooperation with Webmasters and site owners. Collaborative crawling takes Webmasters into the boat to some degree, and both sides did learn a lot from each other in the last months. Google's Sitemaps Team does listen to Joe Webmaster's needs, questions, and suggestions. They have implemented a lot of very useful features based on suggestions in the Google Sitemaps Group, an open forum where members of the Sitemaps team communicate with their users, handing out technical advice even on weekends. The nickname Google Employee used by the Sitemaps team makes it regularly on the list of This month's top posters.
The Sitemaps community, producing an average of 1,500 posts monthly, suffered from repetitive topics diluting the archives. As the idea of a Google Sitemaps Knowledge Base was born in the group, I've discussed the idea with the Sitemaps team. Vanessa Fox, who is blogging for the Sitemaps team from Kirkland, Washington, suggested doing "an e-mail interview to answer some of the more frequently asked questions", so here we are.
Vanessa, thank you for taking the time to support the group's knowledge base project. Before we discuss geeky topics which usually are dull as dust, would you mind to introduce the Sitemaps team? I understand that you're an international team, your team members are working in Kirkland, Mountain View, and Zürich on different components of the Google Sitemaps program. Can you tell us who is who in your team?
Vanessa: You're right. Our team is located in offices around the globe, which means someone on the team is working on Sitemaps nearly around the clock. A few team members were able to take some time to answer your questions, including Shiva Shivakumar, engineering director who started the Google Sitemaps project (and whose interview with Danny Sullivan you may have seen when we initially launched), Grace and Patrik from our Zurich office, Michael and Andrey from our Kirkland office, and Shal from our Mountain View office. We also got Matt Cutts to chime in.
My amateurish try to summarize the Google Sitemaps program is "Aimed crawling makes Google's search results fresh, and Webmasters happy". How would you outline your service, its goals, intentions, and benefits?
Our goal is two-way communication between Google and webmasters. Google Sitemaps is a free tool designed so webmasters can let us know about all the pages on their sites and so we can provide them with detailed reports on how we see their sites (such as their top Google search queries and URLs we had trouble crawling).
We can reach many pages that our discovery crawl cannot find and Sitemaps convey some very important metadata about the sites and pages which we could not infer otherwise, like the page's priority and refresh cycle. In particular, the refresh cycle should allow us to download pages only when they change and thus reduce needless downloads, saving bandwidth.
You've announced collaborative crawling as "an experiment called Google Sitemaps that will either fail miserably, or succeed beyond our wildest dreams, in making the web better for webmasters and users alike". Eight months later I think it's safe to say that that your experiment has been grown up to a great success. How much has your project contributed to the growth, freshness and improved quality of Google's search index?
We've had a huge response from webmasters, who have submitted a great deal of high-quality pages. Many pages would have never found through our usual crawl process, such as URLs with content found behind forms or locked up in databases or behind content management systems.
Also we have many clients supporting Sitemaps natively (many are listed at http://code.google.com/sm_thirdparty.html), in addition to our initial Open Source python client. We are working with more such clients to automatically support Sitemaps natively, and larger websites to compute these automatically as well.
Also, Michael points out "As some of you may have noticed, Sitemaps has also served as an...uhm...impromptu stress test of different parts of the Google infrastructure, and we're working hard to fix those parts."
A major source of misunderstandings and irritations is the common lack of knowledge on how large IT systems -- especially search engines -- work. This leads to unrealistic expectations and rants like "Google is broke because it had fetched my sitemap, the download status shows an OK, but none of my new pages appear in search results for their deserved keywords".
Matt's recent article How does Google collect and rank results? sheds some light on the three independent processes crawling, indexing, and ranking in response to user queries, and I've published some speculations in my Sitemaps FAQ. Can you provide us with a spot on description of the process starting with a Sitemap download, its validation and passing of its URLs to the crawling engine, which sends out the Googlebots to fetch the files and hand them over to the indexer? I'm sure such an anatomic insight would help Sitemaps users to think in realistic time tables.
Your description is pretty close. Sitemaps are downloaded periodically and then scanned to extract links and metadata. The valid URLs are passed along to the rest of our crawling pipeline -- the pipeline takes input from 'discovery crawl' and from Sitemaps. The pipeline then sends out the Googlebots to fetch the URLs, downloads the pages and submits them to be considered for our different indices.
Obviously you can't reveal details about the scores applied to Web sites (besides PageRank) which control priorities and frequency of your (sitemap based as well as regular) crawling and indexing, and -- due to your efforts to ensure a high quality of search results and other factors as well -- you cannot guarantee crawling and indexing of all URLs submitted via Sitemaps. However, for a quite popular and reputable site which meets your quality guidelines, what would you expect as the best/average throughput, or time to index, for both new URLs and updated content as well?
You're right. Groups users report being indexed in a matter of days or weeks. Crawling is a funny business when you are crawling several billions of pages -- we need to crawl lots of new pages and refresh a large subset of previously crawled pages periodically as well, with finite capacity. So we're always working on decreasing the time it takes to index new information and to process updated information, while focusing on end user search results quality. Currently Sitemaps feeds the URL and metadata into the existing crawling pipeline like our discovered URLs.
Matt adds "it's useful to remember that our crawling strategies change and improve over time. As Sitemaps gains more and more functionality, I wouldn't be surprised to see this data become more important. It's definitely a good idea to join Sitemaps so that you can be on the 'ground floor' and watch as Sitemaps improves."
My experiments have shown that often a regular crawl spots and fetches fresh content before you can process the updated Sitemap, and that some URLs harvested from Sitemaps are even crawled when the page in question is unlinked and its URL cannot be found elsewhere. Also, many archived pages lastly crawled in the stone age get revisited all of a sudden. These findings lead to two questions.
First, to what degree can a Google Sitemap help to direct Googlebot to updates and new URLs on well structured sites, when the regular crawling is that sophisticated? Second, how much of the formerly 'hidden Web' -- especially unhandily linked contents -- did you discover with the help of Sitemaps, and what do you do with unlinked orphan pages?
Sitemaps offer search engines with precise information than can be found through discovery crawling. All sites can potentially benefit from Sitemaps in this way, particularly as the metadata is used in more and more ways.
Grace says: "As for the 'hidden Web', lots of high quality pages have indeed been hidden. In many high quality sites that have submitted Sitemaps, we now see 10-20 times as many pages to consider for crawling."
Some Webmasters fear that a Google Sitemap submission might harm their positioning, and in the discussion group we can often read postings asserting that Google has removed complete Web sites from the search index shortly after a Sitemap submission. My standard answer is "don't blame the Sitemap when a site gets tanked", and looking at most of the posted URLs the reasons causing invisibility on the SERPs become obvious at first glance: the usual quality issues. However, in very few of the reported cases it seems possible that a Sitemaps submission could result in a complete wipe-out or move to the supplemental index, followed by heavy crawling and fresh indexing in pretty good shape after a while.
Machine readable mass submissions would allow a few holistic quality checks before the URLs are passed to the crawling engine. Do you handle URLs harvested from XML sitemaps other than URLs found on the Web or submitted via the Add-Url page? Do mass submissions of complete sites speed up the process of algorithmic quality judgements and spam filtering?
Sitemap URLs are currently handled in the same way as discovery URLs in terms of penalties. If a site is penalized for violating the webmaster guidelines, that penalty would apply whether Googlebot followed links from a Sitemap, as part of the regular discovery crawl, or from the Add URL page.
Many users expect Google Sitemaps to work like Infoseek submissions in the last century, that is instant indexing and ranking of each and every URL submission. Although unverified instant listings are technically possible, they would dilute the quality of any search index, because savvy spammers could flood the index with loads of crap in no time.
Experienced Webmasters and SEOs do read and understand your guidelines, play by the rules, and get their contents indexed. Blogs and other easy to use content management systems (CMS) brought millions of publishers to the Web, who usually can't be bothered with all the technical stuff involved.
Those technically challenged publishers deserve search engine traffic, but most probably they would never visit a search engine's Webmasters section. Natural publishing and networking leads to indexing eventually, but there are lots of pitfalls, for example the lack of search engine friendly CMS software and so many Web servers which are misconfigured by default.
What's your best advice for the novice Publisher not willing -- or not able -- to wear a Webmaster hat? What are your plans to reach those who don't get your message yet, and how do you think you can help to propagate a realistic management of expectations?
Our FAQ pages are a good starting place for all webmasters. For those who are using hosting or CMS systems they don't have a lot of experience with, Sitemaps can help alert them to issues they may not know about, such as problems Googlebot has had crawling their pages.
Michael adds that "our webmaster guidelines are intended to be readable and usable by non-experts. Create lots of good and unique content, don't try to be sneaky or underhanded, and be a bit patient. The Web is a very, very big place, but there's still a lot of room for new contributions. If you want to put up a new website, it may be helpful to think about how your website will be an improvement over whatever is already out there. If you can't think of any reasons why your site is special or different, then it's likely that search engines won't either, and that may be frustrating."
Many Web sites use eCommerce systems and CMS software producing cluttered non-standard HTML code, badly structured SE-unfriendly navigation, and huge amounts of duplicated textual content available from various URLs. In conjunction with architectural improvements like SE-friendly cloaking to enhance such a site's crawl-ability, a Google Sitemap will help to increase the number of crawled and indexed pages. In some cases this may be a double-edged sword, because on formerly rarely indexed sites a full crawl may reveal unintended content duplication, which leads to suppressed search results caused by your newer filters.
What is your advice for (large) dynamic sites suffering from sessionIDs and UI specific query string parameters, case issues in URLs, navigation structures which create multiple URLs pointing to the same content, excessively repeated text snippets, or thin product pages without unique content except of the SKU and a picture linked to the shopping cart? Will or do you use the (crawling) priority attribute to determine whether you index a URL from the sitemap, or a variant -- with similar or near duplicated content -- found by a regular crawl?
Everything that applies to regular discovery crawling of sites applies to pages listed in Sitemaps. Our webmaster guidelines provide many tips about these issues. Take a look at your site in a text-only browser. What content is visible?
As for dynamic pages that cause duplicate content listings in the Sitemap, make sure that a version of each page exists that doesn't include things like a session ID in the URL and then list that version of the page in your Sitemap.
Make sure that the Sitemap doesn't include multiple versions of the same page that differ only in session ID, for instance.
If your site uses a content management system or database, you probably want to review your generated Sitemap before submitting to make sure that each page of your site is only listed once. After all, you only want each page listed once in the search results, not listed multiple times with different variations of the URL.
Google is the first major search engine informing Webmasters -- not only Sitemaps users -- about the crawlability of their Web sites. Besides problem reports your statistics even show top search queries, the most clicked search terms, PageRank distribution, overviews on content types, encoding and more. You roll out new reports quite frequently, what is the major source of your inspiration?
We look at what users are asking for in the discussion groups and we work very closely with the crawling and indexing teams within Google.
Andrey adds, "if we find a particular statistic that is useful to webmasters and doesn't expose confidential details of the Google algorithms, we queue that up for release."
On the error pages you show a list of invalid URLs, for example 404 responses and other errors. If the source is "Web", that is the URL was found on the site or anywhere on the Web during the regular crawling process and not harvested from a Sitemap, it's hard to localize the page(s) carrying the dead link. A Web search does not always lead to the source, because you don't index every page you've crawled. Thus in many cases it's impossible to ask the other Webmaster for a correction.
Since Google knows every link out there, or at least should know the location where an invalid URL was found in the first place, can you report the sources of dead links on the error page? Also, do you have plans to show more errors?
We have a long list of things we'd love to add, but we can only work so fast. :) Meanwhile, we do read every post in the group to look for common requests and suggestions, so please keep them coming!
HTTP/1.1 introduced the 410-Gone response code, which is supported by the newer Mozilla-compatible Googlebot, but not by the older crawler which still does HTTP/1.0 requests. The 404-Not found response indicates that the requested resource may reappear, so the right thing to do is responding with a 410 error if a resource has been removed permanently.
Would you consider it safe with Google to make use of the 410 error code, or do your prefer a 404 response and want Webmasters to manually remove outdated pages with the URL console, which scares the hell out of most Webmasters who fear to get their complete site suspended for 180 days?
Webmasters can use either a 404 or 410. When Googlebot receives either response when trying to crawl a page, that page doesn't get included in the refresh of the index. So, over time, as the Googlebot recrawls your site, pages that no longer exist should fall out of our index naturally.
Webmasters shouldn't fear our automated URL removal tool. We realize that the description says if you use the tool, your site will be removed for 180 days and we are working to get this text updated. What actually happens is that we check to make sure the page you want removed doesn't exist or that you've blocked it using a robots.txt file or META tags (depending on which removal option you choose). If everything checks out OK, we remove the page from our index. If later on, you decide you do want the page indexed and you add it back to your site or unblock it, we won't add it back to our index for at least 180 days. That's what the warning is all about. You need to be sure you really don't want this page indexed.
But I would suggest that webmasters not bother with this tool for pages that no longer exist. They should really only use it for pages that they meant to block from the index with a robots.txt file or META tags originally. For instance, if a hypothetical webmaster has a page on a site that lists sensitive customer information and that hypothetical webmaster is also somewhat forgetful and doesn't remember to add that page to the robots.txt file, the sensitive page will likely get indexed. That hypothetical forgetful webmaster would then probably want to add the page to the site's robots.txt file and then use the URL removal tool to remove that page from the index.
As you mentioned earlier, webmasters might also be concerned about pages listed in the Site Errors tab (under HTTP errors). These are pages that Googlebot tried to crawl but couldn't. These pages are not necessarily listed in the index. In fact, if the page is listed in the Errors tab, it's quite possible that the page isn't in the index or will not be included in the next refresh (because we couldn't crawl it). Even if you were to remove this using the URL removal tool, you still might see it show up in the Errors tab if other sites continue to link to that page since Googlebot tries to crawl every link it finds.
We list these pages just to let you know we followed a link to your site and couldn't crawl the page and in some cases that's informational data only. For these pages, check to see if they are pages you thought existed. In this case, maybe you named the page on your site incorrectly or you listed it in your Sitemap or linked to it incorrectly. If you know these pages don't exist on your site, make sure you don't list them in your Sitemap or link to them on your site. (Remember that if we tried to crawl a page from your Sitemap, we indicate that next to the URL.) If external sites are linking to these non-existent pages, you may not be able to do anything about the links, but don't worry that the index will be cluttered up with these non-existent pages. If we can't crawl the page, we won't add it to the index.
Although the data appearing in the current reports aren't that sensitive (for now), you've a simple and effective procedure in place to make sure that only the site owner can view the statistics. To get access to the stats one must upload a file with a unique name to the Web server's root level, and you check for its existence. To ensure this lookup can't be fooled by a redirect, you also request a file which should not exist. The verification is considered successful when the verification URL responds with a 200-Ok code, and the server responds to the probe request with a 404-Not found error. To enable verification of Yahoo stores and sites on other hosts with case restrictions you've recently changed the verification file names to all lower case.
Besides a couple quite exotic configurations, this procedure does not work well with large sites like AOL or eBay, which generate a useful page on the fly even if the requested URI does not exist, and it keeps out all sites on sub-domains, where the Webmaster or publisher can't access the root level, for example free hosts, hosting services like ATT, and your very own Blogger Web logs.
Can you think of an alternative verification procedure to satisfy the smallest as well as the largest Web sites out there?
We are actively working improving the verification process. We know that there are some sites that have had issues with our older verification process, and we updated it to help (allowing lowercase verification filenames), but we know there are still users out there who have problems. We'd love to hear suggestions from users as to what would or wouldn't work for them!
In some cases verification requests seem to stick to the queue, and every once in a while a verfied site falls back in the pending status. You've posted that's an inconvenience you're working on, can you estimate when delayed verifications will be the news of yesterday?
We've been making improvements in this area, but we know that some webmasters are still having trouble and it's very frustrating. We are working on a complete resolution as quickly as we can. We have sped up the time from "pending" to "verified" and you shouldn't see issues with verified sites falling back into pending.
If your site goes from pending to not verified, then likely we weren't able to successfully process the verification request. We are working on adding error details so you can see exactly why the request wasn't successful, but until we have that ready, if your verification status goes from "pending" to "not verified", check on the following as these are the most comment problems we encounter:
- that your verification file exists in the correct location and is named correctly
- that your webserver is up and responding to requests when we attempt the verification
- that your robots.txt file doesn't block our access to the file
- that your server doesn't return a status of 200 in the header of 404 pages
Once you've checked these things and made any needed changes, click Verify again.
The Google Sitemaps Protocol is sort of a "robots inclusion protocol" respecting the gatekeeper robots.txt, standardized in the "robots exclusion protocol". Some Sitemaps users are pretty much confused by those mutually exclusive standards with regard to Web crawler support. Some of their suggestions like restricting crawling to URIs in the Sitemap make no sense, but adding a way to remove URIs from a search engine's index for example is a sound idea.
Others have suggested to add attributes like title, abstract, and parent respectively level. Those attributes would allow Webmasters to integrate the XML sitemaps (formatted by XSLT stylesheets) in the user interface. I'm sure you've gathered many more suggestions, do you have plans to change your protocol?
Because the protocol is an open standard, we don't want to make too many changes. We don't want to disrupt those who have adopted it. However, over time, we may augment the protocol in cases where we find that to be useful. The protocol is designed to be extensible, so if people find extended uses for it, they should go for it.
You've launched the Google Sitemaps Protocol under the terms of the Attribution-ShareAlike Creative Commons License, and your Sitemaps generator as open source software as well. The majority of the 3rd party Sitemaps tools are free too.
Eight months after the launch of Google Sitemaps, MSN search silently accepts submissions of XML sitemaps, but does not officially support the Sitemaps Protocol. Shortly after your announcement Yahoo started to accept mass submissions in plain text format, and has added support of RSS and ATOM feeds a few months later. In the meantime each and every content management system produces XML Sitemaps, there are tons of great sitemap generators out there, and many sites have integrated Google Sitemaps individually.
You've created a huge user base, are you aware of other search engines planning to implement the Sitemaps Protocol?
We hope that all search engines adopt the standard so that webmasters have one easy way to tell search engines comprehensive information about their sites. We are actively helping more content management systems and websites to support Sitemaps. We believe the open protocol will help the web become cleaner with regard to crawlers, and are looking forward to widespread adoption.
We greatly appreciate the enthusiastic participation of webmasters around the world. The input continues to help us learn what webmasters would most like and how we can deliver that to them.
Shiva says that "eight months back we released Sitemaps as an experiment, wondering if webmasters and client developers would just ignore us or if they'd see the value in a more open dialogue between webmasters and search engines. We're not wondering that anymore. With the current rate of adoption, we are now looking at how to work better with partners and supporting their needs so that more and more webmasters with all kinds of hosting environments can take advantage of what Sitemaps can offer."
And Michael wants to remind everyone that "We really do read every post on the Sitemaps newsgroup and we really want people to keep posting there."
Thanks, Sebastian, for giving us this opportunity to talk to webmasters. That's what the Sitemaps project is all about.
Thank you very much!
Canonical server name issues
Tuesday, January 03, 2006 by Sebastian
Many Web hosting services configure Web sites in a way that they are accessible under different addresses. This is meant to please users, but very often the feigned convenience results in all sorts of troubles, because the setup stays half done.
So what is a canonical server name? It's part of the URL of your site, http://www.example.com/. The canonical server name, or host name, consists of two or more components, delimited by dots. From right to the left that's the TLD like com or net, the site's name (domain), and one or more optional sub-domain prefixes like www, mail, ftp, www.name.dept or city.state. Naturally the www prefix is used to serve Web pages, ftp to host downloadable files, and other prefixes stand for segments of a huge site or they separate development servers from the production system.
Each server name is a unique address, like a phone number, and points to different content by default. Many sites host each sub-domain on its own computer, or even run multiple server computers per prefix. Those huge sites define the standard for small sites too. To allow future scalability, one should make use of sub-domain prefixes, e.g. www for Web contents.
In the real life however, zillions of Webmasters don't think in large scale dimensions. They sign up at a hosting service, get the usual small business setup, and are happy that their sites respond to both example.com as well as www.example.com. However, technically both are still different servers, able to serve different contents. For a request from another address, for example by a visitor's browser or a search engine crawler, it's not transparent that both servers pull their contents from the same directory on the Web server's hard disk.
Search engines like Google have learned to deal with those incomplete setups. It works fine as long as they don't get confused by links containing both server names in the URL. An URL is a unique address, that is http://www.example.com/page.html and http://example.com/page.html point to two different pages. Those pages may or may not carry the same contents. Using both variants leads to dilution of link popularity (PageRank), unnecessary problems with duplicate content filters, and all kinds of other troubles resulting in lowered search engine visibility, that is lost traffic.
So what can I do to avoid those troubles? I'm pretty sure that I don't use both server names in my HTML code, but I cannot control external links and URL drops pointing to my site. Also, I want that a visitor can type in the shortened variant and actually lands on my site.
First, use only one server name in URLs, business cards, flyers, TV and radio spots ... Google Sitemaps, link submissions and internal links as well. Your decision makes one of both server names the canonical server name of your Web site. You must stick with the chosen server name, that is you cannot change your mind later on.
Second, make sure that all URLs containing the unused server name respond with a permanent redirect to the URL containing the canonical server name. If you've opted for the www prefix for example, the URL http://example.com/page.html must respond with a 301 error code telling the client (browser or crawler) that the page has to be requested from http://www.example.com/page.html.
If your site is hosted by a professional hosting service, you can ask for this setup. Unfortunately, many Web hosters have no idea why this is important, and will deny your request or simply tell you that's impossible. Then go get a real host, or do it yourself, it's easy.
Thursday, January 12, 2006 by Sebastian
Having multiple URLs pointing to the same content is a very bad idea. Although search engines try hard to deal with those weird setups, most likely the duplicate content will result in invisibility on the search result pages eventually. Nowadays the engines look at content, and if they find identical or way too similar content under more than one address (URL), they list only one version, or, in some cases, they dump both URLs, because content duplication is a well known spammer tactic. Collateral damage is unavoidable in their war on index spam, thus legitimate sites should implement a strict one and only one URL per piece of (textual) content policy.
Multiple domain networks consisting of sites and sub-sites providing closely related contents often lack the power of a segmented mega site. Branding one domain is easier than branding multiple domains, and there are more marketing issues. From a SEO perspective most multiple domain networks come with a higher risk of penalties for artificial linkage and content duplication. Please note that there are many scenarios where multiple domain setups make sound sense, see a SEO expert for individual advice. For the purpose of this article lets say the decision to consolidate multiple domains is well funded.
Hosting identical contents on several domains and consolidation of multiple domains serving different contents are comparable tasks, that is you can follow the same checklist.
1. Put up the complete site on the desired domain and choose a canonical site address, that is either
example.com. Say you choose
www.example.com, the www-prefix has advantages. You'll stick with this decision forever (what is a short period of time nowadays). That is even your business cards and the toilet paper have a bold www.example.com imprint, you link to
www.example.com only and you submit only
www.example.com URLs to Web directories and so on.
Segmenting a huge site with sub-domains like
topic.example.com sounds like a good plan, but sub-domains come with a bunch of disadvantages and negative side effects too. You really should ask a SE expert for an individual consolidation plan, before you unintentionally cut off your organic SE traffic.
2. Create a Google Sitemap for the new site, and submit it. Remove all other sitemaps from your Web servers. Remove them physically from the hard disk. In your Google account do not delete the verified sites from your site overview page, but delete their sitemap submissions. You'll find the crawler stats from the old domains pretty much interesting.
3. From all domains, sub-domains, free hosts and whatever you've used before, implement a permanent (301) redirect to the new main site. If those servers had content, inbound links and other references, you must redirect to the corresponding URI on the new server. The same goes for parked domains registered to protect trademarks, or to capture type-in traffic. Check the HTTP response of each domain. Most 3rd party services use soft redirects (HTTP response code 302) what is not acceptable. Move those domains to your hosting service and make sure they respond with a 301 error code and the fully qualified location "
Sample redirect routes:
· example.com/* => www.example.com/* where "
*" stands for a complete relative URL (path and file name [+ query string [+ fragment identifier]]), and URL components in [brackets] are optional, for example:
· example.com/somepage.html => www.example.com/somepage.html
· example.com/somescript.php?somequerystring => www.example.com/somescript.asp?somequerystring
· example.com/somescript.asp?somequerystring#fragment-identifier => www.example.com/somescript.php?somequerystring#fragment-identifier
· www.example.net/* => www.example.com/*
· example.net/* => www.example.com/*
· www.old-example.com/* => www.example.com/[topic/]*
· old-example.com/* => www.example.com/[topic/]*
· www.older-example.com/* => www.example.com/[topic/]*
· older-example.com/* => www.example.com/[topic/]*
· www.parked-domain.com/* => www.example.com/*
· parked-domain.com/* => www.example.com/*
· yadayadayada.example.com/* => www.example.com/yadayadayada/*
· topic.old-example.com/* => www.example.com/topic/*
4. Do not suspend outdated domains with Google's removal tool for 180 days. Just redirect.
5. Find all links pointing to your old sites. Contact the Webmasters/editors and ask them to change the URI to www.example.com. The redirects will take care of passed reputation and traffic, but some sites simply delete their links when they spot a redirect, without notification.
6. Try to stick with www.example.com. Creativity is great, but results in a huge work load for each move. And with every move you will lose a little traffic.
Thursday, December 15, 2005 by Vanessa
Submitting a Sitemap for the new site is a great first step, because that helps Google learn about the new pages right away. Make sure you place the new Sitemap in the root directory of the new site as the Sitemap must be located on the same domain as the site URLs contained in it.
Another important thing to do is redirecting visitors from the old site to the new one. Put a 301 (permanent) redirect on every page of the old site to point to the corresponding page on the new site (301 redirect code samples).
Follow Matt's advice on DNS changes. Leave the old server up and running forever, but at least until you're sure that all Web robots request your content from the new domain! If you sell or give away the old domain, make sure that the new owner provides a link to your new home. Even if you contact all Webmasters linking to you during the move, there will remain a few links pointing to your content under the old domain.
Google Sitemaps Stylesheets make XML Sitemaps human readable
Friday, December 16, 2005 by Tobias
A Google Sitemaps StyleSheet (GSS) makes use of a technology called XSLT to format structured data stored in an XML file or feed. Basically XSLT + XML works like CSS + X/HTML, that is all data are stored in an XML file, and all formatting code goes to the XSL file. Your Web browser should support those technologies and display the data contained in your Google XML Sitemap applying a nice layout, and even column sorting (sample).
How can I make use of the Google Sitemaps Stylesheet (GSS)?
As described in the introduction, it's really easy. Add one line to your Google XML Sitemap's header:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="gss.xsl"?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84 http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">
Then you have to copy/upload the GSS-XSLT file (which forces the browser to add the HTML code to the XML data on the fly) to the same directory where the sitemaps files are located.
How useful is this Google Sitemaps Stylesheet?
Well, it depends. With a rich XML sitemap, you can just give the users the url of the sitemaps file and they can render it like a normal webpage, you don't have to manage both a HTML-based and an XML-based version of your sitemap.
Unfortunately, because not all search engines support the Google Sitemaps protocol, this could introduce problems with their crawlers. Also not all browsers support XSL transformation of XML files, so a few visitors might see the pure XML data instead of the nicely formated page.
And of course if you have a bigger website, with more than 100 URLs, no user will view this file completely, since it's getting really huge and hard to browse. Here it would make sense to use Google Sitemap index files (GSS formatted) managing a hierarchy of topical sitemaps, or to choose a different approach with regard to navigational sitemaps.
Only a few sitemap generators support GSS, for example John Mueller's GSiteCrawler and Tobias Kluge's phpSitemapNG (since version 1.6.1).
But it is really easy to use this feature, even when your preferred tool doesn't support GSS. You just have to add one line to the outputted XML.
Google's time to index
Monday, December 19, 2005
The most common misunderstanding of the Google Sitemaps program is, that a successful download of a sitemap leads to instant indexing. Although exactly that happens with Web sites in good standing, in many cases a sitemap submission alone will not result in (more) indexed Web pages. To understand why Google doesn't owe sitemap submitters unverified instant listings (of possible bulk junk), please read these quotes from the Google Sitemaps Homepage:
...Google Sitemaps is an experiment in web crawling. By using Sitemaps to inform and direct our crawlers, we hope to expand our coverage of the web and speed up the discovery and addition of pages to our index...
...A Sitemap provides an additional view into your site (just as your home page and HTML site map do). This program does not replace our normal methods of crawling the web...
...A Sitemap simply gives Google additional information that we may not otherwise discover...
...This is a beta program, so we cannot make any predictions or guarantees about when or if your URLs will be crawled or added to our index. Over time, we expect both coverage and time-to-index to improve as we refine our processes and better understand webmasters' needs... [cited 2005-12-19]
It helps to refine your expectations back to the reality outlined above. A Google Sitemap is not a free ticket to the search index. Every URL submitted via XML sitemap has to pass the same quality checks as every URL found in links on the Web. All URLs Google knows about are treated equally.
The Google Sitemaps program is based on a revolutionary concept (for a major search engine), but didn't yet unfold its pioneering approach completely. However, it has helped many many Web sites to gain greatly improved search engine visibility. It is a good idea to provide Google with sitemaps, but don't rely on your sitemaps alone, because such a single-edge approach will not work.
Google doesn't index all pages from a Sitemap
Monday, December 19, 2005 by Sebastian
To understand why a Web site isn't fully indexed by Google, despite a complete Google Sitemap, it helps to classify the pages by link popularity. Note that link popularity is only one minor factor in Google's rankings, and that your analysis will not lead to the same results as Google's high sophisticated algorithms, but it is a good point to start with, even without proper weighting of your inbound links.
Fire up a backlink checker or site explorer to track Web links pointing to your pages. Ignore useless links from scraper sites like DMOZ clones, low ranked forums, blogs and alike. Count only links from high quality sites and popular blogs, and make a note of how many of these links are topically related to the linked page on your site. Internal links from within your site do count too.
Sort your spreadsheet by the number of related inbound links. You'll spot a correlation between inbound links and search engine visibility.
- Fairly ranking pages will have attracted the most valuable inbound links, and prominent internal links.
- Pages without external inbound links, but one or more prominent internal links, will be indexed (findable with a site:example.com search, appearing with title and a text snippet on the SERP), but don't rank that well for their desired keywords.
- Pages where the sole inbound links come from navigational page elements on secondary (or deeper) levels (more than one click away from the home page or another high ranking point of entry), most probably were crawled (fetched by Googlebot), but in many cases not (fully) indexed, or they are shown as URL-only listing on the SERPs of a site-search.
- Pages without inbound links get (sometimes) crawled from the Google Sitemap, they even may appear for a week or two on the SERPs, but get dumped into the supplemental index or even removed, if they don't attract links.
(Results may vary depending on type, size, age, theme and overall popularity of your site.)
The first option to enhance your unlinked pages' crawlability and popularity is to tweak your site's navigation and other internal linkage. For example create static HTML site maps with no more than 100 links per page. The results will be limited, but it has to be done. Make sure your site is perfectly crawlable, and study Google's SEO guidelines.
The next thing to do is a content review of poorly performing pages. If a page is worth the efforts, then add more interesting and original content, enhance non-textual content with well written descriptions, and so on. Then search the Web for other great pages on the page's topic and link to the best ones from within your textual content (more info on link placement). Don't consider PageRank, nor link to high ranking pages only. Add value to your pages by recommending great resources which are most likely interesting for your visitors. Webmasters finding your page in their referrer stats will naturally place a link back, if they like your content. Be creative.
With a commercial site you'll need to adapt the above said, or hire a competent SEO.
Google Python Sitemap Generator - Introduction
Wednesday, January 11, 2006 by Cristina
The Free Python Google Sitemap Generator is described in the Google documentation at Using the Sitemap Generator and can be downloaded from SourceForge.net Google-sitemap_gen.
It is important to use the latest version, because of bug fixes and improvements. At the moment (9 January 2005) the latest version is 1.4. The sitemap generator is written in Python version 2.2 and it does not work with older versions. Python software can be downloaded from python.org.
The sitemap generator collects URLs by walking the file system on the web server and by reading access log files. The resulting sitemap is an XML file, either compressed or uncompressed, in the format specified by the Google Sitemap Protocol, with full XML header.
URLs of dynamically generated pages might not appear in the resulting sitemap if the generator uses only file system walking, since it will find only the URLs of the script files used to generate those pages.
Iterations of the sitemap generator reading access log files can be used to update/enlarge the resulting sitemap. If the number of collected URLs exceeds the maximum of 50,000 the generator will create more sitemap files and a sitemap index file (the sitemap index file will have to be submitted from the Google sitemap account panel), see the Google sitemaps group thread Sitemap gen apache log technique coupled with already existing sitemap, and the description below of the
sitemap node of the
When reading access log entries, the sitemap generator will include in the sitemap only the URLs that return HTTP response status 200 (OK). It is thus necessary, in order to avoid inclusion of non-existent URLs, to have a website set-up that will return 404 (not found) HTTP response status for non-existent URLs, not a redirection to a page returning HTTP status 200 (OK).
When the generator uses only file system walking, the elements included in the sitemap for each URL are, besides the full URL,
lastmod with a value given by the file time stamp (GMT), and
priority with a default value of 0.5.
If the generator uses access log files, then the priority value is given by the frequence with which an URL appears in the access logs. If the generator uses only access logs, without file system walk, file time stamps are unavailable and so there are no
lastmod elements in the resulting sitemap.
The value for the
changefreq element can be specifed individually for each URL by using the
urllist nodes in the
config.xml file, as far as I know it cannot be specified at once for all URLs in a website.
The information specific to each website, like the name of the sitemap file, the domain URL necessary for building the canonical URLs in the sitemap, etc. is contained in a configuration file in XML format, usually called
The script obtains the name of the config.xml file from the command line. For example, a command to run the generator from the same directory as
sitemap_gen.py can be
$ python sitemap_gen.py --config=/path/config.xml
/path/config.xml is the path name of the configuration file. The path name of a folder on the server can be easily found for a UNIX/Linux server from a command window with the Unix command
pwd. The relative path name can also be used, so if
config.xml are in the same directory,
config.xml can be used in the example above as the path name of the configuration file.
Search engine notification and suppression of it for testing
After creating the sitemap file, the generator notifies Google by default using the ping method (the sitemap has to be submitted from the Google Sitemap Account). It is possible to suppress the script search engine notification either from the command line by using the
--testing argument, or from the
config.xml file by using the
suppress_search_engine_notify attribute of the
site root node.
An example of suppressing search engines notification from the command line, from the same directory as
$ python sitemap_gen.py --config=/path/config.xml --testing
The config.xml file
The distribution package from SourceForge.net contains an example for the configuration file
example_config.xml with very good commentaries and explanations.
The generator script processes the
config.xml file using the SAX paradigm. SAX is an acronym for Simple API for XML, and refers to a sequential event-based parsing of an XML document, the script processes each XML element as it is encountered in the stream represented by the XML document. The
config.xml file has the following nodes with attributes.
The site node is the single root node, which contains all the other nodes, and specifies via its attributes the domain URL and the path name for the resulting sitemap file. The first XML tag in the
config.xml file is the opening tag of the
site node and the file ends with the closing tag
</site> for this root node.
site node has two required attributes,
base_url for the domain URL used in canonicalization of the URLs collected for the sitemap either from the walk of the web server file system or from scanning access log files, and the
store_into attribute for the path name of the resulting sitemap XML file. This resulting sitemap file can be uncompressed, with a
.xml file name extension, or compressed, with a
.xml.gz file name extension.
Attention, a bug in generating the compressed sitemap file has been fixed in version 1.4 of this Python Google generator, so it is important to check that you are using the latest version.
site node has also some optional attributes, which specify the detail in the diagnostic output that the script gives, suppression of notification to search engines (similar to the
--testing command-line argument), and the character encoding to use for URLs and file paths.
The directory nodes specify via attributes the path name of the directory where to start the walking of the file system on the web server. If URLs are dynamically generated by a CGI script file, then only the URL of that script file is added to the sitemap, without the URLs dynamically generated by query strings. In this case it is necessary to use also
accesslog nodes to scan access log files, if available.
directory node has two required attributes, for the directory path name and for the URL corresponding to that path name. There is also the optional attribute
default_file for the index file or default file for directory URLs. Setting a default file (for example
<directory) causes URLs of the default files of that name in the specified directory and its subdirectories to be suppressed (when URLs are collected by using only file system walking on the server).
URLs to directories will have the lastmod date taken from the default file rather than the directory itself (as explained by Google Employee in the Google Sitemap groups thread in July 2005 Sitemap_gen.py v1.2).
default_file is not specified, then both the URL to the directory and to the default file will be included in the sitemap, even though they represent the same document.
The accesslog nodes tell the script to scan webserver log files to extract URLs. Both Common Logfile Format (Apache default logfile) and Extended Logfile Format (IIS default logfile) can be read.
accesslog nodes have a required attribute for the path name to the log file and an optional attribute for encoding of the file if not US-ASCII.
There is the possibility of file globbing for access files by using the * wildcard character, for example
<accesslog path="/pathname/www/logs/*" encoding="UTF-8" />, see the Google Sitemaps Group threads Feature Request: File Globbing for AccessLogs and Sitemap_gen.py v1.2
The sitemap nodes tell the script to scan other Sitemap files, there is one required attribute that is the path to the sitemap file. It can help to iterate readings of the access log files to update the resulting sitemap files.
After a first run of the sitemap generator without the
sitemap node in the
config.xml file, when at further runs of the script using
accesslog nodes to scan the access log files, a
sitemap node is added having as attribute the path to the current sitemap file, a feedback loop is created and iterations improve the sitemap. If the collected URLs exceed the maximum number for a sitemap file (50,000), then the sitemap generator script creates new sitemap files and a sitemap index file.
The url and urllist nodes can be used to specify URLs with their
changefreq attributes for addition to the resulting sitemap file.
url nodes have one required attribute, that is the URL, and three optional attributes,
urllist nodes name text files with lists of URLs and the nodes have one required attribute, the path to the file.
These text files with URL lists contain one URL per line. A line can consist of several space-delimited columns, where after a URL that is mandatory, attributes can follow in the form
key=value for lastmod, changefreq and priority.
There is a
example_urllist.txt example file included in the distribution package.
The generator discards URLs that do not start with the domain's URL, but it does not check if a URL exists on the server.
urllist nodes specify URLs with the correct base URL, but that have never been on the server, then these URLs are included in the sitemap.
The filter nodes specify patterns that the script compares against all URLs it finds. There are
drop filters that cause exclusion of matching URLs and
pass filters that cause inclusion of matching URLs.
If no filter at all matches a URL, the URL will be included. Filters are applied in the order specified and a
pass filter shortcuts any other later filters that might also match.
The free Python Google generator is relatively easy to use, no knowledge of Python is necessary. The information and sitemap requirements specific to a website can be easily included in the configuration file by using the well commented
example_config.xml file which comes with the generator.
There are some things in the current version 1.4 that I think could be improved in future versions. For example, non-existent URLs can be included by mistake in the sitemap, as long as they have the correct base URL, via the
Also, when access logs are used in creating the sitemap, if a URL has been removed during the logged interval, such that it appears in the same access log file at first with HTTP response status 200 (OK) and later with 404 (Not Found), it will still be included to the sitemap.
Another thing is that I cannot see a way for specifying the
changefreq at once for all URLs in the sitemap, maybe with globbing. The
changefreq element has to be specified, if used, for individual URLs via the
Google Sitemaps for (tens of) thousands sub-domains
Wednesday, December 14, 2005 by Shawn
Hosts with gazillions of subdomains (user-name.example.com) can implement scripts to generate XML sitemaps for each user on their particular sub-domain (user-name.example.com/sitemap.xml). They can update those sitemaps when a user uploads/creates or changes content.
Because Google accepts sitemaps on a per server base only, it's not possible to create a sitemap index file containing all user-sitemaps on the domain's root level (example.com or www.example.com). That is, the submission of huge amounts of subdomain-sitemaps can't be done via the offial route (manually using a Google account).
A suitable solution for mass submissions of subdomain-sitemaps is anonymous pinging, because it can run fully automated.
That's what I do for my own hosting clients. When they add new content, Google gets pinged with the URL of the sitemap, which is generated on the fly via either a direct file the user maintains or via the 404 handler (which executes a script that generates a real-time sitemap and sets the status code to 200).
It works perfectly since I don't need stats for them. However, if it were necessary to debug a problem, it is possible to add individual sites to my sitemaps account as well. Also, all clients can verify their sitemaps to view their stats and track down crawling problems themselves.
Which major search engines support the Google Sitemaps Protocol?
Thursday, December 15, 2005 by Sebastian
At the time of writing this piece I'm not aware of other engines supporting Google's Sitemap Protocol. At least not officially. Actually that's a shame, since Google made the sitemaps technology open source, and encouraged other engines to participate.
However, the good news is, that MSN and Yahoo both accept mass URL submissions. Yahoo started to process submitted URL lists in plain text format shortly after Google's Sitemaps announcement, and added support of RSS, ATOM, and HTML sitemaps a few months later (details). MSN keeps a very low profile, but silently has started to process XML sitemaps. At least they extract URLs from XML files in general, and crawl those. Both Yahoo and MSN don't tell whether they revisit sitemaps, so it's a good idea to resubmit them every once in a while.
As long as Yahoo doesn't accept XML sitemaps, you should use a sitemap generator which can output text files or feeds, for example GSitemapCrawler or SimpleSitemaps.
Update: Yahoo and Microsoft support XML sitemaps. For more information refer to sitemaps.org
Supported formats: XML, RSS, ATOM, TXT
Supported formats: RSS, ATOM, TXT, HTML/XHTML
Supported formats: XML, RSS
(XML means here: compliance to Google's Sitemap Protocol)
Which URLs should I put in my Google Sitemap?
Thursday, January 26, 2006 by Sebastian
In general it's a good idea to populate a Google Sitemap with all URLs, that includes bigger images, video or sound clips, URLs of iFrames and content frames as well, spread sheets, manuals in PDF format etc., as long as the particular URL is the one and only spiderable URL pointing to a piece of content.
Say you've a piece of textual content, for example a longish product description. This product description is a unique piece of content which search engines should index under one and only one Web address, that is the URL. If you serve a piece of content in various formats, e.g. on a X/HTML page, in a PDF and a Word document, and on a printer friendly page, you end up with four addresses, because each document has its own URL. Only one URL, usually the interlinked X/HTML page, should be accessible and index-able for search engine crawlers.
If you provide your visitors with identical content in different formats, you have several options to make sure that search engines can index only one version, for example robots META tags on X/HTML pages, and the robots exclusion protocol (robots.txt), suitable for X/HTML pages and all other files as well. If your URLs are dynamic, you have other options to prevent search engines from crawling and indexing duplicate content. The one and only indexable URL pointing to a unique piece of content goes into the Google Sitemap.
Apply this rule to all types of content provided in any format.
URLs in FrameSets and iFrames
There are some use cases where frames make sense, but Web site navigation is definitely not on that short list. So if your navigation uses frames, consider a revamp and don't bother search engines with your stuff.
When you use frames to enhance non-HTML documents like spread sheets or presentations with navigational links, then you really should include the URLs of those documents as well as the frameset's URL in your Google Sitemap. Make sure that those URLs are under the sitemap path. If the resource loaded by the frame-set is located on another server or a sub-domain of your site, you must not include its URL. Provide an XML Sitemap on the other server instead.
Most iFrames contain advertisements or dynamic content controlled by client sided scripting, thus search engines usually don't index their contents. If you have informational content worth indexing presented in iFrames, for example topics pulled from a help system or so, the best way to get it indexed is to provide standard HTML pages as an alternative, and to mark all pages loaded in iFrames as not indexable. Only if that's no option, include iFrame URLs in your sitemaps, but make sure that those pages come with a link to the parent page if not loaded in an iFrame, so that search engine users don't land in a dead end.
Non-HTML Web objects like images or videos
Getting search engine traffic from image or video searches is a good thing, so you really should include the URLs of all unique non-HTML objects, like bitmaps, podcasts etc., in your Google Sitemap. Make use of descriptive file names, but don't stuff the URL with too many keywords. And don't expect wonders, there is a pitfall.
To determine the contents of binary files a search engine needs META data describing the content. Some image formats come with descriptive text, search engines can read compressed text in PDF documents and text snippets or headings in slides, but the majority of META data is embedded in HTML pages, for example text in alt and title attributes, and contextual information like textual content surrounding those objects or the anchor text of hyperlinks pointing to non-HTML objects.
To rank and visualize an image on the SERPs, a search engine must consolidate textual and binary data from different sources, because usually the crawler fetching the HTML page embedding an image has nothing to do with the crawler fetching the image file. The conventional crawler spotting an image-URL will notify the image crawler, and provide the indexer with META data belonging to the image. Later on the image crawler will fetch the image file, create a thumbnail, and hand over this package to the indexer. The indexing process decides whether the image will be indexed or not.
It may work the other way round too, that is the image crawler gets alerted on an image, for example by a Sitemaps entry, and will be able to fetch and process the image before the indexer knows about its META data. This could shorten the whole process to a big degree, because the crawling frequency of image crawlers is rather low. This process description is pretty much simplified, but covers the overall principle of most crawling image and video search engines.
XML feeds (RSS, ATOM)
In most cases it makes sound sense to include RSS/ATOM feed URLs in the Google Sitemap. Google's spiders don't tell you whether they request your feeds for the Web search or the blog search index, so you have no chance to serve different content depending on the user agent name, for example headlines to the Web crawler to avoid possible -- although improbable -- duplicate content issues, and full content feeds to the feed crawler invited by your pings.
Since the RSS URL is marked in the source page's header as alternate format, that is the connection between both URLs is obvious, I guess Google can handle duplicated information served from different URLs in different formats properly. Most probably this goes for PDFs or Word documents linked in the header too, but because there is little value in indexed PDFs when there is a corresponding HTML page, I'd make those unspiderable.
A thin page is an X/HTML page without -- or with not that much -- unique and original content, for example a product or item page with a short description (which is reused on index pages of upper levels), an image, and a buy-now link. Most of its textual content is duplicated from the page template, that is logo/header, footer, advertisements and navigation.
Thin pages aren't considered index-worthy, so it makes no sense to stuff a Google Sitemap with URLs which in most cases will never be crawled, and never make it in the search index, at least not in masses. On the other hand product and item pages are legitimate doorways, able to trigger very specific search terms, when they carry a fair amount of unique and original content, for example user reviews or long descriptions which are not used in upper navigation levels. Converting thin pages to content rich pages is a great way to earn huge amounts of very targeted (read converting/profitable) search engine traffic.
Author: The Google Sitemaps Group
Last Update: December 10, 2005 Web Feed