Should I include all URLs in my sitemap, even feeds, images, videos and other Web objects without META data? Will a Google Sitemap help to get framed pages indexed? Should I submit thin pages via Google Sitemaps or would this hurt?

Google Sitemaps KB · Index · Expand · Web Feed

Previous PageWhich major search engines support the Google Sitemaps Protocol?

In general it's a good idea to populate a Google Sitemap with all URLs, that includes bigger images, video or sound clips, URLs of iFrames and content frames as well, spread sheets, manuals in PDF format etc., as long as the particular URL is the one and only spiderable URL pointing to a piece of content.

Say you've a piece of textual content, for example a longish product description. This product description is a unique piece of content which search engines should index under one and only one Web address, that is the URL. If you serve a piece of content in various formats, e.g. on a X/HTML page, in a PDF and a Word document, and on a printer friendly page, you end up with four addresses, because each document has its own URL. Only one URL, usually the interlinked X/HTML page, should be accessible and index-able for search engine crawlers.

If you provide your visitors with identical content in different formats, you have several options to make sure that search engines can index only one version, for example robots META tags on X/HTML pages, and the robots exclusion protocol (robots.txt), suitable for X/HTML pages and all other files as well. If your URLs are dynamic, you have other options to prevent search engines from crawling and indexing duplicate content. The one and only indexable URL pointing to a unique piece of content goes into the Google Sitemap.

Apply this rule to all types of content provided in any format.

URLs in FrameSets and iFrames

There are some use cases where frames make sense, but Web site navigation is definitely not on that short list. So if your navigation uses frames, consider a revamp and don't bother search engines with your stuff.

When you use frames to enhance non-HTML documents like spread sheets or presentations with navigational links, then you really should include the URLs of those documents as well as the frameset's URL in your Google Sitemap. Make sure that those URLs are under the sitemap path. If the resource loaded by the frame-set is located on another server or a sub-domain of your site, you must not include its URL. Provide an XML Sitemap on the other server instead.

Most iFrames contain advertisements or dynamic content controlled by client sided scripting, thus search engines usually don't index their contents. If you have informational content worth indexing presented in iFrames, for example topics pulled from a help system or so, the best way to get it indexed is to provide standard HTML pages as an alternative, and to mark all pages loaded in iFrames as not indexable. Only if that's no option, include iFrame URLs in your sitemaps, but make sure that those pages come with a link to the parent page if not loaded in an iFrame, so that search engine users don't land in a dead end.

Non-HTML Web objects like images or videos

Getting search engine traffic from image or video searches is a good thing, so you really should include the URLs of all unique non-HTML objects, like bitmaps, podcasts etc., in your Google Sitemap. Make use of descriptive file names, but don't stuff the URL with too many keywords. And don't expect wonders, there is a pitfall.

To determine the contents of binary files a search engine needs META data describing the content. Some image formats come with descriptive text, search engines can read compressed text in PDF documents and text snippets or headings in slides, but the majority of META data is embedded in HTML pages, for example text in alt and title attributes, and contextual information like textual content surrounding those objects or the anchor text of hyperlinks pointing to non-HTML objects.

To rank and visualize an image on the SERPs, a search engine must consolidate textual and binary data from different sources, because usually the crawler fetching the HTML page embedding an image has nothing to do with the crawler fetching the image file. The conventional crawler spotting an image-URL will notify the image crawler, and provide the indexer with META data belonging to the image. Later on the image crawler will fetch the image file, create a thumbnail, and hand over this package to the indexer. The indexing process decides whether the image will be indexed or not.

It may work the other way round too, that is the image crawler gets alerted on an image, for example by a Sitemaps entry, and will be able to fetch and process the image before the indexer knows about its META data. This could shorten the whole process to a big degree, because the crawling frequency of image crawlers is rather low. This process description is pretty much simplified, but covers the overall principle of most crawling image and video search engines.

XML feeds (RSS, ATOM)

In most cases it makes sound sense to include RSS/ATOM feed URLs in the Google Sitemap. Google's spiders don't tell you whether they request your feeds for the Web search or the blog search index, so you have no chance to serve different content depending on the user agent name, for example headlines to the Web crawler to avoid possible -- although improbable -- duplicate content issues, and full content feeds to the feed crawler invited by your pings.

Since the RSS URL is marked in the source page's header as alternate format, that is the connection between both URLs is obvious, I guess Google can handle duplicated information served from different URLs in different formats properly. Most probably this goes for PDFs or Word documents linked in the header too, but because there is little value in indexed PDFs when there is a corresponding HTML page, I'd make those unspiderable.

Thin pages

A thin page is an X/HTML page without -- or with not that much -- unique and original content, for example a product or item page with a short description (which is reused on index pages of upper levels), an image, and a buy-now link. Most of its textual content is duplicated from the page template, that is logo/header, footer, advertisements and navigation.

Thin pages aren't considered index-worthy, so it makes no sense to stuff a Google Sitemap with URLs which in most cases will never be crawled, and never make it in the search index, at least not in masses. On the other hand product and item pages are legitimate doorways, able to trigger very specific search terms, when they carry a fair amount of unique and original content, for example user reviews or long descriptions which are not used in upper navigation levels. Converting thin pages to content rich pages is a great way to earn huge amounts of very targeted (read converting/profitable) search engine traffic.

Jump station:

Thursday, January 26, 2006 by Sebastian

Previous PageWhich major search engines support the Google Sitemaps Protocol?

Google Sitemaps Knowledge Base · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · 11 · 12 · Expand · Web Feed

Author: The Google Sitemaps Group
Last Update: December 10, 2005   Web Feed

· Home

· Google Sitemaps Guide

· Google Sitemaps KB

· Google Sitemaps Info

· Google Sitemaps FAQ

· Web Links

· Link to us

· Contact

· What's new

· Site map

· Get Help

Most popular:

· Site Feeds

· Database Design Guide

· Google Sitemaps

· smartDataPump

· Spider Support

· How To Link Properly

Free Tools:

· Sitemap Validator

· Simple Sitemaps

· Spider Spoofer

· Ad & Click Tracking

Search Google
Web Site

Add to My Yahoo!
Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed

Digg this · Add to · Add to Furl · We Can Help You!

Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives

Top of page

No Ads

Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy