The inofficial Google Sitemaps Frequently Asked Questions

We use Google Sitemaps to inform Googles crawler about all your pages and to help people discover more of your web pages. The Inofficial FAQ on Google Sitemaps answers popular questions collected from various forums, email lists and the Google Sitemaps Group. It's meant as an addition to Google's official FAQ, the Google Sitemaps Blog, and the Google Sitemaps Tutorial.

Often manuals and generic information are not suitable to answer particular questions, because individual circumstances require dedicated answers from experienced consultants. Google's Sitemap Protocol along with dozens of awesome 3rd party tools form a toolset to enhance crawlability, but to develop solutions one needs more than a toolset.

This FAQ and the Google Sitemaps Knowledge Base combine detailed knowledge on Google's related protocols, tools and services with experience from experts in the field who have implemented Google Sitemaps with various content management systems (CMS) and Web sites of any type.


Index

Google Sitemaps Rundown

What Google Sitemaps is all about: Aimed crawling makes Google's search results fresh, and Webmasters happy. [Shrink]

Can a Google Sitemap destroy my search engine placements?

No, Google Sitemaps are are just (mass) URL submissions. If you feel comfortable using Google's add-url form to submit your URLs, you can submit an XML sitemap too. In any case you should care what you're submitting. Junk submissions from your site may have impact on search engine placements of clean pages. [Shrink]

Can I remove deleted pages in Google's index via XML Sitemap?

No, Google Sitemaps is a robots inclusion protocol lacking any syntax for deletions. Remove deleted URLs in the XML file, and ensure your server responds 404 or 410 to Googlebot. [Shrink]

Will a Google Sitemap increase my PageRank?

The official answer is no, PageRank adjustments are unrelated to Google Sitemaps. The inofficial answer is depends, in some cases Google Sitemap submissions can have impact on PageRank calculations. [Shrink]

Can I escape the 'sandbox effect' with a Google Sitemap?

No, Google Sitemaps enhance a site's crawlability, which is a good thing as part of a long-term SEO strategy, but even perfectly crawlable sites can get 'sandboxed'. To escape Google's probation period (a.k.a. 'sandbox') you must tweak other factors. [Shrink]

What is the best tool to create a Google Sitemap for my Web site?

Depends. Before you look for a tool, work out a sitemap strategy suitable for your Web site. Choosing a tool sets the procedure to create and maintain the sitemaps in stone, thus make a good decision in the first place. [Shrink]

What is the time table of a Google Sitemap submission?

Breaking up the anatomy of the Google Sitemaps submission process leads to a time table, and answers a lot of related questions. [Shrink]

How long does it take to get indexed by Google?

The time to index counted from the Google Sitemaps submission may range from a few hours to never. For Web sites in pretty good shape Google's time to index usually doesn't exceed two days. [Shrink]

I've another question | The FAQs weren't helpful

If your particular question wasn't answered, submit it here. [Shrink]



Google Sitemaps Rundown

[Shrink]

Friday, October 28, 2005

We use Google Sitemaps to inform Googles crawler about all your pages and to help people discover more of your web pages. Smart Webmasters use Google Sitemaps to submit fresh content as well as updates to Google's Web crawlers. With Google Sitemaps the submission process can run on autopilot, that is each update of the Web site's content database can trigger an immediate crawl of new and changed Web pages. This fully automated procedure can reduce the time to index dramatically, even for small and static Web sites, or blogs, forums, shops ...

Here is how you can make use of Google SiteMaps to get your Web site fully indexed in no time, 24/7/365.


1. Study the Google Sitemaps Tutorial

Now you should know what Google Sitemaps is all about, and you should have a pretty good idea how you can get the most out of it.


2. Download and install a free Google Sitemap Generator

Google's open source tool requires Python 2.2. If that's no option, then look for a suitable 3rd party tool. Selfish recommendation: Simple Sitemaps creates you a Google XML sitemap, a HTML sitemap, a RSS site feed and a Yahoo! URL submission list.


3. Create and validate your XML sitemap

Once you've created your Google Sitemap(s), you should validate the XML structure before you submit it: here is a free online sitemap validator.


4. Submit and verify your Google Sitemap

First sign up for a Google Account (or sign in to your existing acct.), then submit your Google Sitemap. For resubmits of your sitemap(s) on content changes you've many options, but the initial submission must be done in your Google Account. Once your sitemap is accepted, verify it to get daily crawler problem reports.

Done. Sounds easy, but there are pitfalls, and options. Depending on the type of Web site you're operating and its size, the optimal sitemap strategy will vary. Possibly your Google Sitemap implementation becomes a complex task. It may be a good idea to ask for advice or a second opinion.

If you have questions or answers not covered in this FAQ and the tutorial, feel free to contribute.

Surf a few useful Web links for more information on Google Sitemaps. And keep your knowledge current: bookmark the Google News Feed on Google Sitemaps, and subscribe to the Google Sitemap FAQ's RSS Feed to get alerted on updates and Google Sitemap news.



Can a Google Sitemap destroy my search engine placements?

[Shrink]

Monday, November 21, 2005

For a while I was struggling with this FAQ article. I knew I can't avoid it, but having a zero-bullshit-tolerance it's hard to stay polite in this case. I'm a strong believer in there is no such thing as a dumb question, but this question is the exception to the rule. At least this goes for Web-savvy questioners flooding the forums and user groups with variations of this question. Otherwise the question is valid, and the answer is no. Machine readable mass submissions of questionable content make it easier to apply quality guidelines, but indexing based on regular crawling (following links) leads to the same results as a sitemap based crawl, sooner or later.

Google launched the sitemaps program im June 2005, more or less concurrently with other changes, e.g. improved index-spam and duplicate content filtering (besides some algorithmic changes probably primal noted because existing rules were enforced on a broader amount of Web objects). Those changes have produced some collateral damage, but Google's goal to improve the search results was achieved (Google takes feedback serious and has reinstated sites innocently caught by those filters). As for collateral damage, this lies in the eye of the beholder. Just because a Webmaster claims to operate a so called "legit and valuable site", that does not mean that Google considers it compliant to its guidelines. For example, intentional as well as unintentional content duplication worked for years. Just because it worked and made some folks rich, that does not mean that it was legit or valuable in the means of Google's mission to "organize the world's information and make it universally accessible and useful".

If you can't manage to operate a site compliant to Google's written and unwritten policies (that means you're ignoring Google's Webmaster guidelines, professional advice, and common sense), or if your great and innocent site gets tanked because a Google ranking algorithm respectively spam filter causes collateral damage (this happens mostly when a Web site provides questionable or duplicated content, or when a Web site is involved in systematic artificial linkage), or if your server was down or unreachable for Googlebot during the scheduled crawl, don't blame the sitemap.

There are a zillion reasons why a site or a bunch of pages can disappear temporarily, or even permanently. Besides connectivity issues, server outages, rare coincidences, and very few cases of proved collateral damage, the reasons can be revealed by a SEO site review. A professional SEO consultant can often help to lift a ban or penalty, respectively s/he's able to track down the causes for a temporarily disappearance.

Junk submissions via Google Sitemaps can tank a site on the SERPs. Misconfigured Web servers and IIS flaws can result in lost rankings. Outdated SEO tactics usually remove a Web site from the search index. Following free but also bad advice from suspected sources prevents niggardly site owners from free search engine traffic. And so on, and so on, the list of methods to stay ignored by Google and its competitors is endless.

Looking at all causes, there is a lowest common denominator: miserable Site owner/SEO/Webmaster failure. Yes, even choosing the wrong Web hosting service or the wrong CMS, or not hiring a SEO consultant or at least an experienced Webmaster with outstanding SEO skills from the beginning, are unprofessional mistakes. Self-SEO is an artifact from the stone age of the Internet, it doesn't work anymore, alas.

If you're keen on organic search engine traffic, you must play by the rules. Thanks to search engine spammers, most of those rules are unwritten. That's why you need to hire search engine experts. In an ideal world, common sense and honesty would do the trick. We don't live in an ideal world. Unfortunately, in the search engine game there is way too much money involved. Nowadays even non-profit organizations and mom and pop sites rely on paid professional expertise when it comes to search engine placements. Doing things right from an ethical POV is not enough to gain top spots on the organic search engine result pages (SERPs), nowadays search engine expertise is the only way to achieve reasonable and fair search engine placements.



Can I remove deleted pages in Google's index via XML Sitemap?

[Shrink]

Saturday, October 29, 2005

Deleted and renamed pages must be removed from your Google Sitemap. Having invalid or redirecting URLs in a sitemap burns resources, and blows up the crawler problem reports. Once Google knows a URL, Googlebot will try to fetch it more or less until the server dies.

Google Sitemaps is an instrument to submit new Web objects and content changes to Google's crawler, as an addition to the regular crawling process. Google Sitemaps is in no way a catalogue of all URLs per Web server where Googlebot ignores URLs not included in the sitemap. Thus deleting an URL entry in a sitemap will not hinder Googlebot to request it again and again.

The only way to tell a search engine crawler that a page has vanished is via HTTP response code. Google provides an additional method to remove URLs in Google's index immediately, that means before the next (regular) crawl, but the removal procedure has disadvantages, e.g. it does not delete URLs forever.

The HTTP protocol defines return codes to tell a user agent (browser, crawler ...) the status of a resource (URL). If a page is found at the requested address (URL), and its content can be delivered to the user agent, the Web server sends a header containing the return code 200 OK to the user agent, before it sends the content. Otherwise it sends an error code. With static Web objects (HTML pages, images ...) this happens in the background, the Webmaster can configure specific return codes for particular areas or resources. With dynamic pages the Webmaster can manipulate the HTTP return code sent to the user agent per page. The most important HTTP error codes usable for moved and deleted resources (URLs) are explained below:


HTTP return code 404 - Not Found

The 404 return code is a generic error code, used if a resource is not available, and the server does not know, or does not want to reveal, whether the resource is permanently gone or just temporarily unavailable respectively blocked. Usually the Web server is configured to send a custom error page to the user agent, which responds with a 404 error code in the header, and provides the visitor with information (e.g. error messages) and options (e.g. links to related resources). On Apache Web servers this can be done in the .htaccess file located in the root directory:

ErrorDocument 404 /error.htm?errno=404

Because the user agent does not know whether the resource has vanished or not, it might request it again. Therefore the 404 code is not suitable to tell a search engine crawler that it should forget a resource.

Google provides a procedure to delete URLs responding with a 404 code in its databases. Go to Googlebot's URL Console and create an account. You should use the same email address and password as for the Google Sitemaps account. Once the account is active, you can submit dead links found in your Google Sitemaps Stats, server error logs etc. under "Remove an outdated link". It can take five days until the deletion is completed. Every time you log in, you get a status report stating which submitted URLs are not yet removed. Ensure that during this process the URL responds with a "404 not found" error code.


HTTP return code 410 - Gone

The 410 return code tells the user agent that a resource has been removed permanantly. Search engine crawlers usually mark resources responding with a 410 code delisted, and do not request them again. That's not always the case with Google's supplemental index, where dead resources can still appear in search results, even years after their deletion. A 404/410 return code may move a cached resource from the current search index to the supplemental index. However, if a page was deleted and there is no forwarding address (e.g. a new page with similar content), the Web server should send a 410 header. It's good style to make use of a custom error page for human visitors.


HTTP return code 302 - Found (Elsewhere)

The 302 return code tells the user agent that the requested URL is temporarily unavailable, but the content is available from another address. In the 302 header the server gives the user agent a new location (URL), and the user agent will then request this resource. For various reasons a Webmaster should avoid 302 redirects, they lead to all sorts of troubles with search engines. The most common cause for 302 responses is an invalid URL used in internal links and link submissions, e.g. missing trailing slashes etc. (see valid URLs). Unfortunately, 302 is the default return code for most redirects, for example Response.Redirect(location) in ASP, header("Location: $location") in PHP, RewriteRule as well as ErrorDocument 404 http://www.example.com/page(!!) in Apache's .htaccess directives. All server sided programming languages provide methods to set the redirect response code to 301.


HTTP return code 301 - Moved Permanently

The 301 redirect code tells the user agent that a resource has been moved and will never be available at the old address again. All intentional redirects (e.g. renamed URLs, moved URLs ...) must send the requesting user agent a 301 header with the new permanent address. Many scripts make use of redirects to 'link' to external resources, usually because this is simple way to track outgoing traffic. That's a lazy and wacky hack, but if not avoidable, the script should do a permanent redirect at least.
As for deleted pages, often it makes sense to 301-redirect requests instead of sending a dead page error (404 or 410), especially when there is a page with similar content available on the Web server and other sites link to the deleted page.


Examples of 301 - redirects

To ensure your redirects send a 301 response code to the user agent, you can copy and paste the code examples below. The first examples are for Apache's .htaccess files:

#1 301-redirects a page:
RedirectPermanent /directory/page.html http://www.example.com/other-directory/other-page.html

#2 Alternate syntax:
Redirect 301 /directory/page.html http://www.example.com/other-directory/other-page.html

#3 301-redirects all example.tld/* requests to www.example.tld/*:
RewriteEngine On
RewriteCond %{HTTP_HOST} !^www\.example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

Please note, that with both Redirect(2) and RedirectPermanent(1) the first location parameter (source) is a URL relative to the Web server's root, and the second location parameter (target) is a fully qualified absolute URL. The third .htaccess example makes use of the mod_rewrite module and redirects all requests of URLs on example.com to the corresponding URL on www.example.com.

If you suffer from IIS, go to the "Home Directory" tab and click the option "Redirection to a URL". As "Redirect to" enter the destination, for example "http://www.example.com$S$Q", without a slash after ".com" because the path ($S placeholder) begins with a slash. The $Q placeholder represents the query string. Next check "Exact URL entered above" and "Permanent redirection for this resource", and submit. If you haven't both versions (with and without the "www" prefix) configured, create the missing one before.

Or you can use ASP to redirect page requests:

'VBScript:
Dim newLocation
newLocation = "http://www.example.com/other-directory/other-page.asp"
Response.Status = "301 Moved Permanently"
Response.AddHeader "Location", newLocation
Response.End

'JScript:
Function RedirectPermanent(newLocation) {
Response.Clear();
Response.Status = 301;
Response.AddHeader("Location", newLocation);
Response.Flush();
Response.End();
}
...Response.Buffer = TRUE;...
...RedirectPermanent ("http://www.example.com/other-directory/other-page.asp");

The ASP page script must be terminated after sending the 301 header, and you must not output any content, not even a single space, before the header. Everything after Response.End will not be executed.

The same goes for PHP:

$newLocation = "http://www.example.com/other-directory/other-page.php";
header("HTTP/1.1 301 Moved Permanently", TRUE, 301);
header("Location: $newLocation");
exit;


If you are on a free host and if you don't care whether your stuff gets banned by search engines or not, you can use META refreshs:

<META HTTP-EQUIV=Refresh CONTENT="0; URL=http://www.example.com/other-directory/other-page.htm">

or JavaScript:

window.location = "http://www.example.com/other-directory/other-page.htm/";

and intrinsic event handlers:

<body onLoad="setTimeout (location.href='http://www.example.com/other-directory/other-page.htm', '0')">

to redirect. Again, do not use any client sided redirects if you're keen on search engine traffic, especially not the sneaky methods from the examples above. Google automatically discovers sneaky redirects and deletes all offending pages or even complete domains from its search index, mostly without a warning.


Checklist "Delete a page"

  • Delete the URL entry in the Google Sitemap, then resubmit the sitemap
  • Ensure the server responds with the correct HTTP error code
  • Remove or change all internal links pointing to the deleted page
  • Ask Webmasters of other sites linking to the deleted page to change or remove their links


Before you delete pages, consider an archive. Archiving (outdated) content under the current URL preserves existing traffic and comes with lots of other advantages. Archiving is an easy task, just change the page template and/or the navigation links. Smart content management systems (CMS) will archive a page with one mouse click. Changing the URL is a bad idea, because all incoming links become invalid.


Recap: HTTP error codes 404 · 410 · 302 · 301 · Redirect code snippets · Checklist



Will a Google Sitemap increase my PageRank?

[Shrink]

Friday, October 28, 2005

XML URL feeds, a.k.a. sitemaps, are Google's little helpers to enhance the crawling process. Crawlers are Web robots fetching files from foreign Web servers. They get feed with URLs from a back end system, download the files, and store them in a database. Some crawlers are somewhat intelligent, they follow links and they may do some basic content analyses. However, technically the crawling process ends with the content delivery to the search engine's database. Therefore the official answer is "no, Google Sitemaps have no impact on PageRank calculations". Google's crawler Googlebot doesn't compute the PageRank of fetched Web pages.

Next the indexing process takes care of all the high sophisticated stuff necessary to include a Web object (document, image, video, sound file ...) in a search engine's index, or not. One part of the indexing process are PageRank adjustments. PageRank is computed by counting weighted internal as well as external inbound links to a Web object. Because Google discovers, fetches and indexes more documents with the help of Google Sitemaps, Google takes formerly unknown inbound links into account. Therefore the inofficial answer is "yes, internal links discovered during sitemap based crawls have impact on the PageRank calculation for each link destination (page) within the Web site". The same goes for sitemap based crawls on other servers. As more sites make use of Google Sitemaps, as more new links Google will discover.

Related information:

PageRank is a property of Web objects reachable via a URL. Each and every Web object which is the destination of a hyperlink has a PageRank value assigned. The name "PageRank" comes from Google's co-founder Larry Page, not from Web Page by the way. Webmasters using terms like "PR5 site" usually refer to the PageRank assigned to their Web site's root index page. There is no such thing as a site wide PageRank, each and every linked Web object of a Web site has its individual PageRank value.

The PageRank Google displays on the Google Toolbar is not the real PageRank used for page rankings. Toolbar PageRank is an outdated snapshot and gets updated only every 3-4 months. All the PR-tools out there pull the PageRank from the same source, so regardless which method is used to get a page's PageRank, the value is outdated, and pretty much useless. The displayed scale 0/10 - 10/10 is a logarithmic simplification.

PageRank is only one of many factors used for page rankings. It is an important factor, but it usually does not overrule factors used to determine the search query relevancy. Google seems to use a topical PageRank for rankings on the SERPs, because the overall link popularity of a page does not indicate its relevancy with regard to a specific search term, or topic. The value of PageRank is overestimated across the boards.



Can I escape the 'sandbox effect' with a Google Sitemap?

[Shrink]

Wednesday, November 02, 2005

First of all, the term Google Sandbox is pretty generic and often misleading, because it is used to name a bunch of different things. The most common understanding is that the sandbox effect hinders a new Web site to rank fair in Google's search results for at least six months, often longer. Most sandbox theories are pretty vague, because only a few search engine experts understand spam filtering and the process of indexing a new site. 99.9% out of everything you can read on the Web about Google sandboxig new Web sites is utterly nonsense, written by disappointed Webmasters who lost their Google traffic. It makes no sense to explain such an undefined catchword, so lets sandbox the sandbox for a while, and lets talk facts.

Before Google can crawl a new site, it must discover it. Google Sitemaps can alert Google to a new site, but as far as we know today the deep crawling doesn't start before Google has discovered external links pointing to the new site. That was the preferred method for ages, and it seems Google didn't change this preference.

Next the crawling engine must learn how to communicate with the new site. First it caches DNS information, and determines the canonical name. Then Googlebot tries a few HTTP 1.0 requests measuring download rates and such stuff. If that works fine for a while, Googlebot-Mozilla comes by doing HTTP 1.1 requests, testing in which frequency crawlers can hammer the server. If the new site is on a shared IP address, this process can take a few months, and probably the new site's pages will not get any (or at least not all the deserved) PageRank assigned for a while.

Parallel to the technical evaluation, Google tries to figure out what the new site is all about. As a matter of fact, nowadays a search engine cannot trust the on-the-page content (delivered to crawlers) and anchor text from internal linkage to determine a Web site's overall theme(s) and its sub-topics. On the other hand Google has a pretty good idea about the authorities on the Web, so they can use inbound links to double check the new content's quality. It makes no sense for Google to speculate about actual theming, before the new site has attracted a reasonable amount of links from authority pages on it's desired topic. Until a few pages got a trust bonus via inbound links from topical authorities, Google cannot and will not use untrusted and unchecked stuff in search results for quality reasons.

There are a lot of other signs of quality checked by Google, before a new Web site makes it on the SERPs. For example the uniqueness of its contents, since flooding the search results with duplicates, near duplicates or variants of well known stuff doesn't fit Google's mission of "organizing the world's information and making it universally accessible and useful". Natural linkage is another very important criteria. Statistical anomalies in a new site's linkage data do elongate the probation period. Google can detect topical phenomenons producing spikes and should handle those huge amounts of fresh links earned in a short period of time accordingly in most cases. Google likes smoothly growing, user friendly Web sites providing unique content and linking to original content from within the content best.

Go ahead and create a time table from the above said. Look at your site and honestly date milestones like naturally earned authority links, reaching a 'critical mass' of unique content, and so on. Count the weeks or months to get the estimated time to index. If you've done a good job, that's the duration of your new site's probation period. If not, you've entered an extended probation period, also known as 'the sandbox', and nobody except Google can tell you how long you'll live without free traffic from Google.

Unfortunately, caused by hardcore spammers flooding search indexes with crap, this procedure comes with a lot of pitfalls. Even some experienced Webmasters still crank out new Web sites with thousands of pages, donate a few high PageRank links from within their network, submit the site to a few directories, send out a couple of press releases etc., and wonder why the new site ranks fine at Yahoo!, Ask and MSN, but stays invisible at Google, despite its increasing overall PageRank. It pays to hire an experienced consultant before the launch, because most of the old fashioned (respectively established) SEO tactics, repeated to death by each and every search engine marketing resource out there, simply don't work anymore, and the few experts in the field don't hand out free advice any more.

However, if a smart and experienced Webmaster, following Google's written and unwritten guidelines, does everything right, her or his new site can gain fair rankings at Google quite instantly. Not all new sites suffer from mysterious 'sandbox symptoms'. If a new site does not appear in searches, the Webmaster should look for homemade flaws and fix them, instead of whining about an unfair and oh so unavoidable sandbox penalty. Sitting back in resignation prolongates the probation period, since inactivity doesn't eleminate its causes.

Recap:

  • Get professional advice from the beginning.
  • Get a fast and reliable server and a dedicated IP address.
  • Get a search engine friendly content management system (CMS), e-shopping mall ...
  • Soft launch your site to shorten the technical evaluation period.
  • Provide a whole bunch of original content to attract natural inbound links, even with e-commerce sites.
  • Don't run wild, let your site's traffic (measured by inbound links) grow naturally, keeping a reasonable new links / fresh content ratio.
  • Link out to great original content from within your content, not from a links page.

Disclaimer: This article is not meant as a guide to escape Google's 'sandbox' or so, because it omits important details.



What is the best tool to create a Google Sitemap for my Web site?

[Shrink]

Thursday, November 10, 2005

Google Sitemaps is a URL submission tool, currently feeding only one search engine. Despite the label, it has not so much to do with a classic site map at first sight. Google's sitemap protocol lacks attributes like title, abstract, topic/category or parent node (hierarchy level and position of a node in the tree structure), which are necessary to create a site map which helps human users to navigate a site. A flat Google Sitemap file becomes a structured site map in Google's search index, which is build and maintained by regular crawling on the Web, and completed by sitemap based crawls as an additional method of inclusion and 'novation'.

A well thought out sitemap strategy takes care of navigation enhancements, and other methods and targets of URL submissions too. From the data gathered, reviewed and completed to create a Google Sitemap, one can easily make other useful URL collections, e.g. a Yahoo URL submission list, a hierarchical HTML site map, a RSS site feed and so on. Unfortunately, most Google Sitemap related tools don't come with clever add-ons reusing the data. Another important criteria is the desired (automated) handling of page updates. Most Google Sitemaps related tools aren't suitable for Web sites providing their visitors with frequent updates.

In the following I've listed a few types of Web sites along with appropriate procedures to make use of Google Sitemaps. Please note that most of the tools linked below aren't evaluated or even tested, they are just examples of a particular approach. "Static" refers to the method of page creation, it means "stored on the Web server, not dynamically created by a script".


Regardless of the method used to create a Google Sitemap, it must be double checked before the submission. If for some reason duplicated pages, unprotected printer friendly versions, or even questionable stuff like outdated link swap pages find their way into the sitemap, the Google Sitemaps submission will function as an index wiper and tank the site in Google's search results. Also, after the submission you should monitor the downloads and --at least the very first-- crawler problem reports.


One pager and tiny, static Web sites

If such a site needs a Google Sitemap at all, the tool of your choice is a text editor. Just grab an XML syntax example, edit the URLs in <loc>, except of <lastmod> delete all other tags within <url>, save as UTF-8 text, upload and submit the file, done. However, ensure all pages are linked from the root index page.


Small static Web sites, never updated

If the only page ever updated is the guestbook, you need a simple one time sitemap generation, that is an online tool like SitemapsPal. Enter your root URL in the form, disable all other sitemap attributes, and press submit. Copy the results and paste them into your favorite text editor, save the file as UTF-8 text, remove the usual duplicate entry http://www.yourdomain.com/index.html but leave http://www.yourdomain.com/ intact, upload and submit the file, done. On the Google Sitemap Tools Page you'll find a style sheet to display the XML sitemap in a Web browser, and a Google Sitemap editor you can use for minor updates.


Small static Web sites, frequently updated

You can use any tool for editing or regenerating the XML sitemap on content changes, but there is a smart way to do it. Download Simple Sitemaps from this site, if you're able to do minor edits in well documented scripts. Simple Sitemaps uses a plain text file containing a list of URLs to create XML, HTML, TXT, and RSS sitemaps. The tool itself is free, for a low fee you you can purchase the initial URL list spidered from your site. On page updates just change the date of last modification, or add a new line for a new page, upload the text file, and Googlebot updates all four types of sitemaps.


Medium sized and large static Web sites

For this type of Web site there are basically three different approaches:

1. You can add a few lines of code to every page and use an aggregator like AutoSiteMap's proxy script to maintain the Google Sitemap dynamically. The principle is simple: when a user visits a page on your site, its crawling priority gets updated, or, if the page is new, it's added to the sitemap. You rely on a third party service, so if the foreign server is down, incredible slow or unreachable, your sitemap is broke and page loads become slower. If you've multiple URLs pointing to the same page, e.g. affiliate links, you're toast.

2. Desktop tools like GSiteCrawler crawl your site, parse your server log files and even grab search results on the fly to compile a list of URLs. Those tools come with filters to suppress session-IDs and other ugly query string elements, URLs excluded by robots.txt, particular pages or directories and so on, so there is at least a minimum control of a sitemap's content. Usually the sitemap generation process is pretty much automated, and there are options to recreate sitemaps by project, thus desktop tools are a good choice for Webmasters maintaining multiple static sites.

3. Server sided sitemap generators like Google's Python script or phpSitemapNG enhance the potentialities of desktop tools by scanning the Web server's file system for URLs. Especially with large sites those tools are preferable, because they don't burn that much bandwidth. Like desktop tools, server sided solutions provide filters, crawling, stored page attributes, and even fully automated recurring sitemap generation and (re)submission per cron job.

Caution: URL lists harvested from the file system may contain stuff you do not want to submit to a search engine. For example scripts producing garbage without input parameters, forgotten backups of old versions or experimental stuff, and sneaky SEO spam like doorway pages from the last century.


Blogs

Web log software and similar content management systems like Drupal usually come with a build-in sitemap generator. If not, there are good plug-ins for WordPress, Movable Type, and others available. Some of them didn't make it on the sitemap related links lists, so simply search Google for [google sitemap generator "your blog software or CMS here"]. A blog's XML sitemap should be updated in the background, for example when the blog software pings Pingomatic, Weblogs, etc., announcing new posts to the blogosphere. Even with trusted software it's a good idea to check the XML sitemap for duplicates, that is different URLs pointing to the same post, or archive index page. URLs with IDs in the query string are suspect, mostly they have an equivalent address (permanent link). If there is no filter to suppress useless URLs, dump the software.


Forums

Some forum software vendors offer build-in functionality to create Google Sitemaps, for others like vBulletin you need plug-ins. The crux with forums is, that they make use of multiple URLs pointing to the same or similar content. The XML sitemap should contain only URLs to sub-forums and threads. Perhaps even pages per thread, but here it starts to become tricky, because the number of displayed posts per page is --usually-- user dependent, and search engine crawlers like Googlebot don't behave like real users, they don't accept cookies, they don't log in, and they start a new session per 'page view'.

In general, keep any URL with a post-ID out of the XML sitemap, and ensure those pages have a dynamically inserted robots META tag with a NOINDEX value. To avoid disadvantages caused by search engines filtering out duplicated content, see every post as a text snippet, and make absolutely sure that there is no more than ONE indexable URL pointing to a page containing each snippet. To enhance the crawlability of your forum, and to ensure search engines cannot index duplicated content, make creative use of search engine friendly cloaking.


Dynamic sites of any size

With a dynamic Web site do not use a 3rd party tool to create and update your Google Sitemap. Provide a dynamic database driven XML sitemap instead. If you don't use a content management system with build-in creation of XML sitemap files, write the script yourself, or hire a programmer (see dynamic Google Sitemaps design patterns). It's worth the efforts, because with 3rd party sitemap tools you put your search engine placements at risk, and most 3rd party tools handle dynamic content very costly. There are countless good reasons not to use external sitemap generators --not even Google's own script!-- with dynamic sites. I'd have to write a book to list them all, so just trust me and do not use external tools to create a Google Sitemap for your dynamic Web site!


Free hosted stuff

Well, my usual advice is go get a domain and a host, but for mom and pop sites and hobby bloggers as well reputable free hosting makes sense, and their content often deserves nice search engine placements. Unfortunately, in some situations free hosting and Google Sitemaps are not compatible, that is you'll have to live without a Google Sitemap.

Free hosted content management systems like Blogger don't allow the upload of XML sitemaps, but they may create a feed. Google's sitemap program accepts most feeds, so just submit your RSS or ATOM feed. This will not cover updates of older posts, but if the main page is popular and provides links to the archives, Googlebot will crawl all posts frequently.

If the free host adds nasty ads to every HTML page, try to create a plain text file with a list of your URLs, each URL in a new line. Google accepts every file extension for text sitemaps, so try different extensions until you find one which your free host doesn't touch. I didn't try whether Google accepts the misuse of common extensions or not, but perhaps using .gif or .jpeg will do the trick.


Jump station:



What is the time table of a Google Sitemap submission?

[Shrink]

Monday, December 05, 2005


Creating the Google Sitemap

It makes sense to choose the XML format, and to develop a suitable procedure to automate sitemap updates on page updates. It is absolutely necessary to double check the sitemap's contents to avoid unintended submissions, e.g. different URLs pointing to the same page like http://www.example.com/ (valid URL) and http://www.example.com/index.html (invalid URL).

The sitemap file(s) should be placed in the Web server's root, that is they should be reachable via top level URIs like http://www.example.com/sitemap.xml.
Note that www.example.com technically is different to example.com, even when the hosting service has configured delivery of the same content from both the domain and the www-subdomain. Stick with one version, that goes for sitemaps, internal links, link submissions to foreign resources like directories as well as for business cards, radio spots, and so on.


Tips: If you're keen on image and video search traffic, add your images and movie clips to the sitemap. If you provide alternative formats, for example RSS feeds, add those URLs to the sitemap too. Don't add printer friendly HTML pages and pages optimized for different browsers, those should be non-indexable, that is they should have a robots META tag with a "NOINDEX,FOLLOW" value in the head section.


Validating the Google Sitemap

Open the sitemap with your browser, most browsers detect coding errors. Click on view source to check for proper UTF-8 encoding, for example dynamic URLs must not contain ampersands (&), replace those with the HTML entity &amp;. If the contents look fine, check the XML structure with an XML validator and correct possible errors.


Submitting the Google Sitemap

Login to Google and add the URL of your new sitemap to your account. When you follow the link in the 'Sitemaps' column of your account's Site Overview, your sitemap should be listed as "pending". If not, for example because the URL is invalid or so, correct the errors and resubmit.

A pending sitemap is an URL submission received by Google, sitting in a queue awaiting further processing.


Tips: Try to submit your XML sitemap to MSN Search too, and if you've a plain URL list or RSS feed then submit it to Yahoo! Search. Resubmit on changes, because both search engines don't promise to revisit periodically. Although Google downloads accepted XML sitemaps periodically, resubmissions after content changes may be a good idea, at least with Web sites which aren't updated very frequently.


Waiting for the submission receipt

While you're waiting for the first sitemap check by Google, you should verify your account. Just click the verify-link on the overview page and follow the instructions. Ensure that your Web server can store case sensitive files and returns a 404 error code to requests of unknown resources. If you're a Web designer, you can add your clients' sites to your account and the site owner's account as well.

After a few hours, or the next working day at the latest, revisit your account and check the sitemap's status. If it shows "error", click the link and follow the instructions to resolve the problem. If it shows "OK", your sitemap submission was successful.

The OK status means, that Googlebot has downloaded the sitemap file, and that it has passed a loose validation. At this time the submission file's contents were not yet processed. That means, Googlebot has not yet begun to crawl your site. Actually, you don't know whether your sitemap submission has made it into the crawler queue or not.


Waiting for Google's crawler Googlebot

Google's crawler schedule is pretty much ruled by PageRank™. That means, if the average PageRank™ of your Web pages is very low, Googlebot visits every once in a while, if ever, and doesn't crawl everything. If your overall PageRank™ is medium to high, Googlebot is the busiest visitor of your site.

So if you've a new site, relax a few weeks, then check your server logs for visits where the HTTP user agent name matches Googlebot's user agent names, and the IP address leads to sub-domain names like crawl-66-249-64-44.googlebot.com (the digits stand for one of Google's IP addresses and vary depending on your data center's location, because Google tries to crawl from a data center next to your Web server). From the user agent name you can't determine whether a crawl is sitemap based or not, because the regular crawling --by following links on the Web-- leaves identical footprints in your logs.

If you really need to know when the sitemap based crawling begins, then create a non-indexable page (having a NOINDEX robots META tag) which is not linked from anywhere, and put its URI in your sitemap. When Googlebot fetches this page, you've a sure proof that your sitemap's contents were transferred to Google's crawler queues (most probably the sitemap based crawling has been invoked earlier).


Monitoring Google's crawling process

You've two instruments to track Google's crawling, your server logs (or a database driven spider tracker), and Google's crawler statistics. Your server logs tell you which URIs Googlebot has fetched, and what your server has delivered. To track down errors, you need the contents of your log files, because tracking software triggered by page views (= crawler fetches) --e.g. via SSI or PHP includes-- cannot invoke logging of requests which weren't successful, e.g. requests of missing files, and usually they fail when it comes to images or movies. If you've verified your Google Sitemap, you get a random list of HTTP errors in Google's crawler stats.

Remember crawling and indexing are different processes. The Googlebot-Sisters fetch your content and hand it over to the indexer, and they report new URIs harvested from your links back to the crawling engine. Crawlers cannot add Web objects to a search engine's index. Monitoring crawling can tell you what Google knows about your site, and Google's spiders can help with debugging to some degree, that's it.


Waiting for the results of Google's indexing process

If your popular and well ranked Web site is crawled daily, you can expect that Google's index reflects updates within a few hours, and new pages should be visible within two days at the latest. Otherwise wait a few weeks before you get nervous.

With a site search like site:example.com you can find pages from a particular domain, subdomain (prefix www., search., blog. ...), or top level domain (postfix .edu, .org, .gov, .mil ...) in Google's index. Google shows a list of pages from the site(s) matching the search term, and an estimated number of total results, which is refined on every following SERP. Google, like other search engines, doesn't show more than 1,000 results per search query. If you add a space after the colon (site: example.com), you get a snapshot of pages linking to you, ordered by site. To view suppressed (near-) duplicates, add &amp;filter=0 to the URL in your browser's address bar. To search for URLs in a particular path you can add the subdirectory, e.g. site:www.example.com inurl:/directory/, or site:search.example.com inurl:/cgi-bin/view-result (omit the script's file extension, that is search for view-result instead of view-result.pl). Refer to "Index stats" in your sitemaps account's stats area for more examples.

Those URL searches can tell you which pages made it into the search index, but just because an URL is indexed that does not mean it receives traffic from Google's SERPs. Search engine users usually don't bother with URL searches, they perform searches for keywords and keyword phrases. Having a page indexed by Google does not mean it ranks for any keyword found within its indexable contents.


Monitoring the results with Google's search query engine

The query engine is the most visible part of a search engine. It receives the submitted search query, tries to figure out what the heck a user is searching for, and delivers what it thinks are the most relevant results for the given search term. The query engine makes use of attributes stored by the indexer, for example keywords extracted from links pointing to a page, ordered word/phrase lists, assigned PageRank™, trust status, topic relevancy with regard to the search query's identified or guessed context, and so on. Also, the query engine performs a lot of filtering, e.g. omission of near duplicates, suppressing results caused by penalties for cheating or similar results hosted on related servers, and it sorts the results ordered by a lot of different criteria. Actually, it does way more neat things, and one can't say which parts of Google's ranking algorithms are run by the indexer, the query engine, or both.

Basically there are two causes why a page attracts no or low search engine traffic for its desired keywords. Mostly the search engine optimization fails, that is a page simply doesn't provide the contents matching its targeted keywords, and it lacks suitable recommendations - or nobody searches for the targeted keywords. To fix content optimization issues and architectural problems as well as marketing failures, hire an experienced SEO/SEM consultant. You can't gain those skills by reading freely available advice on the Web.

The second cause, unskillfully SERP CTR optimization, is easy to fix. Write sexy title tags and a natural META description acting as eye catcher on the SERPs. If your META description matches the page's contents, chances are Google uses it on the SERPs instead of a text snippet gathered from anywhere on the page, or even from a directory listing or link pointing to the page. Descriptive titles with interesting and related text below the linked page title attract way more clicks on the SERPs than gibberish titles packed with stop words and branding, even more mashed up by machine generated text snippets.


Jump station:

To appoint the time table for your Web site, decide honestly whether your site comes with the prerequisites for each step or not, then add the appropriate processing time estimated above, and summarize all phases.



How long does it take to get indexed by Google?

[Shrink]

Tuesday, November 01, 2005

First of all, the Google Sitemaps Program is by no means a free ticket to Google's search index. It helps Google to discover new resources, and it leads Google's crawlers to updates faster than ever before, but it does not guarantee crawling or indexing. If Google refuses to crawl and/or index a Web page during the regular crawling process, a sitemap submission will not change this fully automated decision. Regardless how Google gets alerted on a URL, be it via sitemap submission or by following links on the Web, the same rules apply. That goes for spam filters, duplicate content handling, ignoring not promoted sites, etc. etc. - Googlebot does not eat unpopular spider food.

A new Web site is considered unpopular, because it lacks link popularity. That is if no other page on the Web donates a link to a new page, Google usually will not index it. Googlebot may fetch it every once in a while, but it will not be visible on the SERPs. This makes sound sense. First, if nobody except of the site owner considers the site's content important enough to link to it, why should it be important enough to debut on the Web via Google's search results? Second, caused by an overwhelming amount of spam Google has to deal with, many (not all!) new sites will not be placed on the SERPs during a probation period, which can last 6 months, a year, or even longer. This probation period is also known as sandbox, the only escape is reputation. A Web site can gain reputation, when it receives well formed natural links from trusted resources.

Maintaining a Google Sitemap with a new Web site makes sense, but it will not result in indexing, until Google has collected enough reputable votes for the new site. During the probation period, a Webmaster should publish fresh and unique content all day long, and acquire valuable links.

Established sites must ensure they provide (at least navigational) links to all (new) pages in a Google Sitemap, because chances for getting unlinked pages indexed are like a snowball's chance in hell. Reasonable linkage and unique content provided, updated pages can make it in the index within a few hours, and new pages get indexed within two days at the latest.

If a Google Sitemap contains a bunch of URLs which are very similar, for example they differ only in a query string variable's value, sitemap based crawls omit a few URLs every now and then, but those get fetched (probably during the regular crawls) later on, usually.

For a brand new Google Sitemap it may take a while until the Googlebot sisters donate the first full deep crawl. The duration of this initializing seems to depend on a site's popularity, reputation, size as per sitemap vs. number of indexed pages, and other factors (no detailed research done here). Once the sitemap(s) are downloaded frequently without resubmits, the submit machine runs smoothly. Adding URLs and changing the lastmod attribute results in prompt fetches and instant indexing then.



I've another question | The FAQs weren't helpful

[Shrink]

Monday, November 14, 2005 by Sebastian

If the official Google Sitemaps FAQ and other information provided by Google, the Google Sitemaps Tutorial, and this inofficial FAQs didn't provide an answer, please submit your question here.

If your question is of common interest and the answer makes it on this FAQ, we will not charge you for the service request (please mark those service requests with "Google Sitemaps FAQ", we'll contact you before we start to work on your request).

Thank you.



Author: Sebastian
Last Update: Friday, October 28, 2005 [DRAFT]   Web Feed

· Home

· Google Sitemaps Guide

· Google Sitemaps FAQ

· Google Sitemaps Info

· Google Sitemaps KB

· Web Links

· Link to us

· Contact

· What's new

· Site map

· Get Help


Most popular:

· Site Feeds

· Database Design Guide

· Google Sitemaps

· smartDataPump

· Spider Support

· How To Link Properly


Free Tools:

· Sitemap Validator

· Simple Sitemaps

· Spider Spoofer

· Ad & Click Tracking



Search Google
Web Site

Add to My Yahoo!
Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed



To eliminate unwanted email from ALL sources use SpamArrest!





neatCMS

neat CMS:
Smart Web Publishing



Text Link Ads

Banners don't work anymore. Buy and sell targeted traffic via text links:
Monetize Your Website
Buy Relevant Traffic
text-link-ads.com


[Editor's notes on
buying and selling links
]






Digg this · Add to del.icio.us · Add to Furl · We Can Help You!




Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives


Top of page

No Ads


Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy