Breaking up the anatomy of the Google Sitemaps submission process leads to a time table, and answers a lot of related questions.

Google Sitemaps FAQ · Index · Expand · Web Feed

Previous PageWhat is the best tool to create a Google Sitemap for my Web site?

How long does it take to get indexed by Google?Next Page



Creating the Google Sitemap

It makes sense to choose the XML format, and to develop a suitable procedure to automate sitemap updates on page updates. It is absolutely necessary to double check the sitemap's contents to avoid unintended submissions, e.g. different URLs pointing to the same page like http://www.example.com/ (valid URL) and http://www.example.com/index.html (invalid URL).

The sitemap file(s) should be placed in the Web server's root, that is they should be reachable via top level URIs like http://www.example.com/sitemap.xml.
Note that www.example.com technically is different to example.com, even when the hosting service has configured delivery of the same content from both the domain and the www-subdomain. Stick with one version, that goes for sitemaps, internal links, link submissions to foreign resources like directories as well as for business cards, radio spots, and so on.


Tips: If you're keen on image and video search traffic, add your images and movie clips to the sitemap. If you provide alternative formats, for example RSS feeds, add those URLs to the sitemap too. Don't add printer friendly HTML pages and pages optimized for different browsers, those should be non-indexable, that is they should have a robots META tag with a "NOINDEX,FOLLOW" value in the head section.


Validating the Google Sitemap

Open the sitemap with your browser, most browsers detect coding errors. Click on view source to check for proper UTF-8 encoding, for example dynamic URLs must not contain ampersands (&), replace those with the HTML entity &. If the contents look fine, check the XML structure with an XML validator and correct possible errors.


Submitting the Google Sitemap

Login to Google and add the URL of your new sitemap to your account. When you follow the link in the 'Sitemaps' column of your account's Site Overview, your sitemap should be listed as "pending". If not, for example because the URL is invalid or so, correct the errors and resubmit.

A pending sitemap is an URL submission received by Google, sitting in a queue awaiting further processing.


Tips: Try to submit your XML sitemap to MSN Search too, and if you've a plain URL list or RSS feed then submit it to Yahoo! Search. Resubmit on changes, because both search engines don't promise to revisit periodically. Although Google downloads accepted XML sitemaps periodically, resubmissions after content changes may be a good idea, at least with Web sites which aren't updated very frequently.


Waiting for the submission receipt

While you're waiting for the first sitemap check by Google, you should verify your account. Just click the verify-link on the overview page and follow the instructions. Ensure that your Web server can store case sensitive files and returns a 404 error code to requests of unknown resources. If you're a Web designer, you can add your clients' sites to your account and the site owner's account as well.

After a few hours, or the next working day at the latest, revisit your account and check the sitemap's status. If it shows "error", click the link and follow the instructions to resolve the problem. If it shows "OK", your sitemap submission was successful.

The OK status means, that Googlebot has downloaded the sitemap file, and that it has passed a loose validation. At this time the submission file's contents were not yet processed. That means, Googlebot has not yet begun to crawl your site. Actually, you don't know whether your sitemap submission has made it into the crawler queue or not.


Waiting for Google's crawler Googlebot

Google's crawler schedule is pretty much ruled by PageRank™. That means, if the average PageRank™ of your Web pages is very low, Googlebot visits every once in a while, if ever, and doesn't crawl everything. If your overall PageRank™ is medium to high, Googlebot is the busiest visitor of your site.

So if you've a new site, relax a few weeks, then check your server logs for visits where the HTTP user agent name matches Googlebot's user agent names, and the IP address leads to sub-domain names like crawl-66-249-64-44.googlebot.com (the digits stand for one of Google's IP addresses and vary depending on your data center's location, because Google tries to crawl from a data center next to your Web server). From the user agent name you can't determine whether a crawl is sitemap based or not, because the regular crawling --by following links on the Web-- leaves identical footprints in your logs.

If you really need to know when the sitemap based crawling begins, then create a non-indexable page (having a NOINDEX robots META tag) which is not linked from anywhere, and put its URI in your sitemap. When Googlebot fetches this page, you've a sure proof that your sitemap's contents were transferred to Google's crawler queues (most probably the sitemap based crawling has been invoked earlier).


Monitoring Google's crawling process

You've two instruments to track Google's crawling, your server logs (or a database driven spider tracker), and Google's crawler statistics. Your server logs tell you which URIs Googlebot has fetched, and what your server has delivered. To track down errors, you need the contents of your log files, because tracking software triggered by page views (= crawler fetches) --e.g. via SSI or PHP includes-- cannot invoke logging of requests which weren't successful, e.g. requests of missing files, and usually they fail when it comes to images or movies. If you've verified your Google Sitemap, you get a random list of HTTP errors in Google's crawler stats.

Remember crawling and indexing are different processes. The Googlebot-Sisters fetch your content and hand it over to the indexer, and they report new URIs harvested from your links back to the crawling engine. Crawlers cannot add Web objects to a search engine's index. Monitoring crawling can tell you what Google knows about your site, and Google's spiders can help with debugging to some degree, that's it.


Waiting for the results of Google's indexing process

If your popular and well ranked Web site is crawled daily, you can expect that Google's index reflects updates within a few hours, and new pages should be visible within two days at the latest. Otherwise wait a few weeks before you get nervous.

With a site search like site:example.com you can find pages from a particular domain, subdomain (prefix www., search., blog. ...), or top level domain (postfix .edu, .org, .gov, .mil ...) in Google's index. Google shows a list of pages from the site(s) matching the search term, and an estimated number of total results, which is refined on every following SERP. Google, like other search engines, doesn't show more than 1,000 results per search query. If you add a space after the colon (site: example.com), you get a snapshot of pages linking to you, ordered by site. To view suppressed (near-) duplicates, add &filter=0 to the URL in your browser's address bar. To search for URLs in a particular path you can add the subdirectory, e.g. site:www.example.com inurl:/directory/, or site:search.example.com inurl:/cgi-bin/view-result (omit the script's file extension, that is search for view-result instead of view-result.pl). Refer to "Index stats" in your sitemaps account's stats area for more examples.

Those URL searches can tell you which pages made it into the search index, but just because an URL is indexed that does not mean it receives traffic from Google's SERPs. Search engine users usually don't bother with URL searches, they perform searches for keywords and keyword phrases. Having a page indexed by Google does not mean it ranks for any keyword found within its indexable contents.


Monitoring the results with Google's search query engine

The query engine is the most visible part of a search engine. It receives the submitted search query, tries to figure out what the heck a user is searching for, and delivers what it thinks are the most relevant results for the given search term. The query engine makes use of attributes stored by the indexer, for example keywords extracted from links pointing to a page, ordered word/phrase lists, assigned PageRank™, trust status, topic relevancy with regard to the search query's identified or guessed context, and so on. Also, the query engine performs a lot of filtering, e.g. omission of near duplicates, suppressing results caused by penalties for cheating or similar results hosted on related servers, and it sorts the results ordered by a lot of different criteria. Actually, it does way more neat things, and one can't say which parts of Google's ranking algorithms are run by the indexer, the query engine, or both.

Basically there are two causes why a page attracts no or low search engine traffic for its desired keywords. Mostly the search engine optimization fails, that is a page simply doesn't provide the contents matching its targeted keywords, and it lacks suitable recommendations - or nobody searches for the targeted keywords. To fix content optimization issues and architectural problems as well as marketing failures, hire an experienced SEO/SEM consultant. You can't gain those skills by reading freely available advice on the Web.

The second cause, unskillfully SERP CTR optimization, is easy to fix. Write sexy title tags and a natural META description acting as eye catcher on the SERPs. If your META description matches the page's contents, chances are Google uses it on the SERPs instead of a text snippet gathered from anywhere on the page, or even from a directory listing or link pointing to the page. Descriptive titles with interesting and related text below the linked page title attract way more clicks on the SERPs than gibberish titles packed with stop words and branding, even more mashed up by machine generated text snippets.


Jump station:

To appoint the time table for your Web site, decide honestly whether your site comes with the prerequisites for each step or not, then add the appropriate processing time estimated above, and summarize all phases.


Monday, December 05, 2005

How long does it take to get indexed by Google?Next Page

Previous PageWhat is the best tool to create a Google Sitemap for my Web site?


Google Sitemaps - The How-To What-Is FAQ · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · Expand · Web Feed



Author: Sebastian
Last Update: Friday, October 28, 2005 [DRAFT]   Web Feed

· Home

· Google Sitemaps Guide

· Google Sitemaps FAQ

· Google Sitemaps Info

· Google Sitemaps KB

· Web Links

· Link to us

· Contact

· What's new

· Site map

· Get Help


Most popular:

· Site Feeds

· Database Design Guide

· Google Sitemaps

· smartDataPump

· Spider Support

· How To Link Properly


Free Tools:

· Sitemap Validator

· Simple Sitemaps

· Spider Spoofer

· Ad & Click Tracking



Search Google
Web Site

Add to My Yahoo!
Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed








Digg this · Add to del.icio.us · Add to Furl · We Can Help You!



Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives


Top of page

No Ads


Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy