How to Make Use of Google SiteMaps · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · 11 · Expand · Web Feed


Scheduling batch jobs to generate RSS feeds and similar stuff like the sitemap.xml file is a way to complex procedure to handle such a simple task, and this approach is fault-prone. Better implement your sitemap generator as dynamic XML file, that is a script reflecting the current state of your web site on each request1. After submitting a sitemap to Google, you don't know when Googlebot finds the time to crawl your web site. Most probably you'll release a lot of content changes between the resubmit and Googlebot's visit. Also, perhaps crawlers of other search engines may be interested in your XML sitemap in the future. There are other advantages too, so you really should ensure that your sitemap reflects the current state of your web site everytime a web robot fetches it.

You can use every file name with your sitemap. Google accepts what you submit, 'sitemap.xml' is just a default. So you can go for 'sitemap.php', 'sitemap.asp', 'mysitemap.xhtml' or whatever scripting language you prefer, as long as the content is valid XML. However, there are good reasons to stick with the default 'sitemap.xml'. Here is an example for Apache/PHP:


Configure your webserver to parse .xml files for PHP, e.g. by adding this statement to your root's .htaccess file:

AddType application/x-httpd-php .htm .xml .rss

Now you can use PHP in all .php, .htm, .xml and .rss files. http://www.yourdomain.com/sitemap.xml behaves like any other PHP script. Note: static XML files will produce a PHP error caused by the XML version header.



You don't need XML software to produce the pretty simple XML of Google's sitemap protocol. The PHP example below should be easy to understand, even if you prefer another programming language. Error handling as well as elegant programming was omitted to make the hierarchical XML structure transparent and understandable.


$isoLastModifiedSite = "";
$newLine = "\n";
$indent = " ";
if (!$rootUrl) $rootUrl = "http://www.yourdomain.com";

$xmlHeader = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>$newLine";


$urlsetOpen = "<urlset xmlns=\"http://www.google.com/schemas/sitemap/0.84\"
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"
xsi:schemaLocation=\"http://www.google.com/schemas/sitemap/0.84
http://www.google.com/schemas/sitemap/0.84/sitemap.xsd\">$newLine";
$urlsetValue = "";
$urlsetClose = "</urlset>$newLine";

function makeUrlString ($urlString) {
    return htmlentities($urlString, ENT_QUOTES, 'UTF-8');
}

function makeIso8601TimeStamp ($dateTime) {
    if (!$dateTime) {
        $dateTime = date('Y-m-d H:i:s');
    }
    if (is_numeric(substr($dateTime, 11, 1))) {
        $isoTS = substr($dateTime, 0, 10) ."T"
                 .substr($dateTime, 11, 8) ."+00:00";
    }
    else {
        $isoTS = substr($dateTime, 0, 10);
    }
    return $isoTS;
}

function makeUrlTag ($url, $modifiedDateTime, $changeFrequency, $priority) {
    GLOBAL $newLine;
    GLOBAL $indent;
    GLOBAL $isoLastModifiedSite;
    $urlOpen = "$indent<url>$newLine";
    $urlValue = "";
    $urlClose = "$indent</url>$newLine";
    $locOpen = "$indent$indent<loc>";
    $locValue = "";
    $locClose = "</loc>$newLine";
    $lastmodOpen = "$indent$indent<lastmod>";
    $lastmodValue = "";
    $lastmodClose = "</lastmod>$newLine";
    $changefreqOpen = "$indent$indent<changefreq>";
    $changefreqValue = "";
    $changefreqClose = "</changefreq>$newLine";
    $priorityOpen = "$indent$indent<priority>";
    $priorityValue = "";
    $priorityClose = "</priority>$newLine";

    $urlTag = $urlOpen;
    $urlValue     = $locOpen .makeUrlString("$url") .$locClose;
    if ($modifiedDateTime) {
     $urlValue .= $lastmodOpen .makeIso8601TimeStamp($modifiedDateTime) .$lastmodClose;
     if (!$isoLastModifiedSite) { // last modification of web site
         $isoLastModifiedSite = makeIso8601TimeStamp($modifiedDateTime);
     }
    }
    if ($changeFrequency) {
     $urlValue .= $changefreqOpen .$changeFrequency .$changefreqClose;
    }
    if ($priority) {
     $urlValue .= $priorityOpen .$priority .$priorityClose;
    }
    $urlTag .= $urlValue;
    $urlTag .= $urlClose;
    return $urlTag;
}


Now fetch the URLs from your database. It's a good idea to have a boolean attribute to exclude particular pages from the sitemap. Also, you should have an indexed date-time attribute storing the last modification. Your content management system should enable the attributes ChangeFrequency, Priority, PageInSitemap and perhaps even LastModified on the user interface. Example query: "SELECT pageUrl, pageLastModified, pagePriority, pageChangeFrequency from pages WHERE pages.pageSiteMap = 1 AND pages.pageActive = 1 AND pages.pageOffsite <> 1 ORDER BY pages.pageLastModified DESC". Loop:


$urlsetValue .= makeUrlTag ($pageUrl, $pageLastModified, $pageChangeFrequency, $pagePriority);


After the loop you can add a few templated pages/scripts, not stored as content pages, which change on each page modification or not:


if (!$isoLastModifiedSite) { // last modification of web site
    $isoLastModifiedSite = makeIso8601TimeStamp(date('Y-m-d H:i:s'));
}
$urlsetValue .= makeUrlTag ("$rootUrl/what-is-new.htm", $isoLastModifiedSite, "daily", "1.0");


Now write the complete XML. Dealing with a larger amount of pages, you should print the <url> tag on each iteration followed by a flush(). If you publish tens of thousands of pages, you should provide multiple sitemaps and a sitemap index. Each sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 10MB.


header('Content-type: application/xml; charset="utf-8"',true);
print "$xmlHeader
$urlsetOpen
$urlsetValue
$urlsetClose
";


Google will process all <url> entries where the URL begins with the URL of the sitemap file. If your website is distributed over many domains, provide sitemaps per domain. Subdomains and the 'www prefix' are treated as seperate domains. URLs like 'http://www.domain.us/page' are not valid in a sitemap located on 'http://domain.us/'. The script's output should be something like


<?xml version="1.0" encoding="UTF-8" ?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84 http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">
    <url>
        <loc>http://www.smart-it-consulting.com/</loc>
        <lastmod>2005-06-04T00:00:00+00:00</lastmod>
        <changefreq>monthly</changefreq>
        <priority>0.6</priority>
    </url>
    <url>
        <loc>http://www.smart-it-consulting.com/database/progress-database-design-guide/</loc>
        <lastmod>2005-06-04T00:00:00+00:00</lastmod>
        <changefreq>monthly</changefreq>
        <priority>1.0</priority>
    </url>
    <url>
        <loc>http://www.smart-it-consulting.com/catindex.htm?node=2</loc>
        <lastmod>2005-05-31T00:00:00+00:00</lastmod>
        <priority>0.5</priority>
    </url>
    <url>
        <loc>http://www.smart-it-consulting.com/what-is-new.htm</loc>
        <lastmod>2005-06-04T08:31:12+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>1.0</priority>
    </url>
</urlset>


Feel free to use and customize the code above. If you do so, put this comment into each source code file containing our stuff:


COPYRIGHT (C) 2005 BY SMART-IT-CONSULTING.COM
* Do not remove this header
* This program is provided AS IS
* Use this program at your own risk
* Don't publish this code, link to http://www.smart-it-consulting.com/ instead



Ask Googlebot to Crawl New and Modified Pages on Your Web SiteNext Page

Previous PagePopulating the sitemap.xml File


How to Make Use of Google SiteMaps · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · 11 · Expand · Web Feed




1

On large sites it may be a good idea to run the script querying the database on another machine to avoid web server slow downs. Also, using the sitemap index file creatively can help: reserve one or more dynamic sitemap files for fresh content and provide static sitemaps, updated weekly or so, containing all URLs. The sitemap tag of the sitemap index offers a lastmod tag to tell Google which sitemaps were modified since the last download. Use this tag to avoid downloads of unchanged static sitemaps.




Author: Sebastian
Last Update: Saturday, June 04, 2005   Web Feed

· Home

· Internet

· Google Sitemaps Guide

· Google Sitemaps FAQ

· Google Sitemaps KB

· Sitemap News

· Simple Sitemaps

· XML Validator

· Google Sitemaps Info

· Web Links

· Link to us

· Contact

· What's new

· Site map

· Get Help


Most popular:

· Site Feeds

· Database Design Guide

· Google Sitemaps

· smartDataPump

· Spider Support

· How To Link Properly


Free Tools:

· Sitemap Validator

· Simple Sitemaps

· Spider Spoofer

· Ad & Click Tracking



Search Google
Web Site

Add to My Yahoo!
Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed



To eliminate unwanted email from ALL sources use SpamArrest!





neatCMS

neat CMS:
Smart Web Publishing



Text Link Ads

Banners don't work anymore. Buy and sell targeted traffic via text links:
Monetize Your Website
Buy Relevant Traffic
text-link-ads.com


[Editor's notes on
buying and selling links
]






Digg this · Add to del.icio.us · Add to Furl · We Can Help You!




Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives


Top of page

No Ads


Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy