|
How to Make Use of Google SiteMaps ·
Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · 11 · Expand · Web Feed
Scheduling batch jobs to generate RSS feeds and similar stuff like the sitemap.xml file is a way to complex procedure to handle such a simple task, and this approach is fault-prone. Better implement your sitemap generator as dynamic XML file, that is a script reflecting the current state of your web site on each request1. After submitting a sitemap to Google, you don't know when Googlebot finds the time to crawl your web site. Most probably you'll release a lot of content changes between the resubmit and Googlebot's visit. Also, perhaps crawlers of other search engines may be interested in your XML sitemap in the future. There are other advantages too, so you really should ensure that your sitemap reflects the current state of your web site everytime a web robot fetches it.
You can use every file name with your sitemap. Google accepts what you submit, 'sitemap.xml' is just a default. So you can go for 'sitemap.php', 'sitemap.asp', 'mysitemap.xhtml' or whatever scripting language you prefer, as long as the content is valid XML. However, there are good reasons to stick with the default 'sitemap.xml'. Here is an example for Apache/PHP:
Configure your webserver to parse .xml files for PHP, e.g. by adding this statement to your root's .htaccess file: AddType application/x-httpd-php .htm .xml .rss Now you can use PHP in all .php, .htm, .xml and .rss files. http://www.yourdomain.com/sitemap.xml behaves like any other PHP script. Note: static XML files will produce a PHP error caused by the XML version header.
You don't need XML software to produce the pretty simple XML of Google's sitemap protocol. The PHP example below should be easy to understand, even if you prefer another programming language. Error handling as well as elegant programming was omitted to make the hierarchical XML structure transparent and understandable.
$isoLastModifiedSite = "";
$newLine = "\n";
$indent = " ";
if (!$rootUrl) $rootUrl = "http://www.yourdomain.com";
$xmlHeader = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>$newLine";
$urlsetOpen = "<urlset xmlns=\"http://www.google.com/schemas/sitemap/0.84\"
xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"
xsi:schemaLocation=\"http://www.google.com/schemas/sitemap/0.84
http://www.google.com/schemas/sitemap/0.84/sitemap.xsd\">$newLine";
$urlsetValue = "";
$urlsetClose = "</urlset>$newLine";
function makeUrlString ($urlString) {
return htmlentities($urlString, ENT_QUOTES, 'UTF-8');
}
function makeIso8601TimeStamp ($dateTime) {
if (!$dateTime) {
$dateTime = date('Y-m-d H:i:s');
}
if (is_numeric(substr($dateTime, 11, 1))) {
$isoTS = substr($dateTime, 0, 10) ."T"
.substr($dateTime, 11, 8) ."+00:00";
}
else {
$isoTS = substr($dateTime, 0, 10);
}
return $isoTS;
}
function makeUrlTag ($url, $modifiedDateTime, $changeFrequency, $priority) {
GLOBAL $newLine;
GLOBAL $indent;
GLOBAL $isoLastModifiedSite;
$urlOpen = "$indent<url>$newLine";
$urlValue = "";
$urlClose = "$indent</url>$newLine";
$locOpen = "$indent$indent<loc>";
$locValue = "";
$locClose = "</loc>$newLine";
$lastmodOpen = "$indent$indent<lastmod>";
$lastmodValue = "";
$lastmodClose = "</lastmod>$newLine";
$changefreqOpen = "$indent$indent<changefreq>";
$changefreqValue = "";
$changefreqClose = "</changefreq>$newLine";
$priorityOpen = "$indent$indent<priority>";
$priorityValue = "";
$priorityClose = "</priority>$newLine";
$urlTag = $urlOpen;
$urlValue = $locOpen .makeUrlString("$url") .$locClose;
if ($modifiedDateTime) {
$urlValue .= $lastmodOpen .makeIso8601TimeStamp($modifiedDateTime) .$lastmodClose;
if (!$isoLastModifiedSite) { // last modification of web site
$isoLastModifiedSite = makeIso8601TimeStamp($modifiedDateTime);
}
}
if ($changeFrequency) {
$urlValue .= $changefreqOpen .$changeFrequency .$changefreqClose;
}
if ($priority) {
$urlValue .= $priorityOpen .$priority .$priorityClose;
}
$urlTag .= $urlValue;
$urlTag .= $urlClose;
return $urlTag;
}
Now fetch the URLs from your database. It's a good idea to have a boolean attribute to exclude particular pages from the sitemap. Also, you should have an indexed date-time attribute storing the last modification. Your content management system should enable the attributes ChangeFrequency, Priority, PageInSitemap and perhaps even LastModified on the user interface. Example query: "SELECT pageUrl, pageLastModified, pagePriority, pageChangeFrequency from pages WHERE pages.pageSiteMap = 1 AND pages.pageActive = 1 AND pages.pageOffsite <> 1 ORDER BY pages.pageLastModified DESC". Loop:
$urlsetValue .= makeUrlTag ($pageUrl, $pageLastModified, $pageChangeFrequency, $pagePriority);
After the loop you can add a few templated pages/scripts, not stored as content pages, which change on each page modification or not:
if (!$isoLastModifiedSite) { // last modification of web site
$isoLastModifiedSite = makeIso8601TimeStamp(date('Y-m-d H:i:s'));
}
$urlsetValue .= makeUrlTag ("$rootUrl/what-is-new.htm", $isoLastModifiedSite, "daily", "1.0");
Now write the complete XML. Dealing with a larger amount of pages, you should print the <url> tag on each iteration followed by a flush(). If you publish tens of thousands of pages, you should provide multiple sitemaps and a sitemap index. Each sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 10MB.
header('Content-type: application/xml; charset="utf-8"',true);
print "$xmlHeader
$urlsetOpen
$urlsetValue
$urlsetClose
";
Google will process all <url> entries where the URL begins with the URL of the sitemap file. If your website is distributed over many domains, provide sitemaps per domain. Subdomains and the 'www prefix' are treated as seperate domains. URLs like 'http://www.domain.us/page' are not valid in a sitemap located on 'http://domain.us/'. The script's output should be something like
<?xml version="1.0" encoding="UTF-8" ?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84 http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">
<url>
<loc>http://www.smart-it-consulting.com/</loc>
<lastmod>2005-06-04T00:00:00+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.6</priority>
</url>
<url>
<loc>http://www.smart-it-consulting.com/database/progress-database-design-guide/</loc>
<lastmod>2005-06-04T00:00:00+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>http://www.smart-it-consulting.com/catindex.htm?node=2</loc>
<lastmod>2005-05-31T00:00:00+00:00</lastmod>
<priority>0.5</priority>
</url>
<url>
<loc>http://www.smart-it-consulting.com/what-is-new.htm</loc>
<lastmod>2005-06-04T08:31:12+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
</urlset>
Feel free to use and customize the code above. If you do so, put this comment into each source code file containing our stuff:
COPYRIGHT (C) 2005 BY SMART-IT-CONSULTING.COM
* Do not remove this header
* This program is provided AS IS
* Use this program at your own risk
* Don't publish this code, link to http://www.smart-it-consulting.com/ instead
Ask Googlebot to Crawl New and Modified Pages on Your Web Site
Populating the sitemap.xml File
How to Make Use of Google SiteMaps ·
Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · 11 · Expand · Web Feed
1
|
On large sites it may be a good idea to run the script querying the database on another machine to avoid web server slow downs. Also, using the sitemap index file creatively can help: reserve one or more dynamic sitemap files for fresh content and provide static sitemaps, updated weekly or so, containing all URLs. The sitemap tag of the sitemap index offers a lastmod tag to tell Google which sitemaps were modified since the last download. Use this tag to avoid downloads of unchanged static sitemaps.
|
Author: Sebastian
Last Update: Saturday, June 04, 2005 Web Feed
|
|