The free Python Google sitemap generator can be used to create Google Sitemaps in XML format by walking the file system on the web server and scanning access logs. It requires Python version 2.2 (or compatible newer versions) installed on your server.
Google Sitemaps KB ·
Index · Expand · Web Feed
Google doesn't index all pages from a Sitemap
Google Sitemaps for (tens of) thousands sub-domains
The Free Python Google Sitemap Generator is described in the Google documentation at Using the Sitemap Generator and can be downloaded from SourceForge.net Google-sitemap_gen.
It is important to use the latest version, because of bug fixes and improvements. At the moment (9 January 2005) the latest version is 1.4. The sitemap generator is written in Python version 2.2 and it does not work with older versions. Python software can be downloaded from python.org.
The sitemap generator collects URLs by walking the file system on the web server and by reading access log files. The resulting sitemap is an XML file, either compressed or uncompressed, in the format specified by the Google Sitemap Protocol, with full XML header.
URLs of dynamically generated pages might not appear in the resulting sitemap if the generator uses only file system walking, since it will find only the URLs of the script files used to generate those pages.
Iterations of the sitemap generator reading access log files can be used to update/enlarge the resulting sitemap. If the number of collected URLs exceeds the maximum of 50,000 the generator will create more sitemap files and a sitemap index file (the sitemap index file will have to be submitted from the Google sitemap account panel), see the Google sitemaps group thread Sitemap gen apache log technique coupled with already existing sitemap, and the description below of the
sitemap node of the
When reading access log entries, the sitemap generator will include in the sitemap only the URLs that return HTTP response status 200 (OK). It is thus necessary, in order to avoid inclusion of non-existent URLs, to have a website set-up that will return 404 (not found) HTTP response status for non-existent URLs, not a redirection to a page returning HTTP status 200 (OK).
When the generator uses only file system walking, the elements included in the sitemap for each URL are, besides the full URL,
lastmod with a value given by the file time stamp (GMT), and
priority with a default value of 0.5.
If the generator uses access log files, then the priority value is given by the frequence with which an URL appears in the access logs. If the generator uses only access logs, without file system walk, file time stamps are unavailable and so there are no
lastmod elements in the resulting sitemap.
The value for the
changefreq element can be specifed individually for each URL by using the
urllist nodes in the
config.xml file, as far as I know it cannot be specified at once for all URLs in a website.
The information specific to each website, like the name of the sitemap file, the domain URL necessary for building the canonical URLs in the sitemap, etc. is contained in a configuration file in XML format, usually called
The script obtains the name of the config.xml file from the command line. For example, a command to run the generator from the same directory as
sitemap_gen.py can be
$ python sitemap_gen.py --config=/path/config.xml
/path/config.xml is the path name of the configuration file. The path name of a folder on the server can be easily found for a UNIX/Linux server from a command window with the Unix command
pwd. The relative path name can also be used, so if
config.xml are in the same directory,
config.xml can be used in the example above as the path name of the configuration file.
Search engine notification and suppression of it for testing
After creating the sitemap file, the generator notifies Google by default using the ping method (the sitemap has to be submitted from the Google Sitemap Account). It is possible to suppress the script search engine notification either from the command line by using the
--testing argument, or from the
config.xml file by using the
suppress_search_engine_notify attribute of the
site root node.
An example of suppressing search engines notification from the command line, from the same directory as
$ python sitemap_gen.py --config=/path/config.xml --testing
The config.xml file
The distribution package from SourceForge.net contains an example for the configuration file
example_config.xml with very good commentaries and explanations.
The generator script processes the
config.xml file using the SAX paradigm. SAX is an acronym for Simple API for XML, and refers to a sequential event-based parsing of an XML document, the script processes each XML element as it is encountered in the stream represented by the XML document. The
config.xml file has the following nodes with attributes.
The site node is the single root node, which contains all the other nodes, and specifies via its attributes the domain URL and the path name for the resulting sitemap file. The first XML tag in the
config.xml file is the opening tag of the
site node and the file ends with the closing tag
</site> for this root node.
site node has two required attributes,
base_url for the domain URL used in canonicalization of the URLs collected for the sitemap either from the walk of the web server file system or from scanning access log files, and the
store_into attribute for the path name of the resulting sitemap XML file. This resulting sitemap file can be uncompressed, with a
.xml file name extension, or compressed, with a
.xml.gz file name extension.
Attention, a bug in generating the compressed sitemap file has been fixed in version 1.4 of this Python Google generator, so it is important to check that you are using the latest version.
site node has also some optional attributes, which specify the detail in the diagnostic output that the script gives, suppression of notification to search engines (similar to the
--testing command-line argument), and the character encoding to use for URLs and file paths.
The directory nodes specify via attributes the path name of the directory where to start the walking of the file system on the web server. If URLs are dynamically generated by a CGI script file, then only the URL of that script file is added to the sitemap, without the URLs dynamically generated by query strings. In this case it is necessary to use also
accesslog nodes to scan access log files, if available.
directory node has two required attributes, for the directory path name and for the URL corresponding to that path name. There is also the optional attribute
default_file for the index file or default file for directory URLs. Setting a default file (for example
<directory) causes URLs of the default files of that name in the specified directory and its subdirectories to be suppressed (when URLs are collected by using only file system walking on the server).
URLs to directories will have the lastmod date taken from the default file rather than the directory itself (as explained by Google Employee in the Google Sitemap groups thread in July 2005 Sitemap_gen.py v1.2).
default_file is not specified, then both the URL to the directory and to the default file will be included in the sitemap, even though they represent the same document.
The accesslog nodes tell the script to scan webserver log files to extract URLs. Both Common Logfile Format (Apache default logfile) and Extended Logfile Format (IIS default logfile) can be read.
accesslog nodes have a required attribute for the path name to the log file and an optional attribute for encoding of the file if not US-ASCII.
There is the possibility of file globbing for access files by using the * wildcard character, for example
<accesslog path="/pathname/www/logs/*" encoding="UTF-8" />, see the Google Sitemaps Group threads Feature Request: File Globbing for AccessLogs and Sitemap_gen.py v1.2
The sitemap nodes tell the script to scan other Sitemap files, there is one required attribute that is the path to the sitemap file. It can help to iterate readings of the access log files to update the resulting sitemap files.
After a first run of the sitemap generator without the
sitemap node in the
config.xml file, when at further runs of the script using
accesslog nodes to scan the access log files, a
sitemap node is added having as attribute the path to the current sitemap file, a feedback loop is created and iterations improve the sitemap. If the collected URLs exceed the maximum number for a sitemap file (50,000), then the sitemap generator script creates new sitemap files and a sitemap index file.
The url and urllist nodes can be used to specify URLs with their
changefreq attributes for addition to the resulting sitemap file.
url nodes have one required attribute, that is the URL, and three optional attributes,
urllist nodes name text files with lists of URLs and the nodes have one required attribute, the path to the file.
These text files with URL lists contain one URL per line. A line can consist of several space-delimited columns, where after a URL that is mandatory, attributes can follow in the form
key=value for lastmod, changefreq and priority.
There is a
example_urllist.txt example file included in the distribution package.
The generator discards URLs that do not start with the domain's URL, but it does not check if a URL exists on the server.
urllist nodes specify URLs with the correct base URL, but that have never been on the server, then these URLs are included in the sitemap.
The filter nodes specify patterns that the script compares against all URLs it finds. There are
drop filters that cause exclusion of matching URLs and
pass filters that cause inclusion of matching URLs.
If no filter at all matches a URL, the URL will be included. Filters are applied in the order specified and a
pass filter shortcuts any other later filters that might also match.
The free Python Google generator is relatively easy to use, no knowledge of Python is necessary. The information and sitemap requirements specific to a website can be easily included in the configuration file by using the well commented
example_config.xml file which comes with the generator.
There are some things in the current version 1.4 that I think could be improved in future versions. For example, non-existent URLs can be included by mistake in the sitemap, as long as they have the correct base URL, via the
Also, when access logs are used in creating the sitemap, if a URL has been removed during the logged interval, such that it appears in the same access log file at first with HTTP response status 200 (OK) and later with 404 (Not Found), it will still be included to the sitemap.
Another thing is that I cannot see a way for specifying the
changefreq at once for all URLs in the sitemap, maybe with globbing. The
changefreq element has to be specified, if used, for individual URLs via the
Wednesday, January 11, 2006 by Cristina
Google Sitemaps for (tens of) thousands sub-domains
Google doesn't index all pages from a Sitemap
Google Sitemaps Knowledge Base ·
Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · 11 · 12 · Expand · Web Feed
Author: The Google Sitemaps Group
Last Update: December 10, 2005 Web Feed