Selected blog posts on search engine friendly Web development and other topics, some of them scraped from Sebastian's Pamphlets. Subscribe to the feed to get alerted on new posts:


Unrelated Links in Ads are Dangerous

A popular Web site selling links (in ads) can possibly devalue its optimization with regard to search engine rankings. Google applies no-follow-logic to sites selling (unrelated) ads, that means links from those sites do not pass PageRank, reputation, and topic relevancy via anchor text. [Shrink]

About Directory Structures and File Names in URLs

Should I provide meaningful URLs? Absolutely, if your information architecture and its technical implementation allow the use of keyword rich hyphened URLs. But bear in mind that URLs are 'unchangeable', thus first consider to develop a suitable information architecture and a flexible Web site structure. You'll learn that folders and URLs are the last thing to think of. [Shrink]

How to buy and sell (text) links

Text link brokers do a good job by connecting related Web sites and handling their traffic deals. It looks like a win-win situation, but there are pitfalls with regard to search engines, who try to devaluate all non-editorial links because their ranking algorithms aren't perfect. Commercial link trades aren't evil, but risky without link condom. [Shrink]

The Top-5 Methods to Attract Search Engine Spiders

There is no such thing as a backdoor to a search engine's index. Opening the front door on the other hand is labor intensive. [Shrink]

Pingable Fresh Content is King

The ability to ping major search engines may have future impact on search engine placements. [Shrink]

Examples of Legitimate Cloaking

Cloaking at all is not penalized by search engines. Search engines consider intentions. Search engines even encourage Webmasters to cloak for improved spider-friendliness. Here is a tiny guide to search engine friendly cloaking. [Shrink]

How to Gain Trusted Connectivity

The bad news of the search year 2005 is, that link development becomes a very expensive and labor intensive task, and requires outstanding knowledge and experience. The good news is, that post Jagger a handful of trusted inbound links do count more than a gazillion of artificially traded links in the past. [Shrink]

Avoid Unintended Delivery of Duplicated Content

Common navigation links on dynamic pages can produce (partial) duplicated content (identical body text served from different URLs). To prevent search engine algos from filtering or even penalizing these URLs, eliminate the overlapping content. [Shrink]

Thoughts on Duplicate Content Issues with Search Engines

What looks like a GoogleBug may just be a sign of a few more GoogleBots applying well known dupe filters to a larger data set. Or why unique content does not equal unique information. [Shrink]

Feed Duplicate Content Filters Properly

Search engines produce way too much collateral damage in their war on index spam. Especially eCommerce sites suffer from harsh duplicate content filtering. Here is a concept which may help to force SE filters to act as they are supposed to work: in the interest of both the search engine user and the site owner. [Shrink]

Revamp Your Framed Pages

Shoving Frames are evil and Fixed navigation is user friendly under one brilliant hat. [Shrink]

Take Free SEO Advice With a Grain of Salt

About bad, misleading, and mistakable free advice on search engine optimization. [Shrink]

Green Tranquilizes

About Google's Toolbar-PageRank. [Shrink]

Googlebots go Fishing with Sitemaps

How the Googlebot sisters go fishing. [Shrink]

Mozilla-Googlebot Helps with Debugging

Tracking Googlebot-Mozilla is a great way to discover bugs in CMS scripts. [Shrink]

Bait Googlebot With RSS Feeds

How to use Google's Personalized Home Page as RSS submitter leading Googlebot to fresh content. [Shrink]

Automated Link Swaps Decrease SE Traffic

Stay away from automated link exchange services. Don't trust sneaky sales pitches trying to talk you into risky link swaps. Systematic link patterns get your Web site penalized or even banned by search engines. [Shrink]

The value of links from a search engine's perspective

What is a link worth with regard to search engine rankings, and what kind of links should you hunt for if you're not the WSJ? [Shrink]

Prevent Your Unique Content From Scraping

You can't protect your content technically, once it's available on the Internet. However, besides enforcing the law to its full extent you've other chances, which can become even amusing. [Shrink]

Spam Detection is Unethical

About search engines tolerating hardcore cloaking. [Shrink]

A SEO Strategy for Consulting Firms

This article outlines a search engine marketing strategy for consultants, tax advisors, lawyers, audit firms, engineers and similar service enterprises. It tells the reader how to drive highly targeted organic search engine traffic to a consultant's Web site. Asking why most consulting firms lack search engine visibility leads to a simple conclusion: they hide themselves on the Web, and actively prevent search engines from ranking their Web sites in top spots on the search result pages. Once this mentality gets recognized, its unwanted consequences can be eliminated with ease. [Shrink]

Optimizing the Number of Words per Page

About so called 'content sites' optimized for contextual advertising. [Shrink]

Yahoo! Site Explorer BETA - First Impressions

Yahoo's site explorer is a great tool for folks keen on linkage data. Here is a quick rundown on its Web interface and the API. [12/06/2005: Y!SE was updated and greatly improved] [Shrink]

Unrelated Links in Ads are Dangerous


Friday, August 26, 2005

At O'Reilly Radar Google engineer Matt Cutts confirms that paid links (in unrelated advertisements) reduce the trust rating Google assigns to a site's linkage, what results in powerless links.

Links from an untrusted site do not pass PageRank, reputation and theme/topic relevancy via anchor text. That goes not only for external links, most probably internal links are devalued too. This effect is hard to discover, because one must validate the PageRank of all involved pages and their ranking on particular search terms as well, to locate the dead ends.

Matt Cutts states: "Google's view on this is ...selling links muddies the quality of the web and makes it harder for many search engines (not just Google) to return relevant results. The rel=nofollow attribute is the correct answer: any site can sell links, but a search engine will be able to tell that the source site is not vouching for the destination page."

So what should a site owner do, if a paid ad leads to a great resource, which is related to the site's overall theme? Labeling this link with rel=nofollow is no option. Unfortunately, Matt Cutts doesn't explain this special case in his post, but I think he would agree. It looks like Google must get alerted in any way to reduce a site's linking trust status. So if all unrelated ads are dead ends, it should be safe to vouch for the related ads.

My personal statement is a plain "Don't sell links for passing PageRank™. Never. Period.", but the intention of ad space purchases isn't always that clear. If an ad isn't related to my content, I used to put client sided promotional links on my sites, because search engine spiders didn't follow them for a long time. Well, it's not that easy any more. I guess I've to switch to 'rel=nofollow', although I dislike it pretty much, because it's not precise and does not carry a message. Remember, 'rel=nofollow' was introduced to fight comment spam on blogs and guestbooks.

About Directory Structures and File Names in URLs


Wednesday, October 19, 2005

"What is the best directory structure?", "Should I use keywords and hyphens in file names?", "Is .html better than .php?" and "Should I stick with trailing slashes or not?" are only a few of the most popular questions in BBS threads on search engine friendly Web site architectures.

Why are those good questions totally and utterly inopportune in the first place? Because it makes no sense to paint a house before the architect has finished the blueprints. Yes, Web sites do need a suitable and future-safe architecture, to be exact they need an information architecture and a underlying technical architecture.

A well thought out information architecture requires a very flexible technical infrastructure. I won't think about directory structures, trailing slashes and details like that in the first place, because the physical structure is (or at least should be) totally independent from the logical structure. Also, the logical structure is subject of many changes during the life cycle of a site or network, whilst for tons of good reasons the physical structure must not be changed, especially URLs should never change.

Before I talk to code monkeys or evaluate CMSs, I should have a detailed model of a suitable information architecture, and a pretty good idea of future growth in different scenarios. While designing the IA, I must not look at technical issues like tactical SEO. Geeky thinking in this project phase is counterproductive, because it leads to unnecessary restrictions and limitations, which as a rule will result in expensive change requests or even failed projects.

It's nice to have meaningful file names in a hierarchical directory structure, but file names and directory structures aren't needed at all to create a user friendly Web site. That does not mean one shouldn't try to provide meaningful URLs. It means that the underlying physical structure must not dictate the form of content presentation and navigation, or even influence the IA in other ways. Stiff hierarchies do not allow natural growth and do not support ever changing business processes, because they are unscalable and lack flexibility.

It makes sense to organize content topically, but every attempt to define a future safe global hierarchy beforehand will fail in the end, so why try it? The storage management and physical structuring of a Web site or network of sites should function like a file system, where we start with an empty disk, allocate space when needed, and organize files in a organic manner. For the IA it plays no role whether the content is ordered by type, topic, size or date of creation, whether the content is stored in files on disks of different servers, or databases, or dynamically requested from 3rd party services. That's achieved with data access layers and dynamic content structures. By the way, a multi layer architecture is applicable for small and large sites, just the complexity differs depending on the project's size and goals.

Implementing the IA is done by assigning topically structured content to nodes and connecting nodes to related nodes, giving the(multiple) logical structure(s) and the user interface(s). URLs are persistent properties of nodes, while a node's location(s) in the structure is(are) transient, that is topical connections can be changed, removed and added if necessary, giving new or alternative logical views at the content, without the need to make structural changes. Hierarchical components of URLs do not necessarily represent the navigation. Technically they are totally meaningless, even UUIDs would do the job (if they were spider friendly or would remind a bookmarking user of the page's content). However, if it's achievable, the URL should give the user a pretty good idea about the page and its position in the logical structure.

Bottom line is, it's possible to provide meaningful and search engine friendly URLs which are not wedged in stiff directory structures. A Web site can be developed without a given global directory structure, it can grow over time and organize its content in ever changing dynamic structures (which can be represented by static-looking URLs too). This principle is valid for Web sites and networks of any size and any type. That does not mean it can be realized with every past-paradigm-toolset.

If you think the above said is dull as dirt, in my article Web Site Structuring, a section of Anatomy and Deployment of Links, I provide illustrated design patterns and examples, so it should be a livelier lecture.

How to buy and sell (text) links


Tuesday, February 21, 2006

Since this site now carries advertisements for a text link broker, it's time to explain the role of ad/link brokers in traffic management from a search engine's point of view, and to discuss risks and benefits from a site owner's perspective. To make a long story short, well done link trades are not penalized by search engines, but there are risks. Many professional link/traffic brokers offer extremely valuable services, and paying for links is not evil.

Links are an important ranking factor for all major search engines. Natural links do boost the ranking of the destination page, thus the engines try to discover each link's intention. Their goal is to conclude advertisements and other types of bought links to prevent their ranking algorithms from counting those as editorial links.

To find out whether a particular link is bought or not, search engines make use of high sophisticated algorithms like block analysis etc., and there are reviews by humans. Neither artificial intelligence nor human driven components of their judgement on the intention of links can determine the character of each and every link out there. However, they do know a lot more than Joe Webmaster can imagine, and they silently devalue links, that is they take away a page's -- or even a complete Web site's -- ability to pass reputation in its links, if the site doesn't devalue paid non-editorial links itself.

The search engines try to create an ideal world for their link based ranking algorithms by requesting link condoms for all non-editorial links. By adding rel="nofollow" to a paid link's A element, this particular link is marked as castrated. A castrated link is powerless with regard to SE rankings, that means it will not pass theme relevancy via keywords in its anchor text, it will not pass topical relevancy from its context, it will not pass PageRank (link popularity), TrustRank or anything other than human traffic to the destination page. Search engine crawlers do follow castrated links to fetch the destination page, and they show them in backlink searches, but links with condom -- and links from pages where the SE has taken away the ability to pass any reputation -- have no impact on the destination page's SE rankings.

The heated "Are paid links evil or not and may search engines penalize sold links or not" debate is pretty much useless. The engines can enforce any rule they want, because it's their search index. I don't think penalizing pages carrying paid links is the right thing to do, but since the link condom policy is out, I have to recommend the usage of the notional monster1 rel="nofollow". Sure, there are still ways to hide paid links from the search engines, but is it really worth it to take the risks for the sole purpose of selling PageRank? Getting penalized for uncastrated paid links means that internal links become powerless. Losing the ability to pass reputation and relevancy in internal (navigational) links cannot outweigh the slightly higher revenues from sold links that may or may not pass PageRank etc.

When it comes to link trades handled by a broker, the risks of uncastrated linkage is way higher than with other link purchases. Like any potential customer search engine staff can browse the broker's publisher lists and harvest the links of sites offering link spots. Also, automatically inserted links managed by a 3rd party may leave footprints in the source code.

That does not mean you should not use the services of link brokers! It means that you should tell the broker that you sell and/or purchase only links with condom. Once the deal is set up, check the source code to make sure that the condom is in place.

The services offered by text link brokers, or better traffic brokers, are extremely valuable, because they find link partners for you and review all involved sites to make sure that your links won't appear on unrelated places, and that you don't link out to unwanted neighborhoods. Also, they monitor each and every deal and refund the fees if the other site takes your links down.

On the sales pitch of most broker's Web sites you'll find statements promising increased search engine rankings due to bought links. That's true, although obviously bought links will not increase the PageRank (or TrustRank ...) of your pages. Don't confuse page rankings with PageRank. Popularity can be measured by other techniques than PageRank. More human visitors coming via bought text links to your Web pages will increase their SE rankings naturally (here is how it works). Human visitors leave their footprints in SE databases, and trafficked pages do get a ranking boost.

Unfortunately some text link brokers don't tell how you benefit from bought links on search result pages, and some may even mention link popularity. However, since they do know how it works, their prices should have factored in the fact that bought links (usually) don't boost search engine rankings directly (by passing PageRank...). Just because a link appears on Google's backlink SERPs or Yahoo's linkdomain results that does not mean that it counts for link popularity. Traffic counts, so when you look at prices for static HTML links on related Web sites, always remember that you're buying traffic, not link popularity or PageRank2.


The value nofollow of the REL attribute creates misunderstandings, because it is, hmmm, hapless.
In fact, it means passNoReputation and nothing more. That is search engines shall follow those links, and they shall index the destination page, and they shall show those links in reversed citation results.
There were micro formats better suitable to achieve the goal, for example Technorati's votelinks, but unfortunately the search geeks have chosen a value adapted from the robots exclusion standard, which is plain misleading because it has absolutely nothing to do with its functionality.


Patrick Gavin, CEO of Text Link Ads, that's the link broker advertising on this site, likes to add:
We recommend only purchasing links on websites that have a good chance of sending you targeted traffic that converts for you. If you are getting your money's worth in targeted traffic you don't have to worry about how the search engines treat the link and any benefit will be a bonus.

The Top-5 Methods to Attract Search Engine Spiders


Thursday, July 21, 2005

Full sized image copyrighted © by Leech Design 2000Folks on the boards and in news groups waste man years speculating on the best bait to entrap search engine spiders.

Stop posting, listen to the ultimate advice and boost your search engine traffic to the sky within a few months. Here are the five best methods to get a Web site crawled and indexed quickly:

5 Laying out milk and cookies attracts the Googlebot sisters1.
4 Creating a Google Sitemap supports the Googlebot sisters.
3 Providing RSS feeds and adding them to MyYahoo decoys Slurp.
2 Placing bold dollar signs '$$' nearby the copyright or trademark notice drives the MSN bot crazy1.
1 Spreading deep inbound links all over the Internet encourages all spiders to deep and frequent crawling and fast indexing as well.

Listen, there is only one single method that counts: #1. Forget everything you've heard about search engine indexing. Concentrate all your efforts on publishing fresh content and acquiring related inbound links to your content pages instead.

Link out to valuable pages within the body text and ask for a backlink. Keep your outbound links up, even if you don't get a link back. Add a page to each content page and use it to trade links on the content page's topic. Don't bother with home page link exchanges.

Ignore tricky 'backdoor' advice. There is no such thing as a backdoor to a search engine's index. Open your front door widely for the engines by actively developing deep inbound links. Once you're indexed and ranked fairly, fine tune your search engine spider support.


Joking. You can't store milk and cookies on your server's hard disk, and although M$ likes $$ pretty much, the M$ crawler doesn't index your site faster when it finds bold dollar signs.
I've added this footnote, because a few days after publishing this article I discovered that some webmasters make use of the '$$-trick'.

Pingable Fresh Content is King


Wednesday, August 03, 2005

Old news:
A bunch of unique content and a high update frequency increases search engine traffic.
Quite new:
Leading crawlers to fresh content becomes super important.
Future news:
Dynamic Web sites optimized to ping SE crawlers outrank established sites across the boards.

Established methods and tools to support search engine crawlers are clever internal linkage, sitemap networks, 'What's new' pages, inbound links from high ranked and often changed pages etc etc. To a limited degree they still lead crawlers to fresh and not yet spidered old content. Time to crawl and time to index are dissatisfying, because the whole system is based on pulling and depends on the search engine backend system's ability to guess.

Look back and forth at Google: Google News, Froogle, Sitemaps and rumors on blogsearch indicate a change from progressive pulling of mass data to proactive and event driven picking of fewer fresh data. Google will never stop crawling based on guessing and continue to follow links, but has learned how to localize fresh content in no time by making use of submissions and pings.

Blog search engines more or less perfectly fulfil the demand on popular fresh content. The blogosphere pings blog search engines, that is why they are that up to date. The blogosphere is huge and the amount of blog posts is enormous, but it is just a tiny part of the Web. Even more fresh content is still published elsewhere, and elsewhere is the playground of the major search engines, not even touched by blog search engines.

Google wants to dominate search, and currently it does. Google cannot ignore the demand on fresh and popular content, and Google cannot lower the relevancy of search results. Will Google's future search results be ranked by sort of 'recent relevancy' algos? I guess not in general, but 'recent relevancy' is not an oxymoron, because Google can learn to determine the type of the requested information and deliver more recent or more relevant results depending on the query context and tracked user behavior. I'm speculating here, but it is plausible and Google already has developed all components necessary to assemble such an algo.

Based on the speculation above, investments in RSS technology and alike should be a wise business decision. If 'ranking by recent relevancy' or something similar comes true, dynamic Web sites with the bigger toolset will often outrank the established but more static organized sources of information.

Examples of Legitimate Cloaking


Friday, September 16, 2005

Hardcore cloaking is a still very effective way to generate huge amounts of very targeted search engine traffic. Search engines dislike it, because it screws their organic results. You'll find a "Do not cloak" statement in any engine's Webmaster guidelines. "Do not cloak" addresses so called black hat SEO tactics.

Despite the fact that those search engines cloak the hell out of their own pages, notwithstanding the fact that all major sites cloak and still rank fine on the SERPs, ignoring good professional advice stating that some legitimate usability goodies aren't achievable without cloaking, many fearful site owners suffer from cloaking paranoia hindring them to get the most value out of their Web site.

Cloaking is defined as "delivering one version of a Web page to Web robots, and another version to human users". Black hat cloaking means feeding crawlers with a zillion keyword optimized content rich pages, while human visitors get redirected to a sales pitch when they click the links on the SERPs. Examples of white hat cloaking are geo targeting, internationalization, browser specific page optimization, links and other content visible to registered users while robots and unregistered visitors get a sign-up form, crawler-friendly URLs with shortened query strings and so on. Search engines do not penalize legitimate cloaking, and they cloak themselves:

Look at Google's home page as a spider and as a user, then compare the pages. The homebred page delivered to bots contains the logo, the search form, and provides links to Google's advertising programs, business solutions, the about page, and currently a link to Hurricane Katrina Resources. Say you're located in europe and you have a Google Mail account. The page served to your browser contains your GMail address, a link to your personalized home page, a link to your account settings and more above the logo. The Katrina link is missing, but you get a link "Go to Google YourCountryHere". Different content served to a robot and a user is cloaking, but the URL is not penalized (Google once banned own pages for prohibited cloaking), it shows a PageRank™ of 10 and appears in search results.

Back to the fearful site owners's concerns. Cloaking at all is not penalized by search engines. Search engines consider intentions. Search engines even encourage Webmasters to cloak for improved spider-friendliness. E.g. a Google rep. posted "... allowing Googlebot to crawl without requiring session ID's should not run afoul of Google's policy against cloaking. I encourage webmasters to drop session ID's when they can. I would consider it safe." at WebmasterWorld on Dec 4, 2002.

What consider search engines allowed cloaking for spider-friendliness? A few examples:

Truncating session IDs and similar variable/value pairs in query strings as described in this tutorial. If the script supposed to output a page discovers a sessionID in its query string, and if the user agent is a search engine crawler, it returns a HTTP header with a permanent redirect response (301) to its own URI without sessionID and other superfluous arguments, and quits. The crawler will then request the page from the URI provided in the 301 redirect header. Again the script identifies the user agent as crawler and behaves differently from a user request. It will not make use of cookies, because bots don't accept cookies. It will not start a session and it will not prompt for 'missing' user specific values. Instead it prints out useful default values where suitable, and provides spider-friendly internal links without sessionID or similar user dependent arguments.

Reducing the number of query string arguments, that is forcing search engines to fetch and index URIs with the absolute minimum number of variables necessary to output the page's content. For example crawlers get redirected to URIs with a query string like "?node=17&page=32" whilst, depending on previous user actions, the query string in a browser's address bar might look like "?node=17&page=32 &user=matt&stylesheet=whitebg&navigation=lefthanded&...".

Stripping affiliate IDs and referrer identifiers, that is hiding user tracking from search engines. If a site has traffic deals with other places and needs to count incoming traffic by referrer ID, or if a site provides an affiliate program, search engines will find links containing IDs in the query string during their crawling on the 'Net, and their spiders follow those links. At the destination site, crawlers will be redirected to 'clean' URIs. E.g. a crawler requesting or gets redirected to respectively

Preventing search engines from indexing duplicated content. User friendly Web sites offer multiple navigation layers, resulting in multiple pages providing the same content along different menu bars. The script detecting a crawler request knows one indexable version per page and puts 'INDEX' in its robots META tag, otherwise it populates the tag with 'NOINDEX'. This handling is way more flexible, and elegant, than hosting different scripts or aliases per navigation layer to control crawling via robots.txt.

There are lots of other good reasons for search engine friendly cloaking. As long as the intention to cloak is not spamindexing, and the well meant intention is obvious, search engines tolerate cloaking. In some rare cases the intention to cloak is not obvious, for example on membership sites: Inserting HTML comments providing a short explanation and outlining the different versions helps to pass human reviews by search engine staff (every once in a while competitors search for cloaking and report competiting sites to the engines).

How to Gain Trusted Connectivity


Thursday, November 24, 2005

Trust and contextual dependable linkage has become one of the most important ranking factors. TrustRank, Google's SandBox or Reciprocal Link Penalties are just more or less misleading catchwords, used in countless publications and discussions on major changes of Google's ranking algorithms. Although those topics do deal with a part of the story, they distract our attention by trying to isolate particular symptoms and phenomenons from the big picture.

The worst example is the discussion of the infamous 'sandbox', which indeed does not stand for an initial aging delay or another sneaky attempt to punish new Web sites, because established sites can get 'sandboxed' too. The causes preventing a new site from instant rankings for competitive search terms, or even the site name (which is a money term in many cases), can under certain circumstances be applied to established sites too. Those circumstances include changes in the way a site is promoted, and improved algorithms in conjunction with more computing power, allowing Google to enforce well known rules on a broader set of data (which perfectly explains the "I've changed nothing but my established and well ranking site got tanked" reports).

To gain trust rank and to avoid 'sandboxing', (reciprocal) link devaluation resulting in decreasing search engine traffic, and other barriers as well, one needs to look at the big picture. Creating workarounds to outsmart particular filter components may be a nice hobby for SEO addicts, and it is a valuable tactical instrument for SE experts, but it is by no means a long term strategy.

A search engine's mission is providing its users with relevant commercial or informational results. A search engine marketers mission is placing Web pages at the top of the SERPs. There are only so many top spots on the SERPs, and way more pages trying to get a position under the top ten search results. Outsmarting the ranking algorithms was (and is - at least short dated) a cheap method to make it on the first SERP.

The war between smart search engine optimizers and just as smart search engine engineers lasts since crawling search engines generate organic search traffic. The crux is, that search engines relying on clever algorithms and unbeatable computing power can get outsmarted by just as clever algorithms and way less computing power. To stop the AI escalation, search engines had to integrate human considerations into their ranking algorithms.

Unfortunately, SE engineers weren't able to connect human brains to their computer clusters, because the homo sapiens still comes without a RJ45 connector and lacks a TCP/IP implementation. The next best solution was statistical utilization of structured and trustworthy editorial work stored in machine readable form.

Discovering trustworthy resources on the Web is an easy task for a search engine. Gathering popular sites with an extremely low percentage of linkage from and to known bad neighborhoods, a high but reasonable number of outgoing links at all, and a natural linked/unlinked text ratio, gives a neat list to start with. Handing out a checklist to a Web savvy jury surfing those popular resources leads to a detailed estimation of trust factors applied to topical authorities.

The most important trust (and quality) factor is a resource's linking attitude, probably followed by ease of cheatability, editorial capability and actuality, and topical competence, devotion and continuousness. The discovery of (more) trusted topical authorities by exploring the neighborhood of established and monitored trustworthy resources is an ongoing process in a search engine's quality assurance department.

Topical PageRank and TrustRank are related approaches, and should not be interpreted as mutually exclusive ranking factors. Pretty much simplified, a naturally earned related link passes PageRank, TrustRank, and topical reputation (authority, relevancy), whilst a naturally earned unrelated link passes only PageRank and TrustRank. I've stressed naturally earned, because at least Google has a quite accurate judgement on the intentions of links.

Artificial linkage for the sole purpose of manipulating rankings becomes more or less useless for Joe Webmaster, and risky. Actually very risky, because along with TrustRank comes a set of filters reliably detecting and penalizing artificial linkage like systematic reciprocal link patterns, randomized triangular linkage, and similar link schemes. Whether massive loads of artificial 3rd party links can reduce or even eliminate a site's TrustRank and result in decreased rankings or not is subject of speculation, although some trusted sources report that link mobbing does work. I can think of effective methods to close this loophole, and hopefully the SE engineers have implemented appropriate safety measures.

So what a Web site needs to rank on the SERPs is trusted authority links, preferential on-topic recommendations, amplified and distributed by clever internal linkage, and a critical mass of trustworthy, unique, and original content. Enhancing usability and crawler friendliness helps too. Absolutely no shady tactics applied. In other words, back to the roots of the Internet, where links were used to send visitors to great resources, not crawlers to promotional targets. Search engines don't honor conventional site promotion any more, they remunerate honestly earned page recommendations from trusted places in conjunction with decent site branding instead.

The vital question is "How to get on-topic authority links passing TrustRank?", and the answer is reengineering --respectively reinventing-- TrustRank to some degree, that is making use of the procedure the engines (at least Google) have most probably applied to identify trusted Web resources. This approach includes the analysis of identified trusted resources to adapt their valuable linkage architecture and behavior on own sites, and it goes far beyond making trusted link acquisition the sole beatified SEO tactic.

Let me close with a warning. Submitting a resource to DMOZ and Yahoo's directory, as well as inserting a Wikipedia link, are well known promotional activities. That is, if a site having those inbound links lacks trusted inbound links from other sources, this may not be seen as enough signs of quality and proof of content relevancy. I mean, everybody can buy a Yahoo directory listing, and many SEOs are OPD editors ... those links are way too easy to get and sometimes not earned naturally. It may be a wise decision to wait for other trusted inbound links, and submitting to DMOZ and Yahoo when the engines have picked up and processed links from other sources of TrustRank, like universities, government sites, and well established topical authorities. Patience and pertinaciousness is the key to success. Google has nullified all SEO Blitzkrieg strategies.

Related information:
The Google 'Sandbox' demystified
Defining natural and artificial linkage
Linking is all about traffic, popularity, and authority
Unrelated non-devalued links are dangerous
Automated and/or artificial link promotion has disadvantages
Enhancing a Web site's crawlability
Matt Cutts stating that TrustRank is Google's secret sauce

Avoid Unintended Delivery of Duplicated Content


Monday, August 15, 2005

Does Google systematically wipe out duplicated content? If so, does it affect partial dupes too? Will Google apply site-wide 'scraper penalties' when a particular dupe-threshold gets reached or exceeded?

Following many 'vanished page posts' with links on message boards and usenet groups, and monitoring sites I control, I've found that indeed there is kinda pattern. It seems that Google is actively wiping dupes out. Those get deleted or stay indexed as 'URL only', not moved to the supplemental index.

Example: I have a script listing all sorts of widgets pulled from a database, where users can choose how many items they want to see per page (values for #of widgets/page are hard coded and all linked), combined with prev¦next-page links. This kind of dynamic navigation produces tons of partial dupes (content overlaps with other versions of the same page). Google has indexed way too many permutations of that poorly coded page, and foolishly I didn't take care of it. Recently I got alerted as Googlebot-Mozilla requested hundreds of versions of this page within a few hours. I've quickly changed the script, putting a robots NOINDEX meta tag when the content overlaps, but probably too late. Many of the formerly indexed (cached, appearing with title and snippets on the SERPs) URLs have vanished, respectively became URL-only listings. I expect that I'll lose a lot of 'unique' listings too, because I've changed the script in the middle of the crawl.

I'm posting this before I've solid data to backup a finding, because it is a pretty common scenario. This kind of navigation is used at online shops, article sites, forums, SERPs ... and it applies to aggregated syndicated content too.

I've asked Google whether they have a particular recommendation, but no answer yet. Here is my 'fix':

Define a straight path thru the dynamic content, where not a single displayed entry overlaps with another page. For example if your default value for items per page is 10, the straight path would be:
Then check the query string before you output the page. If it is part of the straight path, put an INDEX,FOLLOW robots meta tag, otherwise (e.g. start=16&items=15) put NOINDEX.

I don't know whether this method can help with shops using descriptions pulled from a vendor's data feed, but I doubt it. If Google can determine and suppress partial dupes within a site, it can do that with text snippets from other sites too. One question remains: how does Google identify the source?

Thoughts on Duplicate Content Issues with Search Engines


Monday, October 03, 2005

Search engines filtering content to avoid useless repetitions on the SERPs confuse the hell out of most Webmasters. Those approaches to deliver better search results are often called duplicate content penalties, as mentioned here. Referring to (not 100% precise) filtering as penalty is counterproductive however, because it hinders objective analytics. If I think of a phenomenon as a penalty, I'm less likely willing to search for causes at my side too. I'll blame the evil search engine and end up in changing useless things, making the whole thing even worse, and more complex, and less understandable.

A relatively new phenomenon on Google's SERPs is, that Google trashes all similar pages, not a single page from a bunch of near-duplicates survives. It seems that Google hates duplicated content so much, that it wipes it out completely, without preserving the source, that is the page providing the content in question before the duplicates appeared. Here is an example1:

Say you've a group of pages about a blonde C-cup celeb on a paparazzi site for men. An index page, a tour calendar, a bio, some stories and pictorials, an image gallery and a video clip index. All those pages provide unique content, get indexed fine and rank well.

Then you collect a bunch of content about a blonde D-cup celeb. You're using your existing pages as templates, changing only what's significantly different. You'll end up with at least a few near-duplicate pages. Then repeat this procedure with an a-cup celeb and a b-cup hottie. The result will be four groups of pages with an identical structure and very similar content, say 60% of structure and on-the-page-text is identical.

Until recently, all four groups of pages will get indexed and they may even rank pretty good for each celeb's name and a few other keywords.

Now say Google's new duplicate content threshold is 60% (it is not 60%, that's just an example!). Google's expected behavior would be to keep the oldest pages (about the C-cup celeb) in the index, and to suppress the pages about the clones with smaller or bigger breasts.

Here comes the 'bug'. Google trashes all four groups of pages.

But is it really a bug? Probably not. It would be a bug if the assumption "All those pages have unique content" above would be true. The content may be unique within the scope of the site, but it is not not unique on the Web. Many other authors have written their stories about the four blonde celebs, that is the information is spread all over the Web, slightly reworded and often quoted. Even the images and vids are available on tons of other pages out there.

If Google compares text content not page by page, all text counted, but snippet by snippet, extracting the core information even from reworded text, very probably all four versions are considered duplicates, thus have to disappear because other pages from other sites got the source bonus.

Another factor of consideration is, that Google permanently improves its filters and capacity. In many cases those near-duplicate pages have slipped thru, because in the past Google performed those comparisons based on a smaller set of pages, that is not with all pages of every site out there. In fact the logic used in those dupe filters is not that new. It's in the news now, because Google bought a few clusters of new machines to apply the filters to way more Web sites than ever before.

Well, as always we don't know for sure whether the theory outlined above is accurate. At least it's an educated assumption, and it is plausible. So what can one do to break the filtering, except of avoiding popular themes? It should be safe to write each copy from scratch. It should help to forbid copy, paste and modify operations, and to make use of shorter quotes. Promoting fresh content immediately should help to gain the source bonus. Nothing except a reinclusion request can help if a site gets trashed by accident, because search engine filters will always produce collateral damage. But promoting outstanding unique content results in popularity and reputation, what is the best protection against lost search engine placements.


Why not a 'widgetized' example about bees? Nobody would believe that a lot of Web sites provide content about bees, so the celeb example makes it easier to include the whole 'Net into the analysis, and to draw wider conclusions based on the expanded scope.

Feed Duplicate Content Filters Properly


Tuesday, October 04, 2005

Search engines make use of duplicate content filters to avoid repetition on their SERPs. That should be a good thing for search engine users. As a matter of fact, in some cases this filtering does not lead to relevant results, because the filters suppress all relevant pages by accident. This weird behavior may be a persistent bug, but heavily spammed search engines won't lower the staff gauge in their war on spam, regardless the collateral damage those filters produce with regard to zillions of legitimate Web sites getting trashed along with questionable stuff. Thus Webmasters have to react, but how?

Example of a hierarchically organized product structureTo get an idea of the problem, think of a hierarchically organized product structure. Lets say an eCommerce site sells all sorts of widgets. The online shop's product pages are organized three levels deep in structures like widgets / colored widgets / green widgets. The uppermost page tells a story about widgets, listing their general attributes and behavior, and links to widget categories. The second level pages provide information per widget category, expanding the list of widget properties, e.g. with attributes and behavior of colored widgets, and link to product pages. The third level pages complete the product description by adding product specific details, e.g. colors and sizes, giving the full description of a green, red or blue widget, along with prices and shipping details.

A search engine user seeking for [green widgets]1 is supposed to land on the product page, which provides all product information on green widgets. Sounds easy, and worked fine for ages. Unfortunately, caused by search engines filtering out way too much 'duplicate content', it doesn't work anymore. That is the search engine user will not get the page about green widgets on the SERPs for [green widgets]. The frustrated user clicks on an advertisement on the SERP or goes out to buy a green widget at a brick and mortal retailer.

Black = duplicated | White = unique  (Example of a poorly generated product page)What causes a search engine to suppress the green widgets product page on the SERPs for [green widgets]? The reason is a chain of duplicated text snippets (black text on the image):
1. General properties of widgets from the 1st level page, duplicated on all category pages under widgets and all their product pages.
2. Properties of colored widgets from the 2nd level category page, duplicated on all product pages under colored widgets.
3. Shipping details shared with all 3rd level product pages.
In this example only the color "green" is a unique piece of information (white text on the image), even the list of available sizes can be found on many other product pages. That's not enough to consider the page useful from a search engine's point of view, although due to the big picture showing a green widget, the page is useful for visitors and looks unique.

Well, you can argue that in real life the 3rd level used to separate colors is superfluous. Sorry, invalid plea. Most eCommerce applications force a separate product page per SKU. Also, even if green widgets are supposed to get used only in the meadows, blue widgets in the sky, and red widgets in the fire ... that's not enough to make the page unique in the eyes of a search engine eagerly fighting index spam.

So what can be done to slip thru the duplicate content filter? One could display the duplicated text in iFrames, or make it an image. Both is a bad move, because this way keywords get removed from the green widgets page, which are necessary to trigger search queries like [green widgets "other colored widget property"]. Another alternative are short summaries, linked to relevant text snippets on the upper level pages, which tell the whole story. This method improves the unique/duplicated ratio to some degree, but it devalues the on-page content for surfers, and snips out a lot of keywords too, so it's far away from the desired solution.

Game over? Nope. There is another approach to escape the dilemma, but it requires intensive site specific testing before it gets used on production systems. The idea is to feed the duplicate content filter properly, that is forcing the dupe filter to work like it should work in the best interest of both the site owner and the search engine user. The method outlined below is no bullet proof procedure, because its success is highly dependent on the raw content. It will not work on (affiliate) sites where the product pages are generated from a vendor's data feed with no value/content added, or where the contents aren't unique to the site for other reasons. If the product descriptions aren't normalized (e.g. the duplication of text happens in a description field of the products table), the coding becomes tricky.

Search engines analyze Web pages block by block to extract contents from templates (see VIPS). That's why large sites with heavily repeated headers, navigation elements, footers etc. aren't downranked, and their pages rank for keywords provided in the page body, not for keywords from templated page segments or repetitive menus (more info here).

Black/gray = duplicated | White = unique  (Example of an improved product page)Pretty much simplified, page areas belonging to the template aren't considered in rankings, and usually they don't trigger duplicate content filters. Thinking the other way round, duplicated content from upper levels put into templated page blocks is safe. Assuming that works, does block label manipulation alone prevent from dupe filtering? Well, if the unique/duplicated ratio is very poor, it's necessary to throw in some unique text on the SKU level, serving as fodder for the spiders. Even if the restructured page passes the dupe testing then, search engines don't consider a page carrying a tiny amount of unique spiderable text content (thin page) important. If a thin page carries affiliate links, it's considered a thin affiliate page, and that's even worse than getting hit by dupe filters. Fine tuning the unique/duplicate ratio requires an experienced SEO, if the portion of unique text is low in relation to the number of words in the page's body. It sure helps to avoid new systematic patterns, so don't reword the added content on the SKU level over and over. Write it from scratch instead, and in different text lengths per SKU.

So how can one declare the duplicated text as part of the template? To get started, it helps to know how search engines make use of HTML block level elements (e.g. table/row/cell, heading, paragraph, lists) to partition Web pages, and what kind of neat algorithms beyond those simple methods their engineers play with in the labs. The next step would be to analyze the own templates, and some more on popular sites. Look at attributes like class, id and name in HTML block level elements, font attributes, HTML comments, visual lines, different back- and foreground colors, borders or even just whitespace used to draw visible or invisible rectangles around templated page elements. Get a feeling for the code behind rendered content positioning. Search for unique words and phrases found in different blocks to determine how much weight the engines give on particular blocks.

Then consolidate your notes and try to create a product page template, where product information duplicated from upper levels is clearly part of templated blocks, for example the footer. Put the unique content at the top of the page body, separate it from the 'template blocks' with an image, a thin line or other objects, which don't break the user's coherent impression of the product's content blocks as one prominent part of the page.

Although the non-unique text in 'templated' blocks can be formatted similar or even equal to the unique text, it must reside in separate HTML block level elements, which have all signs and attributes of real templated blocks, and which are clearly zoned (even complying HTML comments like 'start footer template' or 'end body area' may help). The goal is not to trick the engines, but to point their dupe filters to the fact that those blocks are repeated on a bunch of related pages, thus they are part of the template and not a legal subject of duplicate content filtering.

Try to place all important and unique 3rd level attributes like sizes and colors in the anchor text of internal links (and external inbound links, if possible). Optimizing off-the-page factors to emphasize the uniqueness of title tags, headings and highlighted keywords on-the-page can make the difference between a search engine's trash can or supplemental index, and fair placements on the SERPs.

Remove all generic stuff to lower the amount of non-unique text. For example display shipping details, general slogans, trademark notices and disclaimers in iFrames, or use text on images. Outputting text and unimportant links client sided (with JavaScript) prevents some search engines from indexing, but that's not a very smart long term strategy, because the crawlers become more and more kinda reengineered human users, that is they do render JavaScript output, or will soon do it.

Go test the new layout for a while with a few products wiped out of SE indexes by duplicate content filters. Tweak the code until the pages reappear in searches. If all code tweaking doesn't help, add more unique text on the SKU level, and repeat. If you participate in the Google Sitemaps program, give your test pages the highest crawling priority and ensure the date of last modification for those pages is accurate. Track Googlebot's visits and search for altered results two days after crawling, that's the average time to index.

Please don't understand the method outlined above as a bullet proof SEO tactic. Whether it can lead to success or not depends on so many site specific factors, (e.g. content quality and structuring, the overall Web site architecture and its linking policy, the Webmaster's experience and SEO skills ...), that any generic prognosis or even guarantee would be foolish. However, a revamp aiming proper feeding of duplicate content filters should result in improved usability, and more search engine friendly pages, what is an improvement in any case, and worth a try.


Expressing search queries in brackets has some advantages, as Matt Cutts points out here. It allows quotes and parentheses being used in the query string. For example a search query like ["search term" +(seo | sem) -spam] is 'unquotable'. Brackets on the other hand have no syntactical meaning in search queries.

Revamp Your Framed Pages


Friday, January 20, 2006

Phoenix posted a great tip at SEO Roundtable:
Creating a Framed Site Without The Drawbacks to SEO

With regard to search engine crawling and indexing, frames are the SEO's nightmare. Some brilliant people have taken the time to develop a CSS solution for fixed site navigation, examples:

Here is the tutorial from Webreference:
Summary: "HTML Frames have been used so far on the Web to provide sections of a Web page that scroll independantly of each other, but they cause a lot of hassle, making linking difficult and breaking the consistency of our documents. CSS fixed positioning helps us work around this by positioning parts of one document relative to the viewport. The overflow property can be used to control their scrolling appropriately. By being careful about how you position these elements, you can have your layout fall back to the default rendering on Navigator 4 and Internet Explorer, making this technique useable in a production environment."

However, if you make use of this technique, test the results with a few browsers. Browser Shots is a neat tool automating the rendering of a page in different environments, give it a try.

Why is framing a bad idea?

Well, the usual answer would be, that framing stands against the concept, that each resource on the Web can be reached via one and only one unique address, that is the URL. On a framed site the contents, although they have an URL assigned, are masked by the site's main URL, the frameset. I call this hiding content from visitors, and here is why.

If a visitor finds a nice article on a framed site and tries to bookmark it, that attempt fails, and the visitor doesn't even recognize that s/he bookmarked everything except of the half read or skimmed stuff earmarked for a second visit. Yes, some browsers store the current URLs of all frames in bookmarks, but even if that would work with all browsers, it surely doesn't work with most social bookmarking services, "blog this" plug-ins and alike. Chances are good that the visitor remembers the bookmark, comes back one day, lands on the home page, is pissed and therefore not willing to surf thru the site's navigation to find the pretended bookmarked content again ... another potential customer lost in cyberspace.

When it's that easy to confuse and lose visitors, the trick should work even better with Web robots like search engine crawlers. Indeed, framing hides content from search engine users too, because URL references in framesets don't count as much as a navigational link in SE ranking algorithms, and because nobody places deep links pointing to URLs in a frameset. To lower the negative effects of the first issue, one can use the noframes element to repeat the URLs in real links, but that's not that good as navigational links which are always visible to all user agents (browsers, crawlers ...). Thoughtlessly giving away the ability to get deep pages ranked with the help of linkage from outside the site is the more serious issue. Home page links to a framed site lose their power long before they reach the real content buried deep inside a not optimal crawlable structure.

Here is a real world example, which by the way brought up the idea to update this article (initally posted on July, 20, 2005). In January 2006 I've received a link request from a smart publisher at, who runs a few nice academic papers on eCommerce topics, all of them hidden by a frameset on the root level and not indexed by all search engines. He noticed that the source code of this site doesn't validate that well at the W3C (a known issue I couldn't care less about for some reasons, at least at the moment), so I think he's smart enough to revamp the architecture of his site (and destroy my example).

Take Free SEO Advice With a Grain of Salt


Monday, August 01, 2005

On the Internet jokes become rumors easily. It happens every day and sometimes it even hurts. In my post The Top-5 Methods to Attract Search Engine Spiders I was joking about bold dollar signs driving the MSN bot crazy. A few days later I discovered the first Web site making use of the putative '$$-trick'. To make a sad story worse, the webmaster has put the dollar signs as hidden text.

This reminds me to the spreading of the ghostly robots revisit META tag. This tag was used by a small regional canadian search engine for local indexing in the stone age of the Internet. Today every free META tag generator on the net produces a robots revisit tag. Not a single search engine makes use of this META tag. It was never standardized. It's present on billions of Web pages however.

That's the way how bad advice becomes popular. Folks read nasty tips and tricks on the net (or simply don't understand what a tip is all about) and don't bother applying common sense when implementing it. Then they add a myth or two when posting crap on the boards themselves.

There is no such thing as free and good advice on the net. Even good advice on a particular topic can result in astonishing effets when applied outside its context or implemented by a newbie. It's impossible to learn SEO from free articles and posts on message boards. Go see a SEO, it's worth it.

Green Tranquilizes


Friday, July 15, 2005

Widely ignored by savvy webmasters and search engine experts, every 3-4 months the publishing Internet community celebrates a spectacular event: Google updates its Toolbar-PageRank!

Site owners around the globe hectically visit all their pages to view the magic green presented by Google's Toolbar. If the green bar enlarged a pixel or even two, they hurry to the next webmaster board praising their genius. Otherwise they post 'Google is broke'.

Once the Toolbar-PR update is settled, folks in fear of undefined penalties by the allmighty Google check all their outgoing links, removing everything with a PR less than 4/10. Next they add a REL=NOFOLLOW attribute to all their internal links where the link target shows a white or gray bar on the infallible toolbar. Trembling they hide in the 'linking is evil' corner of the world wide waste of common sense universe for 3-4 months again.

Where is the call for rationality leading those misguided lemmings back to mother earth? Hey folks, Toolbar-PR is just fun, it means next to nothing. Green tranquilizes, but white or gray is no reason to panic. The funny stuff Google shoves into your toolbar is an outdated snapshot without correlation to current rankings or even real PageRank. It is by no means an indicator how valuable a page is in reality, so please LOL (link out loud) if the page seems to provide a value for your visitors.

As a matter of fact, the above said will change nothing. Green bar fetishists don't even listen to GoogleGuy posting "This is just plain old normal toolbar PageRank".

Googlebots go Fishing with Sitemaps


Monday, July 18, 2005

I've used Google Sitemaps since it was launched in June. Six weeks later I say 'Kudos to Google', because it works even better than expected. Making use of Google Sitemaps is definitely a must, at least for established Web sites (it doesn't help much with new sites).

From my logging I found some patterns, here is how the Googlebot sisters go fishing:
· Googlebot-Mozilla downloads the sitemaps 6 times per day, every 8 hours 2 fetches like a clockwork (or every 12 hours lately, now up to 4 fetches within a few minutes from the same IP address). Since this behavior is not documented, I recommend the implementation of automated resubmit-pings however.
· Googlebot fetches new and updated pages harvested from the sitemap, at the latest 2 days after inclusion in the XML file, respectively after providing a current last modified value. Time to index is constantly maximal 2 days. There is just one fetch per page (as long as the sitemap doesn't submit another update), resulting in a complete indexing (Title, snippets, and cached page). Sometimes she 'forgets' a sitemap-submitted URL, but fetches it later following links (this happens with very similar new URLs, especially when they differ only in a query string value). She crawls and indexes even (new) orphans (pages not linked from anywhere).
· Googlebot-Mozilla acts as a weasel in Googlebot's backwash and is suspected to reveal her secrets to AdSense.

Mozilla-Googlebot Helps with Debugging


Monday, July 18, 2005

Tracking Googlebot-Mozilla is a great way to discover bugs in a Web site. Try it for yourself, filter your logs by her user agent name:

Mozilla/5.0 (compatible; Googlebot/2.1; +

Although Googlebot-Mozilla can add pages to the index, I see her mostly digging in 'fishy' areas. For example, she explores URLs where I redirect spiders to a page without query string to avoid indexing of duplicate content. She is very interested in pages with a robots NOINDEX,FOLLOW tag, when she knows another page carrying the same content, available from a similar URL but stating INDEX,FOLLOW. She goes after unusual query strings like 'var=val&&&&' resulting from a script bug fixed months ago, but still represented by probably thousands of useless URLs in Google's index. She fetches a page using two different query strings, checking for duplicate content and alerting me to a superflous input variable used in links on a forgotten page. She fetches dead links to read my very informative error page ... and her best friend is the AdSense bot since they seem to share IPs as well as the interest in page updates before Googlebot is aware of them.

Bait Googlebot With RSS Feeds


Tuesday, August 02, 2005

Seeing Ms. Googlebot's sister running wild on RSS feeds, I'm going to assume that RSS feeds may become a valuable tool to support Google's fresh and deep crawls. Since I've not yet gathered enough log data, test it for yourself:

Create a RSS feed with a few unlinked or seldom spidered pages which are not (yet) included in your XML sitemap. Add the feed to your personalized Google Home Page ('Add Content' -> 'Create Section' -> Enter Feed URL -> Go). Track spider accesses to the feed and the included pages as well. Most probably Googlebot will request your feed more often than Yahoo's FeedSeeker and similar bots. Chances are that Googlebot-Mozilla is nosy enough to crawl at least some of the pages linked in the feed.

That does not help a lot with regard to indexing and ranking, but it seems to be a neat procedure helping the Googlebot sisters spotting fresh content. In real life add the pages to your Google XML Sitemap, link to them and acquire inbound links...

To test the waters, I've added RSS generation to my Simple Google Sitemaps Generator. This tool reads a plain page list from a text file to generate a dynamic XML sitemap, a RSS 2.0 site feed, and a hierarchical HTML site map. It is suitable for smaller Web sites with no more than 100 pages.

Update: Google's bot fetching feeds for the personalized home page now identifies itself as Feedfetcher. It doesn't obey robots.txt, because it is a a component of user driven RSS reader. To make Googlebot crawl your feeds, you need to ping Pingomatic and Weblogs.

Automated Link Swaps Decrease SE Traffic


Monday, July 25, 2005

Years ago, Google started a great search engine ranking Web pages by PageRank within topical matches. Altavista was a big player, and a part of its algo ranked by weighted link popularity. Even Inktomi and a few others begun to experiment with linkpop as a ranking criteria.

Search engine optimizers and webmasters launched huge link farms, where thousands of Web sites were linking to each other. From a site owner's point of view, those link farms, aka spider traps, 'helped search engine crawlers to index and rank the participating sites'. For a limited period of time, Web sites participating in spider traps were crawled more frequently, and -caused by their linkpop- gained better placements on the search engine result pages.

From a search engine's point of view, artificial linking for the sole purpose of manipulating search engine rankings is a bad thing. Their clever engineers developed link spam filters, and the engines begun to automatically penalize or even ban sites involved in systematic link patterns.

Back in 2000, removing the artificial links and asking for reinclusion worked for most of the banned sites. Nowadays it's not that easy to get a banned domain back in the index. Savvy webmasters and serious search engine optimizers found better and honest ways to increase search engine traffic.

However, there are still a lot of link farms out there. Newbies following bad advice still join them, and get caught eventually. Spider trap operators are smart enough to save their ass, but thousands of participating newbies lose the majority of their traffic when a spider trap gets rolled up by the engines. Some spider traps even charge their participants. Google has just begun to work on a link spam network where the operator earns 46,000$ monthly for putting his customers at risk.

Stay away from any automated link exchange 'service', it's not worth it. Don't trust sneaky sales pitches trying to talk you into risky link swaps. Approaches to automatically support honest link trades are limited to administrative tasks. Hire an experienced SEO Consultant for serious help on your link development.

The value of links from a search engine's perspective


Tuesday, January 31, 2006

The most valuable links are hard to get. That does not mean that it is impossible for an eCommerce site or a site providing niche services to get well linked, and to get ranked by search engines based on those inbound links .

If you can submit your URL to a Web site and you get linked automatically, that's a useless link, regardless of the PageRank your Google Toolbar indicates for the source page. If you can drop your links in a forum's signature lines and posts, that's in most cases a powerless link, regardless whether the forum software adds a link condom or not.

Does that mean you should not spread low valued links? No, just don't rely on cheap links. URL drops and a handful of directory submissions can and should be part of your natural mixture of incoming linkage.

Does that mean you shouldn't drop your links in forums, user groups, blogs and alike? No, it means you shouldn't bother with me-too posts, enhance your links instead. Write an outstanding post, and link to your deep pages and other resources as well to point to related information. An informative and well thought out post will attract replies, so the forum thread gets more content, and ranks higher in the thread list caused by many views and replies. Perhaps the thread gets even linked from pages outside the forum's domain.

Having links in a popular thread helps you in two ways. First, as more links point to the thread, as more reputation (link popularity, PageRank, topic relevancy, TrustRank...) the thread will pass to your pages. Reputation is the underlying load of a link from a search engine's perspective, and the engines have developed a couple of ranking formulas to measure the reputation a link passes to the destination page.

The PageRank formula for example looks at the score of the source page, divides the source page's PageRank by the number of outgoing links, decreases the result by a dampening factor to take into account that a visitor of the source page will not follow every link, and passes the remaining portion of PageRank to the destination page. From this -- pretty much simplified! -- explanation it's obvious that PageRank is easy to manipulate by artificial linkage, but don't try it, you cannot outsmart Google on the long haul. The same goes for TrustRank and other ranking algorithms.

The above said leads to the second and more important reason why well placed deep links in a popular forum thread do help your site's SE rankings: they generate traffic to the landing pages. Forum visitors landing on your interior pages may bookmark your stuff, and come back, that's extremely valuable recurring traffic. A minority of those visitors may even link to your pages, for example from their community blogs or by adding your pages to a social bookmarking service like delicious or furl, or just by dropping your URL onto their own forum posts.

By clicking your links and surfing your site following navigational links, and by bookmarking your pages, a few of your visitors leave tracks in search engine databases, and the engines do use those in their rankings: popular pages do get a ranking boost. The engines capture such information via their toolbars and by spidering social bookmarking services (more information).

To summarize the above said, posting great content on forums generates human traffic during the life cycle of a thread, and it boosts your search engine rankings to some degree. Due to the nature of the beast, this works only when you constantly participate in public discussions, because even popular threads get buried in the archives sooner or later. Forum posts can build a good basis of inbound links, before trusted and more static links from authority sites come to the mix, or when the nature of your content makes it impossible to get links from authority sites like the W3C and high ranked hubs like the LII.

If you do it right, a search for your page titles and related terms will bring your URLs on the SERPs eventually. If you fail, and in the very beginning of your campaign, the forums will outrank you even for your domain name.

Keep this rule of thumb in mind: Valuable links generate human traffic. Even if only a handful of visitors land on your pages, the link is valuable. Don't bother with links placed where no human visitor clicks them. Although the search engines will discover most of these non-prominent links, and may even spider the destination page following them, unclicked links do not improve your search engine rankings, and they will not attract (new) visitors.

Besides active participation in communities like forums, there is another way to build a stable base of recurring visitors: establish your own community. You can add a blog to every type of Web site, even an e-Commerce site or a niche directory will profit from an integrated active blog. Blog interesting stuff frequently, tag all posts properly, ping all blog search engines, reply to comments in a timely manner, repeat.

Once you've produced a fair amount of good blog posts, go out and search for related blog posts. If you can add value to another blog post, write a comment and drop a link to one of your posts in a non-spammy way. In most cases those links don't pass reputation to your blog, but noisy bloggers will follow them and perhaps blog about your blog if they like it. Do not spam other blogger's comments, that is don't comment on each and every post, and don't always include a link to your blog in the message body. Politely build reputation, try to become an authority in your field, and you will be cited more and more, that is link love from the blogosphere floods in continuously.

Another way to create a user base is to interact with visitors, for example allow your visitors to post comments and product reviews, do weekly polls and so on, if it makes sense even add a forum. Be creative. Try new things frequently, and maintain working concepts actively. Search engines follow their users, so if you're able to generate traffic, you'll gain fair search engine placements naturally.

Prevent Your Unique Content From Scraping


Friday, July 22, 2005

Protect your unique content! Yesterday CopyScape alerted me to a content thief reprinting my stuff at XXX1. This moron scraped a few paragraphs from my tutorial on Google Sitemaps, replaced a link to Google's SEO page by a commercial call for action, and uploaded the plagiarism as sales pitch for his dubious and pretty useless SEO tools.

As usual, I've documented the case and sent it over to my lawyer. Then I thought I could do more with all the screen shots, WHOIS info etc., and developed a template for a page of evidence1. Now it takes me only a few minutes to publish everything others should know about a content thief. Entering a few variables and pushing a button creates a nice page documenting the copyright infringement.

Unfortunately I can't post the template, because it works with my CMS only, but you'll get the idea. Be creative yourself, put the thief's name, company and personal data promitently nearby terms like 'evil' and 'thief' all over the page, including the META tags. Then link to the page and submit it to all search engines. After a while do a search for the thief and check out whether you've outranked the offending site. If not, consider reading a few of my articles on search engine optimizing ;)


My content was removed after my outing page was picked up by the engines and ranked fine within a few days. Thus I've removed links and names from this post.
Here is another example of an outing page: Content Theft: Tahir J. Farooque's plagiarism at CRESOFT.COM. This page outranked the business name and more on all engines, before I got an apology from a guy stating the company was sold to him after the copyright infringement, thus I've removed it from all search engine indexes.

Spam Detection is Unethical


Saturday, July 16, 2005

While releasing a Googlebot Spoofer allowing my clients to check their browser optimization for search engine crawler responses, I was wondering again why major search engines tolerate hardcore cloaking to a great degree. I can handle my clients' competition flooding the engines with zillions of doorway pages and alike, so here are no emotions involved. I just cannot understand why the engines don't enforce compliance to their guidelines. That's beyond any logic, thus I'm speculating:

They don't care. If they would go after spamindexing, they would lose a few billions of indexed pages. That'll be a very bad PR effect, absolutely unacceptable.

They have other priorities. Focusing on local search, they guess the problem solves itself, because it's not very probable that a spammer resides close to the search engine user seeking a pizza service and landing in a PPC popup hell. Just claim it ain't broke, so why fix it?

They believe spam detection is unethical. 'Don't be evil' can be interpreted as 'You can cheat us using black hat methods. We won't make use of your own medicine to strike back'. Hey, this interpretation makes sense! Translation for non-geeks: 'Spoofing is as evil as cloaking, cloaking cannot be discovered without spoofing, and since we aren't evil, we encourage you to cloak'.

Great. Tomorrow I'll paint my white hat black and add a billion or more sneaky pages to everyone's index.

Seriously, I've a strong gut feeling the above said belongs to the past pretty soon. The engines changing their crawler's user agent names to 'Mozilla...' could learn to render their spider food and to pull it from unexpected IP addresses. With all respect to successful black hat SEOs, I believe that white hat search engine optimization is a good business decision, probably on the long haul even in competitive industries.

A SEO Strategy for Consulting Firms


Tuesday, December 13, 2005

Consultancy is hard to sell to Internet search engines, because search engines rank by popularity and unique, original and focused text content. Clients and competitors most likely will not link to a consulting firm's Web site, so where to gain topical link popularity? Links from the local chamber of commerce plus a few Web directories aren't enough.

Creation of unique and original content is the second major issue. The consultant usually cannot reveal details about consulting projects, also every project has a different focus, so without interesting stories there is nothing to write about?

Wait! The consultant's broad competence as well as extensive knowledge and experience should lead to a few pages of text content. Not really, most consultants are afraid to publish knowledge which may be sellable some day.

As a matter of fact, mission statements, visions, and similar sales pitches based on a finite number of catchwords will not attract organic search engine traffic. There are only so many ways to praise a consulting firm, none of them is unique any more. The search engines' indexes are flooded with variations of the same yada yada yada, they consider it (near) duplicated text content, and refuse to deliver the consultant's Web pages on their first search result pages (SERPs).

How to escape the dilemma? You really want to reach the potential clients querying their preferred search engine with questions like [how can I achieve "insert goal" in a "insert industry" shop?] or [what exactly is "your special subject here" and how can I implement it?]. You cannot attract this highly targeted search engine traffic by focusing phrases like "our Business Management Consultants are skilled management consultants with extensive combined experience gained at first addresses" on your Web site.

Search engine users don't search for a consultant's philosophy or academic background. They do search for solutions to particular problems, and they do use natural search queries as well as industry specific terms, including brands and abbreviations.

If you don't change your attitude, those potential clients will never join your clientele. You must provide tons of freebies to attract converting search engine traffic. You must plaster your Web site with valuable information, for example knowledge bases, state of the art tutorials and guides, (anonymized) case studies, up to date articles and white papers, free tools like branded checklists or spread sheets, online services like KPMG's Alumni or weekly tax briefings ... be creative in revealing all your business secrets.

Actually, besides your payroll and such stuff you can't reveal crucial business secrets, because the toolset of a consultant consists of knowledge which is available to the public (even if buried in university libraries or obscure periodicals), common sense, and experience. Publishing knowledge in the context of consulting experience --applying all the phrases search engine users may search for-- will not make the reader a good consultant, but it pleases the readers mind. Pleased visitors more likely subscribe to your RSS feeds or newsletters and they bookmark your Web pages. Chances are good that a recurring visitor will get used to your site, making use of your service request form --or at least the contact page-- one day.

Above all, frequent online publishing is a great way to become noticed as an expert on a topic, which leads to increased popularity, and popularity on the Internet gets expressed as link love. Outstanding content combined with the power of link love --that is naturally earned inbound links-- leads to nicely targeted search engine traffic. It doesn't hurt to consult a smart search engine optimizer for some fine tuning of your SEO strategy, and to take care of the technical aspects like crawlability and Web site structuring, but basically a good Web developer/designer will be able to get your thoughts on the World Wide Web.

Optimizing the Number of Words per Page


Sunday, July 17, 2005

I'm just in the mood to reveal a search engine optimizing secret, which works fine for Google and other major search engines as well.

The optimal number of words per page gets determined as follows:

  • Resize your browser window to fit 800*600 resolution.
  • Create a page with three columns below the page title in a H1 tag. In the first column put your menu. In the third column put an AdSense 'Wide Skyscraper' ad (160*600).
  • Write your copy and paste it into the second column. Reload the page. If the ads match your topic, fine. If not, rewrite your text, or add more, or remove fillers.
  • You have written too much words, if the middle column exceeds the height of the right ad.
  • Link to this page and leave it alone until Googlebot has fetched it 2-4 times. Time to index is 2 days, so reload the page 2 days after the last Googlebot visit to check whether the ads still match your content (respectively the keyword phrase you're after). If not, tweak your wording.
  • Then re-arrange the ads and move on. You've achieved the optimal number of words per page for your keyword phrase.

Actually, the gibberish above is a persiflage on AdSense optimized content sites.

Seriously, a loooong copy with fair link popularity attracts way more SE traffic, especially if it is supported by a few tiny pages which are naturally optimized for particular phrases, for example footnote pages pointing out details, or one-page definitions of particular terms used in the long copy. This structure is comfortable for all users, either for experts on the topic and interested newbies as well, thus search engines honor it. There is no such thing as an optimal number of words per page.

Yahoo! Site Explorer BETA - First Impressions


Monday, December 12, 2005

On September/30/2005 the Yahoo! Site Explorer (BETA) got launched. It's a nice tool showing a site owner all indexed pages per domain, and it offers subdomain filters. Inbound links get counted per page and per site. The tool provides links to the standard submit forms.

The number of inbound links seems to be way more accurate than the guessings available from linkdomain: and link: searches. Unfortunately there is no simple way to exclude internal inbound links. So if one wants to check only 3rd party inbounds, a painfull procedure begins:
1. Export of each result page to TSV files, that's a tab delimited format, readable by Excel and other applications.
2. The export goes per SERP with a maximum of 50 URLs, so one must delete the two header lines per file and append file by file to produce one sheet.
3. Sorting the work sheet by the second column gives a list ordered by URL.
4. Deleting all URLs from the own site gives the list of 3rd party inbounds.
5. Wait for the bugfix "exported data of all result pages are equal" (each exported data set contains the first 50 results, regardless from which result page one clicks the export link).
Since December/06/2005 Yahoo provides a filter to exclude internal links (per domain and sub-domain).

The result pages are assorted lists of all URLs known to Yahoo. The ordering does not represent the site's logical structure (defined by linkage), not even the physical structure seems to be part of the sort order. It looks like the first results are ordered by popularity, followed by an unordered list. The URL listings contain fully indexed pages, with known URLs mixed in. The latter can be identified by the missing cached link.

Desired improvements:
1. A filter "with/without internal links".
2. An export function outputting the data of all result pages to one single file.
3. A filter "with/without" known but not indexed URLs.
4. Optional structural ordering on the result pages.
5. Operators like filetype: and
6. Removal of the 1,000 results limit.
7. Revisiting of submitted URL lists a la Google sitemaps.
8. [Added December/06/2005] Filtering of AdSense scraper sites like ODP and Wikipedia clones.

Overall, the site explorer is a great tool and an appreciated improvement. The most interesting part of the new toy is its API, which allows querying for up to 1,000 results (page data or link data) in batches of 50 to 100 results, returned in a simple XML format (max. 5,000 queries per IP address per day).

Good news for site and mass submission addicts: as per December/06/2005 Yahoo accepts RSS/ATOM feeds and HTML pages in addition to the already supported plain URL lists in text files, which were dumped after the fetch triggered by a manual submission, unfortunately.

Author: Sebastian
  Web Feed

· Home

· Internet

· Blog

· Web Links

· Link to us

· Contact

· What's new

· Site map

· Get Help

Most popular:

· Site Feeds

· Database Design Guide

· Google Sitemaps

· smartDataPump

· Spider Support

· How To Link Properly

Free Tools:

· Sitemap Validator

· Simple Sitemaps

· Spider Spoofer

· Ad & Click Tracking

Search Google
Web Site

Add to My Yahoo!
Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed

Digg this · Add to · Add to Furl · We Can Help You!

Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives

Top of page

No Ads

Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy