Web Log Archive · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · 11 · 12 · 13 · 14 · 15 · 16 · 17 · 18 · 19 · 20 · 21 · 22 · 23 · Expand · Web Feed

Search engines produce way too much collateral damage in their war on index spam. Especially eCommerce sites suffer from harsh duplicate content filtering. Here is a concept which may help to force SE filters to act as they are supposed to work: in the interest of both the search engine user and the site owner.

Search engines make use of duplicate content filters to avoid repetition on their SERPs. That should be a good thing for search engine users. As a matter of fact, in some cases this filtering does not lead to relevant results, because the filters suppress all relevant pages by accident. This weird behavior may be a persistent bug, but heavily spammed search engines won't lower the staff gauge in their war on spam, regardless the collateral damage those filters produce with regard to zillions of legitimate Web sites getting trashed along with questionable stuff. Thus Webmasters have to react, but how?

Example of a hierarchically organized product structureTo get an idea of the problem, think of a hierarchically organized product structure. Lets say an eCommerce site sells all sorts of widgets. The online shop's product pages are organized three levels deep in structures like widgets / colored widgets / green widgets. The uppermost page tells a story about widgets, listing their general attributes and behavior, and links to widget categories. The second level pages provide information per widget category, expanding the list of widget properties, e.g. with attributes and behavior of colored widgets, and link to product pages. The third level pages complete the product description by adding product specific details, e.g. colors and sizes, giving the full description of a green, red or blue widget, along with prices and shipping details.

A search engine user seeking for [green widgets]1 is supposed to land on the product page, which provides all product information on green widgets. Sounds easy, and worked fine for ages. Unfortunately, caused by search engines filtering out way too much 'duplicate content', it doesn't work anymore. That is the search engine user will not get the page about green widgets on the SERPs for [green widgets]. The frustrated user clicks on an advertisement on the SERP or goes out to buy a green widget at a brick and mortal retailer.

Black = duplicated | White = unique  (Example of a poorly generated product page)What causes a search engine to suppress the green widgets product page on the SERPs for [green widgets]? The reason is a chain of duplicated text snippets (black text on the image):
1. General properties of widgets from the 1st level page, duplicated on all category pages under widgets and all their product pages.
2. Properties of colored widgets from the 2nd level category page, duplicated on all product pages under colored widgets.
3. Shipping details shared with all 3rd level product pages.
In this example only the color "green" is a unique piece of information (white text on the image), even the list of available sizes can be found on many other product pages. That's not enough to consider the page useful from a search engine's point of view, although due to the big picture showing a green widget, the page is useful for visitors and looks unique.

Well, you can argue that in real life the 3rd level used to separate colors is superfluous. Sorry, invalid plea. Most eCommerce applications force a separate product page per SKU. Also, even if green widgets are supposed to get used only in the meadows, blue widgets in the sky, and red widgets in the fire ... that's not enough to make the page unique in the eyes of a search engine eagerly fighting index spam.

So what can be done to slip thru the duplicate content filter? One could display the duplicated text in iFrames, or make it an image. Both is a bad move, because this way keywords get removed from the green widgets page, which are necessary to trigger search queries like [green widgets "other colored widget property"]. Another alternative are short summaries, linked to relevant text snippets on the upper level pages, which tell the whole story. This method improves the unique/duplicated ratio to some degree, but it devalues the on-page content for surfers, and snips out a lot of keywords too, so it's far away from the desired solution.

Game over? Nope. There is another approach to escape the dilemma, but it requires intensive site specific testing before it gets used on production systems. The idea is to feed the duplicate content filter properly, that is forcing the dupe filter to work like it should work in the best interest of both the site owner and the search engine user. The method outlined below is no bullet proof procedure, because its success is highly dependent on the raw content. It will not work on (affiliate) sites where the product pages are generated from a vendor's data feed with no value/content added, or where the contents aren't unique to the site for other reasons. If the product descriptions aren't normalized (e.g. the duplication of text happens in a description field of the products table), the coding becomes tricky.

Search engines analyze Web pages block by block to extract contents from templates (see VIPS). That's why large sites with heavily repeated headers, navigation elements, footers etc. aren't downranked, and their pages rank for keywords provided in the page body, not for keywords from templated page segments or repetitive menus (more info here).

Black/gray = duplicated | White = unique  (Example of an improved product page)Pretty much simplified, page areas belonging to the template aren't considered in rankings, and usually they don't trigger duplicate content filters. Thinking the other way round, duplicated content from upper levels put into templated page blocks is safe. Assuming that works, does block label manipulation alone prevent from dupe filtering? Well, if the unique/duplicated ratio is very poor, it's necessary to throw in some unique text on the SKU level, serving as fodder for the spiders. Even if the restructured page passes the dupe testing then, search engines don't consider a page carrying a tiny amount of unique spiderable text content (thin page) important. If a thin page carries affiliate links, it's considered a thin affiliate page, and that's even worse than getting hit by dupe filters. Fine tuning the unique/duplicate ratio requires an experienced SEO, if the portion of unique text is low in relation to the number of words in the page's body. It sure helps to avoid new systematic patterns, so don't reword the added content on the SKU level over and over. Write it from scratch instead, and in different text lengths per SKU.

So how can one declare the duplicated text as part of the template? To get started, it helps to know how search engines make use of HTML block level elements (e.g. table/row/cell, heading, paragraph, lists) to partition Web pages, and what kind of neat algorithms beyond those simple methods their engineers play with in the labs. The next step would be to analyze the own templates, and some more on popular sites. Look at attributes like class, id and name in HTML block level elements, font attributes, HTML comments, visual lines, different back- and foreground colors, borders or even just whitespace used to draw visible or invisible rectangles around templated page elements. Get a feeling for the code behind rendered content positioning. Search for unique words and phrases found in different blocks to determine how much weight the engines give on particular blocks.

Then consolidate your notes and try to create a product page template, where product information duplicated from upper levels is clearly part of templated blocks, for example the footer. Put the unique content at the top of the page body, separate it from the 'template blocks' with an image, a thin line or other objects, which don't break the user's coherent impression of the product's content blocks as one prominent part of the page.

Although the non-unique text in 'templated' blocks can be formatted similar or even equal to the unique text, it must reside in separate HTML block level elements, which have all signs and attributes of real templated blocks, and which are clearly zoned (even complying HTML comments like 'start footer template' or 'end body area' may help). The goal is not to trick the engines, but to point their dupe filters to the fact that those blocks are repeated on a bunch of related pages, thus they are part of the template and not a legal subject of duplicate content filtering.

Try to place all important and unique 3rd level attributes like sizes and colors in the anchor text of internal links (and external inbound links, if possible). Optimizing off-the-page factors to emphasize the uniqueness of title tags, headings and highlighted keywords on-the-page can make the difference between a search engine's trash can or supplemental index, and fair placements on the SERPs.

Remove all generic stuff to lower the amount of non-unique text. For example display shipping details, general slogans, trademark notices and disclaimers in iFrames, or use text on images. Outputting text and unimportant links client sided (with JavaScript) prevents some search engines from indexing, but that's not a very smart long term strategy, because the crawlers become more and more kinda reengineered human users, that is they do render JavaScript output, or will soon do it.

Go test the new layout for a while with a few products wiped out of SE indexes by duplicate content filters. Tweak the code until the pages reappear in searches. If all code tweaking doesn't help, add more unique text on the SKU level, and repeat. If you participate in the Google Sitemaps program, give your test pages the highest crawling priority and ensure the date of last modification for those pages is accurate. Track Googlebot's visits and search for altered results two days after crawling, that's the average time to index.

Please don't understand the method outlined above as a bullet proof SEO tactic. Whether it can lead to success or not depends on so many site specific factors, (e.g. content quality and structuring, the overall Web site architecture and its linking policy, the Webmaster's experience and SEO skills ...), that any generic prognosis or even guarantee would be foolish. However, a revamp aiming proper feeding of duplicate content filters should result in improved usability, and more search engine friendly pages, what is an improvement in any case, and worth a try.

Tuesday, October 04, 2005

Revamp Your Framed PagesNext Page

Previous PageThoughts on Duplicate Content Issues with Search Engines

Web Log Archive · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · 11 · 12 · 13 · 14 · 15 · 16 · 17 · 18 · 19 · 20 · 21 · 22 · 23 · Expand · Web Feed


Expressing search queries in brackets has some advantages, as Matt Cutts points out here. It allows quotes and parentheses being used in the query string. For example a search query like ["search term" +(seo | sem) -spam] is 'unquotable'. Brackets on the other hand have no syntactical meaning in search queries.

Author: Sebastian
  Web Feed

· Home

· Internet

· Blog

· Web Links

· Link to us

· Contact

· What's new

· Site map

· Get Help

Most popular:

· Site Feeds

· Database Design Guide

· Google Sitemaps

· smartDataPump

· Spider Support

· How To Link Properly

Free Tools:

· Sitemap Validator

· Simple Sitemaps

· Spider Spoofer

· Ad & Click Tracking

Search Google
Web Site

Add to My Yahoo!
Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed

To eliminate unwanted email from ALL sources use SpamArrest!


neat CMS:
Smart Web Publishing

Text Link Ads

Banners don't work anymore. Buy and sell targeted traffic via text links:
Monetize Your Website
Buy Relevant Traffic

[Editor's notes on
buying and selling links

Digg this · Add to del.icio.us · Add to Furl · We Can Help You!

Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives

Top of page

No Ads

Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy