Web Log Archive · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · 11 · 12 · 13 · 14 · 15 · 16 · 17 · 18 · 19 · 20 · 21 · 22 · 23 · Expand · Web Feed

What looks like a GoogleBug may just be a sign of a few more GoogleBots applying well known dupe filters to a larger data set. Or why unique content does not equal unique information.

Search engines filtering content to avoid useless repetitions on the SERPs confuse the hell out of most Webmasters. Those approaches to deliver better search results are often called duplicate content penalties, as mentioned here. Referring to (not 100% precise) filtering as penalty is counterproductive however, because it hinders objective analytics. If I think of a phenomenon as a penalty, I'm less likely willing to search for causes at my side too. I'll blame the evil search engine and end up in changing useless things, making the whole thing even worse, and more complex, and less understandable.

A relatively new phenomenon on Google's SERPs is, that Google trashes all similar pages, not a single page from a bunch of near-duplicates survives. It seems that Google hates duplicated content so much, that it wipes it out completely, without preserving the source, that is the page providing the content in question before the duplicates appeared. Here is an example1:


Say you've a group of pages about a blonde C-cup celeb on a paparazzi site for men. An index page, a tour calendar, a bio, some stories and pictorials, an image gallery and a video clip index. All those pages provide unique content, get indexed fine and rank well.

Then you collect a bunch of content about a blonde D-cup celeb. You're using your existing pages as templates, changing only what's significantly different. You'll end up with at least a few near-duplicate pages. Then repeat this procedure with an a-cup celeb and a b-cup hottie. The result will be four groups of pages with an identical structure and very similar content, say 60% of structure and on-the-page-text is identical.

Until recently, all four groups of pages will get indexed and they may even rank pretty good for each celeb's name and a few other keywords.

Now say Google's new duplicate content threshold is 60% (it is not 60%, that's just an example!). Google's expected behavior would be to keep the oldest pages (about the C-cup celeb) in the index, and to suppress the pages about the clones with smaller or bigger breasts.

Here comes the 'bug'. Google trashes all four groups of pages.


But is it really a bug? Probably not. It would be a bug if the assumption "All those pages have unique content" above would be true. The content may be unique within the scope of the site, but it is not not unique on the Web. Many other authors have written their stories about the four blonde celebs, that is the information is spread all over the Web, slightly reworded and often quoted. Even the images and vids are available on tons of other pages out there.

If Google compares text content not page by page, all text counted, but snippet by snippet, extracting the core information even from reworded text, very probably all four versions are considered duplicates, thus have to disappear because other pages from other sites got the source bonus.

Another factor of consideration is, that Google permanently improves its filters and capacity. In many cases those near-duplicate pages have slipped thru, because in the past Google performed those comparisons based on a smaller set of pages, that is not with all pages of every site out there. In fact the logic used in those dupe filters is not that new. It's in the news now, because Google bought a few clusters of new machines to apply the filters to way more Web sites than ever before.

Well, as always we don't know for sure whether the theory outlined above is accurate. At least it's an educated assumption, and it is plausible. So what can one do to break the filtering, except of avoiding popular themes? It should be safe to write each copy from scratch. It should help to forbid copy, paste and modify operations, and to make use of shorter quotes. Promoting fresh content immediately should help to gain the source bonus. Nothing except a reinclusion request can help if a site gets trashed by accident, because search engine filters will always produce collateral damage. But promoting outstanding unique content results in popularity and reputation, what is the best protection against lost search engine placements.


Monday, October 03, 2005

Feed Duplicate Content Filters ProperlyNext Page

Previous PageAvoid Unintended Delivery of Duplicated Content


Web Log Archive · Index · Part 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 · 10 · 11 · 12 · 13 · 14 · 15 · 16 · 17 · 18 · 19 · 20 · 21 · 22 · 23 · Expand · Web Feed




1

Why not a 'widgetized' example about bees? Nobody would believe that a lot of Web sites provide content about bees, so the celeb example makes it easier to include the whole 'Net into the analysis, and to draw wider conclusions based on the expanded scope.




Author: Sebastian
  Web Feed

· Home

· Internet

· Blog

· Web Links

· Link to us

· Contact

· What's new

· Site map

· Get Help


Most popular:

· Site Feeds

· Database Design Guide

· Google Sitemaps

· smartDataPump

· Spider Support

· How To Link Properly


Free Tools:

· Sitemap Validator

· Simple Sitemaps

· Spider Spoofer

· Ad & Click Tracking



Search Google
Web Site

Add to My Yahoo!
Syndicate our Content via RSS FeedSyndicate our Content via RSS Feed



To eliminate unwanted email from ALL sources use SpamArrest!





neatCMS

neat CMS:
Smart Web Publishing



Text Link Ads

Banners don't work anymore. Buy and sell targeted traffic via text links:
Monetize Your Website
Buy Relevant Traffic
text-link-ads.com


[Editor's notes on
buying and selling links
]






Digg this · Add to del.icio.us · Add to Furl · We Can Help You!




Home · Categories · Articles & Tutorials · Syndicated News, Blogs & Knowledge Bases · Web Log Archives


Top of page

No Ads


Copyright © 2004, 2005 by Smart IT Consulting · Reprinting except quotes along with a link to this site is prohibited · Contact · Privacy