First let me say that I don’t work for Google. Therefore, my analysis here on this page is more of a thoughtful hypothesis. It is not definitive proof. I am only speaking from personal experience. Now that I got that little disclaimer out of the way, let me proceed in explaining what I see happening with Panda’s duplicate content filter.
Google can’t really penalize a website for having one piece of duplicate content on it. Well, they could, but they don’t. They don’t punish you for that because there are lots of high quality sites that have some amount of duplicate content on them. Instead what they are doing is measuring how much duplicate content you have on your website. If it is too high of a percentage compared to the other unique content on your site, then you are going to get whacked by Panda. I don’t know what that percentage is.
Google knows when you have duplicate content on your site and yes it probably knows when you have spun content on your site. The spun content thing may not be definitive without a manual review, but you can bet that even a really basic algorithm could be written to detect probable spun content. How would they do that?
You have to remember that Google’s computers break down a page into words and also into blocks of content. They count the number of times a word appears. They count the number of times each phrase appears. They use those word counts as means for determining page relevancy for searches. When they are sorting their results according to the most relevant page for a given search, they will see when there are twenty pages on the web that all have the identical word counts for every word or very close to it. They all use the same words the same number of times. That is really easy to spot. Then what do they do? Obviously that page has duplicate content.
They rank the duplicate content pages according to which one has the most Google PageRank. Actually it is probably Topic Sensitive Page Rank. Maybe they will show more than one page of duplicate content if more than one page has a reasonable amount of authority. That is why sometimes you can see a list of pages in the search results that are all duplicate content.
If your page has less authority than the other pages with the same duplicate content, then your page is not likely to appear in the search results for that search. What if your website has a high percentage of pages that this applies to? You get hit by the Panda and your pages quit showing up at all. Your entire site or a certain section of that site drops down in the search results except for maybe an exact domain search. The penalty does not have to be applied to the entire website, but it could be and usually is.
What About Spun Content?
Spun content would be harder to spot because the word counts will be scrambled somewhat. You can bet that Google will still see that there are a bunch of pages with too many similarities like word counts being within fifty words of one another and also having certain phrases match from page to page. If you were to sort these pages into a list or a graph, spun pages would tend to bunch up according to certain characteristics like this. Writing an algorithm to accurately filter those kinds of pages would be difficult because some innocent pages would certainly get caught in the fray. You could just spin your competitors’ article and get them penalized. Of course that is assuming you had more authority or PageRank than they did. This is why Google can’t easily identify those pages with high accuracy using a mathematical algorithm.
So a spun content filter would have to take into consideration many more things. The quality of a site’s inbound links might come into play. Maybe a trust algorithm scale combined with these other on-page coincidental factors could be applied. Maybe they would just use the algorithm to identify suspicious websites that are then scheduled for a manual review. All of these are possibilities.
One thing is apparent. Google is discovering these spun websites eventually. How can you tell? You can tell because most of them are not ranking in the top ten even for low competition searches. If you were to create a simple ten page website with unique higher quality articles on it, you would find that you could get a new article ranked for a low competition search pretty darn easily. That doesn’t happen so much with those spun sites. So, they are picking them out. Over time they will get better and better at it. However, you can always find some exceptions to the rule.
What can you do about Panda?
Well, you could submit and obey Google’s rules like they want you to do.
Make every page on your website be substantially unique compared to the rest of the pages on your site.Make every page on your website be substantially unique compared to the rest of the pages across the web.Make sure every page on your website has enough content on it too so that the content from your website template (header, sidebar, footer) does not make every page appear cookie cutter.
At least then you’ll quit feeling the sting of their whip. Or, you could continue on your wannabe super hacker ways of trying to defeat an ever increasingly sophisticated algorithm. It is quite challenging. I’m going to do a little of both because it is fun. You have got to have some Yin with your Yang if you know what I mean.