I have known G-Man for couple years, and for the most part we always agree on 99.9999% on everything that we talk about in
SEO, SEM, etc.
Over the last week G-Man and I got on the topic of duplicate content, and for the first time we where not agreeing on how the search engines found and filtered duplicate content.
In a nutshell G-Man’s idea of how to get around the dupe content filter was to add more text in or around the dupe content itself, but not to shuffle the content.
Please take a minute to read G-Man’s full article on his idea of
duplicate content, as well as subscribe to his feed as he always has great bit of information on BlackHat, and
SEO in general.
Before I can get started on my view of duplicate content filters we must first talk about shingles and types of keywords.
First, the general perception is that search engines use something called shingles. Shingles reference to a block of text that they use to identify sets of words in a contiguous sequence in a document, and has a close relation to how website rank for given keyword(s).
Click here learn more about
Google Patent on duplicate content and shingles.
Additionally when looking at words one must break words down into groups. With that there are three types of words.
- Stop words are general words like get, I, me, the and you.
- Filler words are great then stop words, but only imply the meaning of the action word and have less value as it does not define what the document is about.
- Action words on the other hand complete the document and define how or what the document is about. Words like rankings, slipped, page, and penalty could be action words.
The full extent of what words are filler words and what words are action words are unknown to us at this time, however there is strong evidence that Google has selected words that it does count as action words and words it does not count.
If you have ever head of a website called CopyScape.com then you already know that it’s a free tool that helps you check a website’s page for duplicate content on other sites.
Over the years I have played with their tool, and being me I like to figure how things work. With that I have spent much time decoding how to manually get Google to show me duplicate content on other sites, and in some cases I have posted the results in other post
like this one.
In the past I found that Google would return in their normal search results 15 action words which would create the shingle out of the 150 char that they showed in the total result per site.
In recent months the number has reduced to 142 char for the total description for each site, and includes around 12-13 action words to complete a shingle.
In either case the size of the current shingle of 12-15 action words will not matter for explaining how Google can hunt down and almost stop duplicate content from ranking.
In theory let’s say Google gives each page 100% good faith on being unique before it starts to index and filter each page on a site.
As Google starts the indexing and filtering process it would apply it’s duplicate content filter by applying the shingles in a step method.
For my duplicate content filter example I will use the following text: “On some website that some webmaster owns he could have the following content, that he is worried about getting dupe content penalty for. But if he did not copy large quantities of content from other sites then he would not have to worry about getting such a penalty to such an extent that their webpage or website would not rank in Google.”
As Google starts the filtering processing for duplicate content it would break the above text down by first removing all stop words as well as filler words from the document, which would only leave unique action words and create yet a simple and unique finger print.
Since I do not know what words Google fully counts as stop words or filler words I will use all the text in the above quote for my example.
In my above example text I have a total of 62 words. If I assume that Google uses 15 words per shingle then we would be able to produce 48 shingles.
Now that we know the above document has 48 shingles we also know that the each shingle is worth around 2.08% of the document total 100%.
As Google moves along it starts comparing its each shingle to other websites or pages on other sites or the same site. As it finds other shingles that pertain to the matching shingle it will subtract the 2.08% from the webpage’s total value of quality.
As more and more shingles are found to have a match the quality score is reduced more and more till at some point the page’s quality score drops below a threshold that Google has defined.
This same quality score could be applied to the entire domain as Google couple take all pages quality score and produce a score for the entire domain that would pertain to the level of quality a domain has for duplicate content.
This is where G-Man’s and me part ways on our view of the subject.
G-Man’s feels the search engines can not use the above method unless it keeps a record of the location in the document for the shingle. However in my view it is not a necessity to keep locations as all documents should never hold a quality score less then lets say 80%.
Also G-Man feels by adding content in around the duplicate content one would throw the search engines off of giving a penalty.
IMO by adding more content that is unique, one only reduces the percent that each shingle gives to the total quality score and in that event one is unable to reduce the percent for all shingles to a point that it would give them a quality score higher then 80% without making more duplicate content.
Also by scrambling the words* you may avoid getting hit with a duplicate penalty, but then you really don’t have anything to rank with, as your document would not produce a quality search string from the action words.
I agree to disagree on duplicate content! - Read More...