Issues with people scraping content

LiamZing · May 18, 2009

I recently posted a question about the benefits of article writing for SEO. As always seems to be the case with SEO, the response was mixed but still very useful, so thank you! To test things out I've posted some articles on article directory websites with do-follow links and now a whole new issue has popped up...

A number of websites are scraping my articles from those article directories and posting them as their own (even changing the author name) and deleting the links back to my site.

What I want to know is, is this an issue in terms of SEO and duplicate content?

How does google know that I'm the original author if at all?

How does google choose which page to show seeing as some of my articles are now listed on several pages?

Finally, is this even a problem or am I suffering from acute SEO paranoia?!

Thanks

Liam

MadWeb · May 18, 2009

Google will know which page it cached first. Assuming the website you originally posted your article on "pinged" Google to say it had new content, then it will know yours is the original.

I would send a polite email to the copiers and ask if they could correct their articles with the correct author link. Probably won't do any good, but worth a try.
Their website will be viewed negatively for having duplicate content anyway.

david64 · May 18, 2009

LiamZing said:
What I want to know is, is this an issue in terms of SEO and duplicate content?

Yes. If people are stealing your content, unless they have a lot of other content on the page, the content is going to be seen as duplicate.

LiamZing said:
How does google know that I'm the original author if at all?

They don't.

LiamZing said:
How does google choose which page to show seeing as some of my articles are now listed on several pages?

They choose which ever one they trust the most. If any of the pages has got PageRank, it is probably going to be that one.

LiamZing said:
Finally, is this even a problem or am I suffering from acute SEO paranoia?!

Yes it is a problem, but there's not much you can do about it without spending all your time writing DMCA complaints. If you are doing article marketing this is the sort of thing that is going to happen. Gutter breeds gutter.

As mentioned in your last post, there are a number of article marketing services you can use, which are designed to create a relation between authors and directories. Article writers want links and the directories want content to plaster in ads and aff schemes.

Here they are again for you:

http://www.articlemarketingautomation.com/
http://www.freetrafficsystem.com/
http://www.uniquearticlewizard.com/

Those are the ones I know of. There are probably more.

Toffsworld · May 18, 2009

As a company who publishes hundreds of articles a month, we work on the assumption that everything we publish will be copied by someone, somewhere.

In our case, that can be advantageous, because we are usually writing about clients or products that we want to promote so we don't really care how many people copy our articles - we're still winning.

And whenever I post articles to article sites, I always give permission to be copied - that way I don't lie awake at night worrying over something I have no control over. Once you publish anything on the Internet, you have to assume that you will be plagiarised - and start writing accordingly.

Very rarely have we found that we've been completely 'Copied and Pasted'. The majority of plagiarism that goes on involves people picking bits from two or three articles and calling it their own. While this can be upsetting to the original authors, legally, this is usually considered to be within 'Copyright' rules.

So - publish like you want to be copied - and you might be able to turn this negative to your advantage.

MadWeb · May 18, 2009

Google will still be aware of which version it spidered first though - then that will be considered the original, and any other duplicate article spidered at a later date is considered the duplicate content.

awebapart.com · May 18, 2009

It is possible that some of the unscrupulous people copying your content are smart enough SEO-wise to know that if they copy your content, and get indexed by google before you do, then google will see them as the originator of that content, and you as the duplicator of that content.

The best way to avoid this situation, is to ensure that google has indexed your content, before you start promoting the content. The problem with some content management systems and blogging systems, is that by default they are set to inform the world (and the content copiers) about your new content before google gets a chance to index your content, and some content copiers with more regularly updated sites - more regularly updated because it is easier to copy other people's hard work than put in that hard work themselves! - can end up getting indexed by google before you.

david64 · May 18, 2009

EssexWeddingServices said:
Google will still be aware of which version it spidered first though - then that will be considered the original, and any other duplicate article spidered at a later date is considered the duplicate content.

This is not the case. I recently took some article from directories and placed them on the homepage of some PR2 websites and my sites are all considered the authoritative version by G.

Hamlet Batista has some good articles on dupe: http://hamletbatista.com/tag/duplicate-content/

Toffsworld · May 18, 2009

I agree that Google is aware of which article it spidered first, but I'm not aware that it does anything particularly useful with this information other than to give a boost to 'New' content.

By default, Google will know which article was published first because they collect the publishing dates from the server on which it's hosted. How useful this information is to them afterwards, I really can't tell.

I have a few instances out there where 2, 3, or even 4 of the listings on Google are the same article but published from different sites.

Sometimes these articles are copied from me, sometimes they are the result of RSS feeds that multiply content across the Net, and sometimes they are deliberate attempts by me to get my own clients maximum exposure on Google by having stories about them repeated as often as possible.

However, I am not aware of any specific effort on the part of Google to 'Delete' these copies.

Other than RSS feeds and Blogs that seem to have a built in shelf life with Google of about 7-12 weeks, after which they seem to 'Drop out' of existence anyway, Web pages can remain in situ for years without ever being removed by Google - copies or not.

This is my experience and I'm happy to be corrected if I'm wrong.

UKSBD · May 18, 2009

Toffsworld said:
Other than RSS feeds and Blogs that seem to have a built in shelf life with Google of about 7-12 weeks, after which they seem to 'Drop out' of existence anyway, Web pages can remain in situ for years without ever being removed by Google - copies or not.

This is my experience and I'm happy to be corrected if I'm wrong.

I tend to agree, so with particular posts I simply edit the published date
every now and then and re ping google.

david64 · May 18, 2009

Toffsworld said:
By default, Google will know which article was published first because they collect the publishing dates from the server on which it's hosted. How useful this information is to them afterwards, I really can't tell.

This information is only useful for static HTML pages. Or are you refering to the date as text on the site?

Toffsworld · May 18, 2009

On one server, I use a content management system called Joomla. Like many Content Management Systems, this stores a 'Creation Date' for each page that I publish. Even without this reference, the server keeps a log of FTP (file transfer protocol) updates as a reference. We are aware that Google spiders and collects some of this data.

As UKSBD commented - its quite easy to re-save an article and re-publish it - so over time I have come to question what use these dates are to Google if they can be so easily manipulated?

I'm not aware that they are of any use and hence, 'Copying' content to my knowledge, won't be stopped by 'Publication Date' alone or being first online with an article- which is where this thread started and which is supported by your own comments (david64).

Secondly, if one is feeding dynamic content into a page from say a SQL server then the page could be two years old but the content in it only 1 minute old - how would Google handle this discrepancy? I don't know and I'm not sure that they try to - that's all I'm saying.

I think Google's algorithm leans back to Page Rank and Backward Links to identify what to list - not the date of publication or whether the content is duplicated...

GNU · May 18, 2009

Toffsworld said:
I think Google's algorithm leans back to Page Rank and Backward Links to identify what to list - not the date of publication or whether the content is duplicated...

Yup, pretty sure thats the way it works. As David said, you can lift content and be accredited as the authorititive source fairly easily.

Another "trick" is for scrapers to get the content indexed before the originating site... which is easily done if the originiating site is spidered infrequntly... and the scraper is publishing "fresh original" content every few hours

Search

Search

Issues with people scraping content

LiamZing

More options

MadWeb

More options

david64

More options

Toffsworld

More options

MadWeb

More options

awebapart.com

More options

david64

More options

Toffsworld

More options

UKSBD

More options

david64

More options

Toffsworld

More options

GNU

More options

Latest Articles

Cookie Consent

Essential

Analytics

Marketing

Issues with people scraping content

LiamZing

MadWeb

MadWeb

GNU