What Is Web Scraping?
Web scraping, screen scraping or content scraping is a process that extracts data from websites, mostly using a computer program. It is a systematic process, wherein the data is gathered on a page by page basis. The extracted data is generally used to generate content on their own websites, without the consent of or any attribution to the owner. Creating useful, unique and share-worthy content is not easy, it takes at least your valuable time (but often costs a lot of money) so you should definitely protect it.
Why Do They Do It?
The obvious reason for web scraping is to generate content for a site with minimal time and effort. Just running a tiny, little script can get you thousands of pages in a matter of a few hours. Why do they want your content? Well, certain websites just need some basic information about certain topics so that search results point to them. By creating a mashup of a large number of articles related to a single topic, they get a significant amount of traffic. When they get traffic, the ads in their websites generate money.
Another possible reason could be generating money through affiliate marketing. Let’s say you wrote about your new phone and the great features it has. An affiliate marketer would like to get your content along with an affiliate link to Amazon for readers to buy if they like your review. If a reader buys the product through that link, the affiliate marketer gets as high as 10% of the price as fees!
Yet another reason for content scraping is plagiarism. Plagiarism is copying someone else’s work and presenting it as your own. People might be interested to show your work as their own in their portfolios and not everyone cross checks if the work is plagiarised. If the plagiarist is lucky, he might as well get away with it.
How Do They Scrape Your Content?
You may already know that web pages are rendered as HTML files, which are basically just text files that contain the content as well the information about how the page should look. These files are then interpreted by the browser to render the look of the pages. Going by the same idea, if you can send HTTP requests using a program, you receive these HTML files.
HTML files are very structured in the way that the content is generally within certain tags. Therefore, by running the received HTML through a parsing library, the desired content can be extracted very easily.
In case of WordPress, you might have noticed that the default URL format is of the form www.mysite.com/?p=[page_no] where “page_no” is a positive number. Even if you have changed the URL format to use the slugs of titles, WordPress redirects you to the real pages if you use the old URL format. You can therefore, progressively increment the page number from 1 to a very large number, which would enable you to extract all the information in a WordPress website.
A web scraper could also get your content from your RSS feeds. Many people give just a summary of their posts in their feeds and that puts the scrapers at a disadvantage, but if you provide full posts in your feeds, you are vulnerable.
How to Find Them?
The easiest way to find them is through Google. If you suspect that a certain post of yours has been copied and posted elsewhere, you could perform two kinds of searches. One is to search the title of the post with the keyword “allintitle:”. Therefore, your search term should be – “allintitle: The Tile of Your Post”. The allintitle keyword prompts Google to search for those words in the titles of posts only. The second and more effective way is to search for some text within your post, with the search term in double quotes. Putting the double quotes tells Google to search for the exact same text.
“more effective way is to search for some text within your post, with the search term in double quotes”
You may get false positives with your title search, as someone might use the same title, but the second way is far more effective because it’s highly unlikely that someone will have the exact same sentences or paragraphs.
You could also use a plagiarism checker like PlagSpotter or Small SEO Tools to find copies of your work online. Paid options are available which search your whole site instead of checking one URL at a time.
You could also use internal links and trackbacks to find out if your posts have been copied. Basically, if your content is copied, the links would still point to your website and this would help you find out where your content is. WordPress, like many other blogging software, supports trackbacks. To further understand how trackbacks work, I suggest you read this post.
If you are registered on Google Webmaster Tools and your site is listed there, you have a feature that enables you to find the link backs to your site. In your dashboard, select, “Links to your site”. If you use internal linking, this feature will put your scrapers right on top of the list!
A fourth way of finding out about scrapers is if you use Feedburner. In your dashboard, select “Uncommon Uses” under analyse and you should be able to see if someone picked up your content.
What Should You Do With Content Scrapers?
Before I discuss certain things that you could do to make the life of content scrapers hard, let me remind you that if something is visible on the internet, it can be scraped no matter what you do. Every solution that I propose can be overcome by using a more complex algorithm to process your data.
The Panda update to Google’s search algorithm flagged a huge number of websites as content scrapers and if you were the first to post the content, you have nothing to worry about. However, you could still ask them politely to take down your content.
If they do not take down the content, you can file a DCMA report with Google to take the site down from its listings. Due to the high number of requests, it takes some time for Google to respond. You can also file a complaint with hosting service of the content provider. You need to be the original copyright holder to file a complaint.
Anti-Feed Scraper Message
You could use this plugin, which adds a message with the original author information to your feeds. If someone uses your feed to get your data, your information would appear on their site anyways!
Another way of tackling with content scrapers is to add inline ads to your posts. That way, the content scraper would end up displaying your ads and you would generate revenue through the scraper’s website!
Block IPs of scrapers
Once you have determined unusual traffic from a web scraper, you could block it on your server using .htaccess files, if you use Apache as your web server. You need to create an .htaccess file in your root directory, after enabling .htaccess and add the following the following line.
Deny from <IP_Address>
Alternately, you could redirect the requests from that IP address to another page using .htaccess. A detailed tutorial on its usage can be found here.
Prevent hotlinking of images
When your content is copied and displayed on another site, the images that are shown still come from your server. You can, however, create an .htaccess file that prevents this hotlinking and either renders no image or an image of your choice (informing the reader of the copyright infringement). Here is an .htaccess generator to prevent hotlinking.
Each post takes a lot of time and care to create and when you find someone copying it, you should definitely be upset. We have shared the best ways to prevent and tackle web scrapers. What are your thoughts? Were you ever a part victim of web scraping? Do share you comments below.