I do a lot of data scraping on the web, and one of the first things I look for is an API. Even if the platform doesn't provide a publicly documented API, generally they will probably have some sort of undocumented "private" API to facilitate client-server communication like search queries or other fun AJAX stuff without reloading the page.
In fact, because it's undocumented, there may be a lot of security-related issues that they might not think about, simply because it's not intended to be consumed by the public (but probably more likely, who likes to think about security?)
One thing I've noticed, as a web scraper, is generally how easy it is to make API requests myself.
A Simple Example
Let's say you have a blog or other content system and you wanted to implement a search function, where the user enters a search query and then your server uses it to return a list of relevant results from the database.
The Request
After building your API, you might make a request like this on the client-side:
POST https://api.myblog.com/search
query=scrape&type=tag
Which is intended to return all posts that are tagged with "scrape" in it. And then you proceed to test your API, verify that it works, and commit your code.
Fantastic, job well done. You release your feature, and now you can tag your posts and let people find them using tag search.
The Scraper
So now I come along and I wanted a list of all your posts on your site. Perhaps I want to build my own site that basically mirrors all of your content (think about all those instagram clones) so that I can enjoy some extra traffic without doing any work. All I really have to do is run a scraper periodically checking for new content and then download them to my own server.
To figure out how your site works, I would come to your blog, type in a search query, and hit submit. I would then notice that you make an API request and then proceed to take a look at how the request and responses are constructed:
- the headers, like cookies, origin, host, referer, user-agent, custom headers, etc
- the body, to see what data is sent
- any security features like CSRF tokens or authorization
Then I would replicate this request and send it from my own server, which bypasses CORS because CORS doesn't mean anything if I can spoof the origin. And also because a lot of people probably don't really understand CORS and set Access-Control-Allow-Origin: *
on their server anyways because half the answers on StackOverflow recommend it as a solution. Conceptually, it's not that difficult to understand and I highly recommend reading about it, maybe over at MDN.
Some Bad Queries
I would start by trying different things. Maybe try something as simple as an empty string
POST https://api.myblog.com/search
query=&type=tag
Maybe your search engine will match on EVERY tag and then give me everything in one shot.
Or I might try some wildcards, maybe %
or *
hoping you don't sanitize your parameters (which can potentially open up a different world of hurt!)
POST https://api.myblog.com/search
query=%&type=tag
Solutions???
Generally, depending on what country you're operating in, the law is on your side since your terms of service will include something about unauthorized access to API's. And if you don't, you probably should! But of course not everyone will respect the law, and depending on the kind of data you're working with, sometimes you might want to be a bit more proactive in preventing scrapers from taking all your data too quickly.
Just like how getting your database hacked and exposing millions of customer data would be a massive blow to your business even if you have the right to sue the hackers, you might not want to wait until it actually happens so that you can play your legal cards.
Unlike securing a database, you can't just stop people from making requests to your server. After all, how does one distinguish between a request from your website, and a request from a 3rd party client that I wrote in Ruby or Python or Java or straight-up curl?
I believe the goal in this case is to make scrapers work hard. The harder they have to work, the more requests they need to make, the slower the data collection process becomes, and the easier it is for you to flag it as suspicious activity and then take action automatically.
Depending on the nature of your content, you might for example enforce a minimum character limit and sanitize the inputs to avoid wild card operations on the server-side. Where you do your checks is important because I build my own requests, so any front-end validation is basically useless. Relying on the client to be honest is like giving me the key to your safe and hoping that I don't open it.
Other common examples in other applications I've seen include
limiting the number of requests per time interval (ie: request cooldowns). If your app users aren't intended to make 100 requests a second, don't let them.
Paginating your results. This is a pretty common strategy for various performance-related purposes (for better or worse), but combining it with request cooldowns, it can be pretty nice.
geofencing strategies, where search results are limited based on a provided location which could be the name of a region, or a pair of latitude, longitude coordinates. Might not apply to you, but if it does, really makes life hard for scrapers.
rate limiting, where you impose limits on the number of API requests that can be made before no more requests can be made. This is useful if requests must be authenticated with a token, possibly tied to a user account. This won't be effective if I'm hitting the server directly with the same token that your own client uses.
compile to native code. This one is actually quite effective because unlike plain-old javascript that anyone can read like a book, reverse engineering native code requires more than just some basic understanding of how to use code-beautifiers and browser debuggers. Sure, they can still do it given enough time and effort, but probably 95% of people in the world don't have this kind of skill set.
By effectively using filters and cooldowns, you can force scrapers to work hard to obtain your data instead of just coming in and then walking away with everything in 5 seconds!
Feedback
Are there any techniques that you like to use to "secure" your data from scrapers? Or perhaps it's not necessary for the average app developer to think about?
Top comments (10)
I love your suggestions here, thank you for this post.
While my practical experience is not that high, it seems that the session token can be a source of security. Monitoring the requests and timing against the session token (and really any unique identifier) can help. Perhaps sequential (minus pagation) requests, or even broad (more than x results per query) could be vectors to detect and limit scraping
Thanks for the feedback. Session tokens are a fantastic tool and so common that I forgot about it, given that most apps require some form of authentication and therefore my activity can be easily logged and flagged.
What do you mean by sequential requests?
Basically checking the last N requests to see what their coverage (especially how broad of indicated content) would be. Depending on user privilege (an auditor, for example, would be an exception), the queries themselves probably shouldn't prompt a wide range of content. For example, sequential(ish) requests would be if someone requested all content for one month, then the previous month, then the 3rd. Having such broad requests could be used to detect a scraper. A user would likely be a bit more specific in what they are looking for. An occasional prompt (helpful hint?) could be provided to the user to be more specific, or even offer a suggestion. If such a prompt is ignored too often, it could again, be a +1 of the suspicious-o-meter.
This could also appear with something like TikTok, Instangram...ect, whereby a user can just scroll through a never-ending list. Each list is still governed by some criteria, even if it's generally handled server side, but the user can select something to view another user/category/tag/ect. The server can keep track of the hits/changes, and limit how much content is being provided by how fast the requests are being switched. A robot could be searching multiple queues at once, but a person is going to "enjoy" their content.
Ultimately, I agree with your philosophy... A strong goal is to make displaying further content take equivocally more time, both to protect server resources, and server content.
Oh I see what you mean. User behaviour definitely is a good indicator based on how you've described it.
Devs probably will be using some sort of analytics framework to try and understand how users use the app. This can also be used to establish "regular" usage vs "irregular" usage, so it can serve multiple purposes!
Thank you for the awesome post
i can see another way if you API only responds to you website or at least web based application CORS strategy would be a good way to prevent scraping because you only accept requests from some certain domains, i can see the CORS is not a bullet proof as we can design a bot that opens a website then types queries as a regular user but that would make the process very tedious and hard to do.
Thanks for the feedback. Using browser based bots like puppeteer or selenium to simulate user behavior is quite effective, especially if the website is kind of annoying to scrape because it runs a lot of client side processing that you don't want to reverse engineer.
Though CORS really only protects you from browsers which also includes WebKit or webview based applications (eg: react) because the browser devs don't let you tamper with the origin header. Outside of regular browser contexts, CORS doesn't mean anything since you can supply your own origin header.
I've devised workflows for getting around CORS while still using development tools like react to build cross platform apps, and agree that CORS is generally quite effective for certain kinds of applications.
nice post, I think nowdays is very easy to scrape with many API-centric websites, JS things etc, even the AMP initiative also make page easier to scrape.
Thanks for the feedback. It's interesting to see how the web has changed and what kind of trends become popular. I hadn't heard of AMP before (besides seeing it appear in the URL when I read news on google lol) but that would probably be a good example of how scraping can be used to make it easier to share data. RSS is cool for example but not everyone wants to bother with that kind of thing.
yes yes, totally agree with that, RSS is cool and WebSub make it cooler. Back to AMP, I think AMP is good initiative for user (in the name of speed) and also good for Google because it's easier to scrape clean content for their indexing purpose hahahahah
Bringing up Google and AMP is interesting, because they both effectively are scrapers. I wonder if you could test Anti-Scraper measures against AMP conversion?