Contrary to popular belief, web scraping is not a shady or illegal activity. That is not to say that any form of web scraping is legal. It, like all human activity, must adhere to certain parameters.
Personal data and intellectual property regulations are the most important boundaries in web scraping, but other factors, such as the website’s terms of service, can also play a role.
Continue reading to learn more about the legality of web scraping. We will go over the most common points of confusion one by one and provide you with some helpful hints to keep your scrapers compliant and ethical.
If you scrape data that is publicly available on the internet, web scraping is legal. However, some types of data are protected by international regulations, so be cautious when scraping personal information, intellectual property, or confidential information. To create ethical scrapers, respect your target websites and use empathy.
Common myths related to web scraping
Before we begin, let’s clear up a few misconceptions. We sometimes hear that “web scrapers operate in a legal grey area.” Or that “web scraping is illegal, but no one enforces it because it is difficult.” Sometimes even “web scraping is hacking” or “web scrapers steal our data” is used. This has been confirmed by clients, friends, interviewees, and other businesses. The problem is, none of this is true.
Myth 1: Web scraping is illegal
It all comes down to what you scrape and how you scrape it. It’s a lot like taking pictures with your phone. In most cases, it is perfectly legal, but photographing an army base or confidential documents may land you in hot water. Web scraping is essentially the same thing. There is no law or rule that prohibits web scraping. However, this does not imply that you can scrape everything.
Myth 2: Web scrapers operate in a grey area of law
No, not at all. Legitimate web scraping companies are regular businesses that adhere to the same set of rules and regulations that everyone else must adhere to in order to conduct their respective business. True, web scraping is not heavily regulated. However, this does not imply anything illegal. On the contrary.
Myth 3: Web scraping is hacking
Although the term “hacking” can refer to a variety of activities, it is most commonly used to describe gaining unauthorized access to a computer system and exploiting it. Web scrapers use websites in the same way that a legitimate human user would. They do not exploit vulnerabilities and only access publicly available data.
Myth 4: Web scrapers are stealing data
Web scrapers only collect information that is freely available on the internet. Is it possible to steal public data? Assume you see a nice shirt in a store and take a note of the brand and price on your phone. Do you believe you stole the information? You wouldn’t do it. Yes, some types of data are protected by various regulations, which we’ll discuss later, but other than that, there’s nothing to worry about when gathering information such as prices, locations, or review stars.
How to make ethical scrapers
Even if the majority of the negative things you hear about scraping are untrue, you should still exercise caution. To be honest, you should exercise caution when conducting any type of business. Web scraping is no different. Personal data is the most important type of data to avoid scraping before consulting with a lawyer, with intellectual property a close second.
This is not to say that web scraping is risky. Yes, there are rules, but you can use empathy to determine whether your scraping will be ethical and legal. Amber Zamora suggests the following characteristics for an ethical scraper:
- The data scraper behaves like a good web citizen, not attempting to overburden the targeted website.
- The copied information was public and not protected by a password authentication barrier.
- The information copied was primarily factual in nature, and the taking did not infringe on another’s rights, including copyrights; and
- The information was used to create a transformative product, not to steal market share from the target website by luring away users or creating a product that was significantly similar.
Think twice before scraping personal data
Not long ago, few people were concerned about personal data. There were no rules, and everyone was free to use their own names, birthdays, and shopping preferences. In the European Union (EU), California, and other jurisdictions, this is no longer the case.
If you scrape personal data, you should definitely educate yourself on the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and your local laws.
Because regulations differ from country to country, you must carefully consider where and whose data you scrape. In some countries, it may be perfectly acceptable, whereas, in others, personal data should be avoided at all costs.
How do you know if you should apply GDPR, CCPA, or another regulation? This is a simplification, but GDPR will apply if you are from the EU, do business in the EU, or the people whose data you want are from the EU. It is a comprehensive regulation. The CCPA, on the other hand, only applies to California businesses and residents. We use it as a point of comparison and because it is ground-breaking legislation in the United States. Wherever you are, you should always check the privacy laws of your home country.
What is personal information?
The GDPR defines personal data as “any information relating to an identified or identifiable natural person.” That’s a little difficult to read, but it gives us an idea of how broad the definition is. If it relates to a specific human being, almost anything can be considered personal data. The definition in the CCPA is similar, but it refers to personal information. To keep things simple, we’ll only use the term “personal data.”
Publicly available personal data
A sizable portion of the web scraping community believes that only private personal data is protected, whatever that means, and that scraping personal data from publicly available sources — websites — is perfectly legal. It all depends.
All personal data is protected under GDPR, and it makes no difference where the data comes from. A European Union company was fined a hefty sum for scraping public data from the Polish business register. The fine was later overturned by a court, but the ban on scraping publicly available data was explicitly upheld.
The CCPA considers information made available by the government, such as business register data, to be “publicly available” and thus unprotected. HiQ vs. LinkedIn is a significant case in the United States involving the scraping of publicly available data from social networks. We’re still waiting for the final decision, but preliminary results support the idea of scraping personal information that the person made public.
The California Privacy Rights Act (CPRA) will take effect in 2023, broadening the CCPA’s definition of publicly available information. Data that the subject previously made public will no longer be protected. This effectively allows the scraping of personal data from websites where people freely share their personal data, such as LinkedIn or Facebook, but only in California. We anticipate that other US states will be inspired by the CCPA and CPRA in developing their own privacy legislation.
How to scrape personal data ethically
Once you are certain that you are not harming anyone with your scratching, you need to analyze which regulations apply to you. If you are a business in the EU, the GDPR applies to you even if you want to collect personal data from people elsewhere in the world. As an EU business, you need to do your research.
Sometimes it’s okay to go ahead for a legitimate interest, but more often than not you’ll need to pass this personal data collection project on to your non-EU partners or competitors. On the other hand, if you’re not an EU company, if you’re not doing business in the EU, and you’re not targeting people in the EU, you’ll be fine. Also be sure to check local regulations, such as the CCPA.
Finally, you must program your scrapers so that they collect as little personal data as possible and only keep them temporarily. Creating a database of people and their information (eg for lead generation) is a very difficult case in secure jurisdictions, while pulling people from Google Maps reviews to automatically identify fake reviews, then deleting personal data could easily pass the legitimate interest test.
Scraping copyrighted content
Almost everything on the internet is protected by copyright in some way. Some things stand out more than others. Music, movies, or photos? Sure, you’re safe. Articles in the news, blog posts, social media posts, or research papers? Also safeguarded. HTML code for websites, database structure and content, images, logos, and digital graphics? All of these things are copyrighted. The only thing that is not protected by copyright is simple facts. But what does this have to do with web scraping?
If a piece of content is copyrighted, it means that you can’t make copies of it without the author’s permission (license) or legal permission. Because scraping is defined as copying content, and you almost never have the author’s explicit consent, legal permissions are your best bet. As is customary, laws differ from one country to the next. We will only talk about EU and US regulations.
Conclusion
So, is it legal to scrape websites? It’s a complicated problem, but we’re convinced of it, and we hope this brief and daringly simplified legal analysis has persuaded you as well. We also believe that web scraping has a promising future. We are witnessing a gradual but steady paradigm shift in the acceptance of scraping as a useful and ethical tool for gathering information and even creating new information on the internet.
In the end, it’s nothing more than the automation of work that would normally be performed by humans. Web scraping simply accelerates and improves the process. Best of all, it frees up people’s time to devote to more pressing matters.
Original blog: https://blog.apify.com/is-web-scraping-legal/
Top comments (2)
It's not the scraping that is usually the problem, but what happens with the data. For example, you can't just republish a website, but a lot of scrapers have done that with various Stack Exchange websites.
It's cynical, usually against copyright, and will end up making money off other people's work.
Thanks Rajat! A great approach to web scraping and a very good reference to read before a new web scraping project!