Create account

DEV Community

WebScraping [Part-1]

Sunil Aleti on May 31, 2020

aletisunil / Scraping_IMDB Scrapes the movie title, year, ratings, genre, vot...

Read full post

cubiclesocial • May 31 '20

You can download most IMDB data without scraping their website. They provide bulk download options in tab-separated format (TSV) files:

datasets.imdbws.com/

The only glaring thing missing from the dataset is MPAA/TV rating codes. But everything you show in your final spreadsheet is readily available in those dataset files.

I recently used the IMDB dataset plus a couple of other datasets to programmatically produce a ranked list of the Top 250 family-friendly movies of all time:

cubicspot.blogspot.com/2020/02/the...

It was an ambitious multi-week project to gather all the data, merge roughly three disparate datasets together without common keys within a 5.4GB SQLite database, and finally generate the list. The oldest movie in the list is from 1925! But, yeah, if you are looking for IMDB's data, they basically make it available without scraping their website.

Sunil Aleti • May 31 '20

The main aim of this tutorial is to make understand people "How and What is Scraping?" I dont have any intentions or work to scrape data. It's just a popular website and it also is easy to explain through this website 🙂

cubiclesocial • May 31 '20

It's against IMDB's Terms of Service (ToS) to scrape their content. Not that their ToS has actually stopped anyone in the past from scraping their site, my response was just to point out an alternative to scraping their content that doesn't violate their ToS.

Scraping websites of private entities is a legal minefield. U.S. government websites, however, are completely legal to scrape as all of the content on them is in the public domain and they usually have data worth scraping that's more up-to-date than what shows up on data.gov. There are also massive multi-petabyte public datasets on Amazon S3 available too that require the use of a scraper toolset to properly retrieve and process (e.g. commoncrawl.org/the-data/get-started/) but that might be a tad more advanced than a beginner's tutorial might be able to cover.

Anywho, just a couple of thoughts.

AW A RE • Jan 2 '22

Any public data is legally scrapable by law. And it should remain legal.

cubiclesocial • Jan 14 '22

Terms of Service are unsigned contracts unless you sign the contract by doing something like create an account and agree to the ToS. Then contract law may apply and "no scraping" clauses in such contracts might be legally binding. I'm not a lawyer but the law is a lot more complex than you think and each region of jurisprudence is different in how it applies its own laws. Your blanket assertion that scraping anything published on a website is legal is false. If someone has to login to obtain content (i.e. agree to a ToS) or they knowingly obtain content that is known to be sourced via illegal means, then civil or even criminal actions can be taken against that person.

Legal issues aside, web server operators can also block those who make excessive requests to their servers. IMDB has official data dumps of their database. It's not perfect since some information is missing but it is a good enough starting point for most purposes. Since IMDB makes data dumps available for direct download and is more efficient than scraping, IMDB has every right to block anyone scraping their main website.

Sunil Aleti • May 31 '20

Thanks 😊

zareefweb • May 31 '20

how to scrap news post in bulk which are dynamic? I searched lot unable to find

Sunil Aleti • May 31 '20 • Edited

Yes, it's very difficult to scrape dynamic sites because the data is loaded dynamically with JavaScript.
In such cases, we can use the following two techniques:

Reverse Engineering JavaScript
Rendering JavaScript

And will definitely make a tutorial in future 🤞

villival • Dec 8 '20

Awesome

villival • Dec 10 '20

simple and precise explanation

Sunil Aleti • Dec 10 '20

Thanks