Scraping 127,000 Lines with Python

#python #scraping

I don't know what it is about web scraping that I find so fascinating. Maybe because there's a clear goal and many different ways to do something. I find myself refactoring my code even though the web scrapers works and I already have the data I needed!

Right now, I'm working on a React app based on the classic Hangman game. In this version of Hangman, I'll be using IMDb's 1,000 highest rated movies for people to guess. Users will be given the length of the title (obviously), release year, genre and summary. To shake things up a little, I've also scraped IMDb's top 1,000 rated movie stars. Users will be given the length of the movie star's name, the movie they're most famous for, and 10 portraits (I think I'll only be showing the largest four images). There will be a 50/50 chance of either a movie or movie star to pop up.

Some challenges I'm working through in my head:

Some of the movie's summaries include the title, for instance, Joker (2019). I'll need to create a function to censor it out.
How will mobile users type? There isn't going to be a form or any input field (I think). So how do I get the keyboard to show?
Can I easily implement a hangman SVG character or should I do something else?
If I show four portraits for, let's say, Tom Hardy, I think it would be best to implement a horizontal scroll for mobile users so they can still see what they're guessing. How can I make it work on the desktop?

For the movie star scraper, I used Beautiful Soup to gather the names and what movie they are famous for. On IMDb, viewing their images would create a modal popup. This was difficult to extract images from so what I ended up using was Contextual Web Search API. It's simple and free to use 10,000 times a month. Great! I only need to use 1000.

I collected the first 10 results for each movie star. The reason why I'm showing four portraits is because some of them aren't actual pictures of the actor or actress. For examaple, here's one of Lynn Shelton who's popular from the movie Humpday. And other images where the movie star is with a group of people.

Here's the scraper code:

import json
import requests
from time import sleep
from bs4 import BeautifulSoup


class Star:
    def __init__(self, name, famous_for, portraits):
        self.name = name
        self.famous_for = famous_for
        self.portraits = portraits

    def __str__(self):
        return f"{self.name}, {self.famous_for}"

    def to_json(self):
        # converts to JSON
        return json.dumps(self, default=lambda o: o.__dict__, indent=4)


page_num = 1
star_list = []


while page_num <= 1001:
    page = requests.get(
        f"https://www.imdb.com/search/name/?gender=male,female&start={page_num}&ref_=rlm").text

    soup = BeautifulSoup(page, 'html.parser')
    stars = soup.find_all('div', class_='lister-item')

    for star in stars:
        name = star.h3.a.text.strip()
        famous_for = star.find_all('a')[2].text.strip()

        url = "https://contextualwebsearch-websearch-v1.p.rapidapi.com/api/Search/ImageSearchAPI"
        querystring = {"autoCorrect": "false", "pageNumber": "1",
                       "pageSize": "10", "q": f"{name}", "safeSearch": "true"}
        headers = {
            'x-rapidapi-host': "contextualwebsearch-websearch-v1.p.rapidapi.com",
            'x-rapidapi-key': "1b193d24f4msh7e6cc900a3729c9p1a4fbbjsnb60afe89cf7b"
        }
        response = requests.request(
            "GET", url, headers=headers, params=querystring)
        sleep(1)

        try:
            portraits = json.loads(response.text)['value']
        except Exception as e:
            print(e)

        star = Star(name, famous_for, portraits)
        star_list.append(star.to_json())
        print(len(star_list))

    page_num += 50

f = open('stars.json', 'w')

for star in star_list:
    f.writelines(star)
    f.writelines(',')

f.close()

Once it's done scraping, it exports it to a JSON file in the same directory. How cool! The one thing I couldn't figure out was how to put all of that data into a single array. So what I ended up doing, for now, is select the 1,000 dictionaries and hit the bracket symbol to surround them in brackets. If anyone knows how I can get it to work automatically, please let me know!

Lastly, here's the data :)

Top comments (1)

Comment hidden by post author - thread only accessible via permalink

Glenn • May 31 '20

By automatically, do you mean you're trying to figure out how to have it go endlessly until you stop getting results?

If yes, maybe a simple stars check that sees if there are any 'stars dividers' on the page that was returned. If there aren't any, then simply exit out of the loop.

Some comments have been hidden by the post's author - find out more