Saurav Jain

for Apify

Posted on Jun 7, 2024 • Originally published at blog.apify.com

How to scrape dynamic websites with Python

#beginners #tutorial #python #webdev

Scraping dynamic websites that load content through JavaScript after the initial page load can be a pain in the neck, as the data you want to scrape may not exist in the raw HTML source code.

I'm here to help you with that problem.

In this article, you'll learn how to scrape dynamic websites with Python and Playwright. By the end, you'll know how to:

Setup and install Playwright
Create a browser instance
Navigate to the page
Interact with the page
Scrape the data you need

What are dynamic websites?

Dynamic websites load content dynamically using client-side scripting languages like JavaScript. Unlike static websites, where the content is pre-rendered on the server, dynamic websites generate content on the fly based on user interactions, data fetched from APIs, or other dynamic sources. This makes them more complex to scrape compared to static websites.

What's the difference between a dynamic and static web page?

Static web pages are pre-rendered on the server and delivered as complete HTML files. Their content is fixed and does not change unless the underlying HTML file is modified. Dynamic web pages, on the other hand, generate content on-the-fly using client-side scripting languages like JavaScript.

Dynamic content is often generated using JavaScript frameworks and libraries like React, Angular, and Vue.js, which manipulate the Document Object Model (DOM) based on user interactions or data fetched from APIs using technologies like AJAX (Asynchronous JavaScript and XML). This dynamic content is not initially present in the HTML source code and requires additional processing to be captured.

Tools and Libraries for Scraping Dynamic Content

To scrape dynamic content, you need tools that can execute JavaScript and interact with web pages like a real browser. One such tool is Playwright, a Python library for automating Chromium, Firefox, and WebKit browsers. Playwright allows you to simulate user interactions, execute JavaScript, and capture the resulting DOM changes.

In addition to Playwright, you may also need libraries like BeautifulSoup for parsing HTML and extracting relevant data from the rendered DOM.

Step-by-Step Guide to Using Playwright

Setup and Installation:
- Install the Python Playwright library: pip install Playwright
- Install the required browser binaries (e.g., Chromium): Playwright install chromium
Scraping a Dynamically-loaded Website:
- Import the necessary Playwright modules and create a browser instance.

from Playwright.sync_api import sync_playwright

    with sync_playwright() as p:
        browser = p.chromium.launch()

- Launch a new browser context and create a new page.

    page = browser.new_page()

- Navigate to the target website.

page.goto("<https://example.com/infinite-scroll>")

- Interact with the page as needed (e.g., scroll, click buttons, fill forms) to trigger dynamic content loading.

```
# Scroll to the bottom to load more content
while True:
    page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
    new_content_loaded = page.wait_for_selector(".new-content", timeout=1000)
    if not new_content_loaded:
        break
```

- Wait for the desired content to load using Playwright's built-in wait mechanisms.

```
new_content_loaded = page.wait_for_selector(".new-content", timeout=1000)
```

- Extract the desired data from the rendered DOM using Playwright's evaluation mechanisms or in combination with BeautifulSoup.

```
content = page.inner_html("body")
```

Here's the complete example of scraping an infinite scrolling page using Playwright:

```
from Playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Launch a new Chromium browser instance
    browser = p.chromium.launch()

    # Create a new page object
    page = browser.new_page()

    # Navigate to the target website with infinite scrolling
    page.goto("<https://example.com/infinite-scroll>")

    # Scroll to the bottom to load more content
    while True:
        # Execute JavaScript to scroll to the bottom of the page
        page.evaluate("window.scrollTo(0, document.body.scrollHeight);")

        # Wait for new content to load (timeout after 1 second)
        new_content_loaded = page.wait_for_selector(".new-content", timeout=1000) # Check for a specific class

        # If no new content is loaded, break out of the loop
        if not new_content_loaded:
            break

    # Extract the desired data from the rendered DOM
    content = page.inner_html("body")

    # Close the browser instance
    browser.close()
```

Challenges and Solutions

Web scraping dynamic content can present several challenges, such as handling CAPTCHAs, IP bans, and other anti-scraping measures implemented by websites. Here are some common solutions:

CAPTCHAs: Playwright provides mechanisms to solve CAPTCHAs using third-party services or custom solutions. You can leverage libraries like python-anticaptchacloud or python-anti-captcha to solve CAPTCHAs programmatically.
IP bans: Use rotating proxies or headless browsers to avoid IP bans and mimic real user behavior. Libraries like requests-html and selenium can be used in conjunction with proxy services like Bright Data or Oxylabs.
Anti-scraping measures: Implement techniques like randomized delays, user agent rotation, and other tactics to make your scraper less detectable. Libraries like fake-useragent and scrapy-fake-useragent can help with user agent rotation.

Summary and Next Steps

Scraping dynamic websites requires tools that can execute JavaScript and interact with web pages like a real browser. Playwright is a powerful Python library that enables you to automate Chromium, Firefox, and WebKit browsers, making it suitable for scraping dynamic content.

However, it's essential to understand that web scraping dynamic content can be more challenging than scraping static websites due to anti-scraping measures implemented by websites. You may need to employ additional techniques like rotating proxies, handling CAPTCHAs, and mimicking real user behavior to avoid detection and ensure successful scraping.

For further learning and additional resources, consider exploring Playwright's official documentation or one of our more in-depth tutorials:

Top comments (2)

JB • Jun 7 '24

As someone who has some issues plowing through longer articles, this one was to the POINT!

The auto wait and the trace view features are really cool, now I am gonna spend countless hours playing around with this instead of doing any "real work" :)

Saurav Jain Apify • Jun 7 '24

haha thanks :D