Scrapfly for Scrapfly

Posted on Sep 18, 2024 • Originally published at scrapfly.io on Sep 18, 2024

How to Track Web Page Changes with Automated Screenshots

#screenshots #python #playwright

There are many different ways to monitor web page changes and one of the most popular techniques is screenshot tracking.

With this method only the final visual representation is tracked which is a convenient way to track real web changes and ignore website code updates making for a great web page detection tool!

Screenshot tracking is used to track product page updates, real estate listing changes, and other visually critical web pages — all of which can be done with automatic screenshots of websites.

In this guide we'll explore automated webpage screenshots using Python and a few of it's key packages. We'll use headless browsers to capture screenshots of websites, including styles and dynamic content and image analysis tools to find the changes. Let's dive in!

Why capture webpage screenshots?

Capturing webpage screenshots programmatically using browser automation tools like Selenium or Playwright can be useful in various scenarios. Some examples of practical use cases are:

Visual Regression Testing: Capture screenshots during automated tests to compare the current UI with a baseline. This helps detect unintended visual changes after updates or code changes.
Cross-Browser Testing: Verify that a website renders correctly across different browsers and devices by capturing screenshots in each environment.
Content Archiving: Preserve the state of a website at a specific point in time, useful for archiving purposes or historical reference.
Competitor Monitoring: Capture screenshots of competitor websites to analyze changes in their marketing strategies, UI/UX design, or product offerings.
Campaign Analysis: Track the appearance and placement of advertisements or promotions on a website over time.
Performance Benchmarking: Visualize the website's appearance under different performance conditions, such as varying network speeds or server response times.

Knowing how to track changes and highlight them programmatically is an essential step for any of the above scenarios. That's why in this guide, we will focus on how to track changes in a webpage by comparing two screenshots of the same website.

Setup

Let's set up our work environment with all the required packages and tools

Capturing Screenshots

To capture webpage screenshots programmatically, we will be using Playwright which is a web browser automation library (like Puppeteer and Selenium) and has a growing web-scraping community. Playwright is available in multiple programming languages, including Python, which we will be using in this guide.

Make sure python is installed on your device, if not, you can install it from the official python website.

To install Playwright with Python's package manager pip. For that, run this following command in your terminal

$ pip install playwright

Next, install your preferred browser web drivers of choice, we will use the Chromium web drivers

$ playwright install chromium 
# alternatively install `firefox` or `webkit` instead of `chromium`

We'll be using Playwright to capture screenshots but to compare them we need another different image computing tool next.

Comparing Screenshots

To compare screenshots and track changes between them, we will be using ImageMagick through its Python binding - Wand. ImageMagick is a free, open-source software suite, used for editing and manipulating digital images.

To install wand with pip, run the following command in your terminal

$ pip install wand

Note that as Wand is a binding of ImageMagick, you have to install ImageMagick on your device:

To verify you wand installation try this basic wand scrip that resizes an image:

from wand.image import Image

# Open an existing image
with Image(filename='input.jpg') as img:
    # Print the original size
    print(f'Original size: {img.size}')
    # Resize the image
    img.resize(200, 200)
    # Print the new size
    print(f'Resized size: {img.size}')
    # Save the image
    img.save(filename='output.jpg')

Besides image resizing, wand and ImageMagick can do a lot of interesting functions with images. With ImageMagick we'll be able to compare captured screenshots to find the differences in them.

Next we need a tool to schedule the capturing process.

Automating Screenshot Capturing

To automate the screen capturing process and monitor changes between screenshots, we will use the schedule library in python. Schedule is a job scheduling library that allows us to run python functions periodically using a friendly syntax.

To install schedule with pip, run the following command in your terminal

$ pip install schedule

Now schedule will allow us to run recurring screenshot capturing tasks from a single Python script. See this code to test schedule:

import schedule
import time

def job():
    print("capturing screenshot")

# every 10 seconds
schedule.every(10).seconds.do(job)

# run an endless loop checking for tasks
while True:
    schedule.run_pending()
    time.sleep(1)

Now that we have all the tools required for our screenshot capturing project we can begin our project. Let's start with Playwright screenshot capturing script.

Capturing Screenshots Using Playwright

All browser automation tools support capturing webpage screenshots. In this guide, we will provide a simple example of using Playwright to take webpage screenshots.

If you're more familiar with Selenium or Puppeteer they're mostly the same and you can check our in-depth guides for these alternatives instead.

Playwright runs in headless mode by default. This means that it launches a headless browser instance instead of a normal one.

A headless browser is a browser instance without visible GUI elements. This allows it to run much faster compared to its headful counterpart.

To capture a screenshot of a website using Playwirght, we can use the Page.screenshot() method:

from pathlib import Path
from playwright.sync_api import sync_playwright

def get_screenshot(url, path):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        context = browser.new_context()
        page = context.new_page()

        # request target web page
        page.goto(url)

        # screenshot as bytes
        image_bytes = page.screenshot()
        Path(path).write_bytes(image_bytes)

# for example:
get_screenshot("https://web-scraping.dev", "./screenshot.png")

Above we made a small function that:

Launches playwright controlled chromium browser
Starts a new browser tab
Navigates to the target webpage
Captures a screenshot of the webpage and saves it to screenshot.png

Next, let's use the above function to take two screenshots of two product variants from web-scraping.dev

get_screenshot("https://web-scraping.dev/product/1?variant=orange-small", "product-1.png")

and our second product variant:

get_screenshot("https://web-scraping.dev/product/1?variant=cherry-small", "product-2.png")

Now that we have captured the screenshots, we are ready to compare them and highlight the differences between them.

Comparing Screenshots

To compare the two screenshots and highlight changes, we first need to read the screenshots files using wand's Image constructor.

Wand provides a .compare() method on the Image object that does exactly that! This method compares an image with another, and returns a reconstructed image & computed distortion. The reconstructed image will show the differences highlighted with red color by default.

from wand.image import Image

def compare_images(image1, image2, diff_image):
    # Open the two images you want to compare
    with Image(filename=image1) as img1, Image(filename=image2) as img2:
        # Compare the images
        diff = img1.compare(img2, metric="root_mean_square")

        # The result is a tuple containing the difference image and the computed difference
        diff_image, diff_value = diff

        print(f"The difference between the images is: {diff_value}")

        diff_image.save(filename=diff_image)

        return (diff_image, diff_value)

compare_images("product-1.png", "product-2.png", "difference.png")

Which produces results like this:

Now that we can compare two screenshots and get the computed difference between them, we need to automate this process to run regularly and notify us with detected screenshot changes.

Automating Screenshot Capturing

To streamline the process of capturing and comparing webpage screenshots, you can automate it in python using the schedule library. You can schedule the script to run at regular intervals (e.g., daily or hourly), capture a new screenshot, compare it with the one from the last run, and send a notification if changes are detected.

Sending Notifications

To send a notification when changes are detected, you can integrate an email or messaging API (like Twilio, SendGrid, or Slack) into your script. For example, you could add the following lines to send an email when a difference is found:

import smtplib
from email.mime.text import MIMEText

def send_change_email(diff_percentage):
    msg = MIMEText(f"Changes detected: {diff_percentage:.2f}% of pixels changed.")
    msg["Subject"] = "Webpage Change Detected"
    msg["From"] = "your_email@example.com"
    msg["To"] = "recipient_email@example.com"

    with smtplib.SMTP("smtp.example.com") as server:
        server.login("your_email@example.com", "your_password")
        server.send_message(msg)

In the example above, we create a small function for sending an email with our screenshot comparison details.

Scheduling The Process

Now that we have all the functions needed to capture the screenshots, compare them, and send an email with the difference between them, let's schedule it to run every day.

import schedule
import time

def job():
    get_screenshot("https://example.com", "new-screenshot.png")
    diff_image, diff_value = compare_images(
        "new-screenshot.png", "old-screenshot.png", "diff-screenshot.png"
    )
    if diff_value > 0:
        send_change_email(diff_image)
    else:
        print("No change detected")

schedule.every().day.do(job)

while True:
    schedule.run_pending()
    time.sleep(1)

With this schedule script running we can check for website changes through our page screenshot capture function that runs once every day. If changes are detected, an email will be sent to notify us!

This concludes our screen scraping project but scaling up projects like this can be difficult and this is where Scrapfly can lend you a hand next!

Powering Up with ScrapFly

So far, we've explored how to capture website screenshots using basic headless browser configurations. However, many modern websites employ anti-bot measures to prevent automated screenshot capture, making it challenging to scale your efforts. This is where ScrapFly comes into play!

ScrapFly provides advanced web scraping, screenshot, and data extraction APIs designed for large-scale operations. Here's how ScrapFly can enhance your screenshot automation:

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - screenshot web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - screenshot dynamic web pages through cloud browsers.
Full screenshot customization - scroll and capture exact areas.
Comprehensive options - block banners, use dark mode, and more.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

Using ScrapFly’s screenshot API is straightforward and can be done with a simple API request. Here’s an example of how to take screenshots using Python:

from pathlib import Path
import urllib.parse
import requests

# Base URL for ScrapFly's screenshot API
base_url = 'https://api.scrapfly.io/screenshot?'

# Define the parameters for the API request
params = {
    'key': 'Your ScrapFly API key', # Your ScrapFly API key
    'url': 'https://web-scraping.dev/product/1?variant=cherry-small', # URL of the webpage to capture
    'format': 'png', # Desired screenshot format (e.g., png, jpeg)
    'capture': 'fullpage', # Area to capture (specific element, fullpage, viewport)
    'resolution': '1920x1080', # Screen resolution for the capture
    'country': 'us', # Proxy country
    'rendering_wait': 5000, # Wait time in milliseconds before capturing
    'options': [
        'dark_mode', # Enable dark mode
        'block_banners', # Block pop-up banners
        'print_media_format' # Emulate print media format
    ],
    'auto_scroll': True # Automatically scroll down the page before capturing
}

# Convert the list of options into a comma-separated string
params['options'] = ','.join(params['options'])
query_string = urllib.parse.urlencode(params)
full_url = base_url + query_string

# Make the API request to capture the screenshot
response = requests.get(full_url)
image_bytes = response.content

# Save the screenshot to disk
Path("screenshot.png").write_bytes(image_bytes)

Scrapfly powers up many of the same Playwright capabilities in a much more performant, scalable and reliable way!

FAQ

To conclude this guide on tracking webpage changes using automated screenshots, let's address some frequently asked questions

What is a screenshot API?

A screenshot API is a service that allows you to capture images of websites via HTTP requests, eliminating the need to manage headless browser instances directly. It enables customized screenshots using various headless browser features, such as setting the device viewport, resolution, JavaScript execution, banner blocking, and more.

What are other alternatives to take screenshots using python?

Other browser automation tools like selenium can be used to take screenshots in python. Check out our dedicated guide on how to take screenshots in python

Screenshot APIs are also a great alternative. We have already compared them all for you and rendered out the best screenshot API.

How to capture screenshots in Node.js?

Pupeteer is the go-to tool to for browser automation in node.js. It as bult-in method to capture screenshots. Take a look at our guide on taking screenshots with Puppeteer. Selenium and Playwright are also available in node.js and can be used to capture screenshots.

Summary

In this tutorial we've created a visual tracking tool using Python that monitors websites for changes.

We've started by using Playwright and the screenshot() method to capture screenshots of web pages. Then we loaded the captured screenshots using Wand (which is a binding of ImageMagick) and compared them using the compare() method. Finally, we automated the process using the schedule library to run the script periodically and send email notifications when changes are detected.

Top comments (1)

Scott Reno • Sep 19 '24

Interesting!