Todd Birchard for Hackers And Slackers

Posted on Aug 8, 2020 • Originally published at hackersandslackers.com on Aug 8, 2020

Scrape Structured Data with Python and Extruct

#python #scraping #dataengineering #scraper

Unless you're entirely oblivious to scraping data in Python (and probably ended up here by accident), you're well-aware that scraping data in Python library begins and ends with BeautifulSoup. BeautifulSoup is Python's scraping powerhouse: we first demonstrated this in a previous post where we put together a script to fetch site metadata (title, description, preview images, etc.) from any target URL. We were able to build a scraper which fetched a target site's <meta> tags (and various fallbacks) to create a fairly reliable tool to summarize the contents of any URL; which is precisely the logic used to generate link "previews" such as these:

Link Preview — Example of a preview link with data fetched via BeatifulSoup.

Perusing the various sites and entities we refer to as "the internet" has traditionally felt like navigating an unstandardized wild-west. There's never a guarantee that the website you're targeting adheres to any web standards (despite their own best interests). These situations lead us to write scripts with complicated fallbacks in case the owner of myhorriblewebsite.angelfire.com somehow managed to forget to give their page a <title>, and so forth. Search engines and other big players recognized this. The standardization of JSON-LD was born as a reliable format for site publishers to include machine-readable (and also quite human-readable) metadata to appease search engines and fight for relevancy.

This post is going to build upon the goal of scraping site metadata we previously explored with BeautifulSoup via a different method: by parsing JSON-LD metadata with Python's extruct library.

What's so great about JSON-LD, you might ask? Aside from dodging the hellish experience of transversing the DOM by hand, JSON-LD is a specification with notable advantages to old school HTML <meta> tags. The multitude of benefits can mostly be boiled down into two categories: data granularity and linked data.

Data Granularity

JSON-LD allows web pages to express an impressive amount of granular information about what each page is. For instance, here's the JSON-LD for one of my posts:

{
  "@context": "https://schema.org/",
  "@type": "Article",
  "author": {
    "@type": "Person",
    "name": "Todd Birchard",
    "image": "https://hackersandslackers-cdn.storage.googleapis.com/2020/04/todd@2x.jpg",
    "sameAs": ["https://toddbirchard.com", "https://twitter.com/ToddRBirchard"]
  },
  "keywords": "Golang, DevOps, Software Development",
  "headline": "Deploy a Golang Web Application Behind Nginx",
  "url": "https://hackersandslackers.com/deploy-golang-app-nginx/",
  "datePublished": "2020-06-01T07:30:00.000-04:00",
  "dateModified": "2020-06-01T09:03:55.000-04:00",
  "image": {
    "@type": "ImageObject",
    "url": "https://hackersandslackers-cdn.storage.googleapis.com/2020/05/golang-nginx-3.jpg",
    "width": "1000",
    "height": "523"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Hackers and Slackers",
    "founder": "Todd Birchard",
    "logo": {
      "@type": "ImageObject",
      "url": "https://hackersandslackers-cdn.storage.googleapis.com/2020/03/logo-blue-full.png",
      "width": 60,
      "height": 60
    }
  },
  "description": "Deploy a self-hosted Go web application using Nginx as a reverse proxy. ",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://hackersandslackers.com"
  }
}

Example JSON-LD for a Hackers and Slackers post.

There's significantly more information stored in the above snippet than all other meta tags on the same page combined. There are surely more supported attributes in JSON-LD than traditional meta tags, yet the representation of data in a JSON hierarchy makes it immediately clear how page metadata is related. It's immediately clear that we're looking at an object representing an article, written by an author, as part of an "organization."

Google's explanation of the benefits of structuring metadata goes something like this:

Structured data is a standardized format for providing information about a page and classifying the page content; for example, on a recipe page, what are the ingredients, the cooking time and temperature, the calories, and so on.

Type

The term "web page" is useless ambiguous, as web pages are documents that can provide information in any number of forms. Web pages might be articles, recipes, product pages, events, and far more. The official schema of possible page types includes over one thousand possibilities for what "type" or "subtype" a page might be considered to be. Knowing the "type" of a page reduces ambiguity, and declaring a page "type" allows us to attach type-specific metadata to pages as well! For instance, let's compare the attributes of an Episode type to an Article type:

Episode
Property	Description
`actor`	An actor, e.g. in tv, radio, movie, video games etc., or in an event. Actors can be associated with individual items or with a series, episode, clip. Supersedes actors.
`director`	A director of e.g. tv, radio, movie, video gaming etc. content, or of an event. Directors can be associated with individual items or with a series, episode, clip. Supersedes directors.
`episodeNumber`	Position of the episode within an ordered group of episodes.
`musicBy`	The composer of the soundtrack.
`partOfSeason`	The season to which this episode belongs.
`partOfSeries`	The series to which this episode or season belongs. Supersedes partOfTVSeries.
`productionCompany`	The production company or studio responsible for the item e.g. series, video game, episode etc.
`trailer`	The trailer of a movie or tv/radio series, season, episode, etc.

Article
Property	Description
`articleBody`	The actual body of the article.
`articleSection`	Articles may belong to one or more 'sections' in a magazine or newspaper, such as Sports, Lifestyle, etc.
`backstory`	For an Article, typically a NewsArticle, the backstory property provides a textual summary giving a brief explanation of why and how an article was created. In a journalistic setting this could include information about reporting process, methods, interviews, data sources, etc.
`pageEnd`	The page on which the work ends; for example "138" or "xvi".
`pageStart`	The page on which the work starts; for example "135" or "xiii".
`pagination`	Any description of pages that is not separated into pageStart and pageEnd; for example, "1-6, 9, 55" or "10-12, 46-49".
`speakable`	Indicates sections of a Web page that are particularly 'speakable' in the sense of being highlighted as being especially appropriate for text-to-speech conversion. Other sections of a page may also be usefully spoken in particular circumstances; the 'speakable' property serves to indicate the parts most likely to be generally useful for speech.
`wordCount`	The number of words in the text of the Article.

There are obviously data attributes of television shows which don't apply to news articles (such as actors, director, etc.), and vice versa. The level of specificity achievable is nearly unfathomable when we discover that types have subtypes. For instance, our article might be an opinion piece article, which has extended the Article type with even more attributes.

Who

All content has a creator, yet content-creators can take many forms. Authors, publishers, and organizations could simultaneously be considered the responsible party for any given content, as these properties are not mutually exclusive. For instance, here's how my author data is parsed on posts like this one:

Author
Property	Description
@type	Person
name	Todd Birchard
image	https://hackersandslackers-cdn.storage.googleapis.com/2020/04/todd@2x.jpg
sameAs	https://toddbirchard.com/
sameAs	https://twitter.com/ToddRBirchard

What makes this data especially interesting is the values listed under the sameAs attribute, which associates the "Todd Birchard" in question to the very same Todd Birchard of the website https://toddbirchard.com/, and Twitter account https://twitter.com/ToddRBirchard. This undoubtedly assists search engines in making associations between entities on the web. Still, a keen imagination may easily recognize the opportunity to leverage these strong associations to dox or harass strangers on the internet quite easily.

Scrape Something Together

Along with Extruct, we'll be installing our good friend requests to fetch pages for us:

$ pip3 install requests extruct

Install libraries

You already know the drill — pick a single URL for now and loot them for all they've got by returning .text from our request's response:

import requests

def get_html(url):
    """Get raw HTML from a URL."""
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    req = requests.get(url, headers=headers)
    return req.text

Retrieve a page's HTML.

Simple stuff. Here's where extruct comes in; I'm tossing together a function called get_metadata, which will do precisely what you'd assume. We can pass raw the HTML we grabbed with get_html and pass it to our new function to pillage:

"""Fetch structured JSON-LD data from a given URL."""
from pprint import pprint
import requests
import extruct
from w3lib.html import get_base_url

def scrape(url):
    """Parse structured data from a target page."""
    html = get_html(url)
    metadata = get_metadata(html, url)
    pprint(metadata, indent=2, width=150)
    return metadata

...

def get_metadata(html: bytes, url: str):
    """Fetch JSON-LD structured data."""
    metadata = extruct.extract(
        html,
        base_url=get_base_url(url),
        syntaxes=['json-ld'],
        uniform=True
    )['json-ld']
    if bool(metadata) and isinstance(metadata, list):
        metadata = metadata[0]
    return metadata

Getting structured data with extruct.

Using extruct is as easy as passing raw HTML as a string and a site's "base URL" with extruct.extract(html, base_url=url). A "base URL" refers to a site's entry point (or homepage, whatever) for the targeted page. The page you're on right now is https://hackersandslackers.com/scrape-metadata-json-ld/. Thus the base URL, in this case, would be https://hackersandslackers.com/. There's a core library called w3lib that has a function to handle this exact task, hence our usage of base_url=get_base_url(html, url).

This is what our extract function returns, using one of my posts as an example:

{ '@context': 'https://schema.org/',
  '@type': 'Article',
  'author': { '@type': 'Person',
              'image': 'https://hackersandslackers-cdn.storage.googleapis.com/2020/04/todd@2x.jpg',
              'name': 'Todd Birchard',
              'sameAs': ['https://toddbirchard.com', 'https://twitter.com/ToddRBirchard']},
  'dateModified': '2020-06-11T16:57:57.000-04:00',
  'datePublished': '2018-11-11T08:35:09.000-05:00',
  'description': "Use Python's BeautifulSoup library to assist in the honest act of systematically stealing data without permission.",
  'headline': 'Scraping Data on the Web with BeautifulSoup',
  'image': { '@type': 'ImageObject',
             'height': '523',
             'url': 'https://hackersandslackers-cdn.storage.googleapis.com/2020/06/beautifulsoup-1-1.jpg',
             'width': '1000'},
  'keywords': 'Python, Data Engineering',
  'mainEntityOfPage': {'@id': 'https://hackersandslackers.com', '@type': 'WebPage'},
  'publisher': { '@type': 'Organization',
                 'founder': 'Todd Birchard',
                 'logo': { '@type': 'ImageObject',
                           'height': 60,
                           'url': 'https://hackersandslackers-cdn.storage.googleapis.com/2020/03/logo-blue-full.png',
                           'width': 60},
                 'name': 'Hackers and Slackers'},
  'url': 'https://hackersandslackers.com/scraping-urls-with-beautifulsoup/'}

JSON-LD data for a Hackers and Slackers post.

One of the keyword arguments we passed to extruct was syntaxes, which is an optional argument where we specify which flavor of structured data we're after (apparently there's more than one). Possible options to pass are 'microdata', 'json-ld', 'opengraph', 'microformat', and 'rdfa'. If nothing is passed, extruct will attempt to fetch all of the above and return the results in a dictionary. This is why we follow up our extruct call by accessing the ['json-ld'] key.

Dealing with Inconsistent Results

You're might be wondering why we index [0] after getting our results from extruct. This is a symptom of structured data: where traditional <meta> tags are predictably 1-dimensional, the "structure" of structured data is flexible and determined by developers. This level of flexibility gives developers the power to do things like define multiple meta images as a site's share image as a list of dicts as opposed to a single dict. This means makes the output of any given site's data unpredictable, which poses problems for Python scripts which are unaware of whether they should searching a list index or accessing a dictionary value.

The way I handle this is by explicitly checking the Python type of data being returned before extracting it:

...

def render_json_ltd(url: str, html) -> Optional[dict]:
    """Fetch JSON-LD structured data."""
    metadata = extruct.extract(
        html,
        base_url=get_domain(url),
        syntaxes=['json-ld'],
        uniform=True
    )['json-ld']
    if bool(metadata) and isinstance(metadata, list):
        metadata = metadata[0]
    return metadata

Check the "type" of structured data

This uncertainly of returned data types occurs everywhere. In the example where a page may have multiple meta images, I might write a function like get_image() below, where I explicitly check the type of data being returned for a given attribute while transversing the data tree:

...

def get_image(parsed_metadata, _data: dict) -> Optional[str]:
    """Scrape parsed_metadata `share image`."""
    image = None
    if bool(_data):
        if isinstance(_data, list):
            image = _data[0]
        if image is not None and isinstance(_data, dict):
            image = image.get('image')
        if isinstance(_data, str):
            return image

Extract data depending on type

Put it to Work

A script to return fetch and return structured data from a site would look something like this:

"""Fetch structured JSON-LD data from a given URL."""
from pprint import pprint
import requests
import extruct
from w3lib.html import get_base_url


def scrape(url):
    """Parse structured data from a target page."""
    html = get_html(url)
    metadata = get_metadata(html, url)
    pprint(metadata, indent=2, width=150)
    return metadata


def get_html(url):
    """Get raw HTML from a URL."""
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    req = requests.get(url, headers=headers)
    return req.text


def get_metadata(html: bytes, url: str):
    """Fetch JSON-LD structured data."""
    metadata = extruct.extract(
        html,
        base_url=get_base_url(url),
        syntaxes=['json-ld'],
        uniform=True
    )['json-ld']
    if bool(metadata) and isinstance(metadata, list):
        metadata = metadata[0]
    return metadata

Scrape a single page for structured data.

Testing our Scraper
Since we're grownups, it's best if we write a simple test or two for a script that could potentially be run on a massive scale. The bare minimum we could do is point our scraper to a site containing structured data and compare the output to the data we'd expect to see. Below is a small test written with Pytest to see that our scrape() function outputs data which matches a hardcoded copy of what I expect to get back:

"""Validate JSON-LD Scrape outcome."""
import pytest
from extruct_tutorial import scrape


@pytest.fixture
def url():
    """Target URL to scrape metadata."""
    return 'https://hackersandslackers.com/creating-django-views/'


@pytest.fixture
def expected_json():
    """Expected metadata to be returned."""
    return {'@context': 'https://schema.org/', '@type': 'Article',
            'author': {'@type': 'Person', 'name': 'Todd Birchard',
                       'image': 'https://hackersandslackers-cdn.storage.googleapis.com/2020/04/todd@2x.jpg',
                       'sameAs': ['https://toddbirchard.com', 'https://twitter.com/ToddRBirchard']},
            'keywords': 'Django, Python, Software Development', 'headline': 'Creating Interactive Views in Django',
            'url': 'https://hackersandslackers.com/creating-django-views/',
            'datePublished': '2020-04-23T12:21:00.000-04:00', 'dateModified': '2020-05-02T13:31:33.000-04:00',
            'image': {'@type': 'ImageObject',
                      'url': 'https://hackersandslackers-cdn.storage.googleapis.com/2020/04/django-views-1.jpg',
                      'width': '1000', 'height': '523'},
            'publisher': {'@type': 'Organization', 'name': 'Hackers and Slackers', 'founder': 'Todd Birchard',
                          'logo': {'@type': 'ImageObject',
                                   'url': 'https://hackersandslackers-cdn.storage.googleapis.com/2020/03/logo-blue-full.png',
                                   'width': 60, 'height': 60}},
            'description': 'Create interactive user experiences by writing Django views to handle dynamic content, submitting forms, and interacting with data.',
            'mainEntityOfPage': {'@type': 'WebPage', '@id': 'https://hackersandslackers.com'}}


def test_scrape(url, expected_json):
    """Match scrape's fetched metadata to known value."""
    metadata = scrape(url)
    assert metadata == expected_json

test_scrape.py

Build a Metadata Scraper

Of course, scrape() simply puts data on a silver platter for you - there's still the work of grabbing the values. To give you an example of a fully fleshed-out script to scrape metadata with extruct, I'll share with you my own personal treasure: the script I use to generate link previews:

"""Fetch structured JSON-LD data from a given URL."""
from typing import Optional, List
import requests
import extruct
from w3lib.html import get_base_url


def scrape(url: str) -> Optional[List[dict]]:
    """Parse structured data from a URL."""
    req = requests.get(url, headers=http_headers)
    base_url = get_base_url(req.content, url)
    json_ld = render_json_ltd(req.content, base_url)
    card = ["bookmark", {
                "type": "bookmark",
                "url": get_canonical(json_ld, html),
                "metadata": {
                    "url": get_canonical(json_ld),
                    "title": get_title(json_ld),
                    "description": get_description(json_ld),
                    "author": get_author(json_ld),
                    "publisher": get_publisher(json_ld),
                    "thumbnail": get_image(json_ld),
                    }
                }
            ]
    return card


def get_html(url):
    """Get raw HTML from a URL."""
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    req = requests.get(url, headers=headers)
    return req.text


def get_metadata(html, url):
    """Fetch JSON-LD structured data."""
    metadata = extruct.extract(
        html,
        base_url=get_domain(url),
        syntaxes=['json-ld'],
        uniform=True
    )['json-ld']
    if bool(metadata) and isinstance(metadata, list):
        metadata = metadata[0]
    return metadata


def get_title(json_ld: dict) -> Optional[str]:
    """Fetch title via extruct."""
    title = None
    if bool(json_ld) and json_ld.get('headline'):
        if isinstance(json_ld.get('headline'), list):
            title = json_ld['headline'][0]
        elif isinstance(json_ld.get('headline'), str):
            title = json_ld.get('headline')
        if isinstance(title, str):
            return title.replace("'", "")
    if bool(json_ld) and json_ld.get('title'):
        if isinstance(json_ld.get('title'), list):
            title = json_ld['title'][0]
        elif isinstance(json_ld.get('title'), str):
            title = json_ld.get('title')
    return title


def get_image(json_ld: dict) -> Optional[str]:
    """Fetch share image via extruct."""
    image = None
    if bool(json_ld) and json_ld.get('image'):
        if isinstance(json_ld['image'], list):
            image = json_ld['image'][0]
            if isinstance(image, dict):
                image = image.get('url')
            if isinstance(image, str):
                return image
        elif isinstance(json_ld.get('image'), dict):
            image = json_ld['image'].get('url')
     return image


def get_description(json_ld: dict) -> Optional[str]:
    """Fetch description via extruct."""
    if bool(json_ld) and json_ld.get('description'):
        return json_ld['description']
    return None


def get_author(json_ld: dict, html: BeautifulSoup) -> Optional[str]:
    """Fetch author name via extruct with BeautifulSoup fallback."""
    author = None
    if bool(json_ld) and json_ld.get('author'):
        if isinstance(json_ld['author'], list):
            author = json_ld['author'][0].get('name')
        elif isinstance(json_ld['author'], dict):
            author = json_ld['author'].get('name')
    return author


def get_publisher(json_ld: dict) -> Optional[str]:
    """Fetch publisher name via extruct."""
    publisher = None
    if bool(json_ld) and json_ld.get('publisher'):
        if isinstance(json_ld['publisher'], list):
            publisher = json_ld['publisher'][0].get('name')
        elif isinstance(json_ld['publisher'], dict):
            publisher = json_ld['publisher'].get('name')
    return publisher


def get_canonical(json_ld: dict) -> Optional[str]:
    """Fetch canonical URL via extruct."""
    canonical = None
    if bool(json_ld) and json_ld.get('mainEntityOfPage'):
        if isinstance(json_ld['mainEntityOfPage'], dict):
            canonical = json_ld['mainEntityOfPage'].get('@id')
        elif isinstance(json_ld['mainEntityOfPage'], str):
            return json_ld['mainEntityOfPage']
    return canonical

Metadata scraper with extruct

One More For the Toolbox

Unless you're actually looking to create link previews like the one I included, using extruct as a standalone library without a more extensive plan or toolkit isn't going to deliver much to you other than an easy interface for getting better metadata from individual web pages. Instead, consider looking at the bigger picture of what a single page's metadata gives us. We now have effortless access to information that crawlers can use to move through sites, associate data with individuals, and ultimately create a picture of an entity's entire web presence, whether that entity is a person, organization, or whatever.

If you look closely, one of extruct's main dependencies is actually BeautifulSoup. You could argue that you may have been able to write this library yourself, and you might be right, but that isn't the point. Data mining behemoths aren't nuclear arsenals; they're collections of tools used in conjunction cleverly to wreak havoc upon the world as efficiently as possible. We're getting there.

This has been a quick little script, but if you're interested I've thrown the source up on Github here:

hackersandslackers / jsonld-scraper-tutorial

🌎 🖥 Supercharge your scraper to extract quality page metadata by parsing JSON-LD data via Python's extruct library.

Structured Data Scraping Tutorial

Supercharge your scraper to extract quality page metadata by parsing JSON-LD data via Python's extruct library.

This repository contains source code for the accompanying tutorial on Hackers and Slackers: https://hackersandslackers.com/scrape-metadata-json-ld/

Installation

Installation via requirements.txt:

$ git clone https://github.com/hackersandslackers/jsonld-scraper-tutorial.git
$ cd jsonld-scraper-tutorial
$ python3 -m venv myenv
$ source myenv/bin/activate
$ pip3 install -r requirements.txt
$ python3 main.py

Installation via Pipenv:

$ git clone https://github.com/hackersandslackers/jsonld-scraper-tutorial.git
$ cd jsonld-scraper-tutorial
$ pipenv shell
$ pipenv update
$ python3 main.py

Installation via Poetry:

$ git clone https://github.com/hackersandslackers/jsonld-scraper-tutorial.git
$ cd jsonld-scraper-tutorial
$ poetry shell
$ poetry update
$ poetry run

Usage

To change the URL targeted by this script, update the URL variable in config.py.

Hackers and Slackers tutorials are free of charge. If you found this tutorial helpful, a small donation would be greatly appreciated to keep us in business. All proceeds go towards coffee, and…

View on GitHub

Until next time.

DEV Community

Scrape Structured Data with Python and Extruct

Data Granularity

Type

Who

Scrape Something Together

Dealing with Inconsistent Results

Put it to Work

Build a Metadata Scraper

One More For the Toolbox

hackersandslackers / jsonld-scraper-tutorial

🌎 🖥 Supercharge your scraper to extract quality page metadata by parsing JSON-LD data via Python's extruct library.

Structured Data Scraping Tutorial

Installation

Usage

Top comments (0)

Read next

Random Forest Classification: Unveiling the Powerful Machine Learning Technique That's Transforming Decision-Making

Remaking a rule-engine DSL

การใช้งาน Polyglot notebook กับ Python

Places365 in PyTorch