Forward: Where I work, we have these things called Lunch and Learn where people in the company talk about something to everyone else. Sometimes it's a client overview, other times it's about scuba diving, sometimes it's just to introduce new people. I gave a talk about web scraping and how it could help your day to day business, personal, or other work. This is the presentation I gave, it might not make a ton of sense stand alone, but I wanted to share. Link to original presentation.
Web Scraping L&L
I’ll take structured data for 100 Alex.
Overview
The purpose of web scraping, or data mining, is to transform web data into structured data you can work with in different formats. There is a huge industry around data mining, web automation, and web scraping. I’ve put together an example method for how to do a simple scrape if you run into data you need to structure yourself.
Presentation Tools
These are the tools I used during the presentation
https://data-miner.io/ (chrome extension)
https://data-miner.io/quick-guides/menu
https://sheets.google.com importxml() and importhtml() functions
Sites we scraped from:
- https://vigilante.pw/
- https://broadbandnow.com/Cable
- https://www.npmjs.com/
- https://www.linkedin.com/mynetwork/invite-connect/connections/
- https://admin.google.com
Challenges
In order to scrape websites using dataminer, you would save yourself a lot of time by watching the tutorial videos. It shows you how to go about using the tool effectively in basic situations. As you need more advanced features, you may need to learn CSS selectors, jquery selectors, or xpath selectors. Additionally for more complex scraping tasks you may need a commercial account from data-miner.io, or move to an open source framework like scrapy/portia.
Javascript
One of the biggest challenges in web scraping is dealing with Javascript. Sites that use Angular, Vue, React will not render well to a typical request based web scraper. Data Miner already handles this well for basic use cases, as it’s using your browsers post-rendered HTML to scrape. A scraping library needs to deal with the javascript first either via a headless browser, or other option. There are commercial options for proxy loading HTML that will pre-render sites before your parser analyzes the HTML, and there are projects like Puppeteer that enable you to have a headless chrome browser running natively (not the same as phantomjs/capserjs).
The scrapy ecosystem has a great project called Splash that is a dockerized headless web browser that’s api driven. Your spider simply makes requests to the api and it handles rendering. Splash has been very useful in many cases where an automated scraper needs to deal with a login page where javascript is required.
Scrapy/Portia
Scrapy and Portia are an opensource endeavor with commercial services if you need. Scrapy is a python framework (based in Django) for deploy web scrapers, spiders and crawlers. Scrapy is easy to use and start out with, and scales to very advanced if the need arises. Portia is a opensource application that creates a visual method for developing scraping recipes. Portia can be self-hosted or hosted as a service. I run a local Portia instance via docker, and while it’s neat, it’s problematic and crashes frequently. This would be frustrating for new users.
https://github.com/scrapinghub/portia
https://github.com/scrapy/scrapy
https://github.com/scrapinghub/learn.scrapinghub.com
https://github.com/scrapy-plugins/scrapy-splash
https://django-dynamic-scraper.readthedocs.io/en/latest/
Frameworkless Python
If you would like to write a scraping bot from scratch, and no framework overhead, BeautifulSoup4, and Requests is a great way to go. You can develop multistage scrapers in about 20 lines of code, but you need to understand the libraries and methods ahead of time. BS4 has excellent documentation, as does Requests and nearly any beginner pythonista could get started with them. There is also a very handy python library that pulls core content automatically from pages (like newspaper article content) called Newpaper3k, if you are looking to pull a large corpus of content for tasks like AI or ML, this is a great module to help you focus on NOT scraping, but what to do with the content you are scraping.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
http://docs.python-requests.org/en/master/
https://newspaper.readthedocs.io/en/latest/
Node Scraping
I haven’t done much research scraping with Node, but I’ve read a lot of articles about it. The biggest barrier to entry for me was that any requests library that didn’t use promises was too easily hung up. I tried some but I really enjoy developing in Python/Jupyter. Here are some resources for starting webscraping in Node.
Framework: https://expressjs.com/
Request library: https://github.com/mikeal/request or https://github.com/axios/axios
HTML Parser: https://github.com/MatthewMueller/cheerio or https://github.com/jsdom/jsdom
Command Line
Sometimes, you just want to grab data directly from the command line. There are 2 tools that will make this remarkably simple: pup and jq.
Example:
curl -s "https://vigilante.pw/" \
| pup 'table tr json{}' \
| jq ' .[] | {"entries": .children[0].text, "database": .children[1].text, "hashing": .children[2].text, "category": .children[3].text, "date": .children[4].text, "acknowledged": .children[5].text }' | head -40
{
"entries": "34,368",
"database": "000webhost.com Forum",
"hashing": "vB",
"category": "Hosting",
"date": "2015-10",
"acknowledged": null
}
{
"entries": "632,595",
"database": "000webhost.com Mailbox",
"hashing": "plaintext",
"category": "Hosting",
"date": "2015-10",
"acknowledged": null
}
{
"entries": "15,311,565",
"database": "000webhost.com Main",
"hashing": "plaintext",
"category": "Hosting",
"date": "2015-10",
"acknowledged": null
}
{
"entries": "5,344",
"database": "007.no",
"hashing": "SHA-1 *MISSING SALTS*",
"category": "Gaming",
"date": null,
"acknowledged": null
}
This example uses the vigilante.pw website we looked at earlier. On command line you use curl
as the requestor, pup
extracts just the table’s rows and transforms them into json, then jq
processes the json into a workable dataset you could use in any other web application. jq could further remove commas from numbers, and normalize other text if needed.
Bonus Round
Put this in a google sheet cell.
=IMPORTHTML("https://vigilante.pw/","table",1)
You can import nearly any xpath you like into google sheets, enabling you to create custom dashboards of web content.
Photo by Maik Jonietz
Top comments (3)
Presentation link no longer works :(
Whoops, I'll relink
Some comments may only be visible to logged-in visitors. Sign in to view all comments.