My final project for my software engineering bootcamp was a web scraping site that uses image processing to scrape images from a given URL. The idea for this came when I was trying to think about a way to gather a large set of images to use as a machine learning data set. I knew that scrapping would be the way to go for this kind of collection but I was unhappy with how brittle traditional web scrapers were.
The scrapers back end uses Selenium, Pillow and the Chrome Driver to accomplish its task. The first step is Selenium opens up the page using the Chrome Driver in headless mode and injects custom css into the DOM. The custom CSS consists of a colored border around each image as well as a color key in the top left corner of the page. I also have CSS to make sure all images as well as the color key render on top of the page. As a note the color key is necessary because Pillow detects RGB values of pixels differently than how they render on the webpage. Once the CSS is injected Selenium takes a screenshot of the entire page.
The screenshot is then processed by the Python Pillow library where each pixel is scanned to see if it matches the color key in the top left corner. Once a pixel matches the key an algorithm checks to see if this pixel is the start of an image. If an image is detected the height and width of the image is found and then used to crop the page screenshot into the desired sub image.
The desired images are then zipped up and sent to the front end where a user can rename and download them. The images are all deleted on the back after the scrape is complete.
Advantages
The biggest advantage is Pixel Harvester can work on any website that uses image tags regardless of HTML structure or css selectors.
This scraper allows you to scrape the entire webpage without having to travel to each individual image link which lowers the scrapers overall footprint on the website.
The images that are scrapped are at the same resolution as how they appear on the website.
Challenges
The biggest challenge this web scraper faces is the challenge all web scrapers face and that is bot detection. Because many websites do not render in headless mode screenshots for those websites will be blank and there will be no images to capture. This scraper makes no attempt to subvert bot detection in order to obey the terms and service of websites that do not wish to be scraped.
With the wide range of how websites are coded some websites that may appear to work at first will fail to produce desired results. For example if a website utilizes the background img attribute to display images instead of using html img tags the scraper will not be able to detect those images and no results will be returned.
Top comments (0)