When I was building PicPerf's page analyzer, I needed to figure out how to identify every image loaded on a particular page. It sounded like a simple task – scrape the HTML for <img>
tags, pull off the src
attributes, and profit. I'm using Puppeteer, so I started stubbing out something like this:
const browser = await puppeteer.launch(launchArgs);
const page = await browser.newPage();
await page.goto(url, { waitUntil: "networkidle2" });
const images = await page.evaluate(() => {
return Array.from(document.getElementsByTagName("img")).map(
(img) => img.src,
);
});
// Do something w/ an array of URLs...
I didn't take long to realize how insufficient that would be.
First, not every image is loaded via <img>
tag. If I didn't want to miss anything, I'd also need to parse the contents of CSS files, <style>
tags, and style
attributes.
I started going down this path and it was not pretty. You can't just pluck a src
attribute off a blob of CSS. You gotta be willing to make shameful choices, like writing a regular expression to pull URLs out of chunks of markup:
const images = await page.evaluate(() => {
function extractImagesFromMarkup(markup) {
return (
markup?.match(
/(?:https?:\/)?\/[^ ,]+\.(jpg|jpeg|png|gif|webp|avif)(?:\?[^ "')]*)?/gi,
) || []
);
}
return {
imageTags: Array.from(document.querySelectorAll("img"))
.map((el) => {
return extractImagesFromMarkup(el.getAttribute("src"));
})
.flat(),
styleAttributes: Array.from(document.querySelectorAll("*"))
.map((el) => {
return extractImagesFromMarkup(el.getAttribute("style"));
})
.flat(),
styleTags: Array.from(document.querySelectorAll("style"))
.map((el) => {
return extractImagesFromMarkup(el.innerHTML);
})
.flat(),
};
});
const { imageTags, styleAttributes, styleTags } = images;
I'm sorry you had to see that. And it doesn't even cover every case (like <picture>
elements or .css
file contents). I was bound to miss something.
Second, even if I could reliably find every image in the code, it doesn't mean every one would be downloaded and rendered on page load. Any given website could have a mound of CSS media queries that load images only on certain screen sizes, or responsive images that leave it up to the browser:
<img
src="ur-mom-2000px.jpg"
srcset="ur-mom-600px.jpg 600w, ur-mom-2000px.jpg 2000w"
sizes="(max-width: 600px) 100vw, 2000px"
alt="Your Mother"
>
If I wanted this page analyzer to be reasonably accurate, I needed only the images a real user would need to wait to be downloaded when a real browser was fired up, and I wasn't interested in trading my soul to write an even more clever chunk of code to pull it off.
Don't Scrape. Listen for Requested Images
I eventually realized I'm not limited to scraping a bunch of cold, hard HTML when using a tool like Puppeteer. I could set up a listener to capture images that were actually downloaded during a browser session.
That's easy enough to set up. First, Puppeteer's request interception feature needed to be enabled when the page was created:
const browser = await puppeteer.launch(launchArgs);
const page = await browser.newPage();
// Enable requests to be intercepted.
await page.setRequestInterception(true);
Then, the request handler could be built out like so:
// Will collect image URLs in here.
const imageUrls = [];
page.on("request", (req) => {
if (req.isInterceptResolutionHandled()) return;
// Do stuff here.
return req.continue();
});
That first line calling isInterceptResolutionHandled()
is important – I used it to bow out early if the incoming request has already been handled by a different event listener. (Technically, this isn't critical if you know you're the only one listening, but good practice nonetheless.). Between that and req.continue()
, I could start collecting images.
Filtering Out the Junk
I just wanted image requests, but as I filtered, I set things up to abort()
requests through domains that didn't impact to how the page was rendered (it'd same on some analysis time too). For the most part, that meant hefty analytics requests:
const DOMAIN_BLACKLIST = [
"play.google.com",
"ad-delivery.net",
"youtube.com",
"track.hubspot.com",
"googleapis.com",
"doubleclick.net",
// Many, many more...
];
Then, it was a matter of aborting the request when its hostname was found in the list:
page.on("request", (req) => {
if (req.isInterceptResolutionHandled()) return;
const urlObj = new URL(req.url());
// Block requests.
if (DOMAIN_BLACKLIST.includes(urlObj.hostname)) {
return req.abort();
}
return req.continue();
});
With that out of the way, I could focus on collecting images into that imageUrls
variable, but only if they were in my list of permitted extensions.
const imageExtensions = [
"jpg",
"jpeg",
"png",
"gif",
"webp",
"avif",
"svg",
];
I also left out any data:
sources, since I wanted only fully qualified image URLs.
page.on("request", (req) => {
if (req.isInterceptResolutionHandled()) return;
const urlObj = new URL(req.url());
if (DOMAIN_BLACKLIST.includes(urlObj.hostname)) {
return req.abort();
}
const fileExtension = urlObj.pathname.split(".").pop();
if (
req.resourceType() === "image" &&
// Must be a permitted extension.
imageExtensions.includes(fileExtension) &&
// No data sources.
!req.url().includes("data:")
) {
imageUrls.push(req.url());
}
return req.continue();
});
That's a much more reliable approach than scraping. But there was still more to be done to collect every image that could possibly be loaded.
Accounting for Scrolling
First up, I wanted to make sure I collected any image that was loaded throughout the full length of the page. But due to possible lazy loading (native or not), I wanted to trigger a full page scroll to catch them all. So, I used this little function that scrolled by 100px every 100ms:
async function autoScroll(page: Page) {
await page.evaluate(async () => {
return await new Promise<void>((resolve) => {
let totalHeight = 0;
const distance = 100;
const timer = setInterval(() => {
window.scrollBy(0, distance);
totalHeight += distance;
if (
totalHeight >=
document.body.scrollHeight - window.innerHeight
) {
clearInterval(timer);
resolve();
}
}, 100);
});
});
}
I then used that to keep the page open until the full page had been scrolled:
const browser = await puppeteer.launch(launchArgs);
const page = await browser.newPage();
// Other page setup stuff...
await page.goto(url, { waitUntil: "networkidle2" });
// page.on("request") handler here.
await this.autoScroll(page);
await browser.close();
That accounted for lazily loaded images, but before this was considered "ready," I needed to tidy up a couple more things.
Viewport & User Agent
For this to resemble a real-life device as much as reasonably possible, it made sense to go with a popular mobile phone size for the viewport. I choose the following:
const browser = await puppeteer.launch(launchArgs);
const page = await browser.newPage();
// Other page setup stuff...
page.setViewport({ width: 430, height: 932 });
And finally, I used my own user agent for the page as well:
page.setUserAgent(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
);
With that, I was set up for success.
That'll Do
I wrapped this tool up with a strong feeling of appreciation for headless browsing tools like Puppeteer and Playwright. There's a lot of complexity wrapped into an API making it easy to programmatically use a browser like a human would. Cheers to the smart people building that stuff.
Try out the tool for yourself, by the way! At the very least, it'll help catch other quirks I've overlooked until now.
Top comments (0)