Artur Daschevici

Posted on Apr 23

Sometimes things simply don't work

#puppeteer #crawling #scraping #bug

As I have previously mentioned I am rather fond of puppeteer. It's a useful library for all kinds of web automation...but like any open source project it needs some TLC.

I am not in any way associated with the developers at puppeteer, but if you are looking for a way to contribute, they are open source

The frustration

I was looking at a somewhat long page(think vertically) and tried to create a screenshot of it. The optimist in me was thinking that it will simply work so I went on as usual and planned my approach on the assumption that it will function as intended.

I checked the screenshot and found that it was a tiled image of a fixed size crop from the top of the file. First reaction was frustration...but I think it was more at myself that I had not allowed any margin for error in the experiment.

The insight

There is no reason to point fingers when something is not working, especially in OSS, if you have the chops fix it for yourself, share it, if it is good enough it might get adopted upstream. In other words perfect is the enemy of good.

The bug

Before focusing on hacking my way out of the jam I scoured the web, as usually problems are not as unique as one might think. I am ashamed to admit it, but I'm not fond of documentation and hacking my way out of a problem by digging into the different related projects' docs is the last step in my debugging journey.

I found that this was related to an old, still open bug in the puppeteer repo.

Discussion ongoing to quite recently...but still open.

The consensus I could gather is either use playwright or use a workaround to solve it in the puppeteer layer. The root cause of the bug is a websocket size limitation on the CDP protocol for chromium.

I had an intention of using playwright but in some of my tests it was failing to load some pages so I decided to revisit the puppeteer idea and solve the issue where I can.

Hacking my way through it

Started by doing a height based chunking method. A more generic approach was to create a chunker that returns a function so that the chunk height is configurable via the parameter.

// return a chunker function with the height for each chunk
// number will be the full height of the element you want to // grab a screenshot of
const chunkBy = (n) => number => {
  let chunks = new Array(Math.floor(number / n)).fill(n);
  chunks = chunks.map((c, i) => {return {height: c, start: i * c}});

  const remainder = number - chunks[chunks.length - 1].start - chunks[chunks.length - 1].height;
  if (remainder > 0) {
    chunks.push({height: remainder, start: chunks[chunks.length - 1].start + chunks[chunks.length - 1].height});
  }

  console.log('CHUNKS = ', chunks);
  return chunks;
};

Afterwards I wrote the method for grabbing the screenshot that works regardless of the height of it so that it works around the CDP limitation.

async function grabSelectorScreenshot() {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3');
    // urls is a list of string urls
    for (const url of urls) {
      const hashed = crypto.createHash('sha256').update(url).digest('hex');
      await page.goto(url, {waitUntil: 'networkidle0'});
      // this is where the element is selected
      const element = await page.$("div#document1 div.eli-container");
      // get height and width for later iterating
      const {width, height} = await element.boundingBox();
      const designatedPathPng = `./screenshots/${hashed}-merged-ss.png`;
      // chunk by 4000 height
      const heights = chunkBy4k(height);
      // keep track of starting point and height
      // to have continuous mapping of the image 
      const chunks = heights.map((h, i) => {
        return element.screenshot({
          clip: {
            x: 0,
            y: h.start,
            height: h.height,
            width,
          },
          path: `./screenshots/${hashed}-${i}-ss.png`
        })
      });
      // wait for all the part files to be written
      const filesResolved = await Promise.all(chunks)
      // merge all the parts in a vertical layout
      const mergedImage = await mergeImg(filesResolved, {direction: true});
      // this is interesting, the merged image is a promise,
      // but the write only worked via a function callback
      mergedImage.write(designatedPathPng, async () => {
        browser.close();
        const dataPng = await readFile(designatedPathPng);
        const b64imgPng = Buffer.from(dataPng).toString('base64');
        // clean up the temporary files created
        await deleteFilesMatchingPattern('./screenshots', new RegExp(`^${hashed}-\\d+-ss\\.png$`));
        return b64imgPng;
      });

    }
}

Cleaning up temporary files

You probably want to clean up the files. One way to do that:

async function deleteFilesMatchingPattern(dirPath, regex) {
  try {
    const files = await readdir(dirPath);  // Read all files in the directory
    for (let file of files) {
      if (regex.test(file)) {  // Check if the file matches the pattern
        const filePath = path.join(dirPath, file);
        await fs.unlink(filePath);  // Delete the file
        console.log(`Deleted: ${filePath}`);
      }
    }
  } catch (error) {
      console.error('Error:', error);
  }
}

In hindsight, probably a better way to do this is by using actual tmp files and decouple the cleanup, but this was good enough for a barebones script.

Conclusion

OSS needs some TLC
problems are rarely unique
it's better to hack at it and unblock yourself, switching library is more of a PITA as there are no guarantees

DEV Community

Sometimes things simply don't work

I am not in any way associated with the developers at puppeteer, but if you are looking for a way to contribute, they are open source

The frustration

The insight

The bug

Hacking my way through it

Cleaning up temporary files

Conclusion

Top comments (0)

Read next

Repository Design Pattern, Highlighting Packages via ADS, and New Arabic Article ✨

Boost Your Application's Performance with Redux RTK Query 🚀

LeetCode Challenge 13: Roman to Integer - JavaScript Solution 🚀

Machine learning for web developers