Jonathan Geiger

Posted on Mar 25 • Originally published at capturekit.dev

How to Extract All Links from a Website Using Puppeteer

#puppeteer #webscraping #webdev

Extracting all links from a website is a common task in web scraping and automation. Whether you're building a crawler, analyzing a website's structure, or gathering data, having access to all links can be invaluable. In this guide, we'll explore two approaches: using Puppeteer for manual extraction and using CaptureKit API for a simpler solution.

Method 1: Using Puppeteer

Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Here's how you can use it to extract all URLs from a website:

const puppeteer = require('puppeteer');

async function extractLinks(url) {
    // Launch the browser
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    try {
        // Navigate to the URL
        await page.goto(url, { waitUntil: 'networkidle0' });

        // Extract all links
        const links = await page.evaluate(() => {
            const anchors = document.querySelectorAll('a');
            return Array.from(anchors).map((anchor) => anchor.href);
        });

        // Remove duplicates
        const uniqueLinks = [...new Set(links)];

        return uniqueLinks;
    } catch (error) {
        console.error('Error:', error);
        throw error;
    } finally {
        await browser.close();
    }
}

// Usage example
async function main() {
    const url = 'https://example.com';
    const links = await extractLinks(url);
    console.log('Found links:', links);
}

main();

This code will:

Launch a headless browser using Puppeteer
Navigate to the specified URL
Extract all <a> tags from the page
Get their href attributes
Remove any duplicate links
Return the unique list of URLs

Handling Dynamic Content

If you're dealing with a website that loads content dynamically, you might need to wait for the content to load:

// Wait for specific elements to load
await page.waitForSelector('a');

// Or wait for network to be idle
await page.waitForNetworkIdle();

Filtering Links

You can also filter links based on specific criteria:

const links = await page.evaluate(() => {
    const anchors = document.querySelectorAll('a');
    return Array.from(anchors)
        .map((anchor) => anchor.href)
        .filter((href) => {
            // Filter out external links
            return href.startsWith('https://example.com');
            // Or filter by specific patterns
            // return href.includes('/blog/');
        });
});

Method 2: Using CaptureKit API (Recommended)

While Puppeteer is powerful, setting up and maintaining a web scraping solution can be time-consuming and complex. That's where CaptureKit API comes in. Our API provides a simple, reliable way to extract all links from any website, with additional features like link categorization and metadata extraction.

Here's how to use CaptureKit API:

curl "https://api.capturekit.dev/content?url=https://tailwindcss.com&access_key=YOUR_ACCESS_KEY"

The API response includes categorized links and additional metadata:

{
    "success": true,
    "data": {
        "links": {
            "internal": ["https://tailwindcss.com/", "https://tailwindcss.com/docs"],
            "external": ["https://tailwindui.com", "https://shopify.com"],
            "social": [
                "https://github.com/tailwindlabs/tailwindcss",
                "https://x.com/tailwindcss"
            ]
        },
        "metadata": {
            "title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.",
            "description": "Tailwind CSS is a utility-first CSS framework.",
            "favicon": "https://tailwindcss.com/favicons/favicon-32x32.png",
            "ogImage": "https://tailwindcss.com/opengraph-image.jpg"
        }
    }
}

Benefits of Using CaptureKit API

Categorized Links: Links are automatically categorized into internal, external, and social links
Additional Metadata: Get website title, description, favicon, and OpenGraph image
Reliability: No need to handle browser automation, network issues, or rate limiting
Speed: Results are returned in seconds, not minutes
Maintenance-Free: No need to update code when websites change their structure

Conclusion

While Puppeteer provides a powerful way to extract URLs programmatically, it requires significant setup and maintenance. For most use cases, using CaptureKit API is the recommended approach, offering a simpler, more reliable solution with additional features like link categorization and metadata extraction.

Choose the method that best fits your needs:

Use Puppeteer if you need full control over the scraping process or have specific requirements
Use CaptureKit API if you want a quick, reliable solution with additional features

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

DEV Community