DEV Community

Jonathan Geiger
Jonathan Geiger

Posted on • Originally published at capturekit.dev

1

How to Extract All Links from a Website Using Puppeteer

Extracting all links from a website is a common task in web scraping and automation. Whether you're building a crawler, analyzing a website's structure, or gathering data, having access to all links can be invaluable. In this guide, we'll explore two approaches: using Puppeteer for manual extraction and using CaptureKit API for a simpler solution.

Method 1: Using Puppeteer

Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Here's how you can use it to extract all URLs from a website:

const puppeteer = require('puppeteer');

async function extractLinks(url) {
    // Launch the browser
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    try {
        // Navigate to the URL
        await page.goto(url, { waitUntil: 'networkidle0' });

        // Extract all links
        const links = await page.evaluate(() => {
            const anchors = document.querySelectorAll('a');
            return Array.from(anchors).map((anchor) => anchor.href);
        });

        // Remove duplicates
        const uniqueLinks = [...new Set(links)];

        return uniqueLinks;
    } catch (error) {
        console.error('Error:', error);
        throw error;
    } finally {
        await browser.close();
    }
}

// Usage example
async function main() {
    const url = 'https://example.com';
    const links = await extractLinks(url);
    console.log('Found links:', links);
}

main();
Enter fullscreen mode Exit fullscreen mode

This code will:

  1. Launch a headless browser using Puppeteer
  2. Navigate to the specified URL
  3. Extract all <a> tags from the page
  4. Get their href attributes
  5. Remove any duplicate links
  6. Return the unique list of URLs

Handling Dynamic Content

If you're dealing with a website that loads content dynamically, you might need to wait for the content to load:

// Wait for specific elements to load
await page.waitForSelector('a');

// Or wait for network to be idle
await page.waitForNetworkIdle();
Enter fullscreen mode Exit fullscreen mode

Filtering Links

You can also filter links based on specific criteria:

const links = await page.evaluate(() => {
    const anchors = document.querySelectorAll('a');
    return Array.from(anchors)
        .map((anchor) => anchor.href)
        .filter((href) => {
            // Filter out external links
            return href.startsWith('https://example.com');
            // Or filter by specific patterns
            // return href.includes('/blog/');
        });
});
Enter fullscreen mode Exit fullscreen mode

Method 2: Using CaptureKit API (Recommended)

While Puppeteer is powerful, setting up and maintaining a web scraping solution can be time-consuming and complex. That's where CaptureKit API comes in. Our API provides a simple, reliable way to extract all links from any website, with additional features like link categorization and metadata extraction.

Here's how to use CaptureKit API:

curl "https://api.capturekit.dev/content?url=https://tailwindcss.com&access_key=YOUR_ACCESS_KEY"
Enter fullscreen mode Exit fullscreen mode

The API response includes categorized links and additional metadata:

{
    "success": true,
    "data": {
        "links": {
            "internal": ["https://tailwindcss.com/", "https://tailwindcss.com/docs"],
            "external": ["https://tailwindui.com", "https://shopify.com"],
            "social": [
                "https://github.com/tailwindlabs/tailwindcss",
                "https://x.com/tailwindcss"
            ]
        },
        "metadata": {
            "title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.",
            "description": "Tailwind CSS is a utility-first CSS framework.",
            "favicon": "https://tailwindcss.com/favicons/favicon-32x32.png",
            "ogImage": "https://tailwindcss.com/opengraph-image.jpg"
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Benefits of Using CaptureKit API

  1. Categorized Links: Links are automatically categorized into internal, external, and social links
  2. Additional Metadata: Get website title, description, favicon, and OpenGraph image
  3. Reliability: No need to handle browser automation, network issues, or rate limiting
  4. Speed: Results are returned in seconds, not minutes
  5. Maintenance-Free: No need to update code when websites change their structure

Conclusion

While Puppeteer provides a powerful way to extract URLs programmatically, it requires significant setup and maintenance. For most use cases, using CaptureKit API is the recommended approach, offering a simpler, more reliable solution with additional features like link categorization and metadata extraction.

Choose the method that best fits your needs:

  • Use Puppeteer if you need full control over the scraping process or have specific requirements
  • Use CaptureKit API if you want a quick, reliable solution with additional features

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

👋 Kindness is contagious

Engage with a wealth of insights in this thoughtful article, valued within the supportive DEV Community. Coders of every background are welcome to join in and add to our collective wisdom.

A sincere "thank you" often brightens someone’s day. Share your gratitude in the comments below!

On DEV, the act of sharing knowledge eases our journey and fortifies our community ties. Found value in this? A quick thank you to the author can make a significant impact.

Okay