Extracting all links from a website is a common task in web scraping and automation. Whether you're building a crawler, analyzing a website's structure, or gathering data, having access to all links can be invaluable. In this guide, we'll explore two approaches: using Puppeteer for manual extraction and using CaptureKit API for a simpler solution.
Method 1: Using Puppeteer
Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Here's how you can use it to extract all URLs from a website:
const puppeteer = require('puppeteer');
async function extractLinks(url) {
// Launch the browser
const browser = await puppeteer.launch();
const page = await browser.newPage();
try {
// Navigate to the URL
await page.goto(url, { waitUntil: 'networkidle0' });
// Extract all links
const links = await page.evaluate(() => {
const anchors = document.querySelectorAll('a');
return Array.from(anchors).map((anchor) => anchor.href);
});
// Remove duplicates
const uniqueLinks = [...new Set(links)];
return uniqueLinks;
} catch (error) {
console.error('Error:', error);
throw error;
} finally {
await browser.close();
}
}
// Usage example
async function main() {
const url = 'https://example.com';
const links = await extractLinks(url);
console.log('Found links:', links);
}
main();
This code will:
- Launch a headless browser using Puppeteer
- Navigate to the specified URL
- Extract all
<a>
tags from the page - Get their
href
attributes - Remove any duplicate links
- Return the unique list of URLs
Handling Dynamic Content
If you're dealing with a website that loads content dynamically, you might need to wait for the content to load:
// Wait for specific elements to load
await page.waitForSelector('a');
// Or wait for network to be idle
await page.waitForNetworkIdle();
Filtering Links
You can also filter links based on specific criteria:
const links = await page.evaluate(() => {
const anchors = document.querySelectorAll('a');
return Array.from(anchors)
.map((anchor) => anchor.href)
.filter((href) => {
// Filter out external links
return href.startsWith('https://example.com');
// Or filter by specific patterns
// return href.includes('/blog/');
});
});
Method 2: Using CaptureKit API (Recommended)
While Puppeteer is powerful, setting up and maintaining a web scraping solution can be time-consuming and complex. That's where CaptureKit API comes in. Our API provides a simple, reliable way to extract all links from any website, with additional features like link categorization and metadata extraction.
Here's how to use CaptureKit API:
curl "https://api.capturekit.dev/content?url=https://tailwindcss.com&access_key=YOUR_ACCESS_KEY"
The API response includes categorized links and additional metadata:
{
"success": true,
"data": {
"links": {
"internal": ["https://tailwindcss.com/", "https://tailwindcss.com/docs"],
"external": ["https://tailwindui.com", "https://shopify.com"],
"social": [
"https://github.com/tailwindlabs/tailwindcss",
"https://x.com/tailwindcss"
]
},
"metadata": {
"title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.",
"description": "Tailwind CSS is a utility-first CSS framework.",
"favicon": "https://tailwindcss.com/favicons/favicon-32x32.png",
"ogImage": "https://tailwindcss.com/opengraph-image.jpg"
}
}
}
Benefits of Using CaptureKit API
- Categorized Links: Links are automatically categorized into internal, external, and social links
- Additional Metadata: Get website title, description, favicon, and OpenGraph image
- Reliability: No need to handle browser automation, network issues, or rate limiting
- Speed: Results are returned in seconds, not minutes
- Maintenance-Free: No need to update code when websites change their structure
Conclusion
While Puppeteer provides a powerful way to extract URLs programmatically, it requires significant setup and maintenance. For most use cases, using CaptureKit API is the recommended approach, offering a simpler, more reliable solution with additional features like link categorization and metadata extraction.
Choose the method that best fits your needs:
- Use Puppeteer if you need full control over the scraping process or have specific requirements
- Use CaptureKit API if you want a quick, reliable solution with additional features
Top comments (0)