Tahsin Abrar

Posted on Oct 17

Web Crawler with Puppeteer and React

Overview

This is a web crawler application that allows users to fetch internal links from a given website and retrieve the content of specific pages. The application is divided into a backend built with Node.js and Puppeteer, and a frontend using React for an easy-to-use interface.

Features

Crawl websites and gather internal links.
Select specific internal pages to crawl.
Retrieve and display page content.
Simple and clean UI built with React and Tailwind CSS.
Graceful server shutdown endpoint.

How Website Crawling Works

Input a URL: You start by entering the URL of the website you'd like to crawl.
Fetch Internal Links: The backend uses Puppeteer to visit the provided URL and extract all the internal links.
Select Pages to Crawl: You can select specific pages from the internal links for further crawling.
Retrieve Page Content: Puppeteer retrieves the content of the selected pages and sends it back to the frontend for display.

Installation
- Backend Setup
- Frontend Setup
Usage
How Crawling Works
Technologies Used
Project Structure

Installation

Backend Setup

Navigate to the backend directory:

   cd backend

Install Dependencies:

Install all the necessary dependencies for the backend:

   npm install

Run the Backend:

Start the backend server, which listens on port 8000 by default:

   node server.js

Frontend Setup

Navigate to the frontend directory:

   cd frontend

Install Dependencies:

Install the necessary dependencies for the frontend:

   npm install

Run the Frontend:

Start the React development server:

   npm start

The frontend will be accessible at http://localhost:3000.

Usage

Start both the Backend and Frontend:
- The backend runs on http://localhost:8000.
- The frontend runs on http://localhost:3000.
Input a URL:
- Enter a URL in the input field on the frontend UI.
Fetch Pages:
- Click on Fetch Pages to retrieve all internal links from the URL.
Select Pages to Crawl:
- A list of internal links will be displayed. Select the pages you want to crawl by checking the checkboxes next to the page URLs.
Start Crawling:
- Click on Start Crawling to retrieve the content from the selected pages.
- The crawled data will be displayed on the page.
Reload the Page:
- You can click Reload Page to reset the application and start fresh with a new URL.

How Crawling Works

Backend Logic:

The backend uses Puppeteer to perform web crawling. It launches a headless browser, visits the given URL, and retrieves all internal links (<a> tags) within the same domain.
Frontend Logic:

The frontend interface allows users to:
- Input a website URL.
- Fetch internal links.
- Select specific pages to retrieve content.
- Display the crawled data in a formatted layout.

Technologies Used

Backend:
- Node.js: Server-side JavaScript runtime.
- Express.js: Framework for building the backend.
- Puppeteer: A headless browser library used for crawling websites.
- CORS: Middleware to handle cross-origin requests.
Frontend:
- React: Frontend JavaScript library for building user interfaces.
- Tailwind CSS: Utility-first CSS framework for styling.

Backend: server.js

const express = require("express");
const puppeteer = require("puppeteer");
const app = express();
const cors = require("cors");

app.use(express.json());
app.use(cors());

app.post("/crawl-site", async (req, res) => {
  const { url } = req.body;

  let browser;
  try {
    browser = await puppeteer.launch({
      headless: true,
      args: ["--no-sandbox", "--disable-setuid-sandbox"],
    });

    const page = await browser.newPage();
    await page.setUserAgent(
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    );

    await page.goto(url, { waitUntil: "domcontentloaded", timeout: 60000 });
    await page.waitForSelector("body", { timeout: 10000 });

    const links = await page.evaluate(() => {
      const anchors = Array.from(document.querySelectorAll("a"));
      return anchors
        .map((anchor) => anchor.href)
        .filter((href) => href.includes(window.location.origin));
    });

    await browser.close();
    res.json({ pages: links });
  } catch (error) {
    console.error(`Navigation to ${url} failed:`, error);
    if (browser) await browser.close();
    res
      .status(500)
      .json({ error: `A server error has occurred: ${error.message}` });
  }
});

app.post("/start-crawl", async (req, res) => {
  const { pages } = req.body;
  let browser;
  try {
    browser = await puppeteer.launch({
      headless: true,
      args: ["--no-sandbox", "--disable-setuid-sandbox"],
    });

    const results = [];
    for (const pageUrl of pages) {
      const page = await browser.newPage();
      try {
        await page.goto(pageUrl, { waitUntil: "networkidle2", timeout: 60000 });
        const content = await page.content();
        results.push({ url: pageUrl, content });
      } catch (error) {
        console.error(`Navigation to ${pageUrl} failed:`, error);
        results.push({ url: pageUrl, error: `Failed to load ${pageUrl}` });
      }
      await page.close();
    }

    await browser.close();
    res.json({ data: results });
  } catch (error) {
    console.error("Failed to launch browser:", error);
    if (browser) await browser.close();
    res.status(500).json({ error: "Failed to launch browser" });
  }
});

const server = app.listen(8000, () =>
  console.log("Backend server started on port 8000")
);

Sure! Here's the continuation of the README, explaining the frontend in detail, including how the components are structured and how they interact with the backend for crawling.

Frontend

Key Components and Logic

`App.js`

The main entry point for the React app, it renders the CrawlPage component, which is responsible for the user interface and crawling operations.

import React from 'react';
import CrawlPage from './components/CrawlPage';

function App() {
  return (
    <div className="App">
      <CrawlPage />
    </div>
  );
}

export default App;

`CrawlPage.js`

This is where the core functionality for interacting with the backend resides. It provides the following features:

URL Input Field: A text input where users enter the website URL they wish to crawl.
Fetch Pages Button: Sends a request to the backend to fetch all internal links from the entered website.
List of Internal Links: Displays the retrieved links with checkboxes to allow users to select specific pages for further crawling.
Start Crawling Button: Sends the selected links back to the backend to retrieve the content of those pages.
Display Crawled Data: Shows the content of each page that was crawled.

Frontend Flow

Step 1: The user inputs a URL.
Step 2: The URL is sent to the backend, which uses Puppeteer to fetch all internal links.
Step 3: The user selects specific pages from the list of links.
Step 4: The selected pages are crawled by the backend, and the crawled content is displayed.

Here’s a breakdown of the CrawlPage component:

import React, { useState } from 'react';

const CrawlPage = () => {
  const [url, setUrl] = useState('');
  const [pages, setPages] = useState([]);
  const [selectedPages, setSelectedPages] = useState([]);
  const [crawledData, setCrawledData] = useState([]);
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState('');

  // Function to fetch internal links from the backend
  const fetchPages = async () => {
    try {
      setLoading(true);
      const response = await fetch('http://localhost:8000/crawl-site', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({ url }),
      });
      const data = await response.json();
      if (response.ok) {
        setPages(data.pages);
        setError('');
      } else {
        setError(data.error || 'Failed to fetch pages');
      }
    } catch (err) {
      setError('Failed to fetch pages');
    } finally {
      setLoading(false);
    }
  };

  // Function to handle the selection of pages
  const handlePageSelection = (pageUrl) => {
    setSelectedPages((prevSelected) =>
      prevSelected.includes(pageUrl)
        ? prevSelected.filter((url) => url !== pageUrl)
        : [...prevSelected, pageUrl]
    );
  };

  // Function to start crawling the selected pages
  const startCrawling = async () => {
    try {
      setLoading(true);
      const response = await fetch('http://localhost:8000/start-crawl', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({ pages: selectedPages }),
      });
      const data = await response.json();
      if (response.ok) {
        setCrawledData(data.data);
        setError('');
      } else {
        setError(data.error || 'Failed to crawl pages');
      }
    } catch (err) {
      setError('Failed to crawl pages');
    } finally {
      setLoading(false);
    }
  };

  return (
    <div className="container mx-auto p-6">
      <h1 className="text-3xl font-bold mb-6">Web Crawler</h1>

      <div className="mb-4">
        <input
          type="text"
          value={url}
          onChange={(e) => setUrl(e.target.value)}
          placeholder="Enter website URL"
          className="p-2 border border-gray-300 rounded w-full"
        />
      </div>

      <button
        onClick={fetchPages}
        className="bg-blue-500 text-white p-2 rounded"
        disabled={!url || loading}
      >
        {loading ? 'Fetching...' : 'Fetch Pages'}
      </button>

      {error && <p className="text-red-500 mt-4">{error}</p>}

      {pages.length > 0 && (
        <div className="mt-6">
          <h2 className="text-xl font-semibold">Select Pages to Crawl</h2>
          <ul className="list-disc ml-5 mt-4">
            {pages.map((pageUrl) => (
              <li key={pageUrl}>
                <label>
                  <input
                    type="checkbox"
                    checked={selectedPages.includes(pageUrl)}
                    onChange={() => handlePageSelection(pageUrl)}
                  />
                  <span className="ml-2">{pageUrl}</span>
                </label>
              </li>
            ))}
          </ul>
          <button
            onClick={startCrawling}
            className="bg-green-500 text-white p-2 mt-4 rounded"
            disabled={selectedPages.length === 0 || loading}
          >
            {loading ? 'Crawling...' : 'Start Crawling'}
          </button>
        </div>
      )}

      {crawledData.length > 0 && (
        <div className="mt-8">
          <h2 className="text-xl font-semibold">Crawled Data</h2>
          {crawledData.map((data, index) => (
            <div key={index} className="bg-gray-100 p-4 mt-4 rounded">
              <h3 className="text-lg font-bold mb-2">{data.url}</h3>
              <pre className="whitespace-pre-wrap">
                {data.error || data.content}
              </pre>
            </div>
          ))}
        </div>
      )}
    </div>
  );
};

export default CrawlPage;

Explanation of Key Parts:

URL Input Handling:
- The user enters a URL in the input field, and the fetchPages function is triggered when the "Fetch Pages" button is clicked. This function sends the URL to the backend to retrieve the internal links.
Selecting Pages:
- The list of internal links is displayed as checkboxes. The user can select which pages to crawl, and their selection is stored in the selectedPages state.
Crawling Pages:
- When the user clicks "Start Crawling", the selected pages are sent to the backend, which returns the content of each page. The crawled data is then displayed in a readable format.
Error Handling:
- Errors (such as network issues or invalid URLs) are captured and displayed to the user. This ensures a smooth user experience, even if something goes wrong.
Loading State:
- A loading indicator is shown when fetching pages or crawling content, so users know that the request is being processed.

How to Run the Frontend

To run the React frontend, follow these steps:

Install Dependencies:

   cd frontend
   npm install

Start the Development Server:

   npm start

The frontend will be available at http://localhost:3000, and it will automatically connect to the backend running at http://localhost:8000

DEV Community