DEV Community

Cover image for 🚀 Introducing Page Replica: Web Scraping and Caching Tool
Zied Hosni
Zied Hosni

Posted on

🚀 Introducing Page Replica: Web Scraping and Caching Tool

What is Page Replica?

"Page Replica" is a versatile web scraping and caching tool built with Node.js, Express, and Puppeteer. It helps prerender web app pages (React, Angular, Vue, etc.), which can be served via Nginx for SEO or other purposes.

Key Features:

  • Scrape Individual Pages or Entire Sitemaps: Easily scrape and cache individual web pages or entire sitemaps through an API.
  • Remove JavaScript: Optionally remove JavaScript from the scraped pages for better SEO performance.
  • Nginx Configuration: Serve cached pages optimally using our sample Nginx configuration, managing both user and search engine bot traffic.

Why Use Page Replica?

  • SEO Optimization: Improve your website's SEO by serving prerendered pages to search engine bots.
  • Caching for Speed: Cache pages to improve load times for your users and reduce server load.
  • Ease of Use: With our new web service, you can start scraping and caching pages without any installation.

Getting Started

Installation (for Self-Hosted Users)

If you prefer to run Page Replica locally, follow these steps:

  1. Clone the Repository:
   git clone https://github.com/html5-ninja/page-replica.git
   cd page-replica
Enter fullscreen mode Exit fullscreen mode
  1. Install Dependencies:
   npm install
Enter fullscreen mode Exit fullscreen mode
  1. Configure Settings: Update index.js with your desired configuration:
   const CONFIG = {
     baseUrl: "https://example.com",
     removeJS: true,
     addBaseURL: true,
     cacheFolder: "path_to_cache_folder",
   }
Enter fullscreen mode Exit fullscreen mode
  1. Start the API:
   npm start
Enter fullscreen mode Exit fullscreen mode

Usage

Scraping Individual Pages

To scrape a single page, make a GET request to /page with the url query parameter:

curl http://localhost:8080/page?url=https://example.com
Enter fullscreen mode Exit fullscreen mode

Scraping Sitemaps

To scrape pages from a sitemap, make a GET request to /sitemap with the url query parameter:

curl http://localhost:8080/sitemap?url=https://example.com/sitemap.xml
Enter fullscreen mode Exit fullscreen mode

Serve Cached Pages with Nginx

Our sample Nginx configuration in nginx_config_sample/example.com.conf helps you efficiently manage traffic:

  • Users: Regular users are routed to the main application server.
  • Bots: Search engine bots are redirected to a dedicated server block for cached HTML delivery.

Need Assistance?

If you have any questions or need support, we're here to help! Join our GitHub Discussion to get in touch with us.

Folder Structure

  • nginx_config_sample: Sample Nginx configuration for redirecting bot traffic to the cached content server.
  • api.js: Express application handling web scraping requests.
  • index.js: Core web scraping logic using Puppeteer.
  • package.json: Node.js project configuration.

Thank you for choosing Page Replica. We look forward to providing you with the best possible service. Happy scraping! 🕷️

Top comments (0)