Jonathan Geiger

Posted on Mar 28 • Originally published at capturekit.dev

How to Run Puppeteer on AWS Lambda

#webdev #webscraping #puppeteer #aws

Running Puppeteer on AWS Lambda can be challenging due to the serverless environment's limitations and Chrome's resource requirements. However, with the right setup and optimizations, it's possible to create a reliable web scraping solution that scales automatically. In this guide, we'll explore how to set up Puppeteer on AWS Lambda and provide a working boilerplate solution.

Why Run Puppeteer on AWS Lambda?

Running Puppeteer on AWS Lambda offers several advantages:

Serverless Architecture: No need to manage servers or worry about uptime
Cost-Effective: Pay only for the compute time you use
Auto-Scaling: Automatically handle varying workloads
Easy Integration: Works well with other AWS services

However, there are some challenges to consider:

Lambda's execution time limits (up to 15 minutes)
Memory constraints (up to 10GB)
Cold starts affecting performance
Chrome binary compatibility issues

Setting Up Puppeteer on AWS Lambda

I've created a boilerplate repository that handles these challenges and provides a working solution. Let's go through the setup process:

Prerequisites

Node.js 18.x (recommended)
AWS Account with Lambda and S3 access
AWS CLI configured for local deployment

Local Development Setup

First, clone the repository and set up your local environment:

# Install Node.js 18
nvm install 18
nvm use 18

# Install dependencies
npm install

# Create environment file
echo "SECRET=your-secret-key-here" > .env

# Run locally
node index.js

AWS Configuration

Create an S3 bucket for your Lambda deployment package
Create a Lambda function with these recommended settings:
- Runtime: Node.js 18.x
- Memory: 1024 MB
- Timeout: 30 seconds
- Architecture: x86_64

Deployment Options

Manual Deployment

# Create deployment package
zip -r lambda.zip index.js node_modules

# Upload to S3
aws s3 cp lambda.zip s3://your-bucket-name/lambda.zip

Then update your Lambda function through the AWS Console:

Go to AWS Lambda Console
Select your function
Go to Code tab
Click "Upload from" -> "Amazon S3 location"
Paste the S3 URL of your uploaded zip file

Automated Deployment with GitHub Actions

The boilerplate includes a GitHub Actions workflow for automated deployment. To set it up:

Add these secrets to your GitHub repository:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
Update the workflow file (.github/workflows/main.yml) with your values:
- Replace {{your-bucket-name}} with your S3 bucket name
- Replace {{your-function-name}} with your Lambda function name
Push to main to trigger deployment

Using the Lambda Function

The function accepts POST requests with this structure:

{
  "url": "https://example.com"
}

Required headers:

secret: your-secret-key

Key Features of the Boilerplate

Stealth Mode: Uses puppeteer-extra-plugin-stealth to avoid detection
AWS Compatibility: Uses @sparticuz/chromium for Lambda compatibility
Security: Secret key authentication
Automated Deployment: GitHub Actions workflow included

Dependencies

The boilerplate uses these key dependencies:

@sparticuz/chromium: ^123.0.1
puppeteer-extra: ^3.3.4
puppeteer-core: 19.6
puppeteer-extra-plugin-stealth: ^2.11.1
puppeteer: ^21.5.0
dotenv: ^16.4.5

Alternative Solution: CaptureKit

While running Puppeteer on AWS Lambda is powerful, it requires significant maintenance and handling of edge cases. If you're looking for a managed solution that handles all the infrastructure and maintenance, consider using CaptureKit. It provides three powerful APIs in one platform:

Screenshot API

Reliable screenshot capture with no infrastructure management
Full-page screenshots with lazy loading support
Built-in ad and cookie banner blocking
Multiple output formats (PNG, WebP, JPEG, PDF)
Direct S3 upload integration

Content Extraction API

Clean, structured HTML extraction
Metadata parsing (title, description, OpenGraph & Schema data)
Link scraping (internal and external)
Consistent data without maintenance headaches
Perfect for data pipelines and web scraping

AI Analysis API

Instant webpage summarization
Key insights extraction
AI-powered content analysis
Scale your web research process
Focus on creating, not extracting content

All CaptureKit APIs are:

Developer-first with instant access
No credit card required for free tier
Lightning-fast support
Built for production use cases

Best Practices and Tips

Memory Management
- Monitor Lambda memory usage
- Adjust memory allocation based on your needs
- Clean up resources properly
Performance Optimization
- Use Lambda layers for dependencies
- Implement connection pooling
- Cache frequently accessed data
Error Handling
- Implement proper error logging
- Set up CloudWatch alarms
- Handle timeouts gracefully
Security
- Never commit AWS credentials
- Use environment variables for secrets
- Implement proper IAM roles

Conclusion

Running Puppeteer on AWS Lambda is a powerful solution for serverless web scraping, but it requires careful setup and maintenance. The provided boilerplate handles many common challenges and provides a solid foundation for your projects.

For those who want to focus on their core business logic without managing infrastructure, CaptureKit offers a comprehensive solution that handles all the complexities of web scraping and content extraction.

Choose the approach that best fits your needs:

Use the Puppeteer Lambda boilerplate if you need full control and customization
Use CaptureKit if you want a managed solution with additional features

How I fixed 20 seconds of lag for every user in just 20 minutes.

Our AI agent was running 10-20 seconds slower than it should, impacting both our own developers and our early adopters. See how I used Sentry Profiling to fix it in record time.

Top comments (0)

Optimize, customize, deliver, manage and analyze your images.

Remove background in all your web images at the same time, use outpainting to expand images with matching content, remove objects via open-set object detection and fill, recolor, crop, resize... Discover these and hundreds more ways to manage your web images and videos on a scale.

Learn more

DEV Community