Running Puppeteer on AWS Lambda can be challenging due to the serverless environment's limitations and Chrome's resource requirements. However, with the right setup and optimizations, it's possible to create a reliable web scraping solution that scales automatically. In this guide, we'll explore how to set up Puppeteer on AWS Lambda and provide a working boilerplate solution.
Why Run Puppeteer on AWS Lambda?
Running Puppeteer on AWS Lambda offers several advantages:
- Serverless Architecture: No need to manage servers or worry about uptime
- Cost-Effective: Pay only for the compute time you use
- Auto-Scaling: Automatically handle varying workloads
- Easy Integration: Works well with other AWS services
However, there are some challenges to consider:
- Lambda's execution time limits (up to 15 minutes)
- Memory constraints (up to 10GB)
- Cold starts affecting performance
- Chrome binary compatibility issues
Setting Up Puppeteer on AWS Lambda
I've created a boilerplate repository that handles these challenges and provides a working solution. Let's go through the setup process:
Prerequisites
- Node.js 18.x (recommended)
- AWS Account with Lambda and S3 access
- AWS CLI configured for local deployment
Local Development Setup
First, clone the repository and set up your local environment:
# Install Node.js 18
nvm install 18
nvm use 18
# Install dependencies
npm install
# Create environment file
echo "SECRET=your-secret-key-here" > .env
# Run locally
node index.js
AWS Configuration
- Create an S3 bucket for your Lambda deployment package
- Create a Lambda function with these recommended settings:
- Runtime: Node.js 18.x
- Memory: 1024 MB
- Timeout: 30 seconds
- Architecture: x86_64
Deployment Options
Manual Deployment
# Create deployment package
zip -r lambda.zip index.js node_modules
# Upload to S3
aws s3 cp lambda.zip s3://your-bucket-name/lambda.zip
Then update your Lambda function through the AWS Console:
- Go to AWS Lambda Console
- Select your function
- Go to Code tab
- Click "Upload from" -> "Amazon S3 location"
- Paste the S3 URL of your uploaded zip file
Automated Deployment with GitHub Actions
The boilerplate includes a GitHub Actions workflow for automated deployment. To set it up:
-
Add these secrets to your GitHub repository:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
-
Update the workflow file (
.github/workflows/main.yml
) with your values:- Replace
{{your-bucket-name}}
with your S3 bucket name - Replace
{{your-function-name}}
with your Lambda function name
- Replace
Push to main to trigger deployment
Using the Lambda Function
The function accepts POST requests with this structure:
{
"url": "https://example.com"
}
Required headers:
secret: your-secret-key
Key Features of the Boilerplate
-
Stealth Mode: Uses
puppeteer-extra-plugin-stealth
to avoid detection -
AWS Compatibility: Uses
@sparticuz/chromium
for Lambda compatibility - Security: Secret key authentication
- Automated Deployment: GitHub Actions workflow included
Dependencies
The boilerplate uses these key dependencies:
-
@sparticuz/chromium
: ^123.0.1 -
puppeteer-extra
: ^3.3.4 -
puppeteer-core
: 19.6 -
puppeteer-extra-plugin-stealth
: ^2.11.1 -
puppeteer
: ^21.5.0 -
dotenv
: ^16.4.5
Alternative Solution: CaptureKit
While running Puppeteer on AWS Lambda is powerful, it requires significant maintenance and handling of edge cases. If you're looking for a managed solution that handles all the infrastructure and maintenance, consider using CaptureKit. It provides three powerful APIs in one platform:
Screenshot API
- Reliable screenshot capture with no infrastructure management
- Full-page screenshots with lazy loading support
- Built-in ad and cookie banner blocking
- Multiple output formats (PNG, WebP, JPEG, PDF)
- Direct S3 upload integration
Content Extraction API
- Clean, structured HTML extraction
- Metadata parsing (title, description, OpenGraph & Schema data)
- Link scraping (internal and external)
- Consistent data without maintenance headaches
- Perfect for data pipelines and web scraping
AI Analysis API
- Instant webpage summarization
- Key insights extraction
- AI-powered content analysis
- Scale your web research process
- Focus on creating, not extracting content
All CaptureKit APIs are:
- Developer-first with instant access
- No credit card required for free tier
- Lightning-fast support
- Built for production use cases
Best Practices and Tips
-
Memory Management
- Monitor Lambda memory usage
- Adjust memory allocation based on your needs
- Clean up resources properly
-
Performance Optimization
- Use Lambda layers for dependencies
- Implement connection pooling
- Cache frequently accessed data
-
Error Handling
- Implement proper error logging
- Set up CloudWatch alarms
- Handle timeouts gracefully
-
Security
- Never commit AWS credentials
- Use environment variables for secrets
- Implement proper IAM roles
Conclusion
Running Puppeteer on AWS Lambda is a powerful solution for serverless web scraping, but it requires careful setup and maintenance. The provided boilerplate handles many common challenges and provides a solid foundation for your projects.
For those who want to focus on their core business logic without managing infrastructure, CaptureKit offers a comprehensive solution that handles all the complexities of web scraping and content extraction.
Choose the approach that best fits your needs:
- Use the Puppeteer Lambda boilerplate if you need full control and customization
- Use CaptureKit if you want a managed solution with additional features
Top comments (0)