Do you remember the Firefox Reader View? It's a feature that removes all unnecessary components like buttons, menus, images, and so on, from a website, focusing on the readable content of the page. The library powering this feature is called Readability.js, which is open source.
Motivation
For one of my personal projects, I needed an API that returns the readable content from a given URL. Initially, that seemed like a straightforward task: just fetch the HTML and feed it into the library. However, it turned out to be a bit more complicated due to the complexity of modern web pages filled with lots of JavaScript.
First of all, in order to retrieve the actual content of a page, a browser is needed to execute all scripts and render the page. And since we're talking Serverless, it has to run on Lambda, of course. Sounds fun?
Stack
I'm usually a Serverless Framework guy, but for this project, I wanted to try something new. So I decided to give the AWS CDK a try and I really liked the experience – more on that at the end. Let's walk through the interesting bits and pieces.
Lambda Layer
The most crucial question was, of course, how to run Chrome on Lambda. Fortunately, much of the groundwork for running Chrome on Lambda had been laid by others. I used the @sparticuz/chromium package to run Chromium in headless mode. However, Chromium is a rather big dependency, so to speed up deployments, I created a Lambda Layer.
const chromeLayer = new LayerVersion(this, "chrome-layer", {
description: "Chromium v111.0.0",
compatibleRuntimes: [Runtime.NODEJS_18_X],
compatibleArchitectures: [Architecture.X86_64],
code: Code.fromAsset("layers/chromium/chromium-v111.0.0-layer.zip"),
});
The corresponding .zip
file was downloaded as artifact from one of the releases.
Lambda Function
The function runs on Node.js v18 and is compiled via ESBuild from TypeScript. There are a few things to note here. I increased the memory to 1600 MB as recommended, and the timeout to 30 seconds to give Chromium enough space and time to start.
I added a reserved concurrency of 1 to prevent this function from scaling out of control due to too many requests.
const handler = new NodejsFunction(this, "handler", {
functionName: "lambda-readability",
entry: "src/handler.ts",
handler: "handler",
runtime: Runtime.NODEJS_18_X,
timeout: cdk.Duration.seconds(30),
memorySize: 1600,
reservedConcurrentExecutions: 1,
environment: {
NODE_OPTIONS: "--enable-source-maps --stack-trace-limit=1000",
},
bundling: {
externalModules: ["@sparticuz/chromium"],
nodeModules: ["jsdom"],
},
layers: [chromeLayer],
});
const lambdaIntegration = new LambdaIntegration(handler);
When bundling this function, the @sparticuz/chromium
package has to be excluded because we provide it as a Lambda Layer. On the other hand, the jsdom
package can't be bundled, so it has to be installed as a normal node module.
REST API
The function is invoked by a GET
request from a REST API and receives the URL as a query string parameter. The url
request parameter is marked as mandatory. Moreover, I made use of the new defaultCorsPrefligtOptions to simplify the CORS setup.
const api = new RestApi(this, "lambda-readability-api", {
apiKeySourceType: ApiKeySourceType.HEADER,
defaultCorsPreflightOptions: {
allowOrigins: Cors.ALL_ORIGINS,
allowMethods: Cors.ALL_METHODS,
allowHeaders: Cors.DEFAULT_HEADERS,
},
});
api.root.addMethod("GET", lambdaIntegration, {
requestParameters: { "method.request.querystring.url": true },
apiKeyRequired: true,
});
Furthermore, I created an API key and assigned it to a usage plan to limit the maximum number of calls per day to 1000.
const key = api.addApiKey("lambda-readability-apikey");
const plan = api.addUsagePlan("lambda-readability-plan", {
quota: {
limit: 1_000,
period: Period.DAY,
},
throttle: {
rateLimit: 10,
burstLimit: 2,
},
});
plan.addApiKey(key);
plan.addApiStage({ api, stage: api.deploymentStage });
Implementation
Let's take a look at the full implementation first and then go into the interesting parts step by step:
let browser: Browser | undefined;
export const handler: APIGatewayProxyHandlerV2 = async (event) => {
let page: Page | undefined;
try {
const { url } = parseRequest(event);
if (!browser) {
browser = await puppeteer.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath(),
headless: chromium.headless,
ignoreHTTPSErrors: true,
});
}
page = await browser.newPage();
await page.goto(url);
const content = await page.content();
const dom = new JSDOM(content, { url: page.url() });
const reader = new Readability(dom.window.document);
const result = reader.parse();
return formatResponse({ result });
} catch (cause) {
const error =
cause instanceof Error ? cause : new Error("Unknown error", { cause });
console.error(error);
return formatResponse({ error });
} finally {
await page?.close();
}
};
First, we declare the browser
outside of the handler
function to be able to re-use the browser instance on subsequent invocations. The launch of a new instance on a cold start causes the majority of execution time.
We parse the url
query string parameter from the API Gateway event and validate it to be a real URL. Then, we use Puppeteer to launch a new browser instance and open a new page. This new page is closed at the end of the function while the browser instance stays open until the Lambda is terminated.
Readability.js requires a DOM object to parse the readable content from a website. That's why we create a DOM object with JSDOM and provide the HTML from the page and its current URL. By the way, the browser may have had to follow HTTP redirects, so the current URL doesn't necessarily have to be the one we provided initially. The parse
function of the library returns the following result:
type Result = {
title: string;
content: string;
textContent: string;
length: number;
excerpt: string;
byline: string;
dir: string;
siteName: string;
lang: string;
};
Some meta information is also available in the result object, but since we are returning raw HTML content, we are only interested in the content
property. However, we have to add the Content-Type
header with text/html; charset=utf-8
to the response object to ensure the browser renders it correctly.
Application
Now comes the fun part. I have created a simple web app with React, Tailwind, and Vite to demonstrate this project. Strictly speaking, you could call the REST API directly from a browser as the Lambda function returns real HTML that renders just fine. However, I thought it would be nicer to use it as a real application.
The following articles are curated examples showcasing the Readability version on the left and the Original article on the right. Of course, you can also try your own article and start here: zirkelc.github.io/lambda-readability
So without further ado, let's read some articles:
Cloud Development Kit
I've got to say, my initial dive into AWS CDK has been quite a pleasant surprise. What impresses me most is the ability to code up my infrastructure using good old JavaScript or TypeScript, the very languages I already use to develop my application. No more fumbling with meta languages or constantly referring to documentation just to figure out how to do this or that – CDK simplifies everything.
The beauty of it all is that I can utilize the fundamental building blocks: if-conditions and for-loops, objects and arrays, classes and functions. I can put my coding skills to work in the same way I always do, without the need for any special plugins or hooks. That’s what Infrastructure as Code should really feel like – a truly great developer experience.
Conclusion
It's pretty amazing how far the Serverless world has come, enabling us to effortlessly run a Chrome browser inside a Lambda function. If you are interested in the mechanics of this project, you can view the full source code on GitHub. I'd really appreciate your feedback, and if you like it, give it a star on GitHub!
zirkelc / lambda-readability
Reader View build with Lambda and Readability
Lambda Readability
Lambda Readability is a Serverless Reader View to extract readable content from web pages using AWS Lambda, Chromium, and the Readability.js library.
For more information, read my article on DEV.to: Building a Serverless Reader View with Lambda and Chrome
Features
- Serverless project built with AWS CDK
- Runs a headless Chrome browser on AWS Lambda
- Uses the Readability.js library to extract readable content from a web page
- Simple REST API for requests
- Frontend built with React, Tailwind, and Vite
Application
Visit zirkelc.github.io/lambda-readability and enter a URL for a website. Here are some examples:
Maker's Schedule, Manager's Schedule by Paul Graham.
Understanding AWS Lambda’s invoke throttling limits by Archana Srikanta on the AWS Compute Blog.
Advice for Junior Developers by Jeroen De Dauw on DEV.to
Development
Install dependencies from root:
npm install
Build and deploy backend with CDK:
cd backend
…
I hope you found this post helpful. If you have any questions or comments, feel free to leave them below. If you'd like to connect with me, you can find me on LinkedIn or GitHub. Thanks for reading!
Top comments (5)
Sounds much like a project, that could run completely in the browser with some lines of code only. Why did you use a server for this?
I had the same assumption, but simply
fetch()
an URL from a script isn't enough for most pages. Most pages nowadays consist of so much JavaScript, especially SPAs, that you must render it in a real browser to get the actual content.Take this link from Quora as an example: qr.ae/pK8Pk7
If you fetch the content, it won't work - the returned HTML contains a JavaScript redirect and must be executed by a browser to get to the actual page.
It might be possible to get this to work with iframes or using JSDOM in the browser, maybe. However, I need this functionality as an API anyway, so I decided to start there and don't get down the rabbithole to iframes.
You are right! You can run a page in an iframe, but access is prohibited by CORS limitations.
Great write up as always!
Thank you!