This post was updated 20 Sept 2022 to improve reliability with large numbers of files.
- Update the stream handling so streams are only opened to S3 when the file is ready to be processed by the Zip Archiver. This fixes timeouts that could be seen when processing a large number of files.
- Use keep alive with S3 and limit connected sockets.
It's not an uncommon requirement to want to package files on S3 into a Zip file for a user to download multiple files in a single package. Maybe it's common enough for AWS to offer this functionality themselves one day. Until then you can write a short script to do it.
If you want to provide this service in a serverless environment such as AWS Lambda you have two main constraints that define the approach you can take.
1 - /tmp is only 512Mb. Your first idea might be to download the files from S3, zip them up, upload the result. This will work fine until you fill up /tmp with the temporary files!
2 - Memory is constrained to 3GB. You could store the temporary files on the heap, but again you are constrained to 3GB. Even in a regular server environment you're not going to want a simple zip function to take 3GB of RAM!
So what can you do? The answer is to stream the data from S3, through an archiver and back onto S3.
Fortunately this Stack Overflow post and its comments pointed the way and this post is basically a rehash of it!
The below code is Typescript but the Javascript is just the same with the types removed.
Start with the imports you need
import * as Archiver from 'archiver';
import * as AWS from 'aws-sdk';
import { createReadStream } from 'fs';
import { Readable, Stream } from 'stream';
import * as lazystream from 'lazystream';
Firstly configure the aws-sdk so that it will use keepalives when communicating with S3, and also limit the maximum number of connections. This improves efficiency and helps avoid hitting an unexpected connection limit. Instead of this section you could set AWS_NODEJS_CONNECTION_REUSE_ENABLED
in your lambda environment.
// Set the S3 config to use keep-alives
const agent = new https.Agent({ keepAlive: true, maxSockets: 16 });
AWS.config.update({ httpOptions: { agent } });
Let's start by creating the streams to fetch the data from S3. To prevent timeouts to S3 the streams are wrapped with 'lazystream', this delays the actual opening of the stream until the archiver is ready to read the data.
Let's assume you have a list of keys in keys
. For each key we need to create a ReadStream. To track the keys and streams lets create a S3DownloadStreamDetails type. The 'filename' will ultimately be the filename in the Zip, so you can do any transformation you need for that at this stage.
type S3DownloadStreamDetails = { stream: Readable; filename: string };
Now for our array of keys, we can iterate after it to create the S3StreamDetails objects
const s3DownloadStreams: S3DownloadStreamDetails[] = keys.map((key: string) => {
return {
stream: new lazystream.Readable(() => {
console.log(`Creating read stream for ${fileToDownload.key}`);
return s3.getObject({ Bucket: s3UGCBucket, Key: fileToDownload.key }).createReadStream();
}),
filename: key,
};
});
Now prepare the upload side by creating a Stream.PassThrough
object and assigning that as the Body of the params for a S3.PutObjectRequest
.
const streamPassThrough = new Stream.PassThrough();
const params: AWS.S3.PutObjectRequest = {
ACL: 'private',
Body: streamPassThrough
Bucket: 'Bucket Name',
ContentType: 'application/zip',
Key: 'The Key on S3',
StorageClass: 'STANDARD_IA', // Or as appropriate
};
Now we can start the upload process.
const s3Upload = s3.upload(params, (error: Error): void => {
if (error) {
console.error(`Got error creating stream to s3 ${error.name} ${error.message} ${error.stack}`);
throw error;
}
});
If you want to monitor the upload process, for example to give feedback to users then you can attach a handler to httpUploadProgress
like this.
s3Upload.on('httpUploadProgress', (progress: { loaded: number; total: number; part: number; key: string }): void => {
console.log(progress); // { loaded: 4915, total: 192915, part: 1, key: 'foo.jpg' }
});
Now create the archiver
const archive = Archiver('zip');
archive.on('error', (error: Archiver.ArchiverError) => { throw new Error(`${error.name} ${error.code} ${error.message} ${error.path} ${error.stack}`); });
Now we can connect the archiver to pipe data to the upload stream and append all the download streams to it
await new Promise((resolve, reject) => {
console.log('Starting upload');
s3Upload.on('close', resolve);
s3Upload.on('end', resolve);
s3Upload.on('error', reject);
archive.pipe(s3StreamUpload);
s3DownloadStreams.forEach((streamDetails: S3DownloadStreamDetails) => archive.append(streamDetails.stream, { name: streamDetails.filename }));
archive.finalize();
}).catch((error: { code: string; message: string; data: string }) => { throw new Error(`${error.code} ${error.message} ${error.data}`); });
Finally wait for the uploader to finish
await s3Upload.promise();
and you're done.
I've tested this with +10GB archives and it works like a charm. I hope this has helped you out.
Top comments (36)
Thanks for the post!
Here are the typos.
Now for our array of
keys/keys, we can iterateofter/after it to create the S3StreamDetails objectsNow we can connect the archiver to pipe
date/data to the upload stream and append all the download streams to itWith
s3StreamUpload
variable, you means3Upload
?Hi Samet,
I am getting an error which says "The request signature we calculated does not match the signature you provided. Check your key and signing method." when I execute await s3Upload.promise() in the end. Any help will be highly appreciated.
My code is below
var aws = require("aws-sdk");
const s3 = new aws.S3();
aws.config.update({
accessKeyId: 'my-access-key',
secretAccessKey: 'my-secret'
});
const _archiver = require('archiver');
var stream = require('stream');
const bucketName = 'myBucket';
const zipFileName = 'zipper.zip';
const streamPassThrough = new stream.PassThrough();
var params = {
ACL: 'private',
Body: streamPassThrough,
Bucket: bucketName,
ContentType: 'application/zip',
Key: zipFileName
};
//This returns us a stream.. consider it as a real pipe sending fluid to S3 bucket.. Don't forget it
const s3Upload = s3.upload(params,
(err, resp) => {
if (err) {
console.error('Got error creating stream to s3 ${ err.name } ${ err.message } ${ err.stack }');
throw err;
}
console.log(resp);
});
exports.handler = async (_req, _ctx, _cb) => {
var _keys = ['PDF/00CRO030.pdf', 'PDF/MM07200231.pdf'];
};
Did you double check your accessKeyId and secretAccessKey? There might be a space before or after your keys (access key or secret key).
Hi Samet, yes I checked and the credentials are correct. However, I am able to download the bytes array from Keys array using s3.getObject.
Is there something which I am missing?
Could you compare your implementation with this?
github.com/rokumatsumoto/aws-node-...
Also please share your implementation with gist link (gist.github.com/)
The difference is this one I posted here dev.to/prosonf/comment/18d6d, the handlers on close, end and error.
I've just finished implementing with the help of Samet <3, thank you both!
I had to change:
s3Upload.on('close', resolve);
s3Upload.on('end', resolve);
s3Upload.on('error', reject);
to:
s3Upload.on('close', resolve());
s3Upload.on('end', resolve());
s3Upload.on('error', reject());
if not the promise doesn't resolves and the code after:
await s3Upload.promise();
doesn't execute
s3StreamUpload
should be replaced bystreamPassThrough
.Many thanks. Post updated!
Great post. I was just trying to do the same thing couple days back (ATM im keeping things in memory before uploading) and failed. I dont need more than ~500MB, but i believe streams are more efficient anyways - better safe than sorry.
BTW. When you operate on a lot of files, using keepAlive can help a lot - theburningmonk.com/2019/02/lambda-...
Also, importing only s3 client from SDK and bundling lambda with webpack makes its cold start much faster - theburningmonk.com/2019/03/just-ho...
Ah, i just noticed, that i have opposite case. I have a zip file that i want to extract, which seems to be a little bit different, because you have to stream zip file, but to upload a file to s3 you need a key, and thats a problem ;-)
That case is detailed at medium.com/@johnpaulhayes/how-extr....
Yeah, buffer.
Thats what ive got, i wanted to have streams to have possibility to support big files, not files that can fit into memory.
This guy is calling 500MB huge because thats the max temp size on lambda (which would be ok, but realistically, saving extracted files to tmp just to upload them to s3 is kind of wasteful and nobody should do that anyways), well, for me thats not huge at all, i was aiming at couple GBs for a good measure.
Also when he writes
This method does not use up disk space and therefore is not limited by size.
he is wrong. Limit is limited, and it would be far less than 3GB (max possible memory on lambda), depending on file types (binary/text) and their number.Hello, thanks for sharing the solution. For those who want to understand what is going on under the hood or if you are facing issues (Timeout errors, memory issues, etc.) please, take some time to READ THIS excellent issue on github.
There you can find a client/server example to reproduce all this operation and also different directions to take down this kind of problem (zip generation/download on s3).
Hi All,
I know People need to work it out on their own, but still i am hoping this would save some time for others. Here's plain Nodejs Code in javascript. I have tested this out for around 1GB so far for my requirement. worked like a charm. :-)
// create readstreams for all the output files and store them
lReqFiles.forEach(function(tFileKey){
s3FileReadStreams.push({
"stream": UTIL_S3.S3.getObject({
Bucket: CFG.aws.s3.bucket,
Key: tFileKey
}).createReadStream(),
"filename": tFileKey
});
});
//
const Stream = STREAM.Stream;
const streamPassThrough = new Stream.PassThrough();
// Create a zip archive using streamPassThrough style for the linking request in s3bucket
outputFile =
${CFG.aws.s3.outputDir}/archives/${lReqId}.zip
;const params = {
ACL: 'private',
Body: streamPassThrough,
Bucket: CFG.aws.s3.bucket,
ContentType: 'application/zip',
Key: outputFile
};
const s3Upload = UTIL_S3.S3.upload(params, (err, resp) => {
if (err) {
console.error(
Got error creating stream to s3 ${err.name} ${err.message} ${err.stack}
);throw err;
}
console.log(resp);
}).on('httpUploadProgress', (progress) => {
console.log(progress); // { loaded: 4915, total: 192915, part: 1, key: 'foo.jpg' }
});
// create the archiver
const archive = Archiver('zip');
archive.on('error', (error) => {
throw new Error(
${error.name} ${error.code} ${error.message} ${error.path} ${error.stack}
);});
// connect the archiver to upload streamPassThrough and pipe all the download streams to it
await new Promise((resolve, reject) => {
console.log("Starting upload of the output Files Zip Archive");
//
s3Upload.on('close', resolve());
s3Upload.on('end', resolve());
s3Upload.on('error', reject());
//
archive.pipe(streamPassThrough);
s3FileReadStreams.forEach((s3FileDwnldStream) => {
archive.append(s3FileDwnldStream.stream, { name: s3FileDwnldStream.filename })
});
archive.finalize();
//
}).catch((error) => {
throw new Error(
${error.code} ${error.message} ${error.data}
);});
//
// Finally wait for the uploader to finish
await s3Upload.promise();
//
Adding this helped for a stable connection as @pawal-kowalski suggested
// setup sslAgent KeepAlive to true in AWS-SDK config for stable results
const AWS = require('aws-sdk');
const https = require('https');
const sslAgent = new https.Agent({
KeepAlive: true,
rejectUnauthorized: true
});
sslAgent.setMaxListeners(0);
AWS.config.update({
httpOptions: {
agent: sslAgent,
}
});
//
Hi Gordon,
do you have your code somewhere? Neither your steps, nor the code from stackoverflow.com/questions/386335... works for me and I am afraid it's beyond my knowledge to make it work in a Lambda.
Thanks!
Hi Yasen,
github.com/rokumatsumoto/aws-node-...
I can help, if you have any questions.
This doesn't work if I have more than 2K small files(each file is size <100KB) on S3. I tried increasing lambda RAM to maximum, but upload doesn't start and lambda times out after 15 minutes. I'm guessing it's due to so many readable streams getting created or some issue with archiver package. Can you suggest something for this case?
please check this issue to understand what is going on and how to solve it
These handlers are wrong:
They have to be over streamPassThrough:
Hello, it worked for me:
s3Upload.on('close', resolv());
s3Upload.on('end', resolve());
s3Upload.on('error', reject());
Hi everyone,
I have some problems with lamda. First, compilation is blocked by :
s3Upload.on('close', resolve());
s3Upload.on('end', resolve());
s3Upload.on('error', reject());
'"close"' is not assignable to parameter of type '"httpUploadProgress"'
'"end"' is not assignable to parameter of type '"httpUploadProgress"'
'"error"' is not assignable to parameter of type '"httpUploadProgress"'
And when I try to call the lambda CloudWatch throw this error :
{
"errorType": "Runtime.UnhandledPromiseRejection",
"errorMessage": "TypeError: archiver_1.Archiver is not a function",
"reason": {
"errorType": "TypeError",
"errorMessage": "archiver_1.Archiver is not a function",
"stack": [
"TypeError: archiver_1.Archiver is not a function",
" at Function.generatePackage (/var/task/dist/src/service/package.service.js:77:36)",
" at /var/task/dist/src/service/package.service.js:40:34",
" at Array.forEach ()",
" at /var/task/dist/src/service/package.service.js:39:17",
" at processTicksAndRejections (internal/process/task_queues.js:94:5)"
]
}
}
Did you have some advices in order to resolve that 2 errors ?
Thanks a lot.
Import archive this way
import Archive from "archive"
With that, you will not get Archive is not a function again
Hello,
I recently came across your blog post on using lambda to zip S3 files, and I wanted to thank you for sharing such a helpful resource! While testing out the example code, I noticed a few typos, so I took the liberty of fixing them and adapting the code to my needs. I'm happy to report that the lambda function works perfectly now and has saved me a lot of time and effort.
If anyone is interested, I've created a GitHub repository with my updated code that you can check out here: github.com/yufeikang/serverless-zi.... I hope this will be helpful to others who may be looking for a more reliable solution.
Thank you again for your excellent work!
I was really hoping to find a PHP version of this solution! I'm happy it's possible in node at least!