Multipart S3 uploads
Uploading files to S3 is a common task that developers often encounter. While for most use cases a simple PUT request to S3 would get the job done.
However, if you want to upload large files or want more flexibility with your uploads you might want to try out multipart uploads.
Note: With normal S3 uploads we can upload a maximum of 5GB with a single request, if you want to upload files greater than this, you must use multipart S3 upload.
(In case you are only interested in the code you can find it here)
What’s a multipart upload?
A multipart upload is similar to any other normal upload request except you upload the file in parts or chunks. So you send multiple upload requests instead of a single request.
What’s the advantage of using multipart uploads?
From the official AWS docs….
- Improved throughput — You can upload parts in parallel to improve throughput.
- Quick recovery from any network issues — Smaller part size minimizes the impact of restarting a failed upload due to a network error.
- Pause and resume object uploads — You can upload object parts over time. After you initiate a multipart upload, there is no expiry; you must explicitly complete or stop the multipart upload.
- Begin an upload before you know the final object size — You can upload an object as you are creating it.
Whenever you are uploading files larger than 100MB you might want to consider using a multipart upload instead of a normal upload.
Let us see how we can actually implement a multipart upload using javascript and the AWS SDK for JavaScript S3 Client.
A multipart upload operation to S3 consists of 3 distinct steps…
- Multipart upload initiation
The first step is to send a request to S3 to initiate the multipart upload. S3 returns an upload ID in the response which is a unique identifier for our multipart upload request. In later steps, we can make use of this Id to upload the individual parts and finalize our upload.
For this, we are using the createMultiPartUpload command from the AWS S3 js SDK.
2. Uploading individual parts
Now that we have initiated the multipart upload request and obtained the upload ID we can start uploading individual parts to S3.
When uploading each part we must specify a Part number in addition to the upload ID.
Some points to remember:
- Part numbers may not be consecutive this means you can upload parts in any order.
- A part number can range between 1 to 10000. ( S3 allows a maximum of 10000 parts in a multipart upload request)
- On successfully uploading a part S3 will respond back with an ETag in the response. You must save the ETags and their corresponding part numbers since this will be required later to complete the multipart upload request.
Let's see this in action with some code…
First, we need to read the file one chunk/part at a time, for this, we can create a function like…
Here we are using the read method from the node js file system package.
First, we allocate an empty buffer where the data would be loaded. We then read “CHUNK_SIZE” bytes of data from the file and save it in the buffer.
Notice how we only read a small part of the file instead of loading the entire file in memory, this helps us to avoid any out-of-memory issues.
When reading a file, the file descriptor position automatically gets updated so the next read operation would read the next “CHUNK_SIZE” bytes. So we don’t need to worry about keeping track of the position from where to read.
The callback passed to the fs.read
the function gives us the length of bytes read and we resolve the promise with the bytes read and the buffer that contains the actual data.
Now that we have the data chunk available in the buffer we can now upload it to S3. For this, we use a function like so…
The above function can be called like
const { buffer, bytesRead } = await readNextPart(fileDescriptor);const response = await uploadPart(S3, { data, bucket, key, PartNumber, UploadId });uploadPartResults.push({ PartNumber, ETag: response.ETag });
The code is pretty much self-explanatory, we are uploading the data chunk or part to S3 using the uploadPart method from the SDK.
Also, notice how we save the returned ETag along with the part number in an array for later use.
The function also has a simple retry mechanism in case uploading a part fails for some reason.
3. Finalizing the upload
Once we have uploaded all the parts to S3 we need to finalize the upload so that S3 creates an object by concatenating all the parts in ascending order based on the part number.
The request to complete a multipart upload must include the upload ID and a list of both part numbers and corresponding ETag values.
Let's look at the corresponding code to achieve this…
Assuming uploadPartsResults
contains an array of objects of the shape { PartNumber, Etag}
we can complete the multipart upload by making the following request.
Here we made use of the CompleteMultipartUpload command from the SDK.
Aaaand that's all folks!
Congratulations you have successfully uploaded your file to S3 via a multipart upload request.
The entire code in action:
Feel free to edit the code as per your requirement.
A sample output of the above script, uploading a 5.16GB file to S3
Initiate multipart upload, uploadId: ju49_EQBW.Oa8mHGIN1....., totalParts: 53, fileSize: 5547798323
Uploaded part 1 of 53
Uploaded part 2 of 53
Uploaded part 3 of 53
....
Uploaded part 51 of 53
Uploaded part 52 of 53
Uploaded part 53 of 53Finish uploading all parts for multipart uploadId: ju49_EQBW.Oa8mHGIN1.....
Successfully completed multipart upload
Gotchas:
- S3 Limits: When uploading individual parts to S3 the minimum part size is 5MB and the maximum is 5GB except for the last part. A single multipart request can have a maximum of 10000 parts. This means using multipart uploads you can upload a maximum of 50000GB of data.
- Parallel uploads: Since the parts can be uploaded in any order we can upload multiple parts in parallel to speed up the upload process. The code given above does a sequential upload where it uploads one part at a time sequentially, but with little effort using something like
Promise.all
you can upload multiple parts at once. (While doing so you might want to take care to limit the network requests. For example, you may choose to upload only a maximum of 3 parts at once, maybe some pooling can help here.) - Retrying failed uploads: If for some reason uploading a part fails then you only need to reupload that part only. This is one advantage of using multipart upload over normal uploads where you would have to re-upload the entire file again in case of any failures.
- Multipart uploads from the browser: In the given code example here we are uploading the file from the node js. If you want to perform multipart uploads from the browser you can do so easily using pre-signed upload URLs. Basically, the front-end code would have to request the server for a pre-signed URL for every part it wants to upload, the rest of the process is almost the same. Here is some code to get you started on this.
- Resumable uploads: You can easily pause a multipart upload by not uploading any further parts and can resume it later without any problems.
- Tracking upload progress: You can always track the progress of the upload by checking the number of parts that have been successfully uploaded, this can be used to show some upload progress to the user.
- Incomplete Multipart uploads: In case you don’t complete your multipart upload the uploaded parts will remain in S3 forever unless you complete your upload or abort the multipart upload request. Click here to know more about this.
Bonus Debugging MalformedXML errors with AWS JS SDKs
While working with the AWS SDKs, you might sometimes face cryptic MalformedXML errors which are hard to debug, for instance, while writing this code I made a mistake in one of the property names in the request body and ended up wasting an hour trying to debug the MalformedXML error.
If you face a similar error an easy way to view the output request XML that is being sent to AWS is by adding middleware to the S3 client object.
Now all outgoing request bodies would be printed, you can then easily analyze the XML to check for missing or incorrect properties.
References:
- AWS guide about multipart uploads
- S3 JS SDK reference
- Other code examples of multipart uploads.
And that's all I had for today folks, hope you liked the article and learned something, until next time.