Amazon S3 Adds Put-If-Match (Compare-and-Swap)(aws.amazon.com) |
Amazon S3 Adds Put-If-Match (Compare-and-Swap)(aws.amazon.com) |
My other favorite pattern is implementing a pool of workers by quering ec2 instances with a certain tag in a stopped state and starting them. Starting the instance can succeed only once - that means I managed to snatch the machine. If it fails, I try again, grabbing another one.
This is one of those things that I never advertised out of professional shame, but it works, its bulletproof and dead simple and does not require additional infra to work.
Cost-wise we're only paying for the EBS volumes for the stopped instances which are like 4GB each, so they cost practically nothing, we spend less than a dollar per month for the whole bunch.
the first one is probably cleaner (though I don't like it, it means that I need the instance to be a kubernetes node, and that comes with a bunch of baggage).
My biggest wishlist item for S3 is the ability to enforce that an object is named with a name that matches its hash. (With a modern hash considered secure, not MD5 or SHA1, though it isn't supported for those either.) That would make it much easier to build content-addressible storage.
The client sends the request headers (including the x-amz-content-sha256 header) to the signer, and the signer responds with a valid S3 PUT request (minus body). The client takes the signer's response, appends its chosen request payload, and uploads it to S3. With such a system, you can implement a signer in a lambda function, and the lambda function enforces the content-addressed invariant.
Unfortunately it doesn't work natively with multipart: while SigV4+S3 enables you to enforce the SHA256 of each individual part, you can't enforce the SHA256 of the entire object. If you really want, you can invent your own tree hashing format atop SHA256, and enforce content-addressability on that.
I have a blog post [1] that goes into more depth on signers in general.
[1] https://josnyder.com/blog/2024/patterns_in_s3_data_access.ht...
https://aws.amazon.com/blogs/aws/new-additional-checksum-alg...
And even if it was for the whole file, it isn't used for the ETag, so, so it can't be used for conditional PUTs.
I had a use case where this looked really promising, then I ran into the multipart upload limitations, and ended up using my own custom metadata for the sha256sum.
Individual objects are split into multiple blocks, each of which can be stored independently on different underlying servers. Each can see its own block, but not any other block.
Calculating a hash like SHA256 would require a sequential scan through all blocks. This could be done with a minimum of network traffic if instead of streaming the bytes to a central server to hash, the hash state is forwarded from block server to block server in sequence. Still though, it would be a very slow serial operation that could be fairly chatty too if there are many tiny blocks.
What could work would be to use a Merkle tree hash construction where some of subdivision boundaries match the block sizes.
I'd like to set IAM permissions for a role, so that that role can add objects to the content-addressible store, but only if their name matches the hash of their content.
> Or are you saying you want S3 to automatically set the name for you based on the hash?
I'm happy to name the files myself, if I can get S3 to enforce that. But sure, if it were easier, I'd be thrilled to have S3 name the files by hash, and/or support retrieving files by hash.
It's no longer top comment, which is fine.
Genuinely, we've wanted this for ages and we got half way there with strong consistency.
As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:
- Download the current database copy
- Perform your write locally
- Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.
- If you get success, consider the transaction successful.
- If you get failure, go back to step 1 and try again.
Without conditional writes, two instances of your application might both read "100", both subtract 1, and both write "99". If they checked the file afterward, both would think everything was fine. But things aren't find because you've actually sold two.
The other cloud storage providers have had these sorts of conditional write features since basically forever, and it's always been really weird that S3 has lacked them.
[1]: https://learn.microsoft.com/en-us/azure/storage/blobs/concur...
So coordinating writes to multiple objects still requires… creativity.
I'm thinking of a situation in which an application assumes that different (possibly adversarial) user-provided data will always generate a different ETag.
Too bad performance would be terrible without a caching layer (ebs).
Can we have this Google?
…
Please?
2. glue jobs to partition by some columns reporting uses
3. query with athena
4. ???
5. profit (celebrate reduced cost)
This thing costs couple $ a month for ~500gb of data. Snowflake wanted crazy amounts of money for the same thing.
I wouldn't be surprised if they saw over 100mil/req/sec globally by now. That's 100 million requests a second that need strong read-your-write consistency and atomicity at global scale. The number of pieces they had to move into place for this to happen is probably quite the engineering tale.
[1] https://aws.amazon.com/blogs/aws/amazon-s3-two-trillion-obje...
This is especially useful in scenarios where multiple users or processes are working on the same data, as it helps maintain consistency and avoids accidental overwrites.
This is using the same mechanism as HTTP's `If-None-Match` header so it's easier to implement/learn
Finally we can have this with s3 :)
If you want to follow along: https://github.com/slatedb/slatedb/issues/164
I suppose you could have some API to request a signed url for a certain hash, but that starts getting complicated, especially if you need support for multi-part uploads, which you probably do.
This does not change the point, I'm just being pedantic, but:
4GB of gp3 EBS takes $0.32 per month, assuming a 50% discount (not unusual), less than a dollar gives only... 6 instances.
S3 also supports more complicated cases where the entire object may not be visible to any single component while it is being written, and in those cases, `ETag:` works differently.
> * Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.
> * Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.
> * If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption. If an object is larger than 16 MB, the AWS Management Console will upload or copy that object as a Multipart Upload, and therefore the ETag will not be an MD5 digest.
https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.h...
No, you need the hash of the previous block before you can start processing the next block.
The main point of it is: I have an object that I want to mutate. I think I have the latest version in memory. So I update in memory and upload it to S3 with the eTag of the version I have and tell it to only commit if that is the latest version. If it "fails", I re-download the object, re-apply the mutation, and try again.
That said, I'm not sure if common HTTP libraries look at response headers before they're done posting a response body, or if that's even allowed/possible in HTTP? It seems feasible at a first glance with chunked encoding, at least.
Edit: Upon looking a bit, it seems that informational response codes, e.g. 100 (Continue) in combination with Expect 100-continue in the requests, could enable just that and avoid an extra GET with If-Match.
1) default, load-compare-&-swap for small fast load/swaps.
2) optional, compare-load-&-swap to allow a large load to pass its compare, and cut in front of all the fast small swap that would otherwise create an un-hittable moving target during its long loads for its own compare.
3) If the load itself was stable relative to the compare, then it could be pre-loaded and swapped into a holding location, followed by as many fast compare-&-swaps as needed to get it into the right location.
"Your Amazon EC2 usage is calculated by either the hour or the second based on the size of the instance, operating system, and the AWS Region where the instances are launched" - https://repost.aws/knowledge-center/ec2-instance-hour-billin...
> Our bet that S3 would get it in a reasonable time-frame worked out!
Scaling on things like the length of the queue doesn't work very well at all in practice. A queue length of 100 might be horribly long in some workloads and insignificant in others, so scaling on queue length requires a lot of tuning that must be adjusted over time as the workload changes. Scaling based on percent of concurrent capacity can work for most workloads, and tends to remain stable over time even as workloads change.
MD5 should not be used for anything security related. Granting write access based on an MD5 hash would be a huge no-no.
Imagine a transaction log being a blob per-customer with many lines corresponding to price, sku, etc, that additionally have some “memo” field provided by the customer. A trusted distributed worker process is responsible for taking incoming requests by the user, pulling their blob down, appending the line based on the request, and CAS’ing it back in (retrying on failure). With enough effort, a particularly devious user could issue many requests with ‘memo’s engineered to not alter the MD5 of their log. This would cause some lines to be lost. An audit of their account transaction log would be unable to accurately reflect the requests they made to the service, and the failure would be invisible.
This is obviously a bit contrived – I’ll be the first to admit. But if the incentives were to exist for this to be worth someone’s time for some system, I think it would be likely to see it come up eventually.
Edit: This is actually already implemented in the Bao project which exploits the structure of the BLAKE3 merkle tree structure to offer cool features like streaming verification and verifying slices of a file as I described above: https://github.com/oconnor663/bao#verifying-slices
CopyObject writes a single part object and can read from a multipart object, as long as the parts total less than the 5 gibibyte limit for a single part.
For future writes, s3:ObjectCreated:CompleteMultipartUpload event can trigger CopyObject, else defrag to policy size parts. Boto copy() with multipart_chunksize configured is the most convenient implementation, other SDKs lack an equivalent.
For past writes, existing multipart objects can be selected from inventory filtering ETag column length greater than 32 characters. Dividing object size by part size might hint if part size is policy.
Correction: and also part quantity (parsed from etag) for comparison
> To create a trailing checksum when using an AWS SDK, populate the ChecksumAlgorithm parameter with your preferred algorithm. The SDK uses that algorithm to calculate the checksum for your object (or object parts) and automatically appends it to the end of your upload request. This behavior saves you time because Amazon S3 performs both the verification and upload of your data in a single pass. https://docs.aws.amazon.com/AmazonS3/latest/userguide/checki...
It would be nice if this got updated for Additional Checksums.
But you're right, if you take a broad view of P, the choice is really between consistency and availability.
For example, running S3 locally or not.
I did read up on the 'proper' solution and it made my head spin.
You're supposed to use AWS batch, creating instances with autoscaling groups, pipe the logs to CloudWatch, and serve it from the on the frontend etc.
The number of new concepts I'd have to master, I have no control over if they went wrong, except to chase after internet erudites and spending weeks talking to AWS support is staggering.
And there's the little things, like CloudWatch logs costing like $0.5/GB, while an EBS block volume costs like $0.08, with S3 being even cheaper than that.
If I go full AWS word salad, I'm pretty sure even the most wizened AWS sages would have no idea what my bills would look like.
Yeah, my solution is shit and Im a filthy subhuman, but at least I know how every part of my code works, and the amount of code I'd had to write is not more than double that if I used AWS solutions, but I probably saved a lot of time debugging proprietary infra.
"Stop Rate Limiting! Capacity Management Done Right" by Jon Moore https://www.youtube.com/watch?v=m64SWl9bfvk
Concurrent capacity might not be the best metric.