Idempotency is easy until the second request is different(blog.dochia.dev) |
Idempotency is easy until the second request is different(blog.dochia.dev) |
You want a rebuildable environment after testing blows it up? Idempotent build scripts.
You want to sell crap from a web interface? Thats a transaction. If you do 'repeat a sale', thats a new transaction, with new goods, with newer date.
Forcing 1 paradigm on a different one always results in gnashing of teeth and sadness. But I guess it gets the blog hits for that dopamine rush.
The user wants something + the system might fail = the user must be able to try again.
If the system does not try again, but instead parrots the text of the previous failure, why bother? You didn't build reliability into the system, you built a deliberately stale cache.
It's not about trying again but about making sure you get consistent state.
Imagine request for payment. You made one and timeouted. Why did it timeout? Your network or payment service error?
You don't know, so you can't decide between retry and not retry.
Thus practice is: make request - ack request with status request id (idempotent, same request gives same status id) - status checks might or might not be idempotent but they usually are - each request need to have unique id to validate if caller even tried to check (idenpotency requires state registration).
If you want to try again you give new key and that's it.
There might of course be bug in implementation (naive example: idempotency key is uint8) but proper implementation should scope keys so they don't clash. (Example implementation: idempotency keys are reusable after 48h).
If same calls result in different responses (doesn't matter if you saw it or not) then API isn't idempotent.
I'm well aware that the first order went through, even though the dumb system fumbled the translation of the success message and gave me a 500 back.
I do retry because I wanted the outcome. I'm not giving it a new key (firstly because I'm a user clicking a form, not choosing UUIDs for my shopping cart) but more importantly, if I did supply a second key, it's now my fault for ordering two copies.
Take a good principle like 'modules should keep their inner workings secret so the caller can't use it wrong', run it through the best-practise-machine, and end up with 'I hand-write getters and setters on all my classes because encapsulation'.
No -- Idempotent means _no_ side-effects.
A lot little things you need to think of. For example.
Client sends a request. The database is temporarily down. The server catches the exception and records the key status as FAILED. The client retries the request (as they should for a 500 error). The server sees the key exists with status FAILED and returns the error again-forever. Effectively "burned" the key on a transient error.
others like:
- you may have Namespace Collisions for users... (data leaks) - when not using transactions only redis locking you have different set of problem - the client needs to be implmented correctly. Like client sees timout and generates a new key, and exactly once processing is broken - you may have race conditions with resource deletes - using UUID vs keys build from object attributes (different set of issues)
I mean the list can get very long with little details..
This is the bug regardless of idempotency, right? It should be recording something like RESOURCE_UNAVAILABLE.
Yes.
The GET/POST split is the defence (even it's only advisory).
GET-only means every time you hit the back button during an order flow, you might double-order.
One thing that's confusing, here, is that idempotency only applies for the same request, but the article implies that idempotency is about whether the request contains a specific "idempotency key".
Don't do that, and this problem evaporates.
Like, I thought the entire definition had to do with "the exact same thing twice."
This rubs me the wrong way. It's stated as fact without any trace of evidence, it is probably false, and it seems to serve no purpose but to make struggling students feel worse (and make the author feel superior).
In the real world you're faced with building five nines active-active systems that interface across various stakeholders, behaviour has to be eventually consistent, you've got a long list of requirements and deadlines, etc. It's practical, hands on, and people are there to build the thing with you at a scale that far exceeds the university undergraduate setting.
It's not a bad thing, it's just different.
Students shouldn't be afraid of it. Your job and coworkers, if it's a good workplace, are there to help you succeed as you succeed together. You learn and grow a lot.
You also learn how to deal with people, politics, changing requirements, etc., which I would imagine is difficult or impossible to teach without just throwing yourself into the fire.
I've been a CS teacher and I found that it's terribly easy to underestimate how much there is to learn and how much effort that learning takes, when you've internalized a skill yourself a long time ago.
Upon initial request I have you "URPAY1". If you never check URPAY1 for status, we'll callback you and expect the result. If neither check nor callback succeeded rollback actions are ran (this is contractual agreement on partnership level).
You can verify your status with URPAY1. You need to provide your status check with check ID (URPAY1) and an unique request ID. You will receive a timestamped response. You won't get different responses for same CheckID + RequestID because it's a activity log and also procedure check (e.g. grossly simplifying success at 23:59:58 might be something different than success at 00:00:05 - these times can vary depending on partner, continent, so it's not only midnight etc.) If at any point you didn't get response you can retry and you will always get the same response.
Didn't get URPAY1 for the first time? No problem try againt second time. You'll get the same URPAY1. No new effects needed.
In this design you, as requester are in full power. You can make the same request 100 times which will cause only 1 effect. If networking is lost, something will crash you're still guaranteed to have effect AT MOST once.
In case you're curious for the full flows and handling edge cases Stripe has great documentation regarding how process looks like from merchant's customer's side (as this is their business and you can integrate with them).
Don't do that, and you solved nothing.
Either I'm missing what you mean, or half the comments here are missing the point of idempotency.
Let's say your server received this request twice within one minute:
{
items: [ { id: 123, amount: 1 } ],
creditCardInfo: { ... }
}
How can you tell from the server if that's a retry (think e.g. some reverse proxy crashed and the first request timed out, but the payment already went through to the user's CC)... or if the user just trying to purchase another item 123 because they forgot they needed 2?There is simply no way to make the requests idempotent without an idempotency key. The only way to tell both situations apart is to key the requests by some UID. The HTTP verb is irrelevant.
Did I misunderstand what you meant?
I mean: you still have the problem regardless of following HTTP verb semantics or not.
My original point, though, was that these semantics are well-understood for PUT, so just use PUT, or use POST (with the idempotency-key header) exactly as you would PUT.
I've built multiple ecommerce APIs with this approach and they work great. No heroic measures required. You can often satisfy this contract with a unique constraint; if not, a simple presence check in redis. No hashing or worrying about PII.
My rant about this: https://github.com/stickfigure/blog/wiki/How-to-%28and-how-n...
The whole point of the idempotence mechanism is so you can make a reliable distributed system. If the first try fails, the client doesn't know if it succeeded or not, so the client should try again later ("at-least-once"). The idempotence mechanism just ensures that we don't get duplicates in the case that the first try actually succeeded.
If you replayed failures there wouldn't be any point to the idempotency key.
Sure, a strict reading of "idempotence" might require that the response for subsequent requests be identical to the first, but for practical concerns, what matters is the API contract you define, document, and adhere to. The purpose of idempotence is to ensure that you don't end up with duplicate transactions. That's what actually matters. How that's represented in the protocol is an implementation detail.
I agree with you - I don't think Stripe has made the right choices here and its unfortunate that it has inspired so many other people to make the same poor choices. I don't agree that their system is as sound as always returning 409s. I think having a short window where you return response bodies is fine, but after that they should still be sending 409s. If no one will ever actually resend a request after 24 hours how is it not fine to send 409s when they do? They've chosen to implement the more expensive choice and then not back it with the cheapest one.
However, it's good for e-commerce where there are a subset of "important" operations, but I'd argue the idempotency key is better for a financial API like Stripe where the majority of your operations need these semantics.
You also run into a problem if PUT/PATCH needs to be exactly-once instead of at least once. Not as common, but again, might be something you run into with a financial API.
The only difference between the approach I'm describing and Stripe's approach is the detail of how the client knows that it's done. "My" mechanism (notify the client that a request is a duplicate) was dominant in financial transaction processing systems before Stripe; I used probably a half dozen of them back in the day.
Stripe came along and from the beginning decided "we want to compete based on making a friendly API". They succeeded, and if you ever look at Paypal's original API, it's easy to see why. It's true that repeating successful API responses makes life slightly easier for clients; they never need to check for a 409 (or whatever) response. It comes at a cost of making life quite a bit more complex on servers. Personally I don't think the tradeoff is worth it, but YMMV. If your single competitive advantage is "easy API" maybe it makes sense. If you're normal B2B, almost certainly not.
I'm no expert but an "idempotency key" already has some major smell to it.
If idempotent key was seen then send back response.
Clients intention is outside the scope. If contract says "idempotency on key" the idempotent response on key. If contract says "idempotent on body hash" then response on body hash (which might or might not include extra data).
APIs are contracts. Not the pinky promise of "I'll do my best guess"
I’ve seen two separate engineers implement a “generic idempotent operation” library which used separate transactions to store the idempotency details without realizing the issues it had. That was in an organization of less than 100 engineers less than 5 years apart.
One other thing I would augment this with is Antithesis’ Definite vs Indefinite error definition (https://antithesis.com/docs/resources/reliability_glossary/#...). It helps to classify your failures in this way when considering replay behavior.
[1] http://johnsalvatier.org/blog/2017/reality-has-a-surprising-...
I once wrote about inherent, irreducible complexity and how we try to deal with it. The draft has sections on how complexity can be hidden, spread out, localized, passed off, or recreated from scratch. Unfortunately, people are now using LLMs to pile complexity on the simplest of tasks, and my essay isn't really worth finishing.
Isn't the opposite true? The more people are messing with complexity, the more they could benefit from a model of a complexity? And if they generate complexity with external tools, then maybe a theoretical take on that will be the only way for them to learn? I mean, we learn these things through struggle and pain, but if all of that becomes an LLM problem, than you just stop learning? But at some point complexity will strike back, at some point there will be as much of it, that LLM will be no help.
OTOH, if LLM still win, and skills of managing complexity will be lost in future generations, if we are at the peak of our skills of dealing with complexity, than shouldn't we try our best to imprint our hard won lessons into a history? Maybe for some later generations the tide will turn and they would write textbooks on complexity, and with your article you'll get your portrait in a textbook, and each bored pupil will decorate it with mustaches? You have a chance to immortalize yourself. xD
Or maybe you can become someone like Ramanujan for math? Someone who honed obsolete math skills to an unimaginable level? Maybe a time will come, when students will pour over Ramanujan works, because his skills became useful again, and they try to find out how Ramanujan thought?
...
Sorry, I just couldn't resist. Seriously, it is hard to predict with LLMs, maybe we will not need intellect or any intellectual skills at all after AGI.
From a cursory read, only the part up to "what if the second request comes while the first is running" is an idempotency problem, in which case all subsequent responses need to wait until the first one is generated.
Everything else is an atomicity issue, which is fine, let's just call it what it is.
A user would generate the idempotency key by loading the front-end application, adding item(s) to their cart, submitting their order but timing out. The user would then navigate back to the front-end application and add another item and submit the order again. Since the user is submitting an identical idempotency key to the same transaction, our payment gateway would look up the request/transaction by idempotency key and see in its cache that there was a successful (200 OK) response to the previous request. The user now believes they purchased three items, however, our system only charged and shipped on two of the orders.
Consequently, the lesson we take away from the aforementioned incident is idempotency keys are really composite keys (Client_Provided_Key + Hash(Request_Payload)).
If a system receives an identical idempotency key (but with a different request payload) the idempotency key should be rejected with a 409 Conflict response with a message similar to "Idempotency key already used with different request payload". Alternatively, some teams argue it should be returned with a 400 Bad Request response. Systems should never return a failed cache response or replace old entries of data.
This article explains how to unlock your flow. The final idempotent key will not be located until the first request completes, but will rather exist when the request is in progress.
To safely accomplish your goal, you have to follow the following steps:
1. Acquire a distributed lock on the idempotent key.
2. Check for the existence of a key in your persistent store.
3. If an existing key is found, verify the hash of the payload against the hash for the payload type. If the hashes do not match, return a 409 error.
4. If the hashes match, look up the status of the payload. If the status shows COMPLETED in the persistent store, return the cached response. If the status shows PENDING in the persistent store, return a 429 Too Many Requests to the user or hold the connection open until the request reaches a PENDING state.
5. After processing the request, save the response to the persistent store before releasing the lock.
While this may look simple on paper, creating a distributed locking state machine for a single API endpoint is typically how developers have their first aha moments with idempotency. Becoming idempotent is often an enormous architectural shift and not just a middleware header check.
Idempotency is about state, not communication. Send the same payment twice and one of them should respond "payment already exists".
If you like the article, upvote. If you don’t, don’t.
Not well organized, but not zero value.
It may improve efficiency where a protocol doesn’t assure exactly-once delivery of messages, but it cannot help you with problems other than deduplication of identical messages.
Creating a payment is not an idempotent operation. If the economics of the operation can differ when the “idempotency” key remains the same then you’ve just created a foot-gun in your API.
You can document that you’re going to ignore “duplicate” requests that share an idempotency key but that’s just user-hostile. The system as a whole is broken as designed.
I would argue sequencing is part of the hard part of idempotency, your business context would decide “when” to apply sequencing is good enough (recall monopoly “bank error collect $$$”).
Set and setting is also relevant, most places don’t deal with money or disastrous concurrency scenarios.
Now if you want to argue for a paradigm shift for why we shouldn’t be here to begin with and offer a way to get back to scalable centralized db system we’re all hears.
Is this the new normal? Assert something, that id clearly broken as the correct, then write a blog fixing their broken logic?
You don’t replay it on retry. You signal it is a success on first try, and subsequent request with the same key return 409.
Anything else and you are doing it wrong.
Here x is interpreted as state and f an action acting on the state.
State is in practice always subjected to side effects and concurrency. That's why if x is state then f can never be purely idempotent and the term has to be interpreted in a hand-wavy fashion which leads to confusions regarding attempts to handle that mismatch which again leads to rather meandering and confusing and way too long blog posts as the one we are seeing here.
*: I wonder how you can write such a lengthy text and not once even mention this. If you want to understand idempotency in a meaningful way then you have to reduce the scenario to a mathematical function. If you don't then you are left with a fuzzy concept and there isn't much point about philosophizing over just accepting how something is practically implemented; like this idempotency-key.
That is simply not true. f could be, for example, “set x.variable to 7”, which is definitely idempotent.
And yes, in real machines we can't ever have true same states between multiple calls as system time, heat and other effects will differ but we define the state over the abstracted system model of whatever we are modelling and we define idempotency as the same state over multiple calls in that system.
"delete record with id 123" is only idempotent if there is no chance that an operation like "create record with id 123" happened in between.
I wondered about this too. Also, why was it framed in the context of JSON based RPC over HTTP ?
In that mathematical notation typically there is no side effects and those are meant to be pure functions.
This entire example is bad design. It's bad, bad design. I'm sorry, but if this is your example, you are doing it wrong in every way. There are ways to handle these sorts of things, well-known and well-established patterns. You are using none of these here.
I get it, it's an example, but it's a poor example. You should change it before someone assumes what you are talking about is sensible or reasonable in a production environment. Or at least put a warning.
Or you can completely forget this feature and make it really awkward for the client to reconcile their view of the world with yours and/or to check in the request later. cough Mercury cough.
It is, just barely, acceptable to generate the identifier server side and return it to the client.
I recently designed a system where this had to be taken into consideration. I find my solution very elegant: When the request arrives, I put the pending request into a map, keyed by the idempotenceId. This whole operation is executed in one step. Now the event loop may process other requests. If one of them has the same key, it will await the same response object from the store. And then, once i have the response, I resolve both promises with it.
And then some lazy birdbrain will come up with some new way to either jump to a random place in the code without guardrails on program state, or referencing data that other code or threads could have touched, and they'll call it a time saving feature.
And then we will all learn the hard way that those annoying restrictions were in place for a reason.
This is the great circle of life and death and rebirth
Idempotency or not, many points in the articles are are about atomic transactions.
Auth, logging, and atomicity are all isolated concerns that should not affect the domain specific user contract with your API.
How you handle unique keys is going to vary by domain and tolerance-- and its probably not going to be the same in every table.
It's important to design a database schema that can work independently of your middleware layer.
(Though I do disagree with the original premise too. Putting on a 'stateless' boxing glove won't mean there's no difference between punching a guy once or twice)
A database on it's own is enough for most business applications.
If you haven't seen this yet, you're just rent seeking.
I've been in this situation, a clientside bug meant that different requests arrived with the same idempotency key.
In my case, updating the client would have taken weeks, in the best case scenario. Updating the backend to check for a matching request body would have taken minutes, maybe hours.
It took me a surprising amount of arguing to convince people that, even if it was a clientside bug, we couldn't let users suffer for weeks in name of "correctness".
Ideally you already send client version in requests (or have an API version prefix). Add the workaround only for legacy clients.
Next client version must distinguish itself from predecessor and must not require the bodge to work.
Then at least admit you’re just hacking quickly fixes, creating technical debt, and not fixing the actual problem.
I agree with your point that business interest is most important, I disagree that it’s the technically most appropriate solution.
The whole article is proclaiming that this is a technical problem about idempotency being hard, while it’s not. The whole premise of client side bugs must be resolved backend side as the correct solution is incorrect.
You have never had to work with PHP backends, have you?
JSON in PHP is a flustercluck. Undefined, null, "" or "null", that is always the question.
If you use a typed Go/Rust client and schemas, you usually end up with "look ahead schemas" that try to detect the actual types behind the scenes, either with custom marshallers or with some v1/v2/v3 etc schema structs.
It's so painful to deal with ducktyped languages ... that's something I wouldn't wish on anyone.
The robustness principle has its times and places but the general consensus that it should be applied everywhere to everything was a big mistake. The default should be that you are very rigid and precise and only apply the robustness principle in those times and places it applies, and I'm perfectly comfortable waiting to deploy something precise and find out that this was one of them. The vast majority of APIs is not the time and place for the robustness principle. It's the time and place for careful precision on exactly what is provided, and detailed and description error messages, logging, and metrics for when the boundaries are transgressed.
The user just needs to know what the trade-off is. And "best guess" can be hard to characterize, so you need to be extremely careful. But sometimes it's a big win for a low price.
"Best guess" can be bad if it is not well-defined, but you can still make error detection obvious rather than hidden.
An API should follow its documented behavior. This is both a specification and a contract. If the docs for the API say that a duplicate idempotency key will receive a 409, and do not mention message hashes, then they need to follow that spec because the client may specifically depend on it. For example if the order was processed and the cart is resent with the same key but an additional item, client does not want another order with the duplicate items in the first one. They want an error.
If the docs do not accurately describe the behavior of the idempotency key, the client should find another provider.
> While this may look simple on paper, creating a distributed locking state machine for a single API endpoint is typically how developers have their first aha moments with idempotency. Becoming idempotent is often an enormous architectural shift and not just a middleware header check.
Yes, when you expand the scope of your API implementation beyond its contract you take on a virtually unbounded amount of edge cases that not only must you solve, but that your customers must guess at how you are solving.
I'm guessing that your API required the idempotency key. I think that is could be risky because it means developers will simply provide a value for it without understanding the purpose, or thinking through the implications. You really only want them using it if they understand the problem it is solving.
Hashing message content could be an alternative behavior that it makes sense to support by default for apps that don't supply an idempotency key. As long as you document it.
The idempotency key should have been viewed as the untrustworthy hint it really is. Then you can decide whether an untrustworthy hint is what you really need. At that point I'd hope someone on the team says "This is ordering - I think we need something trustworthy"
> Consequently, the lesson we take away from the aforementioned incident is idempotency keys are really composite keys (Client_Provided_Key + Hash(Request_Payload)).
Did the postmortem result in any other (wider) changes/actions, out of curiosity?
No idea if this was anything like what happened your case, and probably going off on a tangent, but I've seen so many cases where teams are split into backend and frontend, and they stop thinking about the product as a single distributed system (or, it exacerbates that lack of that thinking from before). Frontend often suggest "Oh we can just create an idempotency key" and any concerns from backend are dismissed. If they implement it incorrectly, backend are on the wrong 'team' to provide input.
Save only if the operation succeeds. It's meaningless to cache a failure, subsequent retries will result in failure from the cache.
Frankly you guys are overengineering the whole thing. We use the concept only for network outages i.e. it is only on timeout that we want to guard against fultilling duplicate request for the same operation.
Congrats on destroying the purpose of Idempotency Keys.
Ask yourself, why not just `Hash(Request_Payload)`? That'll give you half of what you need to know about why the Idempotency Key header is useful in the first place.
The other half you already know? You just described your bug, it's a bug, on your front-end, this has nothing to do with idempotency; if anything, the system is performing as expected.
If your requests do something different, they should have different Idempotency Keys. <- this brings down TFA and most of the comments here. I guess those are the perils of vibecoding.
”Idempotency is about the effect
An operation is idempotent if applying it once or many times has the same intended effect.”
Edit: Perhaps it is my mental model that is different. I think it makes most sense to see the idempotency key as a transaction identifier, and each request as a modification of that transaction. From this perspective it is clearer that the API calls are only implying the expected state that you need to handle conflicts and make PUTs idempotent. Making it explicit clarifies things.
The article actually ends up creating the required table to make this explicit, but the API calls do not clarify their intent. As long as the transaction remains pending you're free to say "just set the details to X" and just let the last call win, but making the state final requires knowing the state and if you are wrong it should return an error.
If you split this in two calls there's no way to avoid an error if you set it from pending to final twice. So a call that does both at once should also crash on conflicts because one of the two calls incorrectly assumed the transaction was still pending.
What's being asked for here is eventual consistency. If you make the same request twice, the system must settle into a the same state as if it was done only once. That's the realm of conflict-free replicated data types, which the article is trying to re-invent.
x = 1
is idempotent. x = x + 1
over a link with delay and errors is a problem that requires the heavy machinery of CRDTs.For idempotency you literally just want f(state) = f(f(state)). Whether you achieve this by just doing the same thing twice (no external effects) or doing the thing exactly once (if you do have side effects) is not important.
But if you have side effects and need something to happen exactly once it seems a lot more useful to communicate this, rather than pretending you did the thing.
You are hiding the relevant complexity in the term "same". What is here the same? I mean, if accidentally buy only 1 instead of two items of a product and then buy afterwards again 1 item. How is this then the same or not the same payment?
The idempotency key of the request
I mean:
> Maybe the first request created a local payment but crashed before publishing an event ...
I mean, yeah, sure. That's a problem. I can come up with another one:
"Maybe the ZFS disk array for the DB caught fire and died a horrible death and you now need to restore from backups".
But that's going to be a problem anyway.
I think there is also a risk of an "easy API" when it leads to magical thinking and sloppy development. If the naive client programmer starts to think the reliability is handled for them, they may also flub the handling of the idempotence key that remains the crux here. E.g. not persisting them well enough and returning to a situation where the user can accidentally make duplicate payments just like the naive system with no idempotence feature...
There are still side effects in the system, of course.
But what your database looks like afterwards is the important part.
Can you recover lost data, replay transactions, undo, etc etc?
I think it depends on whether the sender needs to know whether the thing was done during the request, or just needs to know that the thing was done at all. If the API is to make a purchase then maybe all the caller really needs to know is "the purchase has been done", no matter whether it was done this time or a previous time.
And in terms of a caller implementing retry logic, it's easier for the caller to just retry and accept the success response the second time (no matter if it was done the second time, or actually done the first time but the response got lost triggering the retry).
Some help for others to understand the history of this (which apparently Stripe, Paypal, Dwolla, and others use): https://github.com/mdn/content/issues/41497 There are links to the RFC and prior art.
That aside, my first impulse is to say that the server should specify that the key includes a hash of the important parts of the request, checked on receipt, so that only the key itself need be stored. However, FF's implementation apparently(?) adds the header automatically to POST and PATCH if it's not already present, which means that it's not able to comply with such a decision, and the RFC (currently expired) recommends using a UUID, so.
I'm guessing the original motivation of this is "Browser JS might not be able to send a PUT, or proxies may not handle a PUT correctly".
> State is in practice always subjected to side effects and concurrency.
There was never any claim or assumption regarding f. Maybe the way you interpreted it is what they meant, but it is not what was stated.
The issue with things that client must not do is that they might still do them, and users don't care whose fault it is. It's important to have auxilliary mechanisms to mitigate these.
If it's truly intended, it needs to be part of the official spec, with a robust justification of why it's there at all. Neither server nor client ought to have unnecessary and undocumented things "just in case", because that breeds a culture of uncertainty.
If you fear client regressions, make it a mandatory part of the client's test suite. You control the client, right?
If the client sends the same key but a different payload that’s a 400 or 409 in my eyes.
2) Client's choice
I can choose to purchase a 2nd item, or I can choose to retry purchasing the 1st item. The server making that choice for me is not idempotency.
Idempotency is the server supporting my ability to retry purchasing the 1st item, safe in the knowledge that they won't send me a 2nd one.
You need to store the payment state at each relevant step and process it asynchronously. If requests time out, you check the status of it using the key you store (with the processor) to see if it was even received.
It’s not perfect, some processors will 500 while processing the payment (Braintree), so you still need reconciliation on the backend.
Regardless, I think your assumption about how the request/response cycle should be working is wrong. For this kind of API and transaction, the server should be returning a response immediately: 202 Accepted. The only thing the API server should be doing before returning is creating a row in a DB (with a "state" field with an initial value of "pending"), and pushing some work on a queue.
The server should not be sitting there with the HTTP request open, trying to complete the transaction, and only returning a response to the client when the transaction is finished or has encountered an error.
The client will have to learn about the progress of the state of the transaction outside of this initial request. There are many options here: polling, webhooks, a message queue like kinesis or kafka, etc.
Idempotency-Key should not replay the response (it depends, actually). But also it should not error 409. You need to be content aware before adding Idemmpotency Key header handling.
What will happen when the request is received and handled but during writing response body TCP connection dropped unexpectedly. And after second or two a connection reestablished. How two sides agree that previous request accepted and everything good to go? That's what Idempotency-Key header does.
An HTTP request comes in with a certain idempotency key. The server returns 202, as you say, and begins to process the database transaction.
While the server is still procesing the database transaction, a second HTTP request comes in with the same idempotency key. What response does this second HTTP request get? The original transaction that the first HTTP request triggered hasn't succeeded and hasn't failed, so it doesn't fall into either of the categories in the post I responded to.
Your answer is that the second HTTP request gets a 409, which makes sense to me, although others are objecting to it.
No no no no no.
You have multiple clients submitting the same business operation simultaneously. One must succeed, the others must fail. If you're using the 409 approach ("notify client that request is redundant") you must not send a 409 code until the work is complete.
The client must interpret 200 and 409 as success cases. 200 means "it was done" and 409 means "it was already done". Clients looping (say, processing durable queue messages) can stop when they receive these responses.
If the work is not complete, you can't return 409, or clients will think the work is done. You will lose messages.
But, rather than 409, I'd say that you should be using opportunistic concurrency control if you adopt this perspective. There should be a resource context for the request, so the client can obtain an ETag and send If-None-Match headers, and get a 412 response if things are out of sync. That allows them to retry a failed/lost request and safely prevent a double action.
Under a 412, they have to step back and retry a larger loop where they GET some new state and prepare a new action. Just like in DB transaction programming, where your failed commit means you roll back, clean the slate, and start a whole new interrogation of transaction-protected state leading up to your new mutation request.
That doesn't mean that idempotency keys have to be used. You can certainly hash message content if that is documented behavior. That probably only makes sense when there is already some logical session or transaction identifier that makes dedupe semantics clear.
The system you propose might be sound and might be necessary in some systems, but I can't think of what they might be that wouldn't be better served by the simpler solution that is already widely used for this purpose.
If it processed 99% of the request and the final bookkeeping failed because of a duplicate, that's still a failed request.
Arguably this should be the primary way you check for idempotent requests - you shouldn't have a separate check for existence, you should have the insert/update fail atomically.
This is the same thing you see on filesystems for TOCTOU security holes - the right way is to atomically access and modify once, and you only know the request was already processed because that fails.
Even if you have a complex long-running multistep orchestration problem, you can break it down into simpler transactions. Eg you could start with a "lock the resources" txn.
But 99% of these conversations around idempotence are simple POST operations like "create order" that regular old database concurrency management handles just fine.
That doesn't answer my question. What response do you return to the client in the case I described?
But your follow up responses here are making me rethink. Now you have to have all these special cases where the original request is still in process. I think or assertion of "99% are simple POST operations" is bullshit. For the times where idempotency is hard and really matters, often times you're calling a third party API, like a payment processing API.
I would think a better approach would be to always return a 409 on a subsequent request, regardless of whether it passed or failed, and then have a separate standard API that lets you get the result of any request by its idempotency key.
Devs are too scared to be nice (ie not return errors) to clients when they misbehave.
The pattern I describe was the dominant design pattern for financial transaction processing systems before Stripe. Stripe's API makes life for the clients slightly easier at the expense of making life for servers more complicated, but the two approaches are equivalent in function.
You seem very focused on long-running orchestration type systems. You build these on top of basic transactional primitives, but it's a mistake to try to make the whole process a single transaction. You can have a quick, transactional "start process" operation which must be idempotent. Other operations like "check status" need not be so complicated.
You don't necessarily share the idempotency key between the "start process" request and the "check status" request. You could for convenience, but it isn't necessary, and on balance most APIs don't. This is the "client picks ID" vs "server picks ID" design choice.
Fair enough. So basically your approach is to wait until the first request completes to decide how to respond to the second request that came in with the same idempotency key.
However, that would seem to me to imply that when the second request comes in, you check its idempotency key, realize you've already received a request with that key and you're processing it, and don't do anything else with the second request until the first one is completed. In particular, you don't have the second request trigger the start of another transaction.
But elsewhere in this thread, you've said you would start a second transaction based on the second request, and let your database's transaction mechanism tell you that it's a duplicate when you try to commit it. Why would you do that if you've checked the second request's idempotency key and you know it's a duplicate?
> You seem very focused on long-running orchestration type systems.
I'm not focused on anything except getting what I thought would be a simple answer to a simple question. The above seems to provide that (though it still leaves a question open, as above). That's all I wanted.
> You don't necessarily share the idempotency key between the "start process" request and the "check status" request.
I'm not talking about a "check status" request. The scenario I've been asking about all along is when a second "start process" request comes in with the same idempotency key as a previous "start process" request, while the process is still in progress.
Part of the problem here is that we're confusing how do you structure the API (replay? 409? something else?) with how we implement the API. The original article (and my original response) focused on API structure. We're wandering into the details of implementation, which is fine, but there are of course many ways to do the implementation. Some simpler than others.
Here's the simplest and most reliable way to implement idempotency for a trivial "create payment" operation, where the client submits an idempotency key. This pattern is incredibly common. Every request looks something like this:
* Start a transaction
* Lookup "does this idempotency key already exist"
* If it doesn't, insert the payment record with the idempotency key
* Commit the transaction
* Return the result. Successful insert is always 200OK. "key already exists" results in either replay of the original result (Stripe model) or an explicit error like 409 (my favored approach, still ubiquitous in ecommerce, and very common in financial APIs that predate Stripe).
Does that help? If you're using your database to handle concurrency, you need every request to start inside the transaction. You can't check the idempotency key outside of the transaction or you can't guarantee once-and-only-once behavior.
[Before someone mentions it, yes you can use a unique constraint instead of an explicit transaction, and this is conceptually identical - the check-for-dup transaction is inside a single INSERT]
No, it shouldn't. The comment you're responding to is taking 200 to mean "success" and 409 to mean "it was done" so if it was not in fact done then you _must not_ return that.
That said, I thought one of the benefits of idempotency was nonblocking APIs so I'm not sure I like that scheme. It seems like 200 should mean "submitted, accepted, incomplete" and 409 should mean "previously completed". The client never knows which request succeeded but they're idempotent so that doesn't matter. You just poll until the 200 becomes a 409.
Of course that would provide zero diagnostics in the case of failure so I think it's not sufficient as described.
It doesn't have to be multiple clients. It could be the same client, not having received a response to its first request and deciding to re-send the request again.
It isn't complicated, though I can see how if your entire experience with financial APIs is Stripe, you might not be aware of how simple it is. Because Stripe's approach, while mildly more convenient for clients, is a PITA to implement properly.
You seem to think that it's important to make the specific bytes of an http request-response idempotent.
I think that it's important to make a business operation idempotent.
You're missing the forest for the trees.
The fact that payments have a settlement process is not relevant to this discussion.
Yes, I agree. You want to generate a token, persist it locally and use that to communicate with the payment gateway, so re-submissions use the same key and either error or return the transaction state.
> The fact that payments have a settlement process is not relevant to this discussion.
I wasn't talking about settlement, I was talking about the processing aspect. What I meant was: once you kickstart the process with the gateway, money is highly likely to change hands as a result. This means a process of:
1. POST /checkout
2. Create token
3. POST to payment gateway with token
4. Wait for gateway to return
5. Persist transaction/error
6. Return success/error
What is needed is to persist and return the token to the caller before contacting the payment gateway, to make a check + retry mechanism possible.
And yes, I've seen code that follows steps 1-6 exactly as I've described and, yes, all the problems you imagine would occur from those steps have occurred at one time or another.
So one will complete with 200, one will complete with 409. It doesn't matter which.
That said, there's something odd about the way you phrased this question. If the original request hasn't gotten a response yet, why is it sending a retry? What you're asking is more general: What happens when two conflicting requests come in? This is something we've been solving with RDBMSes since the 1970s.
Because it hasn't gotten a response yet. That's got to be far and away the most common reason any request gets retried in any context.
> why is it sending a retry?
may be two clients tries to do it? Or there's a bug with the client in how they do it?
Isn't the point of idempotency meant to enable clients to retry again, without fear that a 2nd request somehow breaking things?
You absolutely must wait for one request to finish before any other request can return a 409. 409 is a signal to the client that they can stop retrying, the job is done. If some request returns 409 early and the "original" request fails, you will not get further retries and the message will be lost.
Stripe's approach requires serialization as well. Only one request can succeed. If you send multiple conflicting requests in simultaneously, some of those have to block.
The good news is that we have been solving this problem for decades and we have incredibly well refined tools - database transactions and isolation levels - for solving this problem.
Not necessarily - there are different transaction isolation and conflict resolution methods provided by every database built for this purpose. You just have to ensure that only one request actually commits to the database, and that one sends a success response while the other sends a 409. The database or another lock provider can either help enforce serialization up-front - or the app can use optimistic locks based on data in the request that will only block if there is actually a conflict, and this won't delay the first transaction at all.
Solving these kinds of issues are exactly the purposes of idempotency keys and database transactions and using them in the intended way is really the only sound way to build a distributed system. Making things more complicated to "improve DevX" is just going to make them unsound. That is what Stripe chose to do. Their 24-hour replay idea is fine but why not send 409s after that rather than accept those transactions? If "that will never happen" then the 409s will never happen. It would have cost approximately nothing (if designed that way upfront) and inconvenienced their clients not at all.
Um, because connections over the Internet aren't 100% always on? Because packets can get lost? Because computers sometimes have to reboot?
You're assuming that the client will always receive whatever response your server finally sends, and that the client will wait indefinitely to receive a response. Neither of those things are true. So the client can be in a state where it sends a retry because it got no response and doesn't know why. And that means a retry request could come in while the first one is still being resolved--because the client had a timeout or it rebooted or something else happened that made it lose the connection state it previously had. That's the case I'm asking about.
The case of "client sends a retry with the same idempotency key" generalizes to "multiple requests come in for the same idempotency key". These can come in spread out over time (like a traditional loop), or they could come in at once. The solution is the same either way.
The problem of "how do we deal with multiple conflicting requests coming in at once" is something we have been dealing with for decades. We have databases with transactions and isolation levels. If I said in an interview "make an endpoint that inserts a value in a database and returns an error if the value is a duplicate", any competent backend web developer should be able write it without Claude's help. Concurrency is part of our life.
Whether you want to return 409 or replay the success is irrelevant to this question. You must serialize the idempotent operation on the server, because you can have multiple requests coming in simultaneously. If you put the operation in a database transaction with an appropriate isolation level, you are most of the way there.
Idempotency keys are themselves the solution you're looking for. If they don't work concurrently, they aren't idempotency keys. Your response in races or duplicates doesn't inherently matter in that sense, pick whatever semantics make sense for your system.
I.e. idempotent DELETE with proper protocol behavior requires that one request see the 200 OK or 204 No Content and the other sees 404 Not Found, because the delete has already happened. It would be misleading to say 200 OK to both, because that answer means the resource was there when the request arrived.
Honestly, the whole HTTP resource model has a different conceptual backing for state management than the independently developed "idempotence" concepts in distributed systems. Those non-HTTP concepts came from more message-based rather than resource-based architectural assumptions.
The cleanest mapping in the spirit of HTTP would be that you do multiple round trips. A POST creates a new idempotence context, a bit like "start a transaction". The new URI is the key for coordinating state change and allowing restart/recovery.
As I remember it, the idea of idempotence keys in headers really came from the SOAP RPC mindset. It's kind of funny to see it persisting in some hybrid SOAP + REST mental model.
I think that gave me "Enterprise Java Beans PTSD". I.e. an over-engineered solution that adds complexity for both the client and server in the name of some sort of "protocol purity".
People bolted on idempotent semantics onto HTTP because it wasn't provided natively by the protocol, so I don't think it makes sense to go through some hoop-jumping gymnastics for the sake of conforming to a spec that doesn't describe the necessary semantics in the first place.
When I let myself ruminate, it irks me that we all let HTTP become the defacto "internet protocol" just because of firewalls. Because there was a cargo cult idea that HTTP is benign and so one of few ports allowed almost everywhere, we do stupid contortions to squeeze every protocol through an HTTP tunnel.
These short-sighted acts of laziness accumulate into HTTP everywhere. And of course, the firewall is nearly pointless when "everything" is going through that one hole anyway.
This isn't a special case, and it's the same problem if you want to replay the original response on conflict. If the original request isn't complete, what are you going to replay?
Who says you have to replay? If you get a second request with the same idempotency key, and the original request is still in process, why not just send the client a response that says so?
Long running transactions create all sorts of problems, so transactions are generally expected to be short. The actual work behind "create payment" or "create order" is generally fairly trivial - more or less insert a row in a table. There's no good reason to make the API complicated... you either "win" at concurrency or you lose, and the difference is generally sub-millisecond. The only meaningful thing you need to communicate to the client is "you're done" (for both the win and lose cases) or "you need to try again" (for the "something unexpected went wrong" case).
Complicated workflows can certainly have multiple steps, with "fetch the current status" calls in between. But somewhere near the beginning of every complicated workflow there will be a call to "create workflow" and it will need to have sort of mechanism which allows clients to call it idempotently. Otherwise you end up with multiple starts.
I've literally received duplicate products in the mail because of this kind of problem. I've also sent multiple products in the mail because services I relied on didn't offer the necessary idempotency mechanisms.
Even then, it’s not fine because those requests might time out, or your request times out waiting for theirs. Just because your provider abstracts behind one API doesn’t mean you necessarily can!
500 errors, network timeouts, etc all happen. We can't run 2PC transactions with Stripe, so you need durable retries. People run billions of dollars through these APIs every day. It's fine.
My point is that you can’t rely on this specific mechanism because request failure does not mean the payment is not going through!
I’m not sure where we disagree so I must ask: do you disagree with what I’ve written and, if so, what and why?
What you said up to that point didn't really. But then you said this:
> If you're using your database to handle concurrency, you need every request to start inside the transaction. You can't check the idempotency key outside of the transaction or you can't guarantee once-and-only-once behavior.
Which answers the question that what you said earlier in your post raised. If I'm understanding you right, "lookup the idempotency key" is also relying on the same database, so you need the whole operation to be inside a single transaction in that database.
It would seem to me that you would want "what happens if a second request comes in with the same idempotency key while the first is still in progress" to be part of the API, so clients would know what your server is going to do in that scenario.
You could invent your own more sophisticated idempotency API but good luck finding someone that wants to implement it or use it. What real-world problem are you trying to solve?
Meaning, clients don't care about the thing I asked about?
> What real-world problem are you trying to solve?
I'm trying to understand your answers to my questions. When there seems to me to be something missing, I ask about it.
It's generally insert a row in someone else's table, over the wire, 50ms+ away. They might not even be using an RDBMS.
No, I'm asking one question, which doesn't seem to be summarized by your summary.
The situation is that your server has received two requests with the same idempotency key. For the first request, one of three things could be true: it could have succeeded, it could have failed, or it could still be in process.
The original post I responded to said what response the second request gets if the first request succeeded and if it failed. But it didn't say what response the second request gets if the first request is still in process on the server--so it hasn't succeeded and it hasn't failed. I do not see an answer to that anywhere in this thread.
So yeah, you can do basically anything that isn't inconsistent. Success, fail, delay, don't respond until timeout, all are valid as long as you don't double-apply. Most concurrent systems are like this in some way, because all successes can become errors, and all responses might never arrive. It has nothing really to do with idempotency.
Sure you are. You said:
"Retries will only receive 409 if the original request was successful. If the original request failed, the server performs the operation as normal on the second request. It doesn't replay failures."
I understand all that just fine; you don't need to keep trying to "reframe" it. But what you said that I just quoted above assumes, implicitly, that if you get a second request with the same idempotency key, the original request has either failed or succeeded--because you don't even address the case where neither of those things are true. I'm asking you to address that case.
If your answer is "that will never happen", I disagree, and I explained why in response to your question about why the client would send a retry if it hasn't received a response to the original request. You could answer, I guess, that you still think that would never happen--and I would still disagree. But at least that would be an answer. So far all you've done is "reframe" something that I already understand and wasn't asking about.
Whether or not a prior request exists in the system in processed or unprocessed state should not matter in a properly implemented idempotent system, the whole point is that one and only one is processed, and all replicas indicate that they are such.
What you do inside of your boundary to implement that idempotent contract need not be part of the contract and the decision of what primitives to use (locking, content-based addressing etc) are mainly just a question of implementation constraints.
I'm not sure what you mean by "in flight". The case I'm asking about is where the original request was received by the server and is being processed--and then a second request comes in with the same idempotency key. The original request has not succeeded, and has not failed--it's still in process. What response does the second request get? I do not see an answer to that question anywhere in this thread.
And you haven't considered multiple servers in your scenario - what if two requests meant to be idempotent with each other arrived at different servers?
And at the sake of repeating the above commenter, you solve the multiple server by serializing somewhere, because you ultimately need a lock on something. You can also perform the operation in both places and then reconcile the state later but that’s a lot more complex.
When you are using TCP, and you send the same data twice because of a delayed ack, you likewise don't care if the ACK is for the first time or the second time you sent the data. You just know the other side got the data, and that's all you care about.
By sending a third request and getting a response that reveals the state of the system.
Here's a typical example, assuming serializable isolation in a database that uses optimistic concurrency.
* Two simultaneous requests come in to create a payment.
* The requests provide an idempotency key that is expected to be unique (possibly scoped to a tenant).
* The first request starts a transaction and starts processing, everything looks good - no dups.
* The second request starts a transaction and starts processing, everything looks good - no dups.
* The first one commits and returns success.
* The second tries to commit, but a conflict is detected (the first txn committed first). Typically this causes the second transaction to retry.
* On retry, the second transaction detects the duplicate.
The only question here is what happens when the second transaction fails? The Stripe model is "look up the original response and hand that back to the client". An equally valid and much easier to implement solution is "return a response that tells the client that there was a conflict".
Both solutions offer "create payment" as an idempotent operation.
So when the second request comes in, even though it has the same idempotency key as the first request, the server doesn't check to see if there's already a request received with that idempotency key?
That would seem to defeat the whole purpose of idempotency keys.
> On retry, the second transaction detects the duplicate.
So at this point, the second request would return a 409 code (or something like that) to the client?
With optimistic concurrency models, collisions are only detected at commit time. Two transactions can simultaneously update the same data; each update will "succeed"; when they try to commit, only the first one will succeed. The second one will fail with a code that indicates a collision. Standard practice is to just retry the transaction.
In serializable isolation, every transaction sees the state of the database frozen in time at the start of the transaction. They don't see each other's writes (that would be "read committed"). So if you have two transactions simultaneously which do "check if value XYZ exists; if it doesn't exist, insert it" they will both run the insert. The collision will only be detected when the second transaction tries to commit.
There are many other ways to implement this, but this is a pretty common approach.
>> On retry, the second transaction detects the duplicate.
> So at this point, the second request would return a 409 code (or something like that) to the client?
Yes. Stripe's approach is not fundamentally different; they just lookup the original request and return that response body instead of returning an error. It's more work for the server side engineers (and has a bunch of complex but obscure failure modes) but all the underlying database behavior is the same.
Sure, I get that. What I don't get is why you would be using idempotency keys as part of the implementation if you're going to go ahead and start a second transaction when you get a duplicate request, and not even check the idempotency key, and let your database tell you you've got a duplicate when you try to commit the second transaction. This subthread is specifically about implementations that use idempotency keys, since that's what the article is about.
It's not the default (read committed is) and I never saw serializable being set in actual production systems. You can do it, but then you have to be able to retry all of your transactions, including read.
What if the task you do take 5 minutes? 30 minutes? 10 hours? Do you create long transaction, blocking all reads?
It's not the common mode of deployment, but it's definitely in prod use.
> You can do it, but then you have to be able to retry all of your transactions, including read.
Pure read transactions shouldn't need to be retried in postgres due to serialization errors. You need to have read-write dependencies for that.
That's not to say that effectively read only transactions aren't affected by serializable, you do need to record the necessary metadata for the serialization logic to work.
FWIW, if you know your transaction is read only and long running, you can start a transaction with START TRANSACTION READ ONLY DEFERRABLE, which makes the start transaction slower, but then does not need to do any work related to serializable while the transaction is running.
Every major prod system I've worked on in the last 15 years ran in serializable, including my current charge which processes tens of billions of dollars annually. YMMV but this is quite common in serious production systems. Google's Spanner only runs in serializable.
It doesn't matter though. I could write the sequence out with a SELECT FOR UPDATE and the second request will block instead of retry. The client experience is the same; the "second" request blocks. @pdonis wanted an example so I picked one.