I killed a worker mid-payment to test "exactly-once" execution https://github.com/abokhalill/pulse
Distributed systems often claim “exactly-once” execution. In practice, this is usually implemented as at-least-once delivery + retries + idempotency keys. This works for deterministic code. It breaks for irreversible side effects (AI agents, LLM calls, physical infrastructure). I wanted to see what actually happens if a worker crashes after a payment is made but before it acknowledges completion. So I built a minimal execution kernel with one rule: User code is never replayed by the infrastructure. The kernel uses: Leases (Fencing Tokens / Epochs) A reconciler that recovers crashed tasks Strict state transitions (No silent retries) I ran this experiment: A worker claims a task to process a $99.99 payment The worker records the payment (irreversible side effect) I kill -9 the worker before it sends completion to the DB The lease expires, the reconciler detects the zombie task A new worker claims the task with a new fencing token The new worker sees the previous attempt in the ledger (via app logic) and aborts The task fails safely Result: Exactly one payment was recorded. The money did not duplicate. Most workflow engines (Temporal, Airflow, Celery) default to retrying the task logic on crash. This assumes your code is idempotent. AI agents are not. LLM generation is not. Payment APIs (without keys) are not. I open-sourced the kernel and the chaos demo here. The point isn’t adoption. The point is to make replay unsafe again. https://github.com/abokhalill/pulse |