I killed a worker mid-payment to test "exactly-once" execution

1 points by yousef06 31 days ago | 0 comments

https://github.com/abokhalill/pulse Distributed systems often claim “exactly-once” execution. In practice, this is usually implemented as at-least-once delivery + retries + idempotency keys.

This works for deterministic code. It breaks for irreversible side effects (AI agents, LLM calls, physical infrastructure).

I wanted to see what actually happens if a worker crashes after a payment is made but before it acknowledges completion. So I built a minimal execution kernel with one rule: User code is never replayed by the infrastructure.

The kernel uses:

Leases (Fencing Tokens / Epochs)

A reconciler that recovers crashed tasks

Strict state transitions (No silent retries)

I ran this experiment:

A worker claims a task to process a $99.99 payment

The worker records the payment (irreversible side effect)

I kill -9 the worker before it sends completion to the DB

The lease expires, the reconciler detects the zombie task

A new worker claims the task with a new fencing token

The new worker sees the previous attempt in the ledger (via app logic) and aborts

The task fails safely

Result: Exactly one payment was recorded. The money did not duplicate.

Most workflow engines (Temporal, Airflow, Celery) default to retrying the task logic on crash. This assumes your code is idempotent.

AI agents are not.

LLM generation is not.

Payment APIs (without keys) are not.

I open-sourced the kernel and the chaos demo here. The point isn’t adoption. The point is to make replay unsafe again.

https://github.com/abokhalill/pulse

No comments yet