Failsafe – failure handling with retries, circuit breakers and fallbacks

Failsafe – failure handling with retries, circuit breakers and fallbacks(github.com)

116 points by jodah 9 years ago | 41 comments

A note on the name: "fail-safe" in engineering doesn't mean that a system cannot fail, but rather, that when it does, it does so in the safest manner possible.

The term originated with (or is strongly associated with) the Westinghouse railroad brake system. These are the pressurised air brakes on trains, in which air pressure holds the brake shoes open against spring pressure. Should integrity of the brakeline be lost, the brakes will fail in the activated position, slowing and stopping the train (or keeping a stopped train stopped).

https://en.m.wikipedia.org/wiki/Railway_air_brake

Fail-safe designs and practices can lead to some counterintuitive concepts. Aircraft landing on carrier decks, in which they are arrested by cables, apply full engine power and afterburner on landing. The idea is that should the arresting cable or hook fail, the aircraft can safely take off again.

https://en.m.wikipedia.org/wiki/Fail-safe

Upshot: "fail safe" doesn't mean "test all your failure conditions exhaustively". It may well mean to abort on any failure mode (see djb's software for examples). The most important criterion is that whatever the failure mode be, it be as safe as possible, and almost always, based on a very simple and robust design, mechanism, logic, or system.

From the description of this project, it strikes me that it may well be failing (unsafely?) to implement these concepts. Charles Perrow, scholar of accidents and risks, notes that it's often safety and monitoring systems themselves which play a key role in accidents and failures.

Animats 9 years ago | |

"These are the pressurized air brakes on trains, in which air pressure holds the brake shoes open against spring pressure." Air brakes don't really work that way.[1] There's an air tank on each car to provide the pressure to apply the brakes if the brake line loses pressure.

Fail-safe design comes from railroad signaling. It is a principle of classic railroad signaling that any broken wire or relay that fails to pull in must result in an indication not less safe than the correct one. "Vital" Relays in classic signaling systems fall open by gravity, and use silver-to-silver contacts so as to avoid welding together on overloads. (Lightning strikes on rails and on signal lines are considered a normal part of railroad operation.)

[1] https://en.wikipedia.org/wiki/Railway_air_brake#Straight_air...

dredmorbius 9 years ago | | |

From your linked source:

"Under the Westinghouse system, therefore, brakes are applied by reducing train line pressure and released by increasing train line pressure. The Westinghouse system is thus fail safe—any failure in the train line, including a separation ("break-in-two") of the train, will cause a loss of train line pressure, causing the brakes to be applied and bringing the train to a stop, thus preventing a runaway train."

Without air pressure -- from line or cannister, the brakes fail in the activated mode.

I'm trying to find a source, but my understanding is that red/green for lit signals as "stop/go" came about after an earlier mode, in which a steady white light meant "go" proved problematic: the red disks fronting stop lamps could fall out (or perhaps be broken), leaving ambiguity as to what "white" meant.

Switching to red and green lamps meant that the failed-disk mode now clearly indicated a signalling problem, where the signal could not be trusted.

dredmorbius 9 years ago | | |

NB: it's really annoying to see HNers downvoting factually accurate and well-intentioned comments.

Particularly when they're correcting errors or omissions in other comments. Such as those in mine above to which Animats is replying.

superzamp 9 years ago | |

Interesting comment. I've been looking for a term for "system that cannot fail", but have not been able to find any.

An example of such system could be a ball check valve, which can inherently only work.

https://en.wikipedia.org/wiki/Check_valve

Can you think of a word to describe such systems?

dredmorbius 9 years ago | | |

There are two terms that come to mind.

The first is "impossible".

The second is "pre-failed".

As the drunk has observed, you can't fall off the floor.

If you're looking for a term for a system which is highly immune to failure, "resiliant" comes to mind.

Take Tesla's solid-state, no-moving-parts one-way fluid valve. It has no moving parts to break (though it could conceivably be fouled by dust, dirt, sediment, or debris).

http://makezine.com/2012/01/05/the-tesla-valve-one-way-flow-...

"Overengineered" is another possibility.

im4w1l 9 years ago | | |

A titanic system.

pm90 9 years ago | |

This is a great comment. Why do you think the project fails to implement the concepts that you mention?

daenney 9 years ago | | |

I think that what they're trying to get at is that having libraries that (for example) wrap failures in retry modes isn't necessarily failing safely. It can very well obscure problems in your implementation or other parts of the systems you're talking to. Having it fail safely can just as well be "abort execution" and visibly log it so as to raise the problems with those that might be able to solve the root cause.

There's certainly something to be said for retry strategies in places that involve a lot of network chatter but please don't also forget to add some kind of back off to it so you don't end up retry-overloading a system that's trying to recover.

nitrogen 9 years ago |

Very cool. Consistent and clear retry, backoff, and failure behaviors are an important part of designing robust systems, so it's disappointing how uncommon they are. If I were starting a new Java project today I would almost certainly want to use this library instead of the various threads and timers I had to hack together years ago.

heisenbit 9 years ago | |

Indeed this is conceptually hard stuff. The reason for that I believe is that the problems one is solving are system level problems and not local ones. Another way to look at this: It is the other guys problem. A lot of naive retry strategies sort of work until one has a larger number of clients to deal with. I still remember the time trying to get through to a base-station designer who refused to acknowledge the need to do exponential back-off and other mitigation steps. We ran into interesting times shortly later in the field on the management system side. Personally I would also put in a bit of randomness to spread out requests when all clients were initially impacted at the same time and were thus synchronized.

jodah 9 years ago | | |

Good example of where random retry delays would be valuable. I filed this as a feature to add for the next release:

https://github.com/jhalterman/failsafe/issues/39

SwellJoe 9 years ago |

This title would be 100% better with "for Java" on the end.

_Codemonkeyism 9 years ago | |

... for JVM languages.

ckugblenu 9 years ago |

Quite interesting. It shows potential to be used in numerous use cases. Anyone know of similar projects in other languages like Python and Javascript?

rdli 9 years ago | |

(Full disclosure: co-founder of Datawire)

We released a microservices development kit (MDK) last week that includes similar semantics (e.g., circuit breakers, failover) that implements these semantics in Python, JavaScript, Java, and Ruby. The implementation is actually written in a DSL which we transpile into language native impls. We do this to insure interop between different languages. We're working on updating our compiler to support Go and C#, adding richer semantics, and making the service discovery piece pluggable (currently there's a dependency on our own service discovery).

https://github.com/datawire/mdk

sync 9 years ago | |

We use re for Javascript, it works well: https://www.npmjs.com/package/re

rekwah 9 years ago | |

Although, not feature parity with this project, Pybreaker[0] for the circuit breaker patterns in Python.

[0] - https://github.com/danielfm/pybreaker

Rapzid 9 years ago | |

.Net has Polly https://github.com/App-vNext/Polly

Rauchg 9 years ago | |

We use `async-retry` which implements `node-retry` in a way that's friendly to usage with `Promise` and `async/await`.

https://github.com/zeit/async-retry

garthk 9 years ago | |

See also: Twitter's Finagle [1] for the JVM, and Bouyant [2] providing Finagle-as-a-microservice on localhost for language independence.

1: https://twitter.github.io/finagle/ 2: https://buoyant.io

cpitman 9 years ago |

How is this distinct from Hystrix (https://github.com/Netflix/Hystrix)? Why should I use one over the other?

jodah 9 years ago | |

Good question. Someone asked that recently on Github - here's a quick comparison:

https://github.com/jhalterman/failsafe/wiki/Comparisons#fail...

vikiomega9 9 years ago | | |

Is there a more detailed comparison?

For example,

>Executable logic can be passed through Failsafe as simple lambda expressions or method references. In Hystrix, your executable logic needs to be placed in a HystrixCommand implementation

It's not apparent to me what the advantage of either interface is. In both situations I have to define a "lambda" and hold state somewhere(either as an object field or passed into the lambda). Unless I'm something here, either seems acceptable.

ap22213 9 years ago |

It seems like a well-thought, fluent interface to what lots of Java developers (especially Java 8 ones) inevitably have to write themselves.

mandeepj 9 years ago |

Please find some of these patterns for .net\azure\c# stack here - https://msdn.microsoft.com/en-us/library/dn568099.aspx

fdsaaf 9 years ago |

Beware of runaway retries: https://blogs.msdn.microsoft.com/oldnewthing/20051107-20/?p=...

Personally, I'd rather systems fail quickly, with retries only at the highest (application) and lowest (TCP) levels.