The term originated with (or is strongly associated with) the Westinghouse railroad brake system. These are the pressurised air brakes on trains, in which air pressure holds the brake shoes open against spring pressure. Should integrity of the brakeline be lost, the brakes will fail in the activated position, slowing and stopping the train (or keeping a stopped train stopped).
https://en.m.wikipedia.org/wiki/Railway_air_brake
Fail-safe designs and practices can lead to some counterintuitive concepts. Aircraft landing on carrier decks, in which they are arrested by cables, apply full engine power and afterburner on landing. The idea is that should the arresting cable or hook fail, the aircraft can safely take off again.
https://en.m.wikipedia.org/wiki/Fail-safe
Upshot: "fail safe" doesn't mean "test all your failure conditions exhaustively". It may well mean to abort on any failure mode (see djb's software for examples). The most important criterion is that whatever the failure mode be, it be as safe as possible, and almost always, based on a very simple and robust design, mechanism, logic, or system.
From the description of this project, it strikes me that it may well be failing (unsafely?) to implement these concepts. Charles Perrow, scholar of accidents and risks, notes that it's often safety and monitoring systems themselves which play a key role in accidents and failures.
Fail-safe design comes from railroad signaling. It is a principle of classic railroad signaling that any broken wire or relay that fails to pull in must result in an indication not less safe than the correct one. "Vital" Relays in classic signaling systems fall open by gravity, and use silver-to-silver contacts so as to avoid welding together on overloads. (Lightning strikes on rails and on signal lines are considered a normal part of railroad operation.)
[1] https://en.wikipedia.org/wiki/Railway_air_brake#Straight_air...
"Under the Westinghouse system, therefore, brakes are applied by reducing train line pressure and released by increasing train line pressure. The Westinghouse system is thus fail safe—any failure in the train line, including a separation ("break-in-two") of the train, will cause a loss of train line pressure, causing the brakes to be applied and bringing the train to a stop, thus preventing a runaway train."
Without air pressure -- from line or cannister, the brakes fail in the activated mode.
I'm trying to find a source, but my understanding is that red/green for lit signals as "stop/go" came about after an earlier mode, in which a steady white light meant "go" proved problematic: the red disks fronting stop lamps could fall out (or perhaps be broken), leaving ambiguity as to what "white" meant.
Switching to red and green lamps meant that the failed-disk mode now clearly indicated a signalling problem, where the signal could not be trusted.
Particularly when they're correcting errors or omissions in other comments. Such as those in mine above to which Animats is replying.
An example of such system could be a ball check valve, which can inherently only work.
https://en.wikipedia.org/wiki/Check_valve
Can you think of a word to describe such systems?
The first is "impossible".
The second is "pre-failed".
As the drunk has observed, you can't fall off the floor.
If you're looking for a term for a system which is highly immune to failure, "resiliant" comes to mind.
Take Tesla's solid-state, no-moving-parts one-way fluid valve. It has no moving parts to break (though it could conceivably be fouled by dust, dirt, sediment, or debris).
http://makezine.com/2012/01/05/the-tesla-valve-one-way-flow-...
"Overengineered" is another possibility.
There's certainly something to be said for retry strategies in places that involve a lot of network chatter but please don't also forget to add some kind of back off to it so you don't end up retry-overloading a system that's trying to recover.
We released a microservices development kit (MDK) last week that includes similar semantics (e.g., circuit breakers, failover) that implements these semantics in Python, JavaScript, Java, and Ruby. The implementation is actually written in a DSL which we transpile into language native impls. We do this to insure interop between different languages. We're working on updating our compiler to support Go and C#, adding richer semantics, and making the service discovery piece pluggable (currently there's a dependency on our own service discovery).
https://github.com/jhalterman/failsafe/wiki/Comparisons#fail...
For example,
>Executable logic can be passed through Failsafe as simple lambda expressions or method references. In Hystrix, your executable logic needs to be placed in a HystrixCommand implementation
It's not apparent to me what the advantage of either interface is. In both situations I have to define a "lambda" and hold state somewhere(either as an object field or passed into the lambda). Unless I'm something here, either seems acceptable.
Personally, I'd rather systems fail quickly, with retries only at the highest (application) and lowest (TCP) levels.
There's nothing more detailed that I know of. Is there a particular feature area/comparison you're curious about? I can add a bit more detail.
> It's not apparent to me what the advantage of either interface is. In both situations I have to define a "lambda"
What I meant by this bit is that the user experience is different. Failsafe can be used with method references or lambda expressions [1], which are a nice, concise way of wrapping executable logic with some failure handling strategy. You cannot do this with Hystrix since all logic must be wrapped in a HystrixCommand impl, which cannot be implemented as a lambda.
> either seems acceptable.
Like anything, it just depends on what you want. If retries and general purpose failure handling, consider Failsafe. If request collapsing, thread pool management and monitoring, consider Hystrix.
[1]: https://github.com/jhalterman/failsafe#synchronous-retries
Semitrailer parking brakes really are spring-loaded and released by air pressure.
This diagram in particular (from your URL) shows that though there is a spring in the brake-shoe application mechanism, its action is to release the brake.
I hadn't know this (and had never found a good diagram of railroad brake design). This isn't what my understanding had been.
NB: this isn't my area of expertise, and my understanding had been the incorrect idea that spring-pressure held brake shoes in place.
Which makes me wonder why this design was chosen over a spring-driven shoe.
Thanks for sharing that. And brickbats to the hive-minders who've (at this point) downvoted your earlier comment in this thread.
The real answer is that the Westinghouse air brake system won the 1887 Burlington brake trials. Other entries included vacuum brakes, buffer brakes (bumping into the car ahead applied the brakes), a competing air brake system, and electropneumatic brakes (by Herman Hollerith, the punch-card guy). Nobody entered a spring-loaded system.
https://en.wikipedia.org/wiki/Lac-M%C3%A9gantic_rail_disaste...
If you hit an error condition in your code that you aren't explicitly handling, break that mofo.
The faster and more explicitly you break, the better, as this gives you the signal to fix the problem.
Wrapping and retries attempts to heal the damage, meaning, effectively, your code is walking wounded -- it's encountered an untrapped error, has ignored it, and is attempting to continue.
The faster and more definitively an error breaks, the better the likelihood of fixing it, and the more obvious the error and fix are.
I haven't looked in detail at the library, and probalby don't have the chops to identify good or bad features. But the mechanisms described and my understanding of the origins of the concept of "fail safe" seemed at odds, and I wanted to raise the point.
retryPolicy.withBackoff(1, 30, TimeUnit.SECONDS);
and if you want to specify the exponent [2]: retryPolicy.withBackoff(1, 30, TimeUnit.SECONDS, 1.5);
As for which failure handling strategy is safer or what it means to fail safely, in my experience it not only depends on the use case but the type of failure. Certain exceptions, even in a networked application, can and should be retried or recovered from while others cannot. Sometimes retrying is good, sometimes preventing subsequent executions (via circuit breakers), sometimes falling back to an alternative resource. It's all based on the scenario.[1]: http://jodah.net/failsafe/javadoc/net/jodah/failsafe/RetryPo...
[2]: http://jodah.net/failsafe/javadoc/net/jodah/failsafe/RetryPo...
An advantage of standardisation is you get, well, standardisation. Such as US President Herbert Hoover implemented by setting up the National Institute for Standards and Technology (NIST), which specified standards for screws and nuts and bolts. I'm not sure if Bendix transmissions were included, but come WWII, it was possible for the US War Department to order something like five million Jeep transmissions from several dozen suppliers, any of which could (at least in theory) be interchanged or have parts swapped between them.
The disadvantage is that you may find yourself very effectively stuck at a local optimum that's far from a global optimum, with murderous path dependencies.
I've been grousing over a set of TV propaganda videos created by the Mont Pelerin Society / Cato Institution through Johan Norberg and his "Free to Choose Media" production company (at least the propaganda slant is fairly obvious). The 2nd installement of his series on Adam Smith spends much of its time aboard a supersized cargo carrier, waxing rhapsodic about the wonders of the market in coming up with such a marvelously efficient system.
Except that it took the US Navy to standardise container sizes. After some 20 years of dickering over container sizes, materiel transport needs of the Vietnam War finally forced standarisation.
(Another US regulatory body, the Interstate Commerce Commission, meanwhile, had been happily impeding progress thanks to its regulatory capture by the railroad industry, and I won't even begin to mention the Texas Railroad Commission, which has little to do with railroads and was exceptionally significant well beyond Texas, at least for a time).