Systems that defy detailed understanding

Systems that defy detailed understanding(blog.nelhage.com)

278 points by a7b3fa 6 years ago | 60 comments

I often wonder if things would be better if systems were less forgiving. I bet people would pay more attention if the browser stopped rendering on JavaScript errors or misformed HTML/CSS. This forgiveness seems to encourage a culture of sloppiness which tends to spread out. I have the displeasure of looking at quite a bit of PHP code. When I point out that they should fix the hundreds of warnings the usual answer is “why? It works.” My answer usually is “are you sure? “.

On the other hand maybe this forgiveness allowed us to build complex systems.

yuliyp 6 years ago | |

This often devolves into extremely fragile systems instead. For instance, let's say you failed to load an image on your web site. Would you rather the web site still work with the image broken or just completely fail? What if that image is a tracking pixel? What if you failed to load some experimental module?

Being able to still do something useful in the face of something not going according to plan is essential to being reliable enough to trust.

twic 6 years ago | | |

Systems need to be robust against uncontrollable failures, like a cosmic ray destroying an image as it travels over the internet, because we can never prevent those.

But systems should quickly and reliably surface bugs, which are controllable failures.

A layer of suffering on top of that simple story is that it's not always clear what is and what is not a controllable failure. Is a logic error in a dependency of some infrastructure tooling somewhere in your stack controllable or not? Somebody somewhere could have avoided making that mistake, but it's not clear that you could.

An additional layer of suffering is that we have a habit of allowing this complexity to creep or flood into our work and telling ourselves that it's inevitable. The author writes:

> Once your system is spread across multiple nodes, we face the possibility of one node failing but not another, or the network itself dropping, reordering, and delaying messages between nodes. The vast majority of complexity in distributed systems arises from this simple possibility.

But somehow, the conclusion isn't "so we shouldn't spread the system across multiple nodes". Yo Martin, can we get the First Law of Distributed Object Design a bit louder for the people at the back?

https://www.drdobbs.com/errant-architectures/184414966

And let us never forget to ask ourselves this question:

https://www.whoownsmyavailability.com/

andai 6 years ago | | |

That's an interesting distinction. I think each resource should be self contained. Malformed HTML? HTML error. Malformed or missing image? Browser displays an image error.

The key here is that the web wasn't designed for engineers but for amateurs to slap something together sloppily in the first place.

As an aside it's curious how ridiculously forgiving HTML and JS are while CSS craps itself on a single missing semicolon. As though it were okay for the thing to be semantically and functionally malformed and malfunctioning... as long as it looks good!

nikofeyn 6 years ago | | |

> This often devolves into extremely fragile systems instead.

as if the systems we have today aren't fragile? instead, they're fragile but their fragility is hidden and obfuscated.

being robust and reliable is different than just letting systems do whatever they think is best.

dasyatidprime 6 years ago | |

XHTML tried to do this after the webmasters (loosely defined) had fled the barn, and then it fell out of the world and the WHATWG ate everything.

AgentOrange1234 6 years ago | | |

I remember being very fond of xhtml. It seemed much more logical and sensible, every beginning having an end, all things in balance. I don’t really know what the argument against it is/was?

INTPnerd 6 years ago | |

It is true that the underlying technology used to write the code to begin with should be less forgiving. If you use a strictly typed, compiled language instead of PHP, you would have no choice but to fix a lot more of the errors because it would not compile otherwise.

Once it is running on production though, things are quite different. You need the right combination of errors being well reported and gracefully handled without aborting or breaking the rest of the functionality unnecessarily. At that point people are relying on it to get their jobs done and they will usually find ways to work around the errors and even the corrupt data this might result in so they can keep meeting their deadlines while the programmers work on fixing the problem. This is much better than those same employees not being able to do their jobs or getting payed to stand around and do nothing. I guess this attitude is largely driven by the practicalities of where I work. If the employees that rely on the code to work get behind or can't complete their work on time, our company is nailed with thousands of dollars in fines as per the contract agreements we have to agree to in order to get the business/contracts to begin with, and then our customers can't bill their customers, so they are not happy.

Koshkin 6 years ago | |

Indeed, less rigidity and higher tolerances lead to reliability - similar to what we do in construction of buildings: a skyscraper would fall one day if it wasn't for its flexibility under effects of elements such as wind.

naasking 6 years ago | | |

That's not an apt analogy. A system with tight tolerances can still be flexible, we just know more precisely how it can flex and when it will break.

A better analogy would be if your construction workers didn't have standard or prescribed bolts in their design, so they just take what's lying around and hammer and weld bits together until it seemed sturdy enough. Suffice it to say, this is not a recipe that would work to build today's sky scrapers. There is considerable design and sanity checking that goes into this stuff which the web at every point completely lacked.

XHTML was a promising start in the right direction, but they unfortunately bungled it.

9wzYQbTYsAIc 6 years ago | | |

Interesting related trivia: engineers build safeguards around that flexibility - in the same way that a poorly built bridge will shake itself apart in the wind, a building without adaptive dampening or the right properties of flexibility could shake itself apart in the wind.

lkrubner 6 years ago | |

"I bet people would pay more attention if the browser stopped rendering on JavaScript errors or misformed HTML/CSS."

This was strongly suggested by those who fought for strict XHTML, but then Sam Ruby, who was leading the HTML5 effort, asked the question, "I find an image that I know my daughter will like. I send it to her. It is SVG. She wants to upload it to her Myspace page. However, the image won't render, because SVG is a form of XML, and Myspace is non-compliant. And yet, if I send her a JPEG or GIF image, she can upload that to Myspace."

The point was we typically embed content from one page into another page, and no one believed there would ever come a day when every page on the Web would be strict compliant. So HTML5 went in the other direction, dropping most requirements and allowing pretty much anything.

As I've written elsewhere, the fundamental problem we face is that a markup language, such as HTML, is completely unsuitable to the apps we now like to build and run over the Web. We rely on HTML to function as the GUI of TCP/IP, but it was not actually designed for that, as it was descended from SGML, and it carries with it a publishing history. What would make more sense would be use of a data format, such as JSON or EDN, which can then be given visual characteristics, without ever having to participate in one hierarchy or any one understanding of a DOM. Developers understandably complain that Java/Swing had 9 different layout options, the product of much experimentation, but having a variety of layout options does allow more flexibility of styles of building a GUI, with some approaches being simpler than what we get with the React/JS translation into HTML.

k__ 6 years ago | |

If the web worked like that, it probably wouldn't be so popular today.

mech422 6 years ago | |

Personally, as a coder AND as a user - I want the program to flat out fail. As a user, a system that aborts on error maybe a PITA to use, but I have confidence in the output it provides.

As a programmer, I like that same confidence in output AND it requires me to address the failures in some way...

uk_programmer 6 years ago | |

Even in languages like C# it will let you get away with lots of horrendous things. Generally unless you put on options like "Treat Warnings as Errors" most programmers will just ignore them, or wrap some statements in 'pragma' and disable the warnings. I've seen people just wrap an exception around the entire application or put a giant exception filter instead of actually fixing the problem.

Poor/Lazy developers will find ways around more stringent checks.

m463 6 years ago | |

Well, there is the Robustness Principle:

Be conservative in what you do, be liberal in what you accept from others (often reworded as "Be conservative in what you send, be liberal in what you accept").

the wisdom of that policy just accrues over time.

[1] https://en.wikipedia.org/wiki/Robustness_principle

lonelappde 6 years ago | | |

Hyrum's Law challenges that supposed wisdom.

https://www.hyrumslaw.com/

Postel's Law creates debt that we pay interest on forever

Since Hyrum's law is a naturalist observation and Postel's Law is a suggestion, Hyrum's law is a more definitive truth.

logicallee 6 years ago | |

Are you sure? Your comment contains a minor syntax error.

Should you have been unable to submit it, or should people not be able to view it, until you correct it?

  >My answer usually is “are you sure? “.                         
                                       ^

   Line 1:
   Syntax error: "“" not allowed here.

JavaScript is quite forgiving, but that's usually okay. If something doesn't work it's usually not the end of the world.

In this case everyone correctly read your second opening quotation mark as a closing quotation mark.

This allows us to focus on what you're saying (functionality.)

If we couldn't figure out why you included some typos, we would just ignore that part and focus on the rest of your comment.

When someone replies with the nitpicking style it doesn't help anyone. (In fact my first version of this comment was downvoted, before I wrote out the rest of my explanation.)

I think all the leniency in front end JS is pretty good for the same reason. It lets us communicate, and the sandboxed client environment (browser security is built assuming web sites could be malicious) means that the stakes are quite low.

asfarley 6 years ago | |

This is what strict languages like Rust are for, right?

pdr2020 6 years ago | |

On a sidenote, I intensely dislike statements like this that hedge their point.

qppo 6 years ago | |

To me this kind of thing is the difference between furniture you buy from a decent store or from IKEA. It's craftsmanship, not complexity.

smitty1e 6 years ago |

Great article.

Recalls Gall's Law[1]. "A complex system that works is invariably found to have evolved from a simple system that worked."

Also, TFA invites a question: if handed a big ball of mud, is it riskier to start from scratch and go for something more triumphant, or try to evolve the mud gradually?

I favor the former, but am quite often wrong.

[1] https://en.m.wikiquote.org/wiki/John_Gall

mannykannot 6 years ago |

Big balls of mud result from a process that resembles reinforcement learning, in that modifications are made with a goal in mind and with testing to weed out changes that are not satisfactory, but without any correct, detailed theory about how the changes will achieve the goal without breaking anything.

bitwize 6 years ago | |

Sounds like all of Agile, really. One can characterize Agile as a ball-of-mud maintenance process that scales desirably with the amount of mud.

smitty1e 6 years ago | |

So, lack of a larger test suite that can detect ripple effects across the overall system, and not just a component?

Or are test suites just a nice fantasy for a real distributed system?

mannykannot 6 years ago | | |

In the situation I am thinking of, the tests that select successful modifications are, almost by definition, integration tests, because with a big ball of mud, you don't know what the proper specification for the components are, and they don't have clear interfaces.

By 'tests' I am including live failures, which are also a feature of mudballs.

A distributed system is always much more difficult to test than a functionally-equivalent localized version. That's not, of course, a reason to give up on testing, but one must be realistic about how much faith one can put in it to make up for an inadequate use of abstraction and separation of concerns.

carapace 6 years ago |

"Introduction to Cybernetics" W. Ross Ashby

http://pespmc1.vub.ac.be/ASHBBOOK.html

> ... still the only real textbook on cybernetics (and, one might add, system theory). It explains the basic principles with concrete examples, elementary mathematics and exercises for the reader. It does not require any mathematics beyond the basic high school level. Although simple, the book formulates principles at a high level of abstraction.

gentleman11 6 years ago | |

Not sure what cybernetics formally means, but apparently it has to do with complexity management

> W. Ross Ashby is one of the founding fathers of both cybernetics and systems theory. He developed such fundamental ideas as the homeostat, the law of requisite variety, the principle of self-organization, and the principle of regulatory models. Many of these insights were already proposed in the 1940's and 1950's, long before the presently propular "complex adaptive systems" approach arrived at very similar conclusions. Whereas the concepts surrounding the complexity movement are often complicated and confused, Ashby's ideas are surprisingly clear and simple, yet deep and universal.

Good link

AndrewKemendo 6 years ago | |

I find it really sad that cybernetics completely evaporated as a field with the closest remnant being cognitive science. I think there is a huge need for more interdisciplinary fields

carapace 6 years ago | | |

A lot of it was incorporated or duplicated in feedback control theory, but mostly in the context of industry, so it didn't really feed back (heh, sorry) into other, more academic, areas. And, on the other hand, it spun off into (IMO) fluffy "second-order" cybernetics and became a kind of toy philosophy.

I find it sad too. PID controllers are great but from my POV they're barely the first step.

However, another way to look at it is, you can study and apply "Intro to Cyb" and leapfrog into the future.

xyzzy2020 6 years ago |

I think this is useful even for systems (SW stacks) that are much smaller and "knowable": you start by observing, trying small things, observing more, trying different things, observe more and slowly build a mental model of what is likely happening and where.

His defining characteristic is where you can permanently work around a bug (not know it, but know _of_ it) vs find it, know it, fix it.

Very interesting.

jborichevskiy 6 years ago |

> If you run an even-moderately-sophisticated web application and install client-side error reporting for Javascript errors, it’s a well-known phenomenon that you will receive a deluge of weird and incomprehensible errors from your application, many of which appear to you to be utterly nonsensical or impossible.

...

> These failures are, individually, mostly comprehensible! You can figure out which browser the report comes from, triage which extensions might be implicated, understand the interactions and identify the failure and a specific workaround. Much of the time.

> However, doing that work is, in most cases, just a colossal waste of effort; you’ll often see any individual error once or twice, and by the time you track it down and understand it, you’ll see three new ones from users in different weird predicaments. The ecosystem is just too heterogenous and fast-changing for deep understanding of individual issues to be worth it as a primary strategy.

Sadly far too accurate.

naringas 6 years ago |

I firmly believe that in theory all computer systems can be understood.

But I agree when he says, it has become impractical to do so. But I just don't like it personally, I got into computing because it was supposed to be the most explainable thing of all (until I worked with the cloud and it wasn't).

I highly doubt that the original engineers who designed the first microchips and wrote the first compilers, etc... relied on 'empirical' tests to understand their systems.

Yet, he is absolutely correct, it can no longer be understood, and when I wonder why I think the economic incentives of the industry might be one of the reasons?

for example, the fact that chasing crashes down the rabbit hole is "always a slow and inconsistent process" will make any managerial decision maker feel rather uneasy. This make sense.

Imagine if the first microprocessors where made by incrementally and empirically throwing together different logic gates until it just sort of worked??

woodandsteel 6 years ago |

From a philosophical perspective, I would say this is an example of the inherent finitudes of human understanding. And I would add that such finitudes are deeply intertwined with many other basic finitudes of human existence.

lucas_membrane 6 years ago |

I suspect that systems that defy understanding demonstrate something that ought to be a corollary of the halting problem, i.e. just as you can't figure out for sure how long an arbitrary system will take to halt, or even figure out for sure whether or not it will, neither can you figure out how long it will take to figure out what's going on when an arbitrary system reaches an erroneous state, or even figure out for sure whether or not you can figure it out.

nil-sec 6 years ago | |

I’m not sure about this. Define your “erroneous” state as “halt”. Now the question becomes, for a systems that halts, find out how it reached this state. The mathematical answer to this is simply the description of the Turing machine that produced this state. Whether you can understand this description or not isn’t relevant.

natmaka 6 years ago |

Postel's Robustness principle seems pertinent, along with "The Harmful Consequences of the Robustness Principle". https://tools.ietf.org/id/draft-thomson-postel-was-wrong-03....

INTPnerd 6 years ago |

Even if you can reason about the code enough to come to a conclusion that seems like it must be true, that doesn't prove your conclusion is correct. When you figure something out about the code, whether through reason and research, or tinkering and logging/monitoring, you should embed that knowledge into the code, and use releases to production as a way test if you were right or not.

For example, in PHP I often find myself wondering if perhaps a class I am looking at might have subclasses that inherit from it. Since this is PHP and we have a certain amount of technical debt in the code, I cannot 100% rely on a tool to give me the answer. Instead I have to manually search through the code for subclasses and the like. If after such a search I am reasonably sure nothing is extending that class, I will change it to a "final" class in the code itself. Then I will rerun our tests and lints. If I am wrong, eventually an error or exception will be thrown, and this will be noticed. But if that doesn't happen, the next programmer who comes along and wonders if anything extends that class (probably me) will immediately find the answer in the code, the class is final. This drastically reduces possibilities for what is possible to happen, which makes it much easier to examine the code and refactor or make necessary changes.

Another example is often you come across some legacy code that seems like it no longer can run (dead code). But you are not sure, so you leave the code in there for now. In harmony with this article, you might log or in some way monitor if that path in the code ever gets executed. If after trying out different scenarios to get it to run down that path, and after leaving the monitoring in place on production for a healthy amount of time, you come to the conclusion the code really is dead code, don't just add this to your mental model or some documentation, embed it in the code as an absolute fact by deleting the code. If this manifests as a bug, it will eventually be noticed and you can fix it then.

By taking this approach you are slowly narrowing down what is possible and simplifying the code in a way that makes it an absolute fact, not just a theory or a model or a document. As you slowly remove this technical debt, you will naturally adopt rules like, all new classes must start out final, and only be changed to not be final when you need to actually extend them. Eventually you will be in a position to adopt new tools, frameworks, and languages that narrow down the possibilities even more, and further embedding the mental model of what is possible directly into the code.

jerzyt 6 years ago |

Great read. A lot hard earned wisdom!

drvortex 6 years ago |

What a long winded article on what has been known to scientists for decades as "emergence". Emergent properties are systems level properties that are not obvious/predictable from properties of individual components. Looking and observing one ant is unlikely to tell you that several of these creatures can build an anthill.

svat 6 years ago | |

Your comment was very puzzling to me, as I couldn't figure out what kind of misunderstanding about this article would prompt a comment such as this. But finally a possibility occurred to me: perhaps you think the point of this article was simply to say that there exist "systems that defy detailed understanding". It is possible that one could think that, if one went in with preconceived expectations based only on title of the post. (But this is a very dangerous habit in general, as outside of personal blogs like this one, almost always headlines in publications aren't chosen by the author.)

But we all know such systems already: for instance, people! No, this post is a supplement/subsidiary to the previous one ("Computers can be understood" — BTW here's another recent blog post making the same point: https://jvns.ca/blog/debugging-attitude-matters/), carving out exceptions to the general rule, and illustrating concretely why these are exceptions (and what works instead). It is useful to the practitioner as a rule-of-thumb for having a narrow set of criteria for when to avoid aiming to understand fully (and alternative strategies for such cases). Otherwise, it's very easy to throw up one's hands and say "computers are magic; I can't possibly understand this".

(The point of the article here is obvious from even just the first or last paragraphs of the article IMO.)

woodandsteel 6 years ago | |

Yes, but to a lot of people that sounds like a lot of woo-woo. What this article does is explain it in a clear and persuasive way to the people in a particular field.

The fact that you didn't pick this up leads me to think you are more interested in being smart than helpful, but perhaps I am wrong about that.