Type in the exact number of machines to proceed

Type in the exact number of machines to proceed(rachelbythebay.com)

554 points by vii 5 years ago | 332 comments

I've seen this called "pointing and calling" [1], Japan's train drivers use the technique to force themselves to perform actions and take notice of the current environment.

I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.

[1] https://en.wikipedia.org/wiki/Pointing_and_calling

brundolf 5 years ago | |

I've only been in the job field for six years, and yet:

My first boss accidentally deleted our QA database, meaning to delete a local copy

A later boss accidentally deleted our production database, thinking it was the clone that he had just made (which luckily we still had)

Both of them were very experienced developers in their 40s. Nobody is beyond this kind of mistake.

jlmorton 5 years ago | | |

War story time. Long ago, I worked for an interesting company that insisted on running its entire business on Linux desktops, all the way back between 1999-2002. Imagine running StarOffice/OpenOffice, Thunderbird, Netscape Navigator, etc, for your entire business back in 2000, including your executive team, marketing teams, everyone, most of whom had never even heard of Linux before.

Anyway, this being Linux, everyone's home directory was mounted on NFS. All our builds were standardized with a tool called SystemImager, which we could use to push out updates to everyone's desktop whenever we wanted. If there was a new version of KDE, we could pretty easily push that change out.

Sometimes it was convenient for me to work on updates to these images by chrooting into a directory containing the "image," which was really just an rsync tree. And sometimes, when updating these images, it was convenient to mount our NFS home directories in this chroot environment, so I could access things like an archive I had just downloaded on my own desktop.

And eventually we had lots of different images, and the old ones were using up a lot of disk space, so I decide to clean up some space removing the old images. And these are fairly large images, with lots of small files, and this was before SSDs were a thing, so it made sense that deleting them was taking a while, and I stepped out to grab something to eat.

As I was eating lunch, I started getting the tech support escalations. But this wasn't that unusual, our users routinely had problems with the environment we had provided. They hated it, because it was in many ways terrible, and they made sure we knew it. So I wasn't terribly alarmed. I didn't think any major changes had been made, so I didn't hurry back.

By the time I leisurely returned from lunch, half the NFS home directories for our users were gone, along with all their documents, emails, bookmarks, or whatever else. Suddenly it hit me what had happened: at some point, perhaps months earlier, I had left our NFS home directories mounted within one of these image chroots. And now I had sudo rm -rf'd it.

We had backups, but they were on tape, and it took several days to restore, with about a day of data loss.

brlewis 5 years ago | | |

>very experienced developers in their 40s

I'd say they were experienced developers. Only after accidentally deleting databases were they very experienced developers.

mehrdadn 5 years ago | | |

Reminds me of when I accidentally deleted a virtual hard disk I had a few years ago, because I'd copied it earlier and I thought I still had the other copy left. Only afterward did I remember I'd done the exact same thing to the other copy earlier... thankfully the information on it wasn't critical, but it was kind of terrifying to realize it very well could have been.

sgustard 5 years ago | | |

I have been that boss. Is that you, Wendel? In any case: the deletion even had a "type your app name to confirm" prompt, but I knew I wanted to act on production; the issue was deleting the wrong one of multiple production databases. The takeaway was to grab a second pair of eyes to review any dangerous operations.

aidenn0 5 years ago | | |

I deleted our production CRM database meaning to delete the test database. While my boss was running queries on the database for setting my quarterly bonus.

Good news is that I was deleting the test database to ensure that the recovery from backups was properly automated, so it wasn't down too long.

davedx 5 years ago | | |

Yup. Senior dev here, my own devops config screw up wiped out all production sales order data earlier this year. Had to restore from multiple backups, took a while. Stressful experience.

Consider network partitioning so dev/test/accept just has 0 contact with prod.

contravariant 5 years ago | | |

Ironically there seems to be no time more prone to these kinds of mistakes than when you're trying to prevent or fix them.

myself248 5 years ago | |

Ever since hearing about point-and-call, I've started using it in the kitchen when turning on the stove. I used to destroy one or two pans a year by turning on the wrong burner, but it's now been about a year and a half and I haven't screwed it up yet.

The knobs are labeled with a terrible little glyph meant to indicate which is which, and I've supplemented this with plain-english Brady labels "front left", "front right", etc. Now I speak the words above the knob, and point to the burner. It felt goofy at first, but now it feels normal, and like I'm tempting fate if I skip it.

giantDinosaur 5 years ago | | |

I'm curious how exactly you managed to destroy pans. I've never destroyed a pan in my life, and take no particular precautions - is this a common thing? Is this more common with non-stick stuff or something?

dirkt 5 years ago | | |

Not sure how it is in other countries, but don't the knobs when going left-to-right always correspond clockwise to the burners, starting at the lower left? And the oven knob is to the right?

I've never seen a different arrangement.

MaxBarraclough 5 years ago | |

Worth mentioning that, assuming the single study on the matter can be believed, the pointing and calling method is extremely effective in reducing the incidence of silly mistakes (that is, mistakes made in simple routine tasks, by competent individuals).

Unfortunately, it strikes many as looking rather silly, so it hasn't been widely adopted.

js2 5 years ago | | |

I learned a technique from a gray beard[0] when I worked as a student sys admin for the CS dept over two decades ago. Whenever typing a destructive command, he'd take his hands off the keyboard and drop them to his side, re-read the command, then put his hands back to press enter.

I do this whenever I'm on a production server (which is rare anyway). I use different colored prompts for local and remote shells.

[0] Technically he had no beard and if he had, it wouldn't have been gray.

morelisp 5 years ago | | |

I've done this for several years (also after seeing a video about Japanese railway operations). It doesn't seem to catch on.

It's also not perfect; it does not catch mistakes concerning "non-local" state, e.g. configuration files in /etc merging with one in . merging with some command line options. (Personally I try to avoid writing tools with defaults of this sort, but especially Java developers seem have different opinions.)

Unfortunately if you do P&C and still make the mistake due to the aforementioned tooling, you look even stupider.

blantonl 5 years ago | | |

Watch and listen to pilots as they complete checklists. They point and callout each item, switch setting, etc.

acdha 5 years ago | |

Back when I shelled into servers more, I really liked having my deployment put the environment in the prompt and set a red background on production for similar reasons. It only takes a small change to jar you out of habit.

YeGoblynQueenne 5 years ago | |

>> I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.

Yeah, ouch. More ouch if it's the other way around- you delete the test database and it's not the test database.

(long story)

kbenson 5 years ago | | |

> you delete the test database and it's not the test database.

> (long story)

I think you can skip the long story, as most of us can tell a story similar in theme if not specifics (and sometimes, probably some similar specifics too). ;)

With great power comes great responsibility (to not completely screw stuff up because you were on autopilot for a second...)

throwaway894345 5 years ago | | |

I worked at a company where someone deleted the production database by accident and the snapshot mechanism hadn't been working AND the alerting for the snapshot mechanism was also broken. Fortunately someone had taken a snapshot manually some weeks prior and they were able to restore from that and lose relatively little data (it was a startup, so one database was a big deal, but weeks worth of data was not such a big deal).

dheera 5 years ago | |

> I've seen this called "pointing and calling" [1], Japan's train drivers use the technique to force themselves to perform actions and take notice of the current environment.

The concept makes sense, though I don't quite fully get how to translate it to other contexts besides train driving where unexpected and unpredictable events come up all the time. Let's say you're driving a car and the traffic light turns red. Do you point at the traffic light, say "red", point at your brake pedal, say "brakes", and then hit the brakes?

apozem 5 years ago | | |

In high school, I drove a 1993 Toyota Tercel. It was a functional, reliable car, but it had no keyfob to lock the doors remotely.

Getting out of your car, pressing the lock button on the inside of the driver's side door, and shutting the door are all routine, boring actions that make it easy to forget your keys inside the car. The keys can go in all kinds of places as you climb out of the car - jacket pocket, pants pocket, center console. It is very easy to lock your keys in your car.

I quickly learned to hold my keys in one hand, say out loud, "Keys in hand," and then lock the door with the other hand.

This technique is perfect for any repetitive action that could go wrong with non-trivial consequences, and there's lots of that in everyday life.

kube-system 5 years ago | | |

Repetitive tasks are exactly what pointing and calling helps with. The intent is to prevent the brain from going on autopilot for a task that happens exactly the same way 99.9% of the time, in order to prevent disasters that last 0.1% of the time.

Traffic lights are a lot more random (and therefore mentally engaging) than the types of things train conductors are pointing and calling.

An automotive equivalent of a situation that would benefit from pointing and calling is something like this: https://www.consumerreports.org/car-safety/guide-to-rear-sea...

eg.: "Car parked, ignition off, get child"

Timpy 5 years ago | | |

Whenever I have something in my hand that I'm about to put down for a second in the exact absent minded kind of way that would leave me searching all over the house for it 5 minutes later, I say it out loud. "Headphones on the table by front door."

uranusjr 5 years ago | | |

I believe the trick is to anticipate failure, and call out the normal thing instead. So you’d always slow down at every light, and only speed back up after calling out green. This is what all drivers are actually supposed to do, although I fully realise nobody practically does that, which is why we get so many automobile accidents all the time.

nemetroid 5 years ago | | |

The pointing and calling performed by Japanese train drivers is very much about expected events. "Green signal" would be one of the most common call-outs. For example:

https://www.youtube.com/watch?v=afjPmN0GT04

Green signals are pointed at at 2:58 and 3:29.

bo1024 5 years ago | | |

Your example is a reactive event. Something happened in your environment.

This idea is more useful for situations that you are initiating, and where feedback is not immediately obvious.

An example could be turning your car’s lights on at night. Before starting the car, you force yourself to point to the switch, say “lights on”, and do it.

I use this with keys. When leaving my office, house, or car, I hold up the key in my hand and establish sight (I don’t say anything out loud). Then I lock the door.

notJim 5 years ago | | |

I'm a photographer, and I used to get annoyed that I'd have little distractions on the edges and corners of the frame, because I was focussed on the subject and overall composition. I trained myself to sort of bounce my eyes around the sides of the viewfinder when pressing the shutter (think like the DVD player menu). Now I almost never forget to check.

leetcrew 5 years ago | | |

I don't think it really applies to stuff like driving, which almost has to be muscle memory to work at all. even with something routine and non-urgent like switching gears in a manual, the steps have to happen faster than you can say what you're doing.

a good example from normal life is (physical) key management. I used to always forget my keys when walking out the front door, which was a big problem since it locks automatically. to solve the problem, I made my back right pocket be the designated "key pocket". I now slap my right butt cheek whenever I leave a building. it might look weird to observers, but I have not once forgotten my keys since I implemented this system.

SkyBelow 5 years ago | | |

Invert it and I think it works. Always prepare to stop at an intersection. Then point out it is green and call out you do not need to engage in stopping.

It may seem silly, but if we asked people who drive 30+ minutes every day if they have every accidentally ran a stop sign or red light, I suspect the numbers would be quite high (though they likely happen at times/places where chance of accidents are the smallest, such as empty roads late at night).

shezi 5 years ago | | |

I teach my children to point in the direction of where cars can come from before crossing the road. He used to just swing his head around before, now he has to search directions and point there to direct his attention and it works excellently.

As others have pointed out, this is for repetitive tasks that your brain wants to automate away, but you really want to keep in attention.

hrktb 5 years ago | | |

It can be used for exactly the same purpose: checking the environment before doing the action.

E.g. force yourself to read the “production” part of your prompt before running the command. Point at the user name before deleting its record. Read aloud the version name before sending it to deploy.

It really makes a different between just glancing at the info, and having to parse it as part of an action.

jrumbut 5 years ago | | |

Let's say you get a request to delete users #s 1, 17, 152, and 43.

Now you can have the request and database administration tool open and point and call at the numbers and any queries and make sure you are deleting the right users.

saberdancer 5 years ago | |

OpenShift does this by forcing you to write the name of the project you are about to delete. It was something that used to annoy me but reading this I understand it is a good call from their side.

rachelbythebay 5 years ago | |

I do that when I drive around. Car on the side street. Kid over there... with a ball. Hidden left turner in 3...2...1... yep.

I love finding out that this stuff works.

nailer 5 years ago | |

I do things like

  const HARD_CODE_TEST_DATABASE_FOR_SAFETY = 'unit-testing'

  destroyDatabase(HARD_CODE_TEST_DATABASE_FOR_SAFETY)

1. Avoid silly terms our industry should have ditched years ago, like 'drop'

2. Making sure that nobody will ever change HARD_CODE_TEST_DATABASE_FOR_SAFETY because they thought it should 'always be the active database' or whatever.

justinlloyd 5 years ago | |

I have had many disasters in my software career because I jut wantonly hit "Y" without thinking about it.

I have noticed, since learning to cook at a professional level in the kitchen, that I point and call out a lot more in my other activities too. "From hot behind" and "knife" and "oven is over temp" to "Saw blade is live" and "circuit is live" in the workshop to "production server" and "erasing records" in database maintenance. Some days I feel like Sigourney "I have one job damnit" Weaver in Galaxyquest. It's a useful stop-think-go sanity check.

uyt 5 years ago | |

This is true for NYC subways too! https://www.youtube.com/watch?v=i9jIsxQNz0M

greenyoda 5 years ago | | |

The video doesn't really explain why conductors point at the signs - it just says "to prove they're paying attention". Paying attention to what? The answer is that they are verifying that the train is correctly positioned in the station so that all of the doors will open on the platform.

Explained here: https://www.nydailynews.com/new-york/mta-conductors-point-st...

viraptor 5 years ago | |

I try to do that during incidents. I'm not 100% there since it's no a company rule, but it helps me at the time and later when writing up details: "I see <behaviour X>", "<Y> should fix it because <Z>", "I'm starting to do <Z> now and seeing ...", etc.

It also helps when Z results in a total meltdown and you need to pull in more people to help out, so they have context of what happened.

Qu3tzal 5 years ago | | |

French firefighters do this when arriving at a scene. The first messages sent over the radio will say:

- I am... (who you are and where you are)

- I see... (describe what you see in simple non-ambiguous terms)

- I do... (what action you are taking now)

- I ask... (ask for reinforcements if necessary, you may be asked to justify yourself more)

xvf22 5 years ago | |

Killed just under 1k access points when they all upgraded on one go. They had no problem erasing the firmware but when they all tried to download the new one at once it killer the service and we ended up with a lot of blank APs. The conformation message for 1 or 1000 APs is unhelpfully "This will overwrite all existing system images. Are you sure Y/N"

m463 5 years ago | |

> forcing a cache miss in the brain

That is an interesting way of looking at it.

I think a router analogy might be more precise - more like fast path / slow path - where when most packets come in they hit the fast path in hardware, and slow path exception packets get handled by the cpu.

ekanes 5 years ago | |

I do this with my kids, gesturing (not pointing) as it helps my mind remain focused on truly listening to them amid everything else going on. I probably look ridiculous, but I'm a better father for it so ¯\_(ツ)_/¯

stjohnswarts 5 years ago | |

I always called it a "that can't be right" interrogative.

xamuel 5 years ago |

I wish it were possible for similar prompts to appear before all sorts of policy-makers and bureaucrats. "It appears you are about to institute a policy which will require 400 million patients to sign an additional waiver every time they visit a clinic, this will waste a total of 354,921 human hours within the next year alone. Please type 354,921 to proceed."

harikb 5 years ago |

I have a habit of creating cli tools, which potentially do dangerous things, to default to dry-run mode. For example, instead of the typical `--dry-run` or `-n` option, my scripts instead had a cheesy `--do-it` to be non-dry-run. It is annoying as hell to my colleagues, but saved the day many times.

roydivision 5 years ago |

Reminds me of the proposal to keep the nuclear launch codes inside the body of an innocent volunteer, so the President would have to kill the person to get the codes.

https://boingboing.net/2015/12/11/proposal-keep-the-nuclear-...

dgritsko 5 years ago |

Similar idea as GitHub's "type the exact name of this repository if you want to delete it" confirmation dialog. Maybe that's really what you want to do, but in case that's not actually what you meant to do, having a few extra hoops to jump through seems like a good idea.

Hokusai 5 years ago | |

> having a few extra hoops to jump through seems like a good idea.

I think that there is more to that. You need to consciously type the name of the repo that you want to remove. Windows used to add a lot of jumps to get something done, and the result was mindless clicking the "yes" button and realizing 1 second later that you deleted important information.

That extra hoops need to be cognitive meaningful.

Cthulhu_ 5 years ago | | |

Yes, and infrequent; the main issue with Windows (Vista mainly) was that it appeared far too often. Even with 7, when you're setting it up for the first time for example, I think it shows up too often.

Same with Terms & Conditions. If you want your customers to truly have read and understood them, you have to show them a short quiz at the end of it. You're required to do a quiz in Europe nowadays if you want to engage in stock trading.

segfaultbuserr 5 years ago | |

Some disk management software also has "type the exact label of this partition to reformat it" to prevent accidental data loss.

wjdp 5 years ago | |

Do you type the repo name, or just copy/paste or select/middle click it?

Half of me would want them to put `user-select: none` on that text. The other half has to archive 10+ repos and would hate that!

edanm 5 years ago | |

That's what I thought of immediately as well! I've seen that pattern in a few other places too, and I always think it's a really good UX choice.

luhn 5 years ago |

One of the largest AWS outages to date was caused by a scenario like this. [1] A mistyped commanded removed too many servers from an S3 subsystem, overloading the remaining servers and crashing the subsystem. The failure snowballed until the entire S3 region was down, which then caused issues with dependent services like EBS, ALB, and Lambda. They couldn't even update the status page because that also depended on S3.

[1] https://aws.amazon.com/message/41926/

HenryKissinger 5 years ago | |

I remember that. The AWS dashboard was all green checkmarks... because the red checkmarks icons the dashboard was supposed to display were stored inside the crashed servers.

jodrellblank 5 years ago | |

>"overloading the remaining servers and crashing the subsystem. The failure snowballed until"

the entire Eastern Seaboard was without power?

https://youtu.be/XetplHcM7aQ?t=693 (James Burke's Connections, ref. cascading power cut 1965)

jasonpeacock 5 years ago |

Raskin talks about the futility of this in his book The Humane Interface.

Basically, what happens is the brain switches operating context from "I want to do something" to "resolve this interruption (confirmation box)" and you don't relate the one to the other - you're so focused on getting rid of the interruption that the original task is forgotten until after the interruption is gone.

Then you switch back to the original task that had been interrupted by the confirmation box and then you realize you made a mistake.

It's much better to engineer "undo" ability into systems - like delaying commands (GMail's "Undo Send" does this), or caching previous state, etc.

andrewflnr 5 years ago | |

That's exactly why it's not a "confirmation box", but requires you to slow down and think for half a second. She even talked about mitigating copy-paste, which is the next obvious way people could habituate.

Also, while undo is great, it's not always technically feasible. The tools in question are basically for modifying the layer that implements undo for your end users, and are often themselves fundamentally irreversible. Undo for raw hard disks involves forensic analysis at best.

jasonpeacock 5 years ago | | |

The problem (I probably didn't paraphrase Raskin well) is when you slow down & think for a half a second, you context switch from "I need to do operation" to "I need to make this dialog box go away".

No matter what tasks are required to make the dialog box go away - doing math, retyping a message, clicking a randomly ordered box - that becomes the top task in your head and you "forget" about the original task until you finish this task.

Once you resolve the interruption, you switch context back to the original task and then you still have that "oh crap" moment.

Yes, sometimes undo is very difficult, and can require a system designed to support that ability as a first-class feature from the start. Many systems you can perform rollbacks, but there are definitely destructive actions - in which case you should have test stacks to validate your actions in advance, and peer review. (e.g. dual keys to launch the missiles)

robaato 5 years ago | | |

Or you have commands which randomly reverse the meaning of the confirmation prompt:

Continue: yes or no?

Don't continue: yes or no?

As long as operators know to expect this, they also know to wait and actually read the prompt before answering (as in turn of auto reaction)...

bronco21016 5 years ago |

It amazes me that something like this can be done by a single person.

In aviation any time input is given to the machine, it's entered by one human (typically pilot flying) and then verified by the other human (typically pilot monitoring) before being committed to or executed. For example... when a new altitude is assigned by ATC, say FL300, the pilot flying will spin it in the selector window and keep his hand or finger there until the second pilot agrees with and confirms the selection by reading FL300 out of the selector window.

I know there are meat bags in these giant tubes so that changes attitudes towards safety etc. However, it seems to me that when organizations start putting the power to halt nearly the entire business in the hands of one person, there should be some slightly different attitudes. A breaking change in a million servers could easily cost hundreds of thousands or maybe even millions in lost revenue or employee productivity.

I'm just an outsider though. Perhaps this level of attention is practiced at some shops. It's just interesting to me how in some fields we settle on pretty uniform standard practices whereas others are seen as non-human-life threatening so it's just shoot first, ask questions later.

illumin8 5 years ago |

This is a great idea, and I'd like to point out that having such a system in place would have prevented one of the largest Internet outages in recent memory - the Amazon S3 outage in 2017: https://aws.amazon.com/message/41926/

> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

zedpm 5 years ago | |

It's kind of funny, since various operations performed in the AWS web console use this model (e.g. type the name of the resource you're trying to delete). As an organization, they're aware of this approach and think it's useful, but (presumably) didn't use it in their own internal tooling.

audience_mem 5 years ago | | |

Perhaps those were added after they learnt their lesson.

educationcto 5 years ago |

Terraform prints out the number of resources changed and at least requires a "yes" to proceed. Not quite as onerous as described but at least prevents some type of fat-fingering. Basically all changes with Terraform are risky as they usually involved bringing up and down infrastructure.

   Terraform will perform the following actions:

  # google_compute_instance.vm_instance will be created
  + resource "google_compute_instance" "vm_instance" {
  + ... <more>
 
   Plan: 2 to add, 0 to change, 0 to destroy.

   Do you want to perform these actions?
    Terraform will perform the actions described above.
    Only 'yes' will be accepted to approve.

   Enter a value: yes

caymanjim 5 years ago | |

This is exactly the problem the author is referring to. With Terraform, you always type "yes" to proceed, so it turns into muscle memory. You stop reading the output, and you're already typing "yes" before you even see the prompt. Terraform's output is also verbose, and many changes show up as "1 to add, 0 to change, 1 to destroy" because they don't separately list a "replace" category. It's pretty bad; you've got cognitive overload, confusing output summary, and a predetermined continue answer. And this is often an action you're performing under duress. I've been bitten by it plenty of times.

brodouevencode 5 years ago | |

IaC is a real time saver, but inherently dangerous.

remram 5 years ago |

A similar system is molly-guard [1], which replaces the reboot/halt/poweroff/... commands with scripts that make you type in the name of the machine before proceeding. Avoids shutting down the wrong machine because you forgot where you SSH'd.

[1]: https://manpages.debian.org/buster/molly-guard/molly-guard.8...

b6z 5 years ago | |

Many years ago, I made that mistake two or three times, rebooting the wrong machine. Since then, I use molly-guard on all my remote machines. Never happened again.

Darkphibre 5 years ago |

Reminds me of when the Fortune 50 company (150k employees) I worked for rolled out new firewall restrictions that blocked the DNS port.

To all machines. Employee and servers alike.

Yes. Including the DNS servers.

Took them a day or two to work out how to roll that one back.

zamadatix 5 years ago | |

The first use of a new security product my manager insisted we roll out (as a duplicate to an existing tool from another group) was to quarantine a change in a system file that seemed to be spreading through all of the PCs.

Except the change was to quarantine explorer.exe which was being changed with a patch that just got pushed out. The net result was about 6 hours of the desktop group wondering "why the hell are all of the PCs not logging in right after this patch" followed by about a month of rolling tickets from seldom used computers that had just been powered off.

His excuse was it only showed a file hash in the main screen and you had to view details to see the name plus he had a 3 day change open to roll out the system. Never understood how he got away with that one but such things did catch up to him about 2 years later.

tialaramex 5 years ago |

So, related obviously correct designs:

1. Git's Force-with-lease. Git push's "force" is too powerful, you will likely regret this much power, but it's tempting. So force-with-lease is the same power but conditional on you telling git what exactly the state was that you're overriding.

This has two benefits, one is like Rachel's, it is an opportunity for a human to stop for a moment and consider, wait, why are we overriding this state? To find out what it is we might as well read... oh the state says it's an "emergency fix. Call Jerry". Maybe, just maybe, I ought to call Jerry before I force overwrite it?

But the other is about race conditions which Rachel doesn't specifically address. If you are very careful to check that the state you want to overwrite with force is indeed a state that should be overridden, nothing prevents it meanwhile changing and then you overwrote state you didn't even know existed. But force-with-lease fixes that because your lease won't match.

I believe Force-with-lease is a pattern that ought to be far more widespread. I've used several configuration management tools that let somebody say "Temporarily don't mess with config on these machines" and some of them let you write a reason like "James is rebuilding the RAID arrays" but none of them have that force-with-lease pattern that would be let me say "I know James is rebuilding the RAID arrays, this change must happen anyway but if anything else is blocking the change then reject it and let me know".

2. Prefer Undo to Confirmation. If the computer can undo the action, even if that's a bunch of work and you'd rather not bother, put that work in and enable undo. Humans always know they "really" wanted to do the thing you're asking them to confirm so it's somewhat futile to ask, but they often realise they didn't want to afterwards and will undo it if you make that possible.

Not everything can be undone. Undo factory reset isn't a thing. But lots of things you can't undo it was just laziness, try to do better in your own software. Your users (which might include you) will be grateful.

vondur 5 years ago |

That may have helped when Emory University's IT dept. accidentally sent a wipe and reformat command using Microsoft's SCCM to all of the Windows computers and servers on campus back in 2014. https://it.slashdot.org/story/14/05/17/051214/emory-universi...

kbenson 5 years ago |

This is a topic near and dear to my heart, as I'm often that person arguing to make some slightly less automated because the small trade-off in time is insurance against some of the worst mistakes you can have. Automation to the point of removing humans leads to stupid problems that a human wouldn't make if they looked at what was going on. So we automate tot he point where we minimize human contact, presenting a summary of actions that as humans we can apply our wonderful brains to and prevent those problems. Except some percentage of the time we don't actually pay attention, and depending on how the human interaction was introduced instead of complete automation, some percentage (or multiple!) of errors still sneak through.

Automation to the point of minimal human contact where you assume the human will read the presented information and make an informed decision doesn't work. The point is that we want a human to understand what is being asked, so taking some step to ensure they do understand is warranted. It will never be perfect, but adding steps like she proposes are definitely a step in the right direction, IMO.

rossjudson 5 years ago |

This resonates with me. Years ago I took down a service in a cell accidentally (Googlers might empathize: never 'borg' when you meant to 'borgcfg'). If I had been asked to enter the exact number of tasks I was about to nuke, I might have thought twice ;)

scottlamb 5 years ago | |

I've certainly deliberately downed an enormous number of tasks, though, as part of a cluster turn-down. I love the technique of requiring the operator to echo a key fact, but in the case you're describing I think the key fact is not how many tasks but that that they're serving live traffic. So:

* You could ask the operator to echo the qps figure...but really any number other than zero is likely to be an error, so it can just error out in that case without needing the confirmation.

* Even if it is serving zero qps now, if it's not explicitly drained at the load balancer, downing it is likely to be a mistake. So even better to check that.

Only once in my career have I taken down jobs serving live traffic. (They were serving 100% errors.) It was deliberate, but even so I wouldn't have minded having to supply a --yes-i-know-im-downing-live-jobs.

edit: and if for some reason my assumption is wrong and downing undrained things becomes routine...well, you'd want to fix that, but as a short term measure going back to the confirming a number rather than the force option would be appropriate. Is certainly not good to have an override that's routinely used.

jeffbee 5 years ago | | |

The way we approached this on my SRE team was semi-manual with improved ergonomics. We embedded the live traffic graph in the turndown tool, so it would be right in your face before you took the destructive action. Of course it was always possible to go one level down on the tooling and do everything manually, but it wasn't the usual way.

gabeio 5 years ago |

I do like this idea, this is I assume why github makes you type the repo name out in full. I wish AWS followed suit, when deleting any RDS (database) instance on AWS all you have to type is "delete me"... very easy to copy and paste as well as just know what you need to type and be on autopilot. I have even poked support about it and their response was underwhelming.

jaclaz 5 years ago |

Side question.

How many/which companies have more than one million Linux machines?

notacoward 5 years ago | |

At least Facebook (where OP worked), Amazon, Google, and Microsoft. Probably Netflix, maybe Apple. There might be a couple more, but no more than that because we've already accounted for a pretty high percentage of worldwide shipments for servers, disks, etc. Fun fact: when you're that big, your demand creates its own inflation and you have to consider that in projections.

kube-system 5 years ago | | |

If by "machine" we also mean things outside of a 19" rack, I would wager that large telecoms probably have way more devices running Linux than FAANG. Imagine the network of cable modems that Comcast alone must operate. What percentage of their 28+ million broadband customers rent Comcast owned/managed modems? Almost all of them except the tech-savvy crowd? And that's just one device type.

jaclaz 5 years ago | | |

Thanks, so a handful at most, and the "usual" ones, I always thought that those companies keep their machines connected in (redundant) "sets" and that a command affecting all of them was more a case for "never" rather than "once in a while".

abnry 5 years ago | |

The number blew me away. But does she mean in one location or VMs?

One million is a lot no matter how you slice it.

rachelbythebay 5 years ago | | |

How do you define one location? If it's like, a contiguous plat of land with a bunch of buildings, each containing suites, and each of those containing clusters... then these days, yeah, that's probably not too much of a stretch.

And yeah, physical machines, not VMs. Sometimes they're blades, sometimes they're sleds, but I mean real hardware made out of metal that you can pick up and use to defend the datacenter if you have to.

(Although, honestly, I was talking about global counts in the million+ range when I wrote it since it was referencing the past, but by now, a region with a million+ is not far-fetched.)

Ayesh 5 years ago |

I have an old laptop with a dead battery, and for a BIOS upgrade, it prevents me from updating without 50% battery.

I have to type "danger" to bypass this restriction, and I thought it was pretty cool.

Another good UI pattern is in Firefox, that it disables the Run button on downloads for a few seconds.

duskwuff 5 years ago | |

Disabling the "run" button for a few seconds was actually done to mitigate another risk -- sites cueing the user to click in a particular location, then triggering the confirmation dialog with the "run" button right where the user was about to click.

ineedasername 5 years ago |

Oh god this would have saved me so much stress once. It was early in my career, and part of my duties was to run a merge/purge process on dupe records.

I'd select the dupes for merge using a checkbox, but the vendor's interface for this just had a "confirm" button. So, I confirmed. However I'd selected the "select all" box and.... confirmed. Merging every. single. record. into one (1) record.

I was fortunate, the vendor was able to roll back the changes, and nothing was lost. I also had a very good mentor-like boss who avoided reaming me out before we knew if there was a solution or not, and when there was he simply told me "I'm sure you've learned your lesson, but don't do that again."

aqme28 5 years ago |

Nitpicking

> "This might be as simple as printing the number with your locale's version of numerical separators, like "123,456" or "123.456" or "123 456" or whatever else you might use where you are. The trick is then to NOT accept that as input, but instead demand that they remove the separator and jam it in as just digits. "

It's easier to just strip non-digit characters than to parse the input for them and respond accordingly. This is a confirmation step with basically a checksum, so you're not going to get many false positives.

Kerrick 5 years ago | |

Stripping the non-digit characters would allow "123,456" to validate instead of only accepting "123456" -- which defeats the whole purpose of printing the number with numerical separators (to prevent copy/paste).

aqme28 5 years ago | | |

If you're worried about copy-paste, make it a random code.

nemo1618 5 years ago |

Notably, Discord does something like this when you @everyone in a large channel: "You're about to push a notification to 12,000 people, are you sure you want to do that...?"

pwinnski 5 years ago | |

Sounds like a yes/no answer is expected? If so, that is exactly what Rachel is suggesting is not enough.

jerf 5 years ago | | |

In this case, usually the very fact that a popup unexpectedly popped up is enough. I use Konsole as my main shell, and like several other shells now it has a "You're about to paste 100KB, yes/no?", and I don't mindlessly click "yes" because it is already a "cache miss" to see that dialog at all.

raverbashing 5 years ago | |

Slack should take a note of this. Especially for rogue @here notifications

tigger0jk 5 years ago |

I've typically used pdsh https://github.com/chaos/pdsh for these types of commands, and I don't think they have any such safety options. The only protection is to be wracked with fear whenever you type pdsh. Obviously this fear wanes with use, and eventually you don't think about a command for long enough before you do it and hit enter on a regrettable one.

cle 5 years ago |

Even better than you confirming your own action, is someone else confirming it. If the stakes are high, require two people to turn the keys, instead of just one.

rcarmo 5 years ago |

This reminded me that a few years back I worked at a place where (notoriously) Puppet would occasionally go over some random box and remove access to people, just because.

Or to all the machines, on one occasion.

(It was actually some sort of race condition when we massively updated per-project access permissions and asked for SSH keys to be redeployed, but it was annoying as heck, and sure to happen whenever you really needed to access that particular machine.)

lqet 5 years ago |

Github has been doing this for quite a while know when you try to delete a repository - you have to type in the exact repository name to confirm.

bmaupin 5 years ago | |

Which I always mindlessly copy and paste...

jraph 5 years ago | | |

But maybe this is enough? I do this too, but this gives me time to actually read the repo name twice. It's way better than a confirm button for me.

I'm sure it would also wake me up from autopilot. But I don't do this often so I can't really know. It seems like this is good enough for many people, who don't perform this action too often.

coder543 5 years ago | | |

If you really think that’s an issue, pasting could be disabled for that input field. Would that make you happier?

It hasn’t been an issue for me, since repo names aren’t usually super long and onerous to type.

temporallobe 5 years ago |

This is similar to a UI solution a colleague and I came up with. The action the user could kick off was unstoppable and irreversible (a large batch job), and it seemed like even a confirmation prompt was too easy to simply click through. So we had the UI present a modal dialog asking the user to type in a specific word in all caps to confirm the action. Worked like a charm.

D-Coder 5 years ago | |

I did a similar thing with a Star Trek program many years ago. One of the commands (22? 23?) was to detonate the warp engines in the hope of taking the enemy with you.

After hitting the wrong number once, I added a confirmation that presented a random six-digit number that you had to enter before it accepted the command.

TravHatesMe 5 years ago |

Reminds me of a study done where a test was given with questions that weren't difficult but likely to make a silly error. Around 85% of participants got at least one question wrong, but when they repeated the same test with a difficult-to-read font, that number dropped to ~25% or so. That's another way to make your brain work, use a terrible font.

apricot 5 years ago | |

> That's another way to make your brain work, use a terrible font.

And suddenly my complex analysis prof who wrote his exams in Comic Sans is vindicated!

willvarfar 5 years ago |

I am so adding this to a query api I have, where its all too easy to leave off constraints and end up asking for massive data sets by mistake.

Thinking I can probably enhance it by forcing the user to type in the number as text rather than numeric, so they can't cut-n-paste. Kind of force them to type in "I am sure I want all data ever" or something.

recursive 5 years ago | |

I don't think this is useful for an api. This is only useful when humans are the direct user of the component. Automated users, like those of an API will dutifully provide the required safety value.

mcintyre1994 5 years ago |

AWS sometimes does something similar to this like “enter the name of the thing you’re trying to delete to confirm”. I think it makes sense because you can have such a huge difference between how much you care about certain s3 buckets or CloudFormation deploys etc. In true AWS fashion it’s inconsistent between services though.

nucleardog 5 years ago | |

To their credit, even if it’s unintentional, every time one of those screens pop up I have to stop and think about what I’m doing because every screen wants something different from me!

heelix 5 years ago |

Back in the Spiderman 2 days, I worked for a content management company that was supporting a really, really big website. I believe they were playing host file games for Stage/Prod. Was in the room on when they demo'ed something, did a restart of the system - and every pager in the room went off. Yah...

Cthulhu_ 5 years ago |

I for one can't fathom any organization managing a million devices / servers / VMs / whatnot. I'm having enough trouble with one, and my biggest employers had maybe a few dozen at best, and they already had a dedicated ops team that worked mainly with infrastructure-as-code.

woliveirajr 5 years ago |

Once I had to deal with some software-RAID in Linux (mdadm it is), around 2007. There was some -force option that would just print information explaining what it would do and, to perform the real action, you needed to type another flag (that should never be revealed).

Edit: added name of software

andrewfromx 5 years ago |

i've done this before by displaying unix epoc and asking the user to copy/paste that value WITHIN a 3 second window as an env var. i.e. if you up arrow and run same TIMESTAMP=1603827448 ./foo it won't work because 1603827448 is now way too old.

myroon5 5 years ago | |

One of the main benefits is explicitly acknowledging relevant context. Timestamps don't provide additional relevant context

sidpatil 5 years ago |

Hmm, it's conceptually like a combination of a CAPTCHA and a launch code.

vsnf 5 years ago |

I do this with a git pre-push hook to the main branch of my repositories. It displays a prompt in red and forces me to type in the name of the branch.

The result of one too many mindlessly accidental pushes.

regularfry 5 years ago |

I've seen this implemented as "Please type: My username is $USERNAME and I will not cry over spilt milk" but that was more to guard against support tickets.

diebeforei485 5 years ago |

I'm thinking this could also be useful for cases where colleges mistakenly email all applicants saying they'd been accepted, when they in fact had not been.

gitgud 5 years ago |

> "I've worked at a few places that had a large number of Linux boxes. I'm talking about well over a million."

A few places!? What is an example of this?

throwawaygh 5 years ago | |

My guess: Rackspace, Google, and Facebook.

ComodoHacker 5 years ago |

In role-playing games, it's a common practice to confirm deletion of your character by typing in some word, like 'delete' or character name.

bnastic 5 years ago |

Promise Pegasus (thunderbolt storage) comes with a GUI that does the same thing - to shut it down you have to type “CONFIRM” before clicking the button

Animats 5 years ago |

Yes. Github does that when you delete a repository. You have to confirm by typing in the name of the repository you are deleting.

larrik 5 years ago |

I've seen this sort of thing in a few places, and I really do think it's a great idea.

RobRivera 5 years ago |

Having babysat my fair share of critical clusters, i support this advice

wotton 5 years ago |

Marketo, the marketing automation platform, does this when you try to do things to large data sets, very useful.

konjin 5 years ago |

Finally the Roman numeral converter I programmed in university will be useful.

eznzt 5 years ago |

Debian already does this, it asks you to type something like "yes do as I asked" if you want to remove a package that is considered to be part of the core.

jerf 5 years ago |

https://news.ycombinator.com/item?id=24907002

Looks like https vs http link.

dang 5 years ago | |

We've merged the threads now. Thanks!

jancsika 5 years ago |

It would be neat to print out an esoteric error that gets a single result in Google, where the "forum" in the result has a rando answer about using a certain esoteric flag.

Then you search the logs to see who is trying the command with the esoteric flag and "fix the glitch with payroll" for those employees.

JoeAltmaier 5 years ago |

Makes it harder to nest that command inside a script - you have to parse out the number and paste it back? Or do I misunderstand - should it still prompt the user in the middle of the process when that step arrives? That would be problematical if it were included in a web page or whatever.

ccakes 5 years ago | |

The very point of this is to make it difficult to do what you’re describing.

If the tool could potentially touch a large number of machine, even if you’re super sure you got it right you should still prompt the user

JoeAltmaier 5 years ago | | |

Or write a script that carefully calculates the number of machines and gets it right. I guess you wouldn't use this prompting script then?

larrik 5 years ago | |

I believe this would be as part of the script you are writing, not the scripts you are calling.

rad_gruchalski 5 years ago | |

Hopefully there’s an API to fetch that count :)

outworlder 5 years ago |

> 1221425541 machines will be affected

"Do you care? (Y/N)"

Cattle, people. Not pets. Just make sure you don't hit all machines simultaneously and are rolling, instead.

Since the post is talking about automation anyway, assume that any machine that can go down will go down. Ensure that any such disruption will be minimal. Oops, you just killed the production database? Whatever, who cares, it has just failed over anyway (or, for a distributed one, a new node was elected, data started replicating, etc).

If one considers having to SSH to a machine to be an anti-pattern, it's amazing how much crap goes away.

In the more generalized case, where it's not about machines, then it makes more sense. Maybe you are running a query that's going to perform updates across multiple clusters. It still should not be done by hand with direct production access - unless you are in the middle of a declared (and urgent!) incident and everything is on fire. In which case there's a bunch of people watching over your shoulder (or more likely, screen sharing in a conference call).

The same job you have (hopefully) run in QA you should be able to re-target to production. Make the question just be a way to "unlock" your automation - for instance, by not copying credentials or environment information until the proper confirmation has been received. One should still have an escape hatch for when (not IF) things go wrong.

joshuamorton 5 years ago | |

Killing all of your cattle is still a concern.