Type in the exact number of machines to proceed(rachelbythebay.com) |
Type in the exact number of machines to proceed(rachelbythebay.com) |
Then you search the logs to see who is trying the command with the esoteric flag and "fix the glitch with payroll" for those employees.
If the tool could potentially touch a large number of machine, even if you’re super sure you got it right you should still prompt the user
"Do you care? (Y/N)"
Cattle, people. Not pets. Just make sure you don't hit all machines simultaneously and are rolling, instead.
Since the post is talking about automation anyway, assume that any machine that can go down will go down. Ensure that any such disruption will be minimal. Oops, you just killed the production database? Whatever, who cares, it has just failed over anyway (or, for a distributed one, a new node was elected, data started replicating, etc).
If one considers having to SSH to a machine to be an anti-pattern, it's amazing how much crap goes away.
In the more generalized case, where it's not about machines, then it makes more sense. Maybe you are running a query that's going to perform updates across multiple clusters. It still should not be done by hand with direct production access - unless you are in the middle of a declared (and urgent!) incident and everything is on fire. In which case there's a bunch of people watching over your shoulder (or more likely, screen sharing in a conference call).
The same job you have (hopefully) run in QA you should be able to re-target to production. Make the question just be a way to "unlock" your automation - for instance, by not copying credentials or environment information until the proper confirmation has been received. One should still have an escape hatch for when (not IF) things go wrong.
I'm sure it would also wake me up from autopilot. But I don't do this often so I can't really know. It seems like this is good enough for many people, who don't perform this action too often.
It hasn’t been an issue for me, since repo names aren’t usually super long and onerous to type.
After hitting the wrong number once, I added a confirmation that presented a random six-digit number that you had to enter before it accepted the command.
And suddenly my complex analysis prof who wrote his exams in Comic Sans is vindicated!
Thinking I can probably enhance it by forcing the user to type in the number as text rather than numeric, so they can't cut-n-paste. Kind of force them to type in "I am sure I want all data ever" or something.
Edit: added name of software
The result of one too many mindlessly accidental pushes.
A few places!? What is an example of this?
Looks like https vs http link.
many (most?) HN users probably have that disabled, because too many sites abuse it to block password managers, for "security reasons" .
The point I was making is that copying and pasting seems like more effort than just typing the repo name. Do you commonly encounter long, inscrutable repo names? Do you delete repos frequently enough to have built up the habit of copying and pasting the repo name into the delete box?
If it is common enough, disabling paste would actually benefit the user based on the premise of the article.
I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.
My first boss accidentally deleted our QA database, meaning to delete a local copy
A later boss accidentally deleted our production database, thinking it was the clone that he had just made (which luckily we still had)
Both of them were very experienced developers in their 40s. Nobody is beyond this kind of mistake.
Anyway, this being Linux, everyone's home directory was mounted on NFS. All our builds were standardized with a tool called SystemImager, which we could use to push out updates to everyone's desktop whenever we wanted. If there was a new version of KDE, we could pretty easily push that change out.
Sometimes it was convenient for me to work on updates to these images by chrooting into a directory containing the "image," which was really just an rsync tree. And sometimes, when updating these images, it was convenient to mount our NFS home directories in this chroot environment, so I could access things like an archive I had just downloaded on my own desktop.
And eventually we had lots of different images, and the old ones were using up a lot of disk space, so I decide to clean up some space removing the old images. And these are fairly large images, with lots of small files, and this was before SSDs were a thing, so it made sense that deleting them was taking a while, and I stepped out to grab something to eat.
As I was eating lunch, I started getting the tech support escalations. But this wasn't that unusual, our users routinely had problems with the environment we had provided. They hated it, because it was in many ways terrible, and they made sure we knew it. So I wasn't terribly alarmed. I didn't think any major changes had been made, so I didn't hurry back.
By the time I leisurely returned from lunch, half the NFS home directories for our users were gone, along with all their documents, emails, bookmarks, or whatever else. Suddenly it hit me what had happened: at some point, perhaps months earlier, I had left our NFS home directories mounted within one of these image chroots. And now I had sudo rm -rf'd it.
We had backups, but they were on tape, and it took several days to restore, with about a day of data loss.
I'd say they were experienced developers. Only after accidentally deleting databases were they very experienced developers.
Good news is that I was deleting the test database to ensure that the recovery from backups was properly automated, so it wasn't down too long.
Consider network partitioning so dev/test/accept just has 0 contact with prod.
The knobs are labeled with a terrible little glyph meant to indicate which is which, and I've supplemented this with plain-english Brady labels "front left", "front right", etc. Now I speak the words above the knob, and point to the burner. It felt goofy at first, but now it feels normal, and like I'm tempting fate if I skip it.
I've never seen a different arrangement.
Unfortunately, it strikes many as looking rather silly, so it hasn't been widely adopted.
I do this whenever I'm on a production server (which is rare anyway). I use different colored prompts for local and remote shells.
[0] Technically he had no beard and if he had, it wouldn't have been gray.
It's also not perfect; it does not catch mistakes concerning "non-local" state, e.g. configuration files in /etc merging with one in . merging with some command line options. (Personally I try to avoid writing tools with defaults of this sort, but especially Java developers seem have different opinions.)
Unfortunately if you do P&C and still make the mistake due to the aforementioned tooling, you look even stupider.
Yeah, ouch. More ouch if it's the other way around- you delete the test database and it's not the test database.
(long story)
> (long story)
I think you can skip the long story, as most of us can tell a story similar in theme if not specifics (and sometimes, probably some similar specifics too). ;)
With great power comes great responsibility (to not completely screw stuff up because you were on autopilot for a second...)
The concept makes sense, though I don't quite fully get how to translate it to other contexts besides train driving where unexpected and unpredictable events come up all the time. Let's say you're driving a car and the traffic light turns red. Do you point at the traffic light, say "red", point at your brake pedal, say "brakes", and then hit the brakes?
Getting out of your car, pressing the lock button on the inside of the driver's side door, and shutting the door are all routine, boring actions that make it easy to forget your keys inside the car. The keys can go in all kinds of places as you climb out of the car - jacket pocket, pants pocket, center console. It is very easy to lock your keys in your car.
I quickly learned to hold my keys in one hand, say out loud, "Keys in hand," and then lock the door with the other hand.
This technique is perfect for any repetitive action that could go wrong with non-trivial consequences, and there's lots of that in everyday life.
Traffic lights are a lot more random (and therefore mentally engaging) than the types of things train conductors are pointing and calling.
An automotive equivalent of a situation that would benefit from pointing and calling is something like this: https://www.consumerreports.org/car-safety/guide-to-rear-sea...
eg.: "Car parked, ignition off, get child"
https://www.youtube.com/watch?v=afjPmN0GT04
Green signals are pointed at at 2:58 and 3:29.
This idea is more useful for situations that you are initiating, and where feedback is not immediately obvious.
An example could be turning your car’s lights on at night. Before starting the car, you force yourself to point to the switch, say “lights on”, and do it.
I use this with keys. When leaving my office, house, or car, I hold up the key in my hand and establish sight (I don’t say anything out loud). Then I lock the door.
a good example from normal life is (physical) key management. I used to always forget my keys when walking out the front door, which was a big problem since it locks automatically. to solve the problem, I made my back right pocket be the designated "key pocket". I now slap my right butt cheek whenever I leave a building. it might look weird to observers, but I have not once forgotten my keys since I implemented this system.
It may seem silly, but if we asked people who drive 30+ minutes every day if they have every accidentally ran a stop sign or red light, I suspect the numbers would be quite high (though they likely happen at times/places where chance of accidents are the smallest, such as empty roads late at night).
As others have pointed out, this is for repetitive tasks that your brain wants to automate away, but you really want to keep in attention.
E.g. force yourself to read the “production” part of your prompt before running the command. Point at the user name before deleting its record. Read aloud the version name before sending it to deploy.
It really makes a different between just glancing at the info, and having to parse it as part of an action.
Now you can have the request and database administration tool open and point and call at the numbers and any queries and make sure you are deleting the right users.
I love finding out that this stuff works.
const HARD_CODE_TEST_DATABASE_FOR_SAFETY = 'unit-testing'
destroyDatabase(HARD_CODE_TEST_DATABASE_FOR_SAFETY)
1. Avoid silly terms our industry should have ditched years ago, like 'drop'2. Making sure that nobody will ever change HARD_CODE_TEST_DATABASE_FOR_SAFETY because they thought it should 'always be the active database' or whatever.
I have noticed, since learning to cook at a professional level in the kitchen, that I point and call out a lot more in my other activities too. "From hot behind" and "knife" and "oven is over temp" to "Saw blade is live" and "circuit is live" in the workshop to "production server" and "erasing records" in database maintenance. Some days I feel like Sigourney "I have one job damnit" Weaver in Galaxyquest. It's a useful stop-think-go sanity check.
Explained here: https://www.nydailynews.com/new-york/mta-conductors-point-st...
It also helps when Z results in a total meltdown and you need to pull in more people to help out, so they have context of what happened.
- I am... (who you are and where you are)
- I see... (describe what you see in simple non-ambiguous terms)
- I do... (what action you are taking now)
- I ask... (ask for reinforcements if necessary, you may be asked to justify yourself more)
That is an interesting way of looking at it.
I think a router analogy might be more precise - more like fast path / slow path - where when most packets come in they hit the fast path in hardware, and slow path exception packets get handled by the cpu.
:)
https://boingboing.net/2015/12/11/proposal-keep-the-nuclear-...
I think that there is more to that. You need to consciously type the name of the repo that you want to remove. Windows used to add a lot of jumps to get something done, and the result was mindless clicking the "yes" button and realizing 1 second later that you deleted important information.
That extra hoops need to be cognitive meaningful.
Same with Terms & Conditions. If you want your customers to truly have read and understood them, you have to show them a short quiz at the end of it. You're required to do a quiz in Europe nowadays if you want to engage in stock trading.
Half of me would want them to put `user-select: none` on that text. The other half has to archive 10+ repos and would hate that!
the entire Eastern Seaboard was without power?
https://youtu.be/XetplHcM7aQ?t=693 (James Burke's Connections, ref. cascading power cut 1965)
Basically, what happens is the brain switches operating context from "I want to do something" to "resolve this interruption (confirmation box)" and you don't relate the one to the other - you're so focused on getting rid of the interruption that the original task is forgotten until after the interruption is gone.
Then you switch back to the original task that had been interrupted by the confirmation box and then you realize you made a mistake.
It's much better to engineer "undo" ability into systems - like delaying commands (GMail's "Undo Send" does this), or caching previous state, etc.
Also, while undo is great, it's not always technically feasible. The tools in question are basically for modifying the layer that implements undo for your end users, and are often themselves fundamentally irreversible. Undo for raw hard disks involves forensic analysis at best.
No matter what tasks are required to make the dialog box go away - doing math, retyping a message, clicking a randomly ordered box - that becomes the top task in your head and you "forget" about the original task until you finish this task.
Once you resolve the interruption, you switch context back to the original task and then you still have that "oh crap" moment.
Yes, sometimes undo is very difficult, and can require a system designed to support that ability as a first-class feature from the start. Many systems you can perform rollbacks, but there are definitely destructive actions - in which case you should have test stacks to validate your actions in advance, and peer review. (e.g. dual keys to launch the missiles)
Continue: yes or no?
Don't continue: yes or no?
As long as operators know to expect this, they also know to wait and actually read the prompt before answering (as in turn of auto reaction)...
In aviation any time input is given to the machine, it's entered by one human (typically pilot flying) and then verified by the other human (typically pilot monitoring) before being committed to or executed. For example... when a new altitude is assigned by ATC, say FL300, the pilot flying will spin it in the selector window and keep his hand or finger there until the second pilot agrees with and confirms the selection by reading FL300 out of the selector window.
I know there are meat bags in these giant tubes so that changes attitudes towards safety etc. However, it seems to me that when organizations start putting the power to halt nearly the entire business in the hands of one person, there should be some slightly different attitudes. A breaking change in a million servers could easily cost hundreds of thousands or maybe even millions in lost revenue or employee productivity.
I'm just an outsider though. Perhaps this level of attention is practiced at some shops. It's just interesting to me how in some fields we settle on pretty uniform standard practices whereas others are seen as non-human-life threatening so it's just shoot first, ask questions later.
> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
Terraform will perform the following actions:
# google_compute_instance.vm_instance will be created
+ resource "google_compute_instance" "vm_instance" {
+ ... <more>
Plan: 2 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes[1]: https://manpages.debian.org/buster/molly-guard/molly-guard.8...
To all machines. Employee and servers alike.
Yes. Including the DNS servers.
Took them a day or two to work out how to roll that one back.
Except the change was to quarantine explorer.exe which was being changed with a patch that just got pushed out. The net result was about 6 hours of the desktop group wondering "why the hell are all of the PCs not logging in right after this patch" followed by about a month of rolling tickets from seldom used computers that had just been powered off.
His excuse was it only showed a file hash in the main screen and you had to view details to see the name plus he had a 3 day change open to roll out the system. Never understood how he got away with that one but such things did catch up to him about 2 years later.
1. Git's Force-with-lease. Git push's "force" is too powerful, you will likely regret this much power, but it's tempting. So force-with-lease is the same power but conditional on you telling git what exactly the state was that you're overriding.
This has two benefits, one is like Rachel's, it is an opportunity for a human to stop for a moment and consider, wait, why are we overriding this state? To find out what it is we might as well read... oh the state says it's an "emergency fix. Call Jerry". Maybe, just maybe, I ought to call Jerry before I force overwrite it?
But the other is about race conditions which Rachel doesn't specifically address. If you are very careful to check that the state you want to overwrite with force is indeed a state that should be overridden, nothing prevents it meanwhile changing and then you overwrote state you didn't even know existed. But force-with-lease fixes that because your lease won't match.
I believe Force-with-lease is a pattern that ought to be far more widespread. I've used several configuration management tools that let somebody say "Temporarily don't mess with config on these machines" and some of them let you write a reason like "James is rebuilding the RAID arrays" but none of them have that force-with-lease pattern that would be let me say "I know James is rebuilding the RAID arrays, this change must happen anyway but if anything else is blocking the change then reject it and let me know".
2. Prefer Undo to Confirmation. If the computer can undo the action, even if that's a bunch of work and you'd rather not bother, put that work in and enable undo. Humans always know they "really" wanted to do the thing you're asking them to confirm so it's somewhat futile to ask, but they often realise they didn't want to afterwards and will undo it if you make that possible.
Not everything can be undone. Undo factory reset isn't a thing. But lots of things you can't undo it was just laziness, try to do better in your own software. Your users (which might include you) will be grateful.
Automation to the point of minimal human contact where you assume the human will read the presented information and make an informed decision doesn't work. The point is that we want a human to understand what is being asked, so taking some step to ensure they do understand is warranted. It will never be perfect, but adding steps like she proposes are definitely a step in the right direction, IMO.
* You could ask the operator to echo the qps figure...but really any number other than zero is likely to be an error, so it can just error out in that case without needing the confirmation.
* Even if it is serving zero qps now, if it's not explicitly drained at the load balancer, downing it is likely to be a mistake. So even better to check that.
Only once in my career have I taken down jobs serving live traffic. (They were serving 100% errors.) It was deliberate, but even so I wouldn't have minded having to supply a --yes-i-know-im-downing-live-jobs.
edit: and if for some reason my assumption is wrong and downing undrained things becomes routine...well, you'd want to fix that, but as a short term measure going back to the confirming a number rather than the force option would be appropriate. Is certainly not good to have an override that's routinely used.
How many/which companies have more than one million Linux machines?
One million is a lot no matter how you slice it.
And yeah, physical machines, not VMs. Sometimes they're blades, sometimes they're sleds, but I mean real hardware made out of metal that you can pick up and use to defend the datacenter if you have to.
(Although, honestly, I was talking about global counts in the million+ range when I wrote it since it was referencing the past, but by now, a region with a million+ is not far-fetched.)
I have to type "danger" to bypass this restriction, and I thought it was pretty cool.
Another good UI pattern is in Firefox, that it disables the Run button on downloads for a few seconds.
I'd select the dupes for merge using a checkbox, but the vendor's interface for this just had a "confirm" button. So, I confirmed. However I'd selected the "select all" box and.... confirmed. Merging every. single. record. into one (1) record.
I was fortunate, the vendor was able to roll back the changes, and nothing was lost. I also had a very good mentor-like boss who avoided reaming me out before we knew if there was a solution or not, and when there was he simply told me "I'm sure you've learned your lesson, but don't do that again."
> "This might be as simple as printing the number with your locale's version of numerical separators, like "123,456" or "123.456" or "123 456" or whatever else you might use where you are. The trick is then to NOT accept that as input, but instead demand that they remove the separator and jam it in as just digits. "
It's easier to just strip non-digit characters than to parse the input for them and respond accordingly. This is a confirmation step with basically a checksum, so you're not going to get many false positives.
Or to all the machines, on one occasion.
(It was actually some sort of race condition when we massively updated per-project access permissions and asked for SSH keys to be redeployed, but it was annoying as heck, and sure to happen whenever you really needed to access that particular machine.)
This is not a criticism of bureaucracy or regulation BTW (I'm a fan of both, in general). It's simply a recognition that there's a misalignment of objectives.
Not sure how to analyze the calculus in the case of rachaelbythebay's observation. Certainly there is one misalignment which is if the tool has sharp unprotected edges (e.g. can take the company's whole site down) the person who ran the program will be blamed, not the person who wrote it. Unless they are the same person, it's hard to get a proper feedback loop in place. The only tools we have are coding standard and code reviews: bureaucracy!
https://digital.gov/resources/paperwork-reduction-act-44-u-s...
it requires the office of management and business to calculate the impact of records-keeping requirements impact on time and privacy, among other things.
I do not believe it has resulted in a reduced recordskeeping burden. For the most part I simply see an estimate of how long it will take to complete my tax forms and permits, on the form itself. Perhaps others have different views.
Something like: ./dangerous-script.sh $args | bash
The following prefix in a ps1 script enables the -WhatIf and -Confirm parameters:
[CmdletBinding(SupportsShouldProcess=$true)]
To enable -Confirm by default for scary scripts, just use: [CmdletBinding(SupportsShouldProcess=$true,ConfirmImpact='High')]
The nice thing is that in PowerShell, unlike bash, this flows through to the vast majority of other commands. If the script has the snippet above, then you don't have to litter it with "if ( $userSaidYes ) { ... }" blocks all over the place.Similarly, PowerShell automatically wires up logic to produce all of the useful modes you might want:
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend
This is very fiddly to implement manually, and "Suspend" is likely impossible for most shells.See: https://docs.microsoft.com/en-us/powershell/scripting/learn/...
# rm -rf some_dir
Then if you accidentally press return before completing it hasn't happened.
When you have reviewed and are sure it is correct, you recall and delete the hash to execute - simples!
I mean, I use the # hack sometimes too, but when I don't, I find myself often being afraid of accidentally coming on the enter key.
"Run it as a query first" gets 90% of the way until you drop a constraint by accident whilst rewriting it as an update :o
alias harikb_script='harikb_script --do-it'
in their .bashrc to eliminate this annoying step.$ run-script.sh --dry run
`--dry-run` parameter not recognized
Executing ...
If you believe we should never use nuclear weapons, then don't have them at all.
If you believe there is a case where it may be moral and rational to use nuclear weapons, why would you want to put a potential barrier in the way of their use? You could have a situation where everyone was agreed to use them but the president was physically unable to harm the aide to use them.
You can know that something is the right thing to do but not have the courage to physically harm someone to do it.
An interlock that you may not be able to unlock for reasons unrelated to the task at hand is a bad interlock.
In this specific case the "thing to do" is literally to harm hundreds of thousands of people.
The reasoning behind this proposed interlock is that any logic which concludes that it is moral and rational to harm hundreds of thousands of people must also conclude that it is moral and rational to harm the "interlock" individual. Otherwise, it is likely that dropping the bomb would be a mistake.
Everybody agrees that this is a nuke-them-all situation, but the president, given himself part of the task of ripping apart human bodies, thinks more about the subject and decides a another diplomatic round is a better option.
Because you think the point where they become moral and rational to use is way way way further than commonly discussed, and you want to put many barriers of many kinds (physical, emotional, logistical) to delay their point of use without completely blocking them.
You could also say that if a person is incapable of doing the hard parts of the job, don't vote them into the position. (Downside of that is that you'll end up voting someone who doesn't mind killing someone in cold blood while expecting that to be a filter that brings more empathy to the position).
Tell that to Russia. In the short amount of time only the USA had the bomb the USA bossed them all over with threats of using it.
It's an attempt to make an abstraction concrete. Think of it as the trolley problem in real life.
Stalin is famously supposed to have said, "one death is a tragedy, 100,000 is a statistic". Cynical or not it is how humans think.
> If you believe we should never use nuclear weapons, then don't have them at all.
Strategic game theory and Mutual Assured Destruction depend on the possibility that the other guy will use them if you do, and may be the only way to prevent their use. Interestingly this is one reason why you want the other guy to know your procedures, capabilities, deployments etc. Secret weapons have no deterrent value.
The Soviet General Secretary soon receives a report about what the new policy means tactically. Americans will take several extra minutes, possibly more, to authorize retaliation. (The exact delay is subject to disagreement. Secret experiments are conducted to get the timing down. They are inconclusive.) Amid the decade's mounting tensions, a preemptive nuclear strike looks more tempting than before.
Time is also of the essence for MAD; known delay only makes MAD less effective if e.g. sub-launched cruise missiles are faster than dissection. And do all the fallback commanders need their own willing victim to mount a response?
I guess that’s why they consider the idea here and not there.
<me> team: hey, sanity check this please: hsh -A "dumb_thing && other_thing --foo --bar" <teammate> shipit
[ I type the command ]
<me> ok, running as job 1234
The last part was a courtesy done so that they could watch the progress of it too without having to dig to find my request. It also meant they could kill it easily if something went wrong and they couldn't raise me for some reason.
Tools like this are best used outside the solo realm.
In many dysfunctional orgs, having someone to blame is desirable. They will use all kinds of words for it like "accountability".
But at the end of the day, heros who take stupid risks that succeed get rewarded, cautious people that ask questions and try to understand before acting are smugly dismissed, and would-be heroes that burn the house down because of recklessness get blamed and make everyone else look good. It's all too common.
(I realize there is a possible error message case if the remote has changed... but I don’t feel like this command is the best one to use to discover whether the remote has changed, if you have no changes you actually intend to force push.)
[1] https://www.blomberguk.com/appliances/integrated-appliances/... [2] https://www.ikea.com/gb/en/p/smakoka-gas-hob-stainless-steel...
When something happens despite all that, just step back and realize how much worse it could've been, and how successful your safeguards have been up 'til that point.
Then look carefully at the procedure. Is there something about the naming or structure that could be more clear? Can you think of near-misses that resemble the failure you just experienced? Are you using boobytraps in production? Symlinks and overlay filesystems seem clever in the moment but they're bound to subvert our intuition someday. Perhaps you should get in the habit of always using full absolute paths, for instance.
There's always another gotcha, but if your workflow doesn't look as over-the-top safety-silly as aerospace, you're not doing as much as you could be. (Hint: It's not silly.)
First, calm down.
I’m still amazed that he could be so calm when I’d just deleted a bunch of stuff on a clients production environment.
May not have been the most lucrative company I’ve ever worked for, but it was definitely the best one.
A colleague of mine accidentally ran rm -rf on this filesystem.
It was taking a loooong time, so he realised and killed it, but not before it had removed a heap of stuff. Because this was something that could be rebuilt, it wasn't backed up, so we had to go through the process of downloading the tarballs, and recompiling everything for all the different platforms. It took a few days to recover most of it, and weeks to completely restore things.
The day after the incident, when he arrived at work, he found his keyboard was missing a few keycaps. It took him a while to realise that there were four gone: 'R', 'M', '-', and 'F' ...
Good times.
I don't necessarily always do that, and don't make audible calls, but when driving at night or in inclement weather, I try to make extra effort to check for unexpected cross traffic.
A particular outage that I will never forget took out Gmail delivery worldwide in an instant, because the change was not expected to be disruptive and therefore did not integrate with SRS. As it turned out the change disabled the machines where it was applied, and the process of selecting a subset of machines to canary the change was not independent of the way in which Gmail assigns services to machines, so in the space of a few seconds they created a global outage.
I remember someone hit a bug with docker exec --rm years ago where it started deleting some NFS files that it shouldn't...
Once on the box, we wanted to create a container with utilities in the fs but didn't want to download an image tarball or look through the rootfs layer directories for one to use, so we just bind mounted host root onto another directory, beside the config file we were using.
This worked like a charm. Until we rm -rf'd the config directory and deleted host root in the process.
In our case, fortunately the consequences were minimal as all workloads were stateless. The container scheduler moved all the workloads to other hosts and the host scheduler noticed this VM wasn't responding any more and rolled a new one. The whole thing resolved itself in about 5 minutes with no interaction from us - so that was pretty neat.
From the perspective of an advocate I'd say: If they can't come to terms with killing one, who are they to execute hundreds of thousands?
That way, if any of them are missing, I know they must be in the room I just left.
London Underground hasn't had guards for decades at this point, and the Docklands Light Railway hasn't even had drivers (there is a member of staff who is trained to be able to drive it on every train, but they are usually doing other things) since its creation. If they're misaligning often enough for it to be possible for New York to be statistically better I haven't seen anything about it after repeatedly asking.
In the Netherlands, the NS has two types of trains that go between towns. Intercity and Sprinter. Sprinters have someone who will walk onto the platform at every stop, or failing that, lean out of the carriage, verify that no one is getting in, and then step into the train again to put the key into the receptacle and then turn it. Following that, the doors close. In contrast, there is no such person on Intercity trains; they do fine without. There may be a conductor who checks tickets. In comparison to the DLR, both Sprinter and Intercity trains have drivers.
Is there some requirement or function that I am missing that requires a dedicated member of staff to perform this key-turning ritual at every stop on the DLR and Sprinter, or is this simply to appease the unions?
It could be that Sprinters are meant to be more lenient towards people running to get on than Intercities, which might have a stricter schedule.
That triangular key opens a panel by the front left seats of the train, which reveals a complete set of controls for manually driving the train which that member of staff is trained to use. If the GoA 3 system has given up when the train is just out somewhere random then "just get out" while technically possible since there's a walking route along the side at all times - is clearly not ideal even for able-bodied passengers, so in fact the member of staff will drive the train manually to a station unless obviously that's impossible somehow (e.g. terrorists blew up sections of track either side like a Hollywood movie).
Because humans are bad at driving trains, they aren't allowed to move at full speed, they can either let the GoA 3 automation oversee everything (e.g. it won't let them go anywhere it wouldn't be willing to go) at a reduced speed or when that's not useful they can switch off all automation and move at a crawl with no oversight.
Every morning the first train of the day on each route is driven in the first of those two modes, because overnight human maintenance teams sometimes manage to leave tools and equipment on the line and the automation doesn't know not to drive the train into a welding kit left on the track by some idiot who just discovered his wife is leaving him or whatever. So the human staff member's job is to drive the train (with the AI preventing them smashing it into other trains) while looking out the front window for problems.
[1] https://blogs.transparent.com/polish/okulary-by-julian-tuwim... (scroll down for english version)
(For the benefit of non-Googlers/Xooglers: borg is a lower-level tool mostly used when everything else has gone wrong and borgcfg is a higher-level, more routine tool. These days people often layer things on top of that as well, because we love piling up abstraction layers. This approach is completely successful because abstraction layers never leak and solve every problem without making anything hard to debug at all. /s)
In my ideal world, even the lowest layer a human ever uses would do safety checks by default. Eg, imagine if the job specification included "query this safety check service on change" and the borg tool (as part of querying the existing job on a cancel/rm command) discovered that and honored it. Most people/jobs would use a safety check that fails taking down a job unless the load balancer reports all relevant services have that job drained. The safety check service could also specify a confirmation prompt (similar to what Rachel is advocating) that could be customizable (like qps or percent of global capacity rather than just number of tasks). The safety check would be effective no matter what layer you use, and there'd be no good reason to use one that would cause prompt fatigue. The outage rossjudson described (and I know he's not the only one who has done exactly this!) would have been avoided.
If it is Postgres (don't know about other dbs), you can go a way long way using "savepoints" and "rollbacks" to truly have a trial-and-error safe surgery on db. Still dangerous, but quite helpful. I hate working on any other db without those features. Postgres also allows schema changes to be within a txn envelope.
You can jam a select in the end of the transaction to check what happens.
https://dev.mysql.com/doc/refman/8.0/en/mysql-command-option...
Typing the confimation and requesting to delete the snapshots.
He had two brosers open, one for development (of cloudformation, etc)... but someone did ask him to change a thing in prod.
Both browsers were identical. Only the account in the top right corner did change.
Both cloudformation stacks were identical (instance names, etc).
He had been all the morning launching and deleting the dev environment.
Team mates were joking loud around his table before the moment it did happen.
Sadly, he got fired (the company was proud of it's cost savy choices, didn't have other backups than a few days of snapshots, probably CTO choice).
Everybody has off days, or just instances where circumstances misalign in just the wrong way. To pretend otherwise is silly; instead, it's the leader's/team's responsibility to ensure that those sort of off days don't lead to massive losses via redundancy & the sort of measures we're talking about here & in the OP. Firing somebody in these circumstances just acts to severely reduce morale, since we all secretly know in our hearts that it very easily could have been us.
Firing in this case just seems retributive. It's not going to bring the lost data back, and you've just eliminated the very person who could have told you most about the chain of events leading to the incident in question to help you guard against it in the future. These incidents usually sound simple at the surface level ("I clicked the button in the wrong window") but often hint at deeper, perhaps even organizational, issues. A lack of team focus on reliability/quality, a lack of communication or trust about decisions made (or not made) by higher ups, or so on.
And they are probably the single least likely person to cause a similar incident again -- that person will now likely be double and triple checking their commands for eternity.
If your CTO scattered those landmines all over then "not stepping right" is not an error. It just sucks.
We had an admin in charge of our storage. He had worked with our old vendor's SAN for years, then we got a new SAN. Trained him/certified him etc. He "accidentally" shut down the entire SAN. That brought down the entire company for over 9 hours.
Fast forward two years later, he screwed up again and caused a storage outage affecting about 1100 VMs. Luckily not much data loss, but a painful outage.
Then a month ago, he offlines part of the SAN.
Some people never learn, and recognizing this early is usually better than letting someone continue to risk things.
These words reminded me a story of similar/different "flaps" and "landing gear" controls on a plane - where crashed airplanes were also blamed on pilots first, before a trivial engineering/UI solution was implemented: https://www.endsight.net/blog/what-the-wwii-b17-bomber-can-t...
This is why it's a good practice to include the environment name in the resource names when it makes sense. Even better, don't append the env name, but use it as a prefix, like ProdCustomerDb instead of CustomerDbProd. I also like to change the theme to dark mode in the production environments as most management UIs support this. One other neat trick is to color code PS1 in your Linux instances, like red for prod, green for dev.
This is definitely a nice one to add. Though I did work with someone once who believed that all servers should be 100% vanilla and reverted my environment colors.
In container-only shops with no ssh, this is less of an issue, and instead you rely on having different permissions and automations for different environments.
Basically, I had a habit of starting a new SQL Server Management Studio instance in its own window for each database I was working on. At some point this struck me as wasteful, for some reason, so I closed all my windows and opened all the databases in one window. Then sometime after that I went to delete the test database as a routine maintainance task, but of course I was used to clicking the database at the top of the left pane in SSMS, which was the test database when it was the only database in a window... but now happened to be the production database. Then five minutes later I got a call from the client company that used our system, to ask me if there was any maintainance going on because everyone's client had just crashed.
The horror when I realised.
It was educational, though. I don't think I'll make that particular mistake ever again. And my bosses were ace to be fair, probably because I worked my ass off to correct the mess that ensued.
Though they're not perfect. They said that one pilot is supposed to read the item, the other pilot say the answer, and the first pilot visually confirm it; but at 1:42, I noticed the first pilot say "emergency exit lights", hear the confirmation, and move to the next item without her eyes moving away from the list.
I'm not sure which of several possible conclusions to draw from that. ("Humans suck", "it is indeed staged", "the procedure has enough redundancy that the chance they're both careless on a given step is small", "the pilots feel that the emergency exit lights aren't particularly important", ...)
A: “Passing control”
B: “Taking control”
A: “You have control”
B: “I have control”
This is how I remember it (6174, UH-1Y).
A: "Belay on?"
B: "Belay on"
A: "Climbing"
B: "Climb on"
Then the climber begins.
It's interesting to me that highly regulated and totally unregulated activities have evolved extremely similar processes. I suppose having your life on the line is a good motivator to follow best practices.
"You have the control."
"I have the control."
IDK if it changes between aircraft types, commercial/private/military cultures, or if it's just coincidence.
"Flaps up selected"
"Flaps are indicating up"
There's a lot to learn from the way airplanes are engineered and operated.
For certain procedures we had a second party (“reader”) observing and acknowledging each part of each step.
Operator (Gesturing anti-clockwise while pointing at valve XYZ) Operator: Opening valve XYZ. Reader: Opening valve XYZ, aye. Operator: Valve XYZ is open. Reader: Valve XYZ is open, aye. Operator: Indications of flow Reader: Indications of flow, aye.
People can still get complacent, and things can still get missed but the deliberate mentality goes a long way. Now when GitHub makes me type out the repository name before I can delete it, I sometimes copy/paste... YOLO.
Like when clicking on a file in a directory you just entered and looking for the file, the observer can literally locate and point to the file for the mouse user 5-10x faster than the mouse operator.
The observer seems to interpret the information that results from the directory listing faster than the person who just did the double-click to enter the directory because they don't have the muscle coordination context switch and can immediately move to interpreting the results.
It's probably because mouse manipulation uses brain infrastructure that is more recently evolved, but observe-react is a lot earlier in the brain processing pipeline evolutionarily, and a lot more refined/involved.
https://gitlab.com/brlewis/brlewis-config/-/blob/master/bash...
Well exactly... doesn't that show you that it's a bad idea? People don't know if they could bring themselves to throw the switch even if everyone thinks it makes rational sense.
You're taking a rational, well-considered, strategic decision... and making the interlock a messy personal emotional one unrelated to the actual issue at hand. That sounds like the wrong way around to be doing things?
I don't think so, no. Sometimes we think too abstractly and make what turn out to be poor decisions. Emotions are really valuable heuristics and should be harnessed at a time like this.
If the answer to launch-nukes-by-cutting-a-human-aide is "well, I need more time to think" then maybe that's a good outcome?
Yes, but you can know it's the right thing to do, but not be able to physically do it.
The president's ability to physically cut someone open is not relevant to whether it's a good idea to use nuclear weapons or not. Him being unable to do it tells you nothing about whether they should be launching the weapons.
If the president fails the test that tells you nothing about whether the launch is the right thing to do. Doesn't that fundamentally make the test bad?
Right, but can you understand that 'the President being able to look somebody in the eye before they killing them' is not a requisite for 'the employment of nuclear weapons being justified'?
We require the president to be able to do B before they can do A. But what if A is the right thing to do but the President is not able to do B? Being not able to do B does not mean A is wrong.
See the logical disconnect?
Our emotional systems are the product of millions of years of evolution and often (not always, but often) show better judgement than our "higher" faculties. Bringing that part of our capabilities into the decision-making loop is a very good idea.
I'm sure if the president was physically incapable of wielding a knife, she would have someone on hand to do that for her.
But really I meant being able to 'bring yourself' to cut someone.
But maybe you're a Satanist, in which case the reverse order probably makes sense.
Not really. You would need to be absolutely certain that the other party won’t carry out a retaliatory strike before they’re destroyed.
The only thing that matters is that the other party is capable of indescriminate destruction, not the certainty they’ll actually do it.
It’s like punching someone holding a gun in the face.
If particular systems or people are seeing a high frequency of mistakes, maybe the system design is at fault, not just the person. Obviously it's hard to do in practice, but the ideal is to design systems that are mistake proof.
Great way to invalidate years of experience. Presumably from your telling of the story, he didn't cause problems with the old vendor's SAN?
> "He "accidentally" shut down the entire SAN."
So, was it an accident, or was it an "accident"? You can't have it being a mistake if you're also hinting it was deliberate and malicious.
It was a real accident when he shut down the SAN the first time. I don't know why I put it in scare quotes.
Mine started turning grey in my mid 20s.
Could be related to me doing the electricians equivalent of deleting production DBs. I've drilled through the comms cable to payment terminals during opening hours. I've run over a copper gas line with a scissor lift. And yes, I've cut live 230V cables with hand tools.
That sinking feeling in your stomach you get immediately after doing something bad - it's universal across professions.
Thankfully, I've never fucked anything major up, and I've had my hands in hospitals, power plants, ISP fiber backbones, police stations and whatnot.
> You're a survivor.
> But I've nearly died, dozens of times.
> Exactly.
A friend of mine who does fire alarm systems was tasked to install one at a bank branch. He found out the hard way that one of the cables for the safes safety system wasn’t in the place where it should have been according to the plans. Safe’s safety system hosed, bank branch closed for repair.
Doesn't hurt to use an image that's related to the server's purpose, and to put the name of the server right there in the wallpaper somewhere.
I also like adding redundant conditions to the WHERE so a typo in any single one of them won't sink me.
Finally, change ROLLBACK to COMMIT only when you are positive all is well.
That way, if you accidentally send it, the command fails and nothing happens.
Regardless of the exact details, I think the point of this thought experiment is that for a head of state, the decision to launch a massive attack that will cause hundreds of thousands of casualties can feel a little abstract. "Bombing a city" can seem abstract, even if the president understands this means killing children. Understanding is quite different from feeling. However, if the act of ordering a bombing raid on a city involved physically murdering a child, it would definitely feel more immediate and less abstract.
Your point stands, of course. But the part about removing the abstractness of the act seems relevant when ordering people killed.
- First you have to lift one hand up off the keyboard and put it down on the mouse. This may or may not mean taking your eyes off the screen.
- Then you need to find the mouse pointer on the screen
- Then you need to aim for what is usually a relatively small target and move the pointer there.
- If you're right-clicking, the right-click menu usually presents more small targets you need to aim for.
- If you need to use the keyboard, again you have to move your hand over to the keyboard from the mouse.
For finding the pointer, I developed this unconscious habit of slamming the mouse pointer to the very top-left of the screen. It's difficult though when on someone else's machine, where your brain isn't used to the pointer velocity or where multi-monitor means that slamming the mouse to the top-left actually puts the pointer on another monitor.
People look at me in awe when I'm using a two-pane file manager but honestly not having to take your hands off the keyboard and not having to move your eyes off the screen gives so much better flow. It's also why I like the UI of Blender - one hand on the keyboard and one hand on the mouse at most times.
I accidentally learned when teaching a course at a site with too many people for the available machines, that pair exercises was very effective - I got lots more questions and overall learning went way up. If the pair discussed it and couldn't find an answer they would have the confidence to ask. On their own, neither would probably bother and just wait for me to go through things.
ib production && ssh production-machine
ib demo && ssh demo-machine
It's definitely helped me when testing a fix on a demo or staging instance, and has helped me avoid doing it on production accidentally.If you cannot kill your friend to kill a few hundreds of thousands more, how can it possibly be justified? I just struggle to come up with a scenario where that is the case.
Of course I’m of the school that thinks firing nuclear weapons is never a good idea.
I kind of feel like we’re going in circles though, so maybe better to just stop here :)