Why should I do production support?(devenbhooshan.wordpress.com) |
Why should I do production support?(devenbhooshan.wordpress.com) |
Production support alone is not that much of a problem. What the author skipped (conveniently? or forgot to mention?) is - it's really the "on call" phenomenon that's the problem.
The "typical" on-call - where when you are on-call you are magically on-call 24x7. Yes, during your sleeping hours as well; as if that's less important and the company can avoid spending money to hire dedicated support for those hours and instead make you suffer (yes, it's just that - there's no other name for it like "satisfaction", "learning", "growing" or any of those buzzwords).
You want engineers to do production support? Well, let them do it during normal office hours and only few times a month. Or heck, let them do it for weeks but let them punch in and punch out normal office hours. Let them choose to do only one half of the day and have someone else willing to do the another half.
There's no excuse for burning out engineers (esp. unsuspecting youngsters) by pushing them into ungodly hours of work ruining their health among other things while trying to constantly tell them - "do you even realise what a service to humanity you are doing!".
It's just exploitation.
If a company thinks an application is important enough to run 24x7 then it should staff for 24x7 support. Stealing wages from workers by expecting them to be available 24x7 (on-call) is an absolute abuse.
It also leads to burn out, poor performance during the day (how is a dev's development ability when they were up at 2:30am on an incident call the night before?), and clouded thinking causing mistakes or impacting recovery time during incidents.
And where it really matters, they do. My team and I build and manage a large Emergency Services telecommunication network. We have Tier 1/2 operators on shift work 24/7. Tier 3 staff (programmers, system integrators and administrators) are their escalation point for critical issues outside of business hours.
> Stealing wages from workers by expecting them to be available 24x7 (on-call)
The Tier 3's that are on-call in our environment are on a rotating roster are compensated nicely for being prepared to answer the phone outside business hours. Frequently they don't get called during their week at all and it's free money.
> how is a dev's development ability when they were up at 2:30am on an incident call the night before?
Easy, as well as the financial compensation, we give them time in lieu. Two hours callout in the middle of the night, two (paid) hours given back on their next working day, or whenever they prefer, subject to availability of other staff.
There are simple solutions to these problems, and where they matter, they are applied. Granted things are very black and white for us as lives are potentially at stake, but any company that wants to have 24/7 engineers available needs to pay for that kind of support.
I would not trust someone who I just woke up at 2am to do something. He/she is mid-sleep. They will be prone to errors, they will be super tired, and I just ruined their next 1.5 days that it will take them to recover from that.
This is not a job where you live boxes where intellect is not needed as much, (strength and stamina will also be affected by a mid-night alarm). You want your folks to be 100% on par, otherwise they may make things worse.
"We need [insert thing manager asks for here] immediately" has consequences.
You are less of a person and more of a means to an end. A tool to achieve something, and some tools are disposable. It can be of career advantage to a manager to burn out engineers. Maybe instead of spreading 24x7 on-call across 3 teams in three timezones, you put it on 1 team in 1 timezone. By doing so a manager can achieve a lot with less resources, and hopefully secure their own elevation up the corporate ladder before the cost of their strategy becomes evident.
The cost of burn out I think remains hidden, in technology there's a constant flux of staff anyway, teams being being created and dissolved, in all the noise a few people being exhausted and bailing from the company is hardly noticed. Perhaps they said something before they left, but it's best for everyone in middle management if the burnt out individual is labeled the problem, they were a bad culture fit you see, a grumbler who didn't have what it took.
I'm happy to hold the pager if I've also got the right to block/rollback deploys until the system is stable - my current job has had two out-of-hours pages in the past year, and we're in the alexa top 10k so it's not like there's no traffic.
I've been at this for 35 years at many companies and working with many teams and it's always the same: if you want good software then make the team creating it also support it. In every case I've experienced it leads to software requiring little to no support, easy to maintain, and easy to extend. Why? Because nobody wants to get up in the middle of the night or work weekends and moreover, they'd rather be adding features than limping along with existing features.
I've worked at too many places that had no SWEs on-call for the on-call alerts that I get, which the vast majority of the time involves throwing a bandaid (as in redirecting traffic, etc) in front of an internal bug that I hope eventually gets fixed once the RFO/etc has been submitted before it hits my NEXT on-call rotation or my poor coworkers.
Without SWEs on my rotation they don't understand the immediacy. They aren't the ones getting their Christmas week interrupted every 4 hours while ops keeps the house running. In Ops having your entire day ruined by various on-call alerts usually feels like you're working without any breaks and nobody even cares.
Anyone want a bad golang developer, wannabe ex-ops person who knows a lot about platform reliability and o11y and wants to focus on the golang end finally? I'll make your teams automation and o11y purr no matter where it is (bm, cloud, global pops, serverless..)..
It creates an actual market for on call work where engineers can simply say no to the extra cash if they don't like work taking up their nights and weekends. If the company is having trouble with no engineers wanting to be on call the pay is simply too low and needs to be increased. It's a job like any other and should be compensated as such.
In the end I honestly believe it will be beneficial for the company not having engineers burn out so quickly. Compensation also clearly sets the expectations — if you're being paid to do it you'll take it more seriously.
Just my 2 cents
Where I work we are not on-call. Nevertheless, I try to help the ops team when they encounter issues. This does make you improve logging and error handling since you know it takes a lot more time when it's difficult to filter logs for the interesting events.
Engineers not exposed to production issues and customers will never understand why you need these extra measures.
I don’t know who needs to hear this besides me twenty years ago, but if you want to do charity then go home at 6pm and volunteer at a real charity. Don’t do it for a wannabe robber baron who will not share with you. Don’t do it for someone where even an emotional payoff is years away or may never come.
Find something else you care about and help some people just because. Not because you’re getting under-paid and over-guilted to do it.
On the other hand I had a friend at a very large and well known company. He got a job offer and was hired into one department, but he wanted to take a little time off between jobs before he started. They somehow convinced him to start saying there was a holiday coming up and he could take the time off then.
and as soon as he came in he started getting calls 1am 2am 3am etc...
so he left.
And they cajoled him back saying things were different and he finally bought it and went back.
same thing happened again, and he quit for a final time.
One part of the problem was that he was a US citizen working with a bunch of H1B visa folks and the company could get away with that sort of stuff. H1B folks will say yes sir, no sir because their dreams of living in the US are tied to keeping their job at all costs. and then the bad work culture festers.
I ran the Engineering org for a startup and we had a small, 3 person Ops team that handled initial triage of events. About 75% of these issues were Engineering-related. My solution was to a) create an on-call rotation for Engineering and b) allow the Engineers to prioritize reliability work.
It sounds like a no-brainer, but I had to fight with the rest of the exec team to allow b) to happen, since it came at the expense of the product roadmap. I eventually won the fight and our nightly on-call volume went from 1-2 incidents per day to 1-2 every few months.
Vindication came about a year later when we were acquired by a large company. As part of the due diligence process (including 18 hours with me going over technical details in front of 30 senior folks from the acquiring company) we got major kudos for having a level of reliability that far exceeded what they typically saw for a company our size.
When they ask it’s usually after a giant hole has been dug, and patterns have been set. If you knew from the outset that you would be the support team then you’d have prioritized some other tickets. You’d have increased the estimates on others. You would have refused to work on these three, you would have argued vigorously about these four decisions, and you would insisted your boss fire “That Guy” months ago because his code is garbage and his only real skill is articulate deflection.
This group of folks wants several somethings for nothing. One of them is labor, another is somewhere to assign blame. They are grooming you for failure and we all deserve better.
In the US, this is happening across the board and not just in tech. The expectation to always be available [often without monetary compensation] is sadly the new normal. Without strong labor laws in place, this implicit form of exploitation will never cease.
seriously
I get that you like unions but just because you have a hammer doesn't make every problem a nail.
Seriously
If it's a serious issue they can't handle they might wake up one of us programmers, but usually they can find some temporary fix or workaround until the next morning.
They had to quit to get out of support.
However, production support teams don’t have a real understanding of our application and how it’s build. So most of the times you have engineers on call with production support, telling them how to debug the problem and come up with relevant logs.
It’s incredibly infuriating and time consuming, and I absolutely hate doing it this way.
90% of the time you also get incredibly vague bug reports with irrelevant logs, and a description of what they think the problem is. Most of the time you need to spend another day finding correct logs and somehow debugging it. Most teams log every single request with all parameters and payloads because they can just replicate the problem locally instead of relying on production support.
We’ve long advocated for either having dedicated support or have engineers on some sort of schedule that can do support.
My code will crash sooner or later. I already know that. I don't write 100% bug-free code. But I cannot accept to give 100% of my time one week per month or so to a company in exchange for money. I just don't understand why people can't understand that I can be a professional only during 8 hours per day, but not more.
I would argue all developers should be required to do some support work.
I'm not saying you wouldn't learn from working on production, but whether it's worth the stress is another question. In terms of software development, it's hard to think of a worse feeling than when you do a production deploy, you hit refresh on the website or whatever it is, and it shows a fatal error, then there's a mad scramble to roll back the change and figure out quickly what went wrong before the consequences grow too great. Most of the time bosses + coworkers aren't that understanding about it either and get into finger-pointing.
They're never offered extra work though. Companies are always willing to wait for Monday when they are asked to put money on the table.
Production support is customer support: responding to chat messages or communications from users.
An on-call rotation, on the other hand, involves responding to production incidents and mounting a proper incident response.
The Google SRE workbook has a great chapter on the subject: https://landing.google.com/sre/workbook/chapters/on-call/
Or to stagnate, depending on how you look at it
I have always had mixed feelings about "on call". I dread my turn on the rotation because the imminent threat of a prod issue has a psychological impact on my entire week, even off hours, and usually for a day or two after.
If everybody on the team feels that way, maybe it can act as a forcing function for product quality. I've seen this work on teams that already cultivate a strong sense of ownership.
On the flip side, it really stresses me out, and I sometimes resent that I'm not getting paid overtime for 24hr on call days. Maybe that's just baked into an engineer's salary these days, though...
What I want is to run an engineering organization as if you should never have to call us. And if you do you either get chewed out for making a frivolous call, or we’re falling all over ourselves because that thing that is happening should definitely not be happening and we’ll be looking at how to keep that from ever happening again, again.
I've also seen folks spotted extra time off for really gnarly oncall shifts. Folks should push to have such accommodations standardized.
A good structure is to have first line support be relatively generic ops people. They can handle problems related to infrastructure, e.g. hardware failures, network problems, or issues that can be handled by adding resources. The deployment process should be consistent enough across applications that they can e.g. roll back to a previous release.
This covers the majority of production problems. After that, it's time to bring in someone who understands the details of how the application works. If the dev team is geographically distributed, then someone is available during working hours. Otherwise, we have to get someone out of bed.
If the dev team has done their job right, this should be a rare occasion. Making the dev team fully responsible for the reliability of the application means that they are motivated to make it reliable. Otherwise there is a tendency to have an underclass of ops people who get abused.
A fundamental mindset here is taking responsibility for the user experience, including reliability. If this is not owned by the product development team, then who?
> I no longer work at Gojek
This also seems very telling
Twenty years ago for many systems devs could do whatever they wanted in production. There were insider trading scandals and combined with SOX, regulators cracked down on it so now devs have lost at least write access. If you have an old system that relied on knowledgeable devs to fix stuff its a terrible situation where people just quit and no one can support it.
You're right that production support has access to those systems, and could potentially make changes and install different binaries, but the amount of people that can do that is extremely limited. Every change also requires a change request that needs several approvals, to request data you need another data request.
They can, they just can’t have direct access to live systems due to separation of duties. But there are methods for dealing with this, like centralised logging so a developer never needs to see the original log file on the problematic box.
"We don't trust" a dev. The change management processes demand the existence of 1) Dev, 2) Librarian (we used to call them that)(that would review and transfer the code, or review and compile the code), 3) the prod sys admin.
Some orgs may have a slightly different setup, but in some form or another, but (these general) rules apply.
Today with tools like CyberArk it is easier to grant temporarily privileged access to a dev for production support, we also got the tools to trace/monitor/record access, so it makes the process auditor-friendly.
To be fair, being great wasn't enough, their job was only possible because the company had unified tooling. A single deployment solution that was deploying near 1M tasks a day in the company, allowing all employees to lookup what is running where and see logs.
This made me appreciate just how useful it is to have both dedicated support AND unified tooling. The average company couldn't benefit from having folks on rota because it's impossible to figure out where anything is running.
The thing is, this all is pushed down from management. In my previous project, we tried to automate as much as possible, but at the end of the day, our production support still wanted to deploy manually. Our business still wanted to see manual end-to-end tests with screenshots.
Then there's also different regulations in certain countries where you need to host your application and database in the country itself, so that's another solution.
Working in finance can be a real eye opener sometimes.
This reeks of bad documentation to me (which finance is notorious for). If a dev has to be on to support normal prod ops thats largely due to errors in both documentation and often in poor tooling. Sometimes those errors aren't as much the devs fault because of management decisions, usually related to understaffing, but I hate how prod support gets shit on so often for failing to fix an issue when it's not really their fault.
> This reeks of bad documentation to me
Not necessarily, you can document your entire application, but production support only looks at the logs, and does a data extract based on what they see. It would be far more beneficial if you had someone who has a clear understanding of the application so that they can help with debugging and actually solving the problem.
At the end of the day, production support are teams who help with 10-20 applications, it's impossible for them to truly understand specific applications. They receive a bug report from the business, investigate and extract logs, then pass it to the relevant development teams. If you need extra info, well though luck, you can reply to the ticket and wait for it to be picked up again. It's no surprise companies like this move so slow.
In the off chance that a dev has the unique knowledge to solve a problem, they may get the firefighter/temporary elevated access needed, but will have to document the reason and the dev's actions very very well, because both internal and external auditors will zero in on that.
On-call: "Hey devs, I'm being woken up at 3AM because your app sucks. Please fix it." Devs: "Sure, no problem."
4 months go by
On-call: "These alerts are still coming in at 3AM. Did you fix the issue?" Dev: "We have a lot of work, we can't dedicate all our time to some minor problems, we have a deadline."
Next week, Devs are put on-call.
The alerts are fixed in two weeks. Site reliability goes up. Apps suddenly become more resilient to failure.
Honestly, the whole attitude of not wanting to work more than 8 hours is privilege. Most of the rest of the world works long hours. As a dev, you get a good salary and a job you don't have to break your body to do. The least you can do is be completely responsible for your own code.
And it helps you as an engineer. Like the article points out, it creates empathy for the users and product support engineers, it helps you improve architecture and app design, and it helps you understand different failure domains. You won't learn all that on your own time, especially without the scale of production.
> Honestly, the whole attitude of not wanting to work more than 8 hours is privilege. Most of the rest of the world works long hours. As a dev, you get a good salary and a job you don't have to break your body to do. The least you can do is be completely responsible for your own code.
unless i signed a contract that states i will do on-call, i'm not gonna do on-call. I doesn't matter how long the rest of the world works.
Well, that's another problem (the dev not being able to solve a bug that reappears at 3AM).
> The alerts are fixed in two weeks. Site reliability goes up. Apps suddenly become more resilient to failure.
I always wondered why DevOps has the "Dev" in its title. At least, in most of the companies I have worked on, it was DevOps the ones that were on call (payed), but they were very picky regarding what they can touch/work on (they almost never touched application code... we should call them "Ops" then, no?).
> Honestly, the whole attitude of not wanting to work more than 8 hours is privilege.
And it's a privlege I'm thankful for. What's wrong with that?
> As a dev, you get a good salary and a job you don't have to break your body to do.
We do break our body to do software engineering (our brains, to be more specifically). If you think physical work >>> brain work, well, that's relative. Every person is different, and for me, brain work is equally taxative as physical work.
> And it helps you as an engineer. Like the article points out, it creates empathy for the users and product support engineers, it helps you improve architecture and app design, and it helps you understand different failure domains
I know I can become better by working harder and smarter (it's obvious), but I just want to be the best version of myself by putting at most 40h/week. Isn't that something honourable in itself? Or does that make me a "bad engineer"?
Most of the world works labor jobs, and studies have shown that the body can work longer than the mind without burnout.
Too often I see BigCorp development teams seeming blatantly oblivious to where their pain points are, and it's because they aren't forcing their developers to do support. They're pushing code, but they aren't pushing code that solves real problems for people.
No one expects customer support people to write code. Why? Because they don't have the skillset.
Yet people who make this argument seem to think any moron can do support.
The skillset for an engineer is not a superset of a customer support person.
Have your engineers sit in on support, by all means, but actually making them DO support will result in unhappy engineers and sub-par support.
Do not undervalue a good support person. They have a whole suite of skills engineers often don't have.
You aren't looking for the hardest problems, you're looking for the problems your users hit the most that an engineer could reduce in the product.
Are the support folks considered engineers?
Is someone covering the full 24 hours, or is it just out of hours?
Are they under a one week in X rota, or expected to be permanently available?
Are they expected to cover anything/everything, or are they just the escalation point for their specialist area?
What other support is available? (i.e. if the shit really hits the fan, are you left to deal with it alone?)
On average, how many times are callouts expected? There's a big difference between half a dozen times a week and half a dozen times a year.
How are the extra duties remunerated/compensated? Is there time off in lieu?
There's a massive spectrum there ranging from hugely unpleasant and not worth the money to not particularly onerous and helpful extra cash/time off.
Indeed, most of our newer services have been doing that, so developers have direct access to logs, which makes our lives a bit easier.
But sadly a lot of systems our outdated, and nobody wants to invest time and money into implementing things there.
Now let's say the plumber works for a contractor. The contractor tells the plumber not to tell the customer about the mistake, because it would make the contractor look bad. The plumber can choose to own up to their mistake to the customer and right the wrong, or they can do what the contractor wants and charge the customer.
On the one hand, the plumber might decide to charge the customer. They keep their job, the contractor makes money. On the other hand, maybe the customer is poor and can't really afford the repair. If it doesn't get fixed, the customer'll have to deal with the brokenness themselves, even though the plumber knows they caused it. But then again, maybe the plumber is broke and really needs this subcontracting gig.
There is no simple answer there. But I think that in the context of software development, in most cases, the answer is simple. Most of us are fortunate enough to have the extra time and money to spend fixing our own bugs, regardless of what we're told to do during the regular 9-to-5. When we have the opportunity to take responsibility for our actions, we oughta.
This is the hard part of software.
But yes, they are often Ops, and in general they shouldn't be troubleshooting the application. They can isolate system issues by looking at and correlating metrics and events in the high level systems, but only a developer of the code in question can efficiently and effectively diagnose a specific bug in code in a reasonable amount of time.
You've seriously not heard the phrase 'check your privilege'? In general it means not to take it for granted, but specifically to not take advantage of it at the expense of others. Just because you aren't forced to care for others doesn't mean you shouldn't. If you have privilege, your moral imperative is to ensure it's used for good, not evil (indifference is often the latter when affected by those with privilege).
How is software dev breaking your body? Are you typing with your face?
If you want to be the best version of yourself within 9-5, please fix the bugs that alert on-call the first morning you hear about them. Most developers never do this, which is why they are put on call.
Last week I was on-call. The vast majority of the time, when it's the application at fault, I don't know who the hell wrote it or where to begin with troubleshooting it, so I need a dev to look at it, because we don't have time to waste if the product is down. This doesn't seem like a controversial idea to me: you are hired by a company to make a product that works, so if your product isn't working, you need to fix it. I can't always fix it. I need help sometimes. It's literally your responsibility as a professional adult to help.
If the product you work on has to work 24 hours a day, you have implicitly agreed to support it during that time. Otherwise you can get a different job where if the product breaks, it can wait until morning. I've had those too, and not being on-call was great! But with my current role I knew it would require some on-call time, because those are the products I'm helping to build and run.
So I spend part of my time improving the product support team so that everything is as resilient as possible, but also so that devs can understand how their code affects the products. This means being as involved in architecture and design decisions as with deploying infrastructure. And I stop at 40hr/week too, but once every 8 weeks, I get a few calls about broken shit, and I put in the time to help prevent those from re-occurring, because most people I work with don't.
I wrote about it. I consider my brain just like any other muscle of my body, so after 8 hours of work, my brain is quite tired. Perhaps my wording was wrong.
> If you want to be the best version of yourself within 9-5, please fix the bugs that alert on-call the first morning you hear about them. Most developers never do this, which is why they are put on call.
Agree.
> you are hired by a company to make a product that works, so if your product isn't working, you need to fix it
Agree as well, but my contract states "40 hours per week". I'm being plain straightforward here, I'm not willing to give anything for free to any company (does that happen the other way around? Never). Not sure what's "wrong" with this.
> And I stop at 40hr/week too, but once every 8 weeks, I get a few calls about broken shit, and I put in the time to help prevent those from re-occurring, because most people I work with don't.
I have nothing against that, and I respect it. I guess the other way around should work as well, right? Like, if someone is not willing to give more than what's stated in their contract, that should be fine for everyone. What you call "help" sure it's help, but companies are taking it as free labor. I have nothing against companies making money, but I do not support companies making money without having to pay employees for that. But the main point of my first comment was: even though companies are paying for being on-call, one should have the option to say 'No, thanks. I don't want to give you my free time in exchange for more money. I already have enough with my 40h/week schedule', and that should be fine for everybody.
Desk work comes with significant bodily stresses which ergonomics can only partially ameliorate.
I'll take it over digging ditches, sure. But I won't do more than six hours of it a day, that's as long as I can healthily stand, and several hours longer than I can sit without pain.
CompanyA is using ITS OWN assets, funds, IP, etc. you own it, you can burn to the ground.
BankB is holding other people's money. You can't go make a mistake, a bank losing 100m of OUR money and say "oops my dev made a mistake".
Edit: similar expectations are in publicly traded companies (aka companies where they use OUR money - we give them our cash and they give us stocks). This is why external auditors (e.g. Big4) do not like when they see "poor change management processes", such as inconsistent SoD.
Not only that, but once that happens, regulators will come in, and everybody involved can be held liable. Not only will the bank be fined, but depending on how bad your fuck up was, you'll probably end up losing your job and might face further penalties.
So in the interest of everyone, it's best to just avoid it all together.
2. Not every bug is directly related to the product owner/team. Bugs can be introduced from different teams/processes.
3. Requiring senior engineers to do L1 and L2 support is likely a misallocstion of resources when less senior people can handle those issues.
That's, I think, is a wrong perspective, when people are on call, they have to be somewhere near their computer/internet connection and be ready to work (so it is not just you can go to a party and if call happens do some quick fix in a toilet).
On-calls cannot do what they want with their time, so they don't get money for free.
This still sounds cheap to me. I have never worked on-call (and I never plan to), but the exhaustion cost of working two hours in the middle of the night is not equal to two hours of uninterrupted sleep. I would expect to get at least a half day off (paid) for any amount of middle-of-the-night work.
Every company should do this but none I've worked at do. To be honest, I just take the makeup time myself.
I didn't note this in my post above, but I always gave time-in-lieu for any late night activity. However, the thing that REALLY worked best was allowing the Engineers to prioritize reliability. I had to fight to make it happen, but going from nightly to every couple months volumes was worth it.
But yeah, some legacy systems could be 5 years old, and that’s a long time in tech.
You’re right on the visibility part, but sadly that’s an organisational issue, you need higher ups to change this.
From where I come from, unions are mainly a way for lazy employees to get immunity while doing nothing all day long.
I would still maintain that if you live in a country where it's legal to be called at any time, any day, then you have a third world class labor law - go downvote US :>
If other things are considered more valuable than proper labor law (like, say, building a wall), then I guess voters get what they deserved.