Graceful server restart with Go(blog.appsdeck.eu) |
Graceful server restart with Go(blog.appsdeck.eu) |
That's why I wrote socketmaster[1], it's simple enough that it doesn't need to change and it's the one handling the socket and passing it to your program. I haven't had to touch it for years now.
For my current work I wrote crank[2], a refinement of that idea. It's a bit more complex but allows to coordinate restarts. It implements a subset of the systemd socket activation protocol conventions. All your program has to do is look for a LISTEN_FDS environment variable to find the bound file descriptor and send a "READY" message on the NOTIFY_FD file descriptor when it's ready to accept new connection. Only then will crank shutdown the old process.
* [1]: https://github.com/zimbatm/socketmaster * [2]: https://github.com/pusher/crank
Edit: more concise explanation, in the context of SO_REUSEPORT: http://lwn.net/Articles/542718/
I think the article also misses an important step - you need to let the new process to initialize itself (e.g. read its config files, connect to db, etc), and then signal the parent that it is ready to accept connections, only at which point the parent stops accepting. The important point here is that the child may fail to init, in which case the parent should carry on as if nothing happened.
If you really can't afford someone getting a "connection refused" what happens when the machine's network connection dies?
First step of a deployment: shift traffic away from the machine, while allowing outstanding requests to complete gracefully. Next you can install new software or undertake any upgrade actions in isolation. This way any costs involved in the deployment don't impair the performance of real traffic. Bring the new version up (and prewarm if necessary). Finally, direct the load balancer to resume traffic. We call the general idea "bounce deployments", as a feature of the deployment engine.
Two advantages of having a general-purpose LB solution:
(1) You can apply it to any application or protocol, regardless of whether the server supports this type of socket handoff. Though to be fair, some protocols are more difficult to load balance than others - but most can be done, with some elbow grease (even SSH).
(2) It's possible to run smoke tests and sanity tests against the new app instance, such that you can abort bad deployments with no impact. Our deployment system has a hook for sanity tests to be run against a service after it comes up. These can verify its function before the instance is put back into the LB, and are sometimes used to warm up caches. If you view defects and bad deployments as inevitable, then the ability to "reject" a new app version in production with no outage impact is a great safety net. With the socket handover, your new server must function perfectly, immediately, or else the service is impaired. (Unless you keep the old version running and can hand the socket back?)
(By LB I don't necessarily mean a hardware LB. A software load balancer suffices as well - or any layer acting as a reverse proxy with the ability to route traffic away from a server automatically.)
A technique like this would also be useful for implementing single-points like load balancers or databases, so that they can upgrade without outage. Though failover or DNS flip is usually also an option.
1. Won't this leave the parent process running until the child completes? And, if you do this again & again, won't that stack up a bunch of basically dead parent processes? Maybe I'm misunderstanding how parent/child relatioships work with ForkExec
2. What if you want the command-line arguments to change for the new process?
3. In addressing (2), in general would it be simpler to omit the parent-child relationship with a wrapper program? The running (old) process can write its listener file descriptor to a file, similar to how it is done here, and the wrapper reads that file & sets an environment variable (or cmd-line argument) telling the new process?
The wrapper could be used for any server process which adheres to a simple convention:
on startup, re-use a listener FD if provided (via env or cmd line ... or ./.listener)
once listening, write your listener FD to well-known file (./.listener)
on SIGTERM, stop processing new connections but don't close the listener (& exit after waiting for current connections to close, obvi)
4. Am i the only one who finds "Add(1)/Done()" to be an odd naming convention? I might go with "Add(1)/Add(-1)" instead just for readability
1. When the parent process has finished handling its connections, it just exits. The children are then considered as 'orphans' and are automatically attached to the init process. When you run your service as a daemon, that's exactly what you want, so you don't have a huge stack of processes. 2. I used syscall.ForkExec(os.Args[0], os.Args […]), but I could changed the string array os.Args by anything I want to change the arguments. 3. It could be a way to do it, it would also work, but it is not the choice we have done. 4. It may look a bit weird, but it's part of the language, you get used to it really quickly ;-)
syscall.Wait4(-1, &wait, syscall.WNOHANG, nil)
I would also recommend to use a higer level StartProcess instead2. You can pass any arguments when starting a child, or even execute completely different binary:
p, err := os.StartProcess(path, os.Args, &os.ProcAttr{
Dir: wd,
Env: os.Environ(),
Files: files,
Sys: &syscall.SysProcAttr{},
})https://github.com/gwatts/manners
And Mailgun's fork that supports passing file descriptors between processes:
The 'manners' package only enables graceful shutdown in a HTTP server, there is still work to be done to restart it gracefully, that's what I'm trying to show in the article.
That's why I've added missing methods here:
https://github.com/mailgun/manners
Getting files from listener:
https://github.com/mailgun/manners/blob/master/listener.go#L...
Starting server with external listener:
https://github.com/mailgun/manners/blob/master/server.go#L87
It's used to restart Vulcand without downtime:
https://github.com/mailgun/vulcand/blob/master/service/servi...
Let's collaborate on this as a library if you are interested
EDIT: added author
https://github.com/rcrowley/goagain/blob/master/goagain.go#L...
Not sure why it happens though, but it led to all sorts of strange intermittent issues with broken connections.
Once I replaced this logic with passing files using GetFile().Fd() instead it started working fine, so beware of this. I still wonder why it happens though.
Were you able to publish your changes either on a fork or in a PR?
Re run my program on a different port, point nginx at the new port, reload nginx, kill the old.
Curious what is so bad about this approach? I admit it's hacky, but it works. Is there just too many things to do?
Edit: typo
> file := os.NewFile(3, "/tmp/sock-go-graceful-restart")
What's with that filesystem path, which isn't referenced anywhere else, and which should be unnecessary because the file descriptor 3 is inherited when the process starts?
[1] http://golang.org/pkg/os/#File [2] http://golang.org/pkg/os/#File.Name
I've also written an implementation of a very similar pattern in Node (wait for a set of asynchronous things to complete) and I've used Add() and Signal(), never Add(-1)
https://github.com/gwatts/manners
But the code that extracts the fd without using reflection and access to the private properties is here:
https://github.com/mailgun/manners/blob/master/listener.go#L...
I think it should be fairly easy to port it to Richard's implementation
Can you point to a single unhandled edge case mentioned so far in this discussion? The only possible one I see is finnh's complaint that it doesn't support changing arguments, but that seems more like a missing feature (which doesn't seem that important to me) than an edge case.
The logic needed to correctly implement this is quite minimal, and implementing it yourself both spares you a rather heavy dependency and gives you more flexibility.
https://news.ycombinator.com/item?id=8773176
So far, I haven't encountered any annoying edge case I can't handle, so if you have examples, I'll be glad to discuss them with you
I'm fairly certain that the connection stays alive and that SSHd doesn't need to care about this, all handled on OS-level. That's why if you change IP or something like that, it doesn't "reattach" as you call it.
"A keep-alive is a small piece of data transmitted between a client and a server to ensure that the connection is still open or to keep the connection open. Many protocols implement this as a way of cleaning up dead connections to the server. If a client does not respond, the connection is closed.
SSH does not enable this by default. There are pros and cons to this. A major pro is that under a lot of conditions if you disconnect from the Internet, your connection will be usable when you reconnect. For those who drop out of WiFi a lot, this is a major plus when you discover you don't need to login again."
Source: http://www.symkat.com/ssh-tips-and-tricks-you-need
There's probably better sources out there, that was just one of the top results in Google as, if I'm honest, I'm not an expert on this either.
This happens because your router or firewall is trying to clean up dead connections. It's seeing that no data has been transmitted in N seconds and falsely assumes that the connection is no longer in use.
To rectify this you can add a Keep-Alive. This will ensure that your connection stays open to the server and the firewall doesn't close it.
In other words: What keep-alive does is that it prevents routers/middle-ware-boxes to forget that the connection exists in the first place. This is not needed on a clean internet connection where everything is treated as stateless and simple routing is everything that is done.I'm open to being proved wrong here, but as I've already said, only been doing this for several years now, so I'd need a counter argument to explain the mechanics of what's allowing the connection to reattach rather than "it's not possible" :)
edit: hmmm, re-reading the latter part of keep alive article I posted, it does seem to imply what your saying. So how come my SSH connections aren't nuked then? Is this just a property of TCP/IP (I'm not a networking guy so ignorant to some of the lower level stuff)