Then we turn on confirmed writes on the master so that the non-participating standby (called the "seed") has to receive and confirm your write before the transaction can commit.
This has the bonus of preventing split brain... If the wrong instance thinks it's master, writes will block indefinitely because the seed isn't confirming them. If the seed is following the wrong machine, same thing. And if clients and the seed and the master are all "wrong", then that's ok because at least they all "consistently" disagree with etcd.
The seed instance can run anywhere, and is responsible for receiving WAL snapshots from the master and archiving them (to shared storage) so it can crash too and be brought up elsewhere and catch up fine. The writes just block until this converges.
It's worked quite well for us for a few months on a hundred or so Postgres clusters, we haven't seen an issue yet. I'd love for somebody knowledgeable about this stuff to point out any flaws.
Consensus algorithms are popular because they're supposed to solve the difficult problem of guaranteeing consistency while attempting to provide liveness, in the presence of arbitrary node or connection failures. Etcd itself can provide this for operations on its own datastore, but that doesn't mean it can be used as a perfect failure detector for another system (which is impossible in the general case). In particular, if the database master becomes partitioned from the etcd leader for more than 30 seconds but is still accessible to clients, boom -- split brain.
(You can attempt to mitigate this with timeouts, but that's not foolproof if your system can experience clock skew or swapping/GC delays. Exactly this kind of faulty assumption has caused critical bugs in e.g. HBase in the past, turning what would otherwise be a temporary period of unavailability into data loss.)
EDIT: If I'm reading the code correctly, compose.io doesn't make any attempt to mitigate this failure scenario. If the Postgresql master can't contact etcd, it continues acting as a master indefinitely, even after 30 seconds have expired and another server might have taken over. This appears to be what happens in the "no action. not healthy enough to do anything." case in ha.py. I'd be happy to be corrected if there's something I'm missing.
If a client sees a stale value of the leader key (which is possible, either through network hiccups or etcd's normal behavior of allowing reads from followers) then it could contact the old leader and perform updates which won't be visible on the new leader.
I've been doing database, and specifically PostgreSQL, administration and HA setups for a long time now. This stuff is a lot harder than people think it is. People who roll their own solutions, thinking "Oh, this will totes be good enough!" tend to find themselves very painfully surprised that it isn't.
If you can live with a few minutes of downtime, I would recommend to trigger your failover using human intervention once you have ascertained that the failover would actually help (you never, ever want to fail over if master doesn't respond in time due to high load - at that point, failing over will only make things worse due to cold caches).
See https://github.com/blog/1261-github-availability-this-week for a nice story of automated DB failover going wrong.
In our case, we're running keepalived to share the IP address of the postgres master, but we don't actually automatically act on PG availability changes.
In a situation that actually warrants the failover, a human will kill the master node by shutting it down and keepalived will select another master and trigger the failover (which is then automated using `trigger_file` in `recovery.conf`).
In this case we have only one additional piece of infrastructure (keepalived) and we can be sure that we don't accidentally make our lives miserable with automated failovers.
The cost is, of course, potential additional downtime while somebody checks the situation, does minimal emergency root cause analysis and then shuts down the failed master.
In the even rarer case of hardware failure, keepalived would of course fail over automatically, but let's be honest: Most failures are caused by application or devops issues and in these cases it pays off to be diligent instead of panicing.
I agree with pilif that you almost always want to failover the db manually.
I agree with teraflop that just because etcd gives strong guarantees, that doesn't mean your application logic built on top of etcd primitives shares them. So you have to be careful about your reasoning there.
I'm curious if you're doing anything to mitigate haproxy being a single point of failure?
One thing I've had to fix in other people's HA PG setups is ease of getting back to HA after a failover. You lose the master and promote the slave, and now you've just got a master. Ideally it should be easy to just launch another db instance and everyone keeps going. I think this setup achieves that, and that's great!
I'm curious about HAProxy being a single point of failure as well. What happens when it fails?
At Joyent, we built a similar system for automated postgresql failover called Manatee. I'm sure today we would have used a Raft-based system, but that was not available when we did this work, so we used ZooKeeper. We haven't spent much time polishing Manatee for general consumption, but there's a write-up on how it maintains consistency[1]. The actual component is available here[2], and it's also been ported to Go as part of Flynn[3].
Edit: Manatee uses synchronous replication, not async, so it does not lose data on failover.
[1] https://github.com/joyent/manatee-state-machine
I feel like HAProxy with PostgreSQL + Bucardo (multi-master + at least one slave) would achieve this, and net you fewer moving parts. Under what circumstances does this fail where the etcd-dependent solution succeeds?
I'm no expert at all on this stuff, but I do smell either a race condition (if other nodes comes alive and 'goes to see who owns the leader key in etcd' before the node 'takes over as leader') or a longer-than-needed time without a leader (where the new node knows it wants to become the leader, but is running health checks)
The functionality in the code is here: https://github.com/compose/governor/blob/master/helpers/etcd...
The documentation for etcd is here: https://coreos.com/etcd/docs/latest/api.html#atomic-compare-...
"If no one has the leader key it runs health checks and takes over as leader."
but
"If no one has the leader key it takes over as leader, runs health checks, and starts functioning as leader."
? If so, I would do the health checks and then try to become the leader. Or do the 'health checks' involve other nodes?
First: There's a functional resource agent available to handle PostgreSQL. It handles single instances or multiple ones.
Second: Zhe whole cluster-thing can be very complex. You can have a LOT of fail scenarios and I wouldn't recommend to anyone to try to catch them all.
If you're considering MySQL for HA, a project called Vitess jives well with Kubernetes + CoreOS, and has been in production use at YouTube for a while now:
In this case the author is stating that Postgres doesn't come with a high availability capability. He then goes on to explain the high availability setup he put together.
(disclaimer, I work for Compose)
For many applications, consistency includes not losing acknowledged data. If I PUT data into an application and fetch it back and it's not there, that's not consistent.
Better to do an assessment of each thing that can fail, how to isolate/detect it, how to recover from it, how to implement that with available tools, and implement it. Test it in a number of situations on same hardware, network, and apps you'll use in production. Once it's solid, put them into production. Then, never worry about that stuff again past monitoring and maintenance.
Btw, Netflix employs Monkeys to do this. Open-sources their tools with blog writeups on their use, too. I'm sure you Humans will be able to handle it. ;)
The manual failover is in case of something going horribly wrong (outside of hardware failure), in which case a human steps in, looks at the situation, determines the best solution... and if it's failover, they initiate the failover.
I've personally used this procedure in the past and it worked 100% of the time there was a failure in a production environment. The tricky part is then notifying the hell out of everyone who needs to be notified that something really bad has happened, a failover occurred, everything is OK, but it needs some attention ASAP.
In PGSQL world, there are even a handful of tools to help you turn the old (failed) master into a slave, and correctly escalate the old (promoted) slave into a master; all in a single command on each side (which can be kicked off through keepalived).
Compared to streaming replication, during high load, Bucardo sync is also quite expensive for a replication mechanism.
As a service, Bucardo's requirements did not scale for us. It created to many caveats. The limitations of Bucardo for our service became obvious quickly.
PGPool failed at basic failover. It worked fine while the leader remained leader. It would failover to the follower who became leader, but after the first failover, it would stall on connections. We worked through various settings and attempts at making it more stable, but in the end we were not happy with the stability.
PGBouncer requires a connection to a single database and requires a user store associated at the PGBouncer level. One of our internal requirements for our Postgres service is give customers full access to Postgres capabilities. PGBouncer would either limit customer functionality or require us to build more tools for customers to use Postgres's complete functionality. For instance, if a customer ran `CREATE USER foo WITH LOGIN …` from the Postgres connection, the customer would not be authenticate as foo user because PGBouncer would not have immediate knowledge of the new user.
In the end, HAProxy offered the stability and enabled the base functionality of Postgres we wanted. In tests, it failed over quickly and reliably. The only caveat with HAProxy + Postgres is that you have to rely on TCP passthrough with SSL termination at Postgres. We'd have preferred the SSL termination at HAProxy, but Postgres engineered it's own connecting procedure to listen for standard and SSL connections on the same port. SSL termination at the HAProxy was causing issues for drivers that were built to use that procedure and cannot use a standard SSL connection.
I think if you wanted to have the simplest possible solution, pgbouncer and postgresql-specific replication mechanism would be perfect. This is along the same lines as my question - I don't really see how these alternate solutions could be construed as lacking...
[1]: https://github.com/compose/governor/blob/master/haproxy_stat...
Manual failover is often a lot safer, automatic systems have a nasty habit of not doing what you expect them to and trashing your database / losing data.
Durability and consistency are two separate concepts.
A client shouldn't use a stale value. If a DB does not hold a valid key, it shouldn't accept new connections, or signal that data was committed.
But that's exactly the problem! If you were to run this whole system under a tool like Jepsen, this would show up as "acknowledged but lost writes". It's not generally considered acceptable to connect to your database, issue an UPDATE and a COMMIT, and have everything appear to work successfully, only for the data to disappear into the aether because it got sent to the wrong replica.
> If a DB does not hold a valid key, it shouldn't accept new connections, or signal that data was committed.
Exactly, and the problem with this implementation (again, unless I'm missing something) is that it may violate this contract.
If there is a need for multiple 9's of uptime, there should be an escalation process for these kinds of events, which will probably include 24/7 on-call rotations.
Even if the problem is entirely self-resolving, it should still be looked at by more than one system. It should be noted, observed, documented, and confirmed it's truly resolved. That system is usually a human, but it doesn't necessarily have to be.
Oh and here is the Compare and Swap (Atomic) functionality of etcd that he described: https://github.com/coreos/etcd/blob/master/Documentation/api...
The proxy can find the cluster manually or use SRV records. Autoscale the Postgres machines as much as you want after that while leaving etcd on stable machines.
In every case, at some point, there were implementation/software bug related issues that ultimately caused more unplanned outages than I've ever seen a single, well run server experience.
* After a complete, planned shutdown, neither server is happy to start until it sees the other one online. In the end, neither ends up booting. * A failover occurs, at which point you find out the hard way there is state being stored in a non-replicate file. I've seen this with several different Asterisk HA solutions in particular. * A failover occurs, and non-database aware storage snapshots leave the redundant server with a non-mountable mirror of the database.
Any consultant worth their day (let alone week) rate should be also able to refer you to previous clients, from whom you can hopefully get some sense of how satisfied people are with the candidate's work.
And, ultimately, there's an intuition factor at work here. It's been my consistent experience that if a candidate gives you some kind of hinky vibe, don't use them. When folks I've worked with haven't followed that, the results have pretty reliably been poor, at best.