High Availability for PostgreSQL, Batteries Not Included

High Availability for PostgreSQL, Batteries Not Included(compose.io)

132 points by rfks 10 years ago | 62 comments

ninkendo 10 years ago |

Where I am we have a similar setup for leader election and failover (using etcd and haproxy) but we add an additional step: a standby instance that does not participate in master election, and always follows the elected master.

Then we turn on confirmed writes on the master so that the non-participating standby (called the "seed") has to receive and confirm your write before the transaction can commit.

This has the bonus of preventing split brain... If the wrong instance thinks it's master, writes will block indefinitely because the seed isn't confirming them. If the seed is following the wrong machine, same thing. And if clients and the seed and the master are all "wrong", then that's ok because at least they all "consistently" disagree with etcd.

The seed instance can run anywhere, and is responsible for receiving WAL snapshots from the master and archiving them (to shared storage) so it can crash too and be brought up elsewhere and catch up fine. The writes just block until this converges.

It's worked quite well for us for a few months on a hundred or so Postgres clusters, we haven't seen an issue yet. I'd love for somebody knowledgeable about this stuff to point out any flaws.

dap 10 years ago | |

That's interesting. We do something pretty similar in the Manatee component that I mentioned elsewhere in this thread, except that the designated synchronous standby can takeover if the primary goes away. But it can only do so when another peer is around to become the new synchronous standby, so we maintain the write-blocking behavior that avoids split-brain.

teraflop 10 years ago |

Like a lot of designs that use Raft/Zookeeper/Paxos/whatever as a building block, the full system doesn't inherit all of the safety properties of the underlying consensus algorithm. I don't think that makes this code useless by any means, but I think it's important to be aware of the edge cases.

Consensus algorithms are popular because they're supposed to solve the difficult problem of guaranteeing consistency while attempting to provide liveness, in the presence of arbitrary node or connection failures. Etcd itself can provide this for operations on its own datastore, but that doesn't mean it can be used as a perfect failure detector for another system (which is impossible in the general case). In particular, if the database master becomes partitioned from the etcd leader for more than 30 seconds but is still accessible to clients, boom -- split brain.

(You can attempt to mitigate this with timeouts, but that's not foolproof if your system can experience clock skew or swapping/GC delays. Exactly this kind of faulty assumption has caused critical bugs in e.g. HBase in the past, turning what would otherwise be a temporary period of unavailability into data loss.)

EDIT: If I'm reading the code correctly, compose.io doesn't make any attempt to mitigate this failure scenario. If the Postgresql master can't contact etcd, it continues acting as a master indefinitely, even after 30 seconds have expired and another server might have taken over. This appears to be what happens in the "no action. not healthy enough to do anything." case in ha.py. I'd be happy to be corrected if there's something I'm missing.

jsprogrammer 10 years ago | |

If the PostgreSQL leader doesn't reset the leader key, it's no longer leader.

teraflop 10 years ago | | |

The rest of the cluster doesn't think it's the leader, but the problem is that it still accepts database connections as if it were.

If a client sees a stale value of the leader key (which is possible, either through network hiccups or etcd's normal behavior of allowing reads from followers) then it could contact the old leader and perform updates which won't be visible on the new leader.

rosser 10 years ago | | |

Fencing isn't quite that simple, unfortunately.

I've been doing database, and specifically PostgreSQL, administration and HA setups for a long time now. This stuff is a lot harder than people think it is. People who roll their own solutions, thinking "Oh, this will totes be good enough!" tend to find themselves very painfully surprised that it isn't.

geertj 10 years ago | | |

I haven't look at the code, but the failover should ensure that the HAproxy isolates the failed master ("fencing" in HA terminology).

pilif 10 years ago |

Personally, I would try to go for a simpler solution. In case of a failover event which is already complicated in itself and happening at a point in time where stuff is already going wrong (there would be no failover otherwise), do you really want to have all this additional infrastructure with etcd and haproxy as a dependency?

If you can live with a few minutes of downtime, I would recommend to trigger your failover using human intervention once you have ascertained that the failover would actually help (you never, ever want to fail over if master doesn't respond in time due to high load - at that point, failing over will only make things worse due to cold caches).

See https://github.com/blog/1261-github-availability-this-week for a nice story of automated DB failover going wrong.

In our case, we're running keepalived to share the IP address of the postgres master, but we don't actually automatically act on PG availability changes.

In a situation that actually warrants the failover, a human will kill the master node by shutting it down and keepalived will select another master and trigger the failover (which is then automated using `trigger_file` in `recovery.conf`).

In this case we have only one additional piece of infrastructure (keepalived) and we can be sure that we don't accidentally make our lives miserable with automated failovers.

The cost is, of course, potential additional downtime while somebody checks the situation, does minimal emergency root cause analysis and then shuts down the failed master.

In the even rarer case of hardware failure, keepalived would of course fail over automatically, but let's be honest: Most failures are caused by application or devops issues and in these cases it pays off to be diligent instead of panicing.

pjungwir 10 years ago |

Since most of the comments are critical, I'll say: thank you for the awesome writeup! I agree this is more complex than HA PG setups I've done in the past, but I'm thrilled to have another perspective. Also doing a thorough writeup like this takes time, and a lot of people would rather jump back into building the next thing. It's a great contribution!

I agree with pilif that you almost always want to failover the db manually.

I agree with teraflop that just because etcd gives strong guarantees, that doesn't mean your application logic built on top of etcd primitives shares them. So you have to be careful about your reasoning there.

I'm curious if you're doing anything to mitigate haproxy being a single point of failure?

One thing I've had to fix in other people's HA PG setups is ease of getting back to HA after a failover. You lose the master and promote the slave, and now you've just got a master. Ideally it should be easy to just launch another db instance and everyone keeps going. I think this setup achieves that, and that's great!

gshx 10 years ago | |

Agree and that's a great point about human failover. It can become a challenge for distributed databases running on a large number of instances (like bigtable) but if we're talking only about master HA, then yes, that can still do with human intervention though automation is still preferable. For smaller db setups, much easier to just let a human/dba intervene.

mrkurt 10 years ago | | |

Smaller DB setups rarely have the ops/DBA support required to do manual failover. I think having an as-consistent-as-feasible, automatic failover is something of a default expectation for databases these days, at any size.

winsletts 10 years ago | |

Take a look at the code we have open sourced: https://github.com/compose/governor

ozgune 10 years ago | |

Great write-up, thanks for sharing!

I'm curious about HAProxy being a single point of failure as well. What happens when it fails?

dap 10 years ago |

Thanks for writing this up!

At Joyent, we built a similar system for automated postgresql failover called Manatee. I'm sure today we would have used a Raft-based system, but that was not available when we did this work, so we used ZooKeeper. We haven't spent much time polishing Manatee for general consumption, but there's a write-up on how it maintains consistency[1]. The actual component is available here[2], and it's also been ported to Go as part of Flynn[3].

Edit: Manatee uses synchronous replication, not async, so it does not lose data on failover.

[1] https://github.com/joyent/manatee-state-machine

[2] https://github.com/joyent/manatee

[3] https://github.com/flynn/flynn

imperialWicket 10 years ago |

This seems robust, but feels like more moving parts than are necessary.

I feel like HAProxy with PostgreSQL + Bucardo (multi-master + at least one slave) would achieve this, and net you fewer moving parts. Under what circumstances does this fail where the etcd-dependent solution succeeds?

Someone 10 years ago |

"If no one has the leader key it runs health checks and takes over as leader."

I'm no expert at all on this stuff, but I do smell either a race condition (if other nodes comes alive and 'goes to see who owns the leader key in etcd' before the node 'takes over as leader') or a longer-than-needed time without a leader (where the new node knows it wants to become the leader, but is running health checks)

winsletts 10 years ago | |

The code relies on functionality in etcd to prevent a race condition. Using `prevExist=false` on acquiring the leader key, the set will fail if another node wins the race.

The functionality in the code is here: https://github.com/compose/governor/blob/master/helpers/etcd...

The documentation for etcd is here: https://coreos.com/etcd/docs/latest/api.html#atomic-compare-...

Someone 10 years ago | | |

But then, isn't it not

"If no one has the leader key it runs health checks and takes over as leader."

but

"If no one has the leader key it takes over as leader, runs health checks, and starts functioning as leader."

? If so, I would do the health checks and then try to become the leader. Or do the 'health checks' involve other nodes?

matrixritter 10 years ago |

I really wonder why people go for implementing their own HA stack when there's Pacemaker or rgmanager available?

First: There's a functional resource agent available to handle PostgreSQL. It handles single instances or multiple ones.

Second: Zhe whole cluster-thing can be very complex. You can have a LOT of fail scenarios and I wouldn't recommend to anyone to try to catch them all.

wyc 10 years ago |

Slightly off-topic:

If you're considering MySQL for HA, a project called Vitess jives well with Kubernetes + CoreOS, and has been in production use at YouTube for a while now:

http://vitess.io/

dgreensp 10 years ago |

What does "Batteries Not Included" mean for a highly available database? Is it a good thing?

stonemetal 10 years ago | |

"Batteries Not Included" is a phrase found on the box of children's toys, so that parents know they will need to buy batteries before giving it to their kids. It has become an expression for something being incomplete and will require effort before it will work properly.

In this case the author is stating that Postgres doesn't come with a high availability capability. He then goes on to explain the high availability setup he put together.

rachbelaid 10 years ago | |

"Batteries Not Included" in this article refers to the fact that PostgreSQL doesn't offer any built-in solution yet to do HA (automatic fail-over) and you have to depend of 3rd party solution.

knite 10 years ago |

How does this compare to RDS - both in terms of HA, and generally speaking?

qaqy 10 years ago |

And with this amazing design you can easily loose committed data and have all sorts of other fun problems.

mrkurt 10 years ago | |

It's relatively easy to adjust this setup to use synchronous replication. In fact, it's something that I expect we'll be offering to customers in the future. Most people we talk to aren't willing to sacrifice the write availability or performance synchronous replication requires. We mostly try to give them the tools to do it right, and then educate them on the tradeoffs.

(disclaimer, I work for Compose)

chousuke 10 years ago | |

Is it even possible to guarantee that you won't lose commits with postgresql replication? For many applications, consistency is more important than not losing any data ever. For the other kind of application, you'll need something else.

dap 10 years ago | | |

With postgresql synchronous replication, in order to lose writes that have been acknowledged to the postgresql client, you'd have to lose filesystem data on both the primary and the synchronous standby. (I believe the way postgresql uses the term "committed", you can lose data that's "committed", but not once postgresql has acknowledged it to the client.)

For many applications, consistency includes not losing acknowledged data. If I PUT data into an application and fetch it back and it's not there, that's not consistent.