Joy and Pain of Using Google BigTable(syslog.ravelin.com) |
Joy and Pain of Using Google BigTable(syslog.ravelin.com) |
I feel that something should be said on the plus side of the ledger here. I'm the solo founder of a company that indexes huge amounts of fine-grained information. Bigtable is the key technology that let me start my company on my own: it soaks up all the data we can throw at it, with almost zero maintenance. Even within the stable of GCP technologies it stands out as being particularly reliable.
My biggest "problem" with BigTable is the lack of public information on schema design - which in this context is mostly the art of designing key structures to solve specific problems. I've come up with sensible strategies, but much of it was far from obvious. I can't help but feel that there should be a body of prior art I could draw on.
You might find this talk from a recent Google Cloud event useful in this regard:
Visualizing Cloud Bigtable Access Patterns at Twitter for Optimizing Analytics (Cloud Next '18) https://www.youtube.com/watch?v=3QHGhnHx5HQ
"It's called BigTable, not FastTable or AvailableTable!"
...It's probably a bad idea to evaluate 2019's BigTable based on the joke, but my puerile mind still find it amusing. :)
The Key Visualizer has been a huge help but there's still not enough metrics and tooling to understand when things do go wrong or what is happening behind the scenes. Luckily we have a cache sitting in front of Bigtable for reads that allows us to absorb most of the described intermittent issues because cost has prevented us from doing any sort of replication.
Putting in that cache is a great move. Cache is challenging for us as we get hits over a very wide range of keys.
I follow the "it is perfect when you don't need to remove anything else" rule in most systems/processes/functions/tasks in life (not only IT systems). I am happy to see in this cluttered space called IT there are many more like-minded people who see that too much is TOO much.
As to those hiccups, unless they last for minutes or hours, in which case you might have a case of data corruption (BT is paranoid and rereads data right after any kind of compaction), most of the time they might be explained by, in approximately increasing order of badness:
- an orderly tablet server restart, e.g. for a binary update or because a Borg machine is undergoing a kernel update
- a tablet server crash: a software crash or a hardware one (this is bad, because there's a timeout that needs to be hit before a new server can take over the shard. The BT paper has details about the recovery protocol.)
- heavy load on the master, while either of the previous two are happening
- I don't think any of the various types of compactions would normally block reads/writes, but with some abnormal traffic patterns you might be able to make the tablet server suffer
- slowness at the lower layer, GFS/Colossus (although it mitigates a bit against this by having two separate log files into which it can write)
- Chubby outage
- power outage affecting a good chunk of or the entire cluster
[1] https://aws.amazon.com/blogs/database/how-amazon-dynamodb-ad...
Definitely adaptive capacity targets the primary reason people had to overprovision DynamoDB. It changes the entire calculation and obsoletes all the advice you might have heard based on experiences prior to late 2017.
I feel like "strong consistency" is misused here. Strongly consistent is relevant only in a distributed environment. Its usually solved by using paxos/raft between the replicas. Bigtable only has had best-effort replication, so I am not sure its being mentioned here. I think they are looking for the term serial, that their queries have to be executed in a specific order for a particular user request.
This would be a super frustrating situation for me, particularly when you're not given the tools you need to diagnose in the first place, and you loop in support but they still can't help you identify what's wrong.
Years ago, I worked on a .NET system that sometimes would respond super slowly and we didn't have a concrete explanation for why. As in TFA, we developed a kind of religion about it. "Oh, it must be JITting", that sort of stuff.