An Unlikely Database Migration(tailscale.com) |
An Unlikely Database Migration(tailscale.com) |
Is this actually easier than using SQLite?
I imagine it's like if you were trying to be an olympic marathon runner, you'd study things like humidity and shoes and arm motion deeply, and he kind of does that (function naming, nodes per block, comments, test coverage, et cetera). Even if you don't agree with him, as you suggest read it and play "devil's advocate", and you will be forced to think about details that will make you a better software engineer.
Probably not so, bc tailscale is a decent product, but this post did not change my view in a good way
> Through this process we would do major reorganizations of our SQL data model every week, which required an astonishing amount of typing. SQL is widely used, durable, effective, and requires an annoying amount of glue to bring into just about any programming language
and
> So we invested what probably amounts to two or three weeks of engineering time into designing in-memory indexes that are transactionally consistent
Sounds to me like someone has learned a lot on the job. Good for him, but it looks exactly like what I said before.
No. Not unless your data is relational. This is a common problem, relational databases have a lot of over head. They are worth it when dealing with relational data. Not so much with non relational data.
In particular you've designed a very complicated system: Operationally you need an etcd cluster and a tailetc cluster. Code-wise you now have to maintain your own transaction-aware caching layer on top of etcd (https://github.com/tailscale/tailetc/blob/main/tailetc.go). That's quite a brave task considering how many databases fail at Jepsen. Have you tried running Jepsen tests on tailetc yourself? You also mentioned a secondary index system which I assume is built on top of tailetc again? How does that interact with tailetc?
Considering that high-availability was not a requirement and that the main problem with the previous solution was performance ("writes went from nearly a second (sometimes worse!) to milliseconds") it looks like a simple server with SQLite + some indexes could have gotten you quite far.
We don't really get the full overview from a short blog post like this though so maybe it turns out to be a great solution for you. The code quality itself looks great and it seems that you have thought about all of the hard problems.
What do you mean by this part? tailetc is a library used by the client of etcd.
Running an etcd cluster is much easier than running an HA PostgreSQL or MySQL config. (I previously made LiveJournal and ran its massively sharded HA MySQL setup)
The typical workflow is do do all of your "reads" out of the keyspace, attempt to apply Etcd transactions, and (if needed) block until your keyspace has caught up such that you read your write -- or someone else's conflicting write.
Oh. Since they have a full cache of the database I thought it was intended to be used as a separate set of servers layered in front of etcd to lessen the read load. But you're actually using it directly? Interesting. What's the impact on memory usage and scalability? Are you not worried that this will not scale over time since all clients need to have all the data?
What if you used one of the managed RDBMS services offered by the big cloud providers? BTW, if you don't mind sharing, where are you hosting the control plane?
Plus maybe replacing the sync.Mutex with RWMutexes for optimum read performance in a seldom-write use case.
On the other hand again, I feel a bit weird criticizing Brad Fitzpatrick ;-) — so there might be other things at play I don‘t have a clue about...
If you are expecting to simply persist one instance of one application's state across different runs and failures, a database can be frustrating.
But if you want to manage your data across different versions of an app, different apps accessing the same data, or concurrent access, then a database will save you a lot of headaches.
The trick is knowing which one you want. Persistence is tempting, so a lot of people fool themselves into going that direction, and it can be pretty painful.
I like to say that rollback is the killer feature of SQL. A single request fails (e.g. unique violation), and the overall app keeps going, handling other requests. You application code can be pretty bad, and you can still have a good service. That's why PHP was awesome despite being bad -- SQL made it good (except for all the security pitfalls of PHP, which the DB couldn't help with).
In the OP they spent two weeks designing and implementing transaction-save indexes -- something that all major SQL RDBMS (and even many NoSQL solutions) have out of the box.
Loss of efficiency? Come on, you were using a file before! :-)
Article makes me glad I'm using Django. Just set up a managed Postgres instance in AWS and be done with it. Sqlite for testing locally. Just works and very little engineering time spent on persistent storage.
Note: I do realize Brad is a very, very good engineer.
Having no dedicated database server or even database instance, being able to persist data to disk with almost no additional memory required, marginal amount of CPU and no heavy application dependencies can be considered very efficient depending on context.
Of course, if you start doing this on every change, many times a second, then it stops being efficient. But then there are ways to fix it without invoking Oracle or MongoDB or other beast.
When I worked on algorithmic trading framework the persistence was just two pointers in memory pointing to end of persisted and end of published region. Occasionally those pointers would be sent over to a dedicated CPU core that would be actually the only core talking to the operating system, and it would just append that data to a file and publish completion so that the other core can update the pointers.
The application would never read the file (the latency even to SSD is such that it could just as well be on the Moon) and the file was used to be able to retrace trading session and to bring up the application from event log in case it failed mid session.
As the data was nicely placed in order in the file, the entire process of reading that "database" would take no more than 1,5s, after which the application would be ready to synchronize with trading session again.
This was my main thought throughout reading it. So many things to consider and difficult issues to solve it seems they face a self-made database hell. Makes me appreciate the simplicity and stable performance of django orm + postgre.
First they're conflating storage with transport. SQL databases are a storage and query system. They're intended to be slow, but efficient, like a bodybuilder. You don't ask a bodybuilder to run the 500m dash.
Second, they had a 150MB dataset, and they moved to... a distributed decentralized key-value store? They went from the simplest thing imaginable to the most complicated thing imaginable. I guess SQL is just complex in a direct way, and etcd is complex in an indirect way. But the end results of both are drastically different. And doesn't etcd have a whole lot of functional limitations SQL databases don't? Not to mention its dependence on gRPC makes it a PITA to work with REST APIs. Consul has a much better general-purpose design, imo.
And more of it doesn't make sense. Is this a backend component? Client side, server side? Why was it using JSON if resources mattered (you coulda saved like 20% of that 150MB with something less bloated). Why a single process? Why global locks? Like, I really don't understand the implementation at all. It seems like they threw away a common-sense solution to make a weird toy.
I think we're pretty well aware of the pros and cons of all the options and between the team members designing this we have pretty good experience with all of them. But it's entirely possible we didn't communicate the design constraints well enough. (More: https://news.ycombinator.com/item?id=25769320)
Our data's tiny. We don't want to do anything to access it. It's nice just having it in memory always.
Architecturally, see https://news.ycombinator.com/item?id=25768146
JSON vs compressed JSON isn't the point: see https://news.ycombinator.com/item?id=25768771 and my reply to it.
SQL - I'm not sure what the problems are with SQL. But it is like a second language to me so maybe I experienced these problems long ago and have forgotten about them.
ORM - I never use an ORM, so I have no idea what the problems might be.
testing latency - I don't know what this refers to.
dependency - ditto
https://twitter.com/_rchase_/status/1334619345935355905
It’s just 2 files.
Sometimes it’s better to focus on getting the product working, and handle tech debt later.
The real problem with that project was of course trying to set up a microservices architecture where it wasn't necessary yet and nobody had the right level of experience and critical thinking to determine where to separate the services.
This tool needs to insert data in the middle of (pretty short) lists, using a pretty complicated algorithm to calculate the position to insert at. If I had used an RDBMS, I'd probably have to implement fractional indexes, or at least change the IDs of all the entries following the newly inserted one, and that would be a lot of code to write. This way, I just copy part of the old slice, insert the new item, copy the other part (which are very easy operations in Go), and then write the whole thing out to JSON.
I kept it simple, stupid, and I'm very happy I went with that decision. Sometimes you don't need a database after all.
[0] Via mutex in your case. Have you thought about durability, though. That one's actually weirdly difficult to guarantee...
Strictly speaking, it's literally impossible to guarantee[0], so it's more a question of what kinds and degrees of problems are in- versus out-of-scope for being able to recover from.
0: What happens if I smash your hard drive with a hammer? Oh, you have multiple hard drives? That's fine, I have multiple hammers.
is this a typo? 150MB is such a minuscule amount of data that you could do pretty much anything and be OK.
But writing out 150MB many times per second isn't super nice when both 150MB and the number of times per second are both growing.
I am also going to go out on a limb and guess that this is all running in kubernetes. Running etcd there is dead simple compared to even running something like Postgress.
Congrats on a well engineered solution that you can easily test on a dev machine. Running a DB in a docker container isn't difficult but it is just one more dev environment nuance that needs to be maintained.
At that point I thought "hmm, i guess I know databases" and a few years later, attempted to store some slightly larger, more complicated data in MySQL and query it. My query was basically "join every record in this table against itself, returning only rows that satisfy some filter". It ran incredibly slowly, but it turned out our lab secretary was actually an ex-IBM Database Engineer, and she said "did you try sorting the data first?" One call to strace showed that MySQL was doing a very inefficient full table scan for each row, but by inserting the data in sorted order, the query ran much faster. Uh, OK. I can't repeat the result, so I expect MySQL fixed it at some point. She showed me the sorts of DBs "real professionals" designed- it was a third order normal form menu ordering system for an early meal delivery website (wayyyyy ahead of its time. food.com). At that point I realized that there was obviously something I didn't know about databases, in particular that there was an entire schema theory on how to structure knowledge to take advantage of the features that databases have.
My next real experience with databases came when I was hired to help run Google's MySQL databases. Google's Ads DB was implemented as a collection of mysql primaries with many local and remote replicas. It was a beast to run, required many trained engineers, and never used any truly clever techniques, since the database was sharded so nobody could really do any interesting joins.
I gained a ton of appreciation for MySQL's capabilities from that experience but I can't say I really enjoy MySQL as a system. I like PostgresQL much better; it feels like a grownup database.
What I can say is that after all this experience, and some recent work with ORMs, has led me to believe that while the SQL query model is very powerful, and RDBMS are very powerful, you basically have to fully buy into the mental model and retain some serious engineering talent- folks who understand database index disk structures, multithreading, etc, etc.
For everybody else, a simple single-machine on-disk key-value store with no schema is probably the best thing you can do.
* our data is tiny and fits in RAM
* our data changes often
* we want to eventually get to an HA setup (3-5 etcd nodes now, a handful of backend server instances later)
* we want to be able to do blue/green deploys of our backend control server
* we want tests to run incredibly quickly (our current 8 seconds for all tests is too slow)
* we don't want all engineers to have to install Docker or specific DB versions on their dev machines
Curiously though its a pattern I've seen twice in the last 12 months, there was that guide on the good bits of AWS that also recommended starting with a single host with everything running on it.
Maybe we should all move that host back under our desks and really be back to basics!
I had no idea companies of this size had engineers with that much free time on their hands.
I don't even mind senior devs putting together things like this at the cornerstone of the company provided there are always at any given point in time 2 people that know how it works and can work on it, and sufficient time was spent looking at existing solutions to make that call. It should be made with full expectations that the first paragraph is inevitable.
Specifically, in this case: Without any actual data (# of reads, # of writes, size of writes, size of data, read patterns, consistency requirements) it is not possible to judge whether going custom on such a system was merited or not. I would find it VERY difficult to come to the conclusion that this use case couldn't be solved with very common tooling such as spark and/or nats-streaming. "provided the entire dataset fits in memory" is a very large design liberty when designing such a solution and doesn't scream "scalability" or n+1 node write-consistency to me. I say this acknowledging full well that etcd is an unbelievably well written piece of software with durability and speed beyond it's years.
Keeping my eyes open for that post-series-a-post-mortem post.
Uh, you compressed it first, right? because CPUs can compress data faster than disk I/O.
But the bigger problem was the curve. Doing something O(N*M) where N (file size) and M (writes per second) were both growing was not a winning strategy, compression or not.
The only feature it misses is to compress the data that is not actively in use, that way there is really not much of a downside.
People seem to keep using poorly-designed ORMs or are stuck with some strange anti-ORM ideology.
Modern ORMs are fast, efficient, and very productive. If you're working with relational databases then you're using an ORM. It's a question of whether you use something prebuilt or write it yourself (since those objects have to be mapped to the database somehow). 99% of the time, ORMs generate perfectly fine SQL (if not exactly what you'd type anyway) while handling connections, security, mapping, batching, transactions, and other issues inherent in database calls.
The 1% of the time you need something more complex, you can always switch to manual SQL (and ORMs will even let you run that SQL while handling the rest as usual). The overall advantages massively outweigh any negatives, if they even apply to your project.
> Never underestimate how long your “temporary” hack will stay in production!
I get that SQLite wouldn't work, but it also doesn't make sense to have one completely independent database per process. So I imagine you're using a shared database, at which poitlnt etcd starts to make more sense. It's just not that widely understood in production as sql databases, and has limitations which you might reach in a few years.
The right tool is a wrapper / DSL over SQL, which allows to interact with the database in a predictable and efficient way, while not writing placeholder-ridden SQL by hand. Composable, typesafe, readable.
ORMs do fine in small applications without performance requirements. The further you go from that, the less adequate an ORM becomes, and the more you have to sidestep and even fight it, in my experience.
The only reason objects are the "wrong" abstraction is because they don't match relational models exactly. That impedance mismatch is the entire reason for the object-to-relational mapping, otherwise you can use things like document-stores and just serialize your objects directly.
A ORM that worked on the principle of insert query text of any complexity receive object as the primary usecase, not the "nonstandard and non-idiomatic usecase" would be the only way to ease the concerns of dba's who code like me.
Its the same pitfall of api clients. Why would I take the time to learn an api like its a sdk, along with the pains of trying to shunt openapi's libraries in to my application without requiring the creation of a composer build step, further complicating deployment, when I can make 5 methods in the time before lunch to do the bits i need as rest queries and deployment of my php app remains as simple as `git pull production` on the nfs share all the workers read from?
The benefit of compile validated symbols is moot in the days of test driven dev, so the benefits gained from that can still be realized without creating build complexity or making competent engineers re-learn something they already know only re-abstracted in a way that almost always makes it harder for somebody who understands the low level to learn the new way compared to a new dev.
Like I said, you're either using an existing ORM or just writing your own everytime, and the one you write probably won't be very good as seen by the numerous security and performance bugs that are constantly found. Also abstractions are useful. All of software development is built with abstractions and they don't suddenly become useless when it comes to databases.
I also don't see what this has to do with API clients or build steps, but there are good and bad examples of those too.
What "modern ORM" are you using?
Sure I don't write PHP, but the ORMs I've used in Java/Kotlin, Ruby, JavaScript and lately Rust and Crystal have all been more than fine with me writing raw SQL.
The author seems to be using Go, which honestly could use work in that area. gorm is the biggest / most popular ORM out there, but it looks like a one-person project, the author seems well worn-out already, and it kinda falls apart when you work with a data model more than one level deep.
Plus broadly speaking, there seems to be a bit of an aversion to using libraries in the Go community.
Which ones do you have experience with?
In our case, the control plane data store really should be as boring as possible. It was real stretch using anything other than MySQL. We tried to lay out the arguments in the post, but the most compelling was we had lots of unit tests that spun up a DB and shut it down quickly. Maybe a hundred tests whose total execution time was 1.5s. The "boring" options made that surprisingly difficult.
(Tailscaler and blog post co-author)
Of course the benefit of what you did, even if I wouldn't have done it, is that you're _not_ using a different system in dev vs prod.
Create database test_db template initial_db;They are the same thing, a way of taking text based queries for formatted data as std:maps and shunting that away behind classes and types and interfaces.
> Like I said, you're either using an existing ORM or just writing your own everytime
vectors of maps works pretty well.
I used to do this and stopped when I noticed that sqlite and postgres treat booleans differently; postgres accepts 't' as true, but SQLite stores a literal 't' in the boolean-typed field. This means you get different results when you read things back out. All in all, not a rabbit hole you want to go down.
Personally, I just create a new database for each test against a postgres server on localhost. The startup time is nearly zero, and the accuracy compared to production is nearly 100%.
Reaching limitations in a few years and biting that bullet makes the difference between a successful startup that knows when and where to spend time innovating or a startup that spends all their time optimizing for that 1 million simultaneous requests / sec.
It's the devil you know versus the devil you don't. SQL is a very well understood devil, so your plans around it will be reliable. I would argue that being able to accurately estimate future work is the most valuable business asset.
Our database has very small amounts of data but a very, very large number of parallel readers. etcd explicitly disclaims any ability to scale to large data sizes, and probably rightly so :)
> So we can get, say, 1000 updates, bundle them all, get it synced in say ~100ms, and then answer all 1000 requests at once, and still only take ~100ms.
I assume the same trick is applicable to RDBMS as well? So you batch the 1000 updates, and do one commit with a single fsync.
> Virtually every other database I've used is quite naive about how they flush blocks to disk, which dramatically reduces their effective transactions/sec. It's rare to see one that made all the right choices here.
Can you elaborate on this? Anyway RDBMS worth its salt should be able to saturate the disk IOPS, i.e. the act of flushing itself wouldn't be the bottleneck.
> Our database has very small amounts of data but a very, very large number of parallel readers.
So the control plane is the sole writer of this database, and there are maybe 100s/1000s of other readers, who each has a watcher on etcd? Who are these readers? If they are different processes on different machines, how did it work when the database was in the json file?
Sorry for the barrage of questions, but I have to ask out of curiosity.
EDIT: Negation is important
Usually people design their app with the expectation of strict serializable isolation, relax it because of some production emergency, and then deal with the business consequences of the database doing the wrong thing until the company goes out of business (usually not due to database isolation levels, to be fair).
We considered that option but didn't like it. It still wasn't fast enough, and placed onerous (or at least annoying) dependencies on future employees.
Did you configure the Postgres (or MySQL) database to be entirely in memory, e.g. by using a tmpfs Docker volume?
As for being onerous or annoying for new employees, which is worse: having to set up a Docker environment, or using a relatively obscure data store in a way that nobody else does?
We've since hired many employees who just learned about our database today from this blog post but had been happily testing against it on their laptops for months.
3-4 of us know about it, and that's sufficient.
Us _not_ using Kubernetes isn't because we don't know how to use it. It's because we _do_ know how to use it and when _not_ to use it. :)
When not to use it is a tough question. If I was ever in charge of a company, kubernetes would be the only way of running my product that I would consider. I am a fan of kubernetes as I use it every day but I have also been on the other side of the fence. I have run production systems on bare metal, VMs, EC2 instances, etc. The operational burden of anything non-kube is too much and takes time away from solving big problems such as stability, scaling, deploy, monitoring and more. The solutions to the problems become standard, boring and consistent.
I say the above as someone that spent over a year migrating an entire platform/product from ECS to GKE. It is not perfect but so many silly day to day interruptions have been eliminated. Retired and broken instances are a thing of a past. Scaling is easy. Stability is easier.
Side effects of the move are that our Ops team is 1/2 the size it was a year ago (attrition/covid), we are running 3 times the number of product stacks for 1/3rd the cost. I should really blog about that one!
We could (and likely would, despite the costs) but that doesn't address our testing requirements.
The control plane is on AWS.
We use 4 or 5 different cloud providers (Tailscale makes that much easier) but the most important bit is on AWS.
Curious: Was running DynamoDB with DAX (DynamoDB Accelerator) in front ever in contention? If not, is it due to vendor lock-in (for example, not being able to migrate out) or because tailscale doesn't feel the need to use managed offerings especially for core infrastructure?
> What if you used one of the managed RDBMS services offered by the big cloud providers?
Yeah, AWS RDS "multi-AZ" does a good job of taking care of HA for you. (Google Cloud SQL's HA setup is extremely similar.) But you still get 1-2 minutes of full unavailability when hardware fails.
I haven't operated etcd in production myself, but I assume it does better because it's designed specifically for HA. You can't even run less than three nodes. (The etcd docs talk about election timeouts on the order of 1s, which is encouraging.)
For many use cases, 1-2 minutes of downtime is tolerable. But I can imagine situations where availability is paramount and you're willing to give up scale/performance/features to gain another 9.
So architecturally it's:
3 or 5 etcd (forget what we last deployed) <--> 1 control process <--> every Tailscale client in the world
The "Future" section is about bumping "1 control process" to "N control processes" where N will be like 2 or max 5 perhaps.
The memory overhead isn't bad, as the "database" isn't big. Modern computers have tons of RAM.
Given that the goals seem to be improving performance over serializing a set of maps to disk as JSON on every change and keeping complexity down for fast and simple testing, a KV library would seem to accomplish both with less effort, without introducing dependence on an external service, and would enable the DB to grow out of memory if needed. Do you envision going to 2+ control processes that soon?
Any consideration given to running the KV store inside the control processes themselves (either by embedding something like an etcd or by integrating a raft library and a KV store to reinvent that wheel) since you are replicating the entire DB into the client anyway?
Meanwhile I'm working with application-sharded PG clusters with in-client caches with coherence maintained through Redis pubsub, so who am I to question the complexity of this setup haha.
I guess Go package discovery remains an unsolved problem.
Or did you just not really search, like most of us excited to DIY? :-D
Godoc is pretty good, the package shows up for the searches I'd probably do in a similar situation.
It also doesn't wrap transactions -- you use the etcd client directly for that.
The Nagel delay it implements helps quite a bit with scaling, though, while keeping the benefits of a tightly packed sorted keyspace. And you can directly access / walk decoded state without copies.
Not sure whether I agree or disagree, actually...
AFAICT Linearilazable is about the best we can expect in reality (at least for a distributed system), but as you point out: Very few people actually check their assumptions... and even fewer actually think about DB transactions correctly in the first place. It's actually really, really hard and people have these rules of thumb in their heads that aren't actually correct.
Which gets me to wondering if we could formalize some of this stuff... (in relevant "code scopes", dgmw!)
EDIT: If there is one thing I am certain about it is the fact that a lot of consistency can be relaxed around human interaction. It's lossy anyway, and people will call you (eventually, depending on anxiety/shyness) if you haven't fulfilled an order. The browser is the first order of that and that's already out of date once you show a page, so... Anyway, that's just to say it's amusing how much people worry about consistency on the front end
ORMs: consider yourself lucky. They try to make SQL easy by auto-generating terrible SQL.
Testing latency: we want to run many unit tests very quickly without high start-up cost. Launching MySQL/PostgreSQL docker containers and running tests against Real Databases is slower than we'd like.
Dependencies: Docker and those MySQL or PostgreSQL servers in containers.
So it's not really in our patience budget for adding a mysqld or postgres start up (possible docker pull, create its schema, etc).
Edit: 3s for global setup/teardown. Not per test function/suite.
There are many operational differences between Etcd and a traditional RDBMs, but the biggest ones are that broadcasting updates (so that actors may react) is a core operation, and the MVCC log is "exposed" (via ModRevision) so that actors can resolve state disagreements (am I out of date, or are you?).
With the amount of explaining and skepticism you're having to deal with in most of the threads here (plenty of reasonable questions, some seem to approach the question with the assumption that your approach is totally wrong) I feel compelled to comment on how nice such a fast feedback loop would be just so it's known that you're not listing these benefits into an ether that doesn't appreciate them.
Mind you, I build ML/statisical models, so my integration/e2e tests are definitely not going to get down to 8 seconds.
It would create a nice flow to get feedback from your test suite that quickly.
I mean, that's what my old company did pre-Docker. It works, but it's tedious.
(But of course the battery will drain a little faster if it is a laptop)
We've definitely done some whack-a-mole with allocations in the engine, and of course there's always things getting changed/added still.
Managing local DBs once new versions are out and your server isn't upgraded yet is irritating, but when I'm using a Mac I'd still rather use a native DB than Docker because of the VM overhead, since I've not yet run into a bug caused by something like "my local postgres was a different version than the server was." (Closest I've gotten was imagemagick for mac doing something a bit differently than for linux, about 10 years ago at this point.)
Ran into that at a recent place - the code was doing "= NULL" in a bunch of places (before my time) and PG12 treated that differently than PG11 did which broke a bunch of tests.