LiteFS a FUSE-based file system for replicating SQLite

LiteFS a FUSE-based file system for replicating SQLite(github.com)

241 points by sysbot 3 years ago | 65 comments

ctur 3 years ago |

If a "FUSE to replicate SQLite" solution came from anywhere else, I'd be quite skeptical, but there is a lot of very interesting tech coming out of fly.io these days and Ben certainly knows this space well. It still feels a little like a hack and piercing of layers of abstraction (less so than, say, litestream).

I love it when at first glance it isn't clear if a project is a crazy idea from someone just goofing around vs a highly leveraged crazy idea that will be a foundational part of a major technology shift.

I suspect it's the latter and the strategy though is to layer this on top of litestream to create an easy way to use sqlite transparently in a widely distributed multi-node environment (litestream providing the backups and/or readonly replication to remote sites, with LiteFS handling low latency local access in a cluster, POP, or data center).

Cool stuff. It will be fun to see where fly takes this :)

benbjohnson 3 years ago | |

Thanks for the vote of confidence! I can understand the "hack" feel -- it's a trade-off. If I wrote it the "proper" way and integrated directly into the SQLite source or used a VFS then it'd be a lot harder to deploy for most folks. By making it a FUSE file system, someone can use it without really knowing much about it from the application's perspective.

As for strategy, it unfortunately doesn't work to layer with Litestream as backups need some strict control over who is the current primary. Instead, I'm adding S3 replication support [1] directly into LiteFS. LiteFS also uses a different transactional file format called LTX so it wouldn't be compatible with Litestream. The LTX format is optimized for compactions so point-in-time restores can be nearly instant.

The end goal isn't much of a secret. We want to let folks spin up nodes in regions across the world, automatically connect to one another, and have the whole thing have the ease of a single node app. We still have a ways to go on that vision but that's what we're going for.

[1] https://github.com/superfly/litefs/issues/18

ignoramous 3 years ago | | |

> I'm adding S3 replication support directly into LiteFS.

Nice! There's a lot of value one can get out of a blob store, despite it appearing seemingly at odds with block-device dependent systems, like most sql dbms.

When a database at BigCloud layered replication (real-time backups) atop S3, they did so by shipping both the WAL and the on-disk files. For write heavy tables, WAL was streamed every second, and on-disk files (snapshots) every 30mins (or at some apt size-based threshold).

While WAL streaming also doubled-up as a key foundation for them to build materialized views, support real-time triggers, and act as an online data-verification layer; S3 itself served as an insurance against hardware errors (memory, cpu, network, disk) and data corruption.

https://web.archive.org/web/20220712155558/https://www.useni... (keyword search S3)

Elasticsearch / OpenSearch does something similar but it only implements snapshot-based replication to S3 (periodic backups).

https://web.archive.org/web/20190722153122/https://www.micro... / https://archive.is/Q5jUj (docs)

tmp_anon_22 3 years ago | |

> a highly leveraged crazy idea that will be a foundational part of a major technology shift

Has anything other then the Cloud presented a true foundational shift in how applications are built? Kubernetes, Serverless, Blockchain, React, Swift, these things are big but not big enough.

I think we just like pretending every little thing is the next big thing.

benbjohnson 3 years ago |

LiteFS author here (also Litestream author). I'm happy to answer any questions folks have about how it works or what's on the roadmap.

hinkley 3 years ago |

I have an adjacent problem, and I haven't been able to find anyone who has a fix for me.

One perfectly reasonable use case for a read replica of a database is a bastion server. Database + web server on a machine that is firewalled both from the internet and from the business network. With read only access there is a much smaller blast radius if someone manages to compromise the machine.

The problem is that every single replication implementation I've seen expects the replicas to phone home to the master copy, not for the master copy to know of the replicas and stream updates to them. This means that your bastion machine needs to be able to reach into your LAN, which defeats half the point.

The most important question is, "what options exist to support this?" but I think the bigger question is why do we treat replicas as if they are full peers of the system of record when so often not only are they not - mechanically or philosophically - and in some cases couldn't be even if we wanted to? (eg, a database without multi-master support).

moderation 3 years ago |

> LiteFS is intended to provide easy, live, asychronous replication across ephemeral nodes in a cluster. This approach makes trade-offs as compared with simpler disaster recovery tools such as Litestream and more complex but strongly-consistent tools such as rqlite.

I think rqlite having a single binary that handles Raft / consensus _and_ includes SQLITE makes it simpler. Beyond 'hello world', Consul isn't trivial to run and Fly have blogged about this [0]

0. https://fly.io/blog/a-foolish-consistency/

tptacek 3 years ago | |

rqlite is very cool, but it's also much more ambitious than LiteFS; it's Raft consistency for every instance of the database, where Litestream/LiteFS is replication (for single-writer multi-reader setups, where reads are answered quickly from edges or read caches, and writes are funneled to a central node --- with LiteFS, an elected central node). Raft is, of course, more powerful, but it's also its own whole thing to manage and monitor.

The advantage of LiteFS/Litestream is that, for the most part, the database is "just" SQLite. You can't really say that, to the same extent, about rqlite.

I hope rqlite takes off! It's a good project.

We've spent a lot of time at Fly.io wrestling with Consul, but that's because we abuse it. That's what the article is about: we shoehorned Consul into a part of our architecture where we're taxed for features it has that we don't actually use (the overwhelming majority of all the data we have in Consul is stuff for which there's a single, continually available source of truth for the data, and Consul was just a "convenient" way to replicate it). Consul is great for the stuff it's meant for.

I wouldn't hesitate to reach for Consul in a new design. I just wouldn't use it for the thing we used it for.

gmemstr 3 years ago |

And I /just/ got my infrastructure bits and pieces running Litestream! Guess I'll have to figure out if it's worth switching to this -- my gut reaction is no, since I only really run one pod at a time, so Litestream serves the purpose of not only saving the database offsite but also restoring it. But I will be keeping a very close eye on this thanks in part to my love of SQLite.

Hats off to Ben and Fly.io, you're doing some cool stuff.

benbjohnson 3 years ago | |

Thanks! Yeah, if you don't need multiple replicas then Litestream should work just fine. I'd say stick to that for now.

ThinkBeat 3 years ago |

Having been a happy user developing solutions around SQLite for a good amount of time, I find all these "enterprisy" hacks / extension curious.

There are great solutions out there that handle these things and have for a long time.

I know SQLite has become the new hotness, but I really do not want SQLite to get good at all these things because then it would no longer be great at what it does marvelously already.

mrkurt 3 years ago | |

This is about as non-enterprisey as it gets. It's built to make sqlite work better for tiny little node.js apps running on very cheap hosting.

ThinkBeat 3 years ago | | |

From the first of the description of the project on GitHub.

> """LiteFS is a FUSE-based file system for replicating SQLite databases across a cluster of machines."""

> """Leader election: currently implemented by Consul using sessions """

That sounds enterprisy to me.

mandeepj 3 years ago |

Db sharding and replication is a fascinating subject and a matter of deep interest to me.

Ben and Matt - appreciate your contributions in this area. I'm interested in making contributions along with you. Please let me know if you are looking for help. Much Thanks.

benbjohnson 3 years ago | |

Thanks! I think sharding is really interesting -- especially with a lightweight database like SQLite. I'm not looking for contributions right now but I would love to hear any feedback on the approach taken with LiteFS. I want to make it as easy to run as possible.

mandeepj 3 years ago | | |

I'd give LiteFS a run soon. SqlLite is cross-platform, but seems like LiteFS is not. True?

https://github.com/superfly/litefs

hobo_mark 3 years ago |

Tangentially related: I'd like to use litestream but my SQLite files are several gigabytes, is there a way to lazily download the db only once it's being accessed? (using something like userfaultfd maybe? just thinking out loud)

pkhuong 3 years ago | |

Verneuil (https://github.com/backtrace-labs/verneuil) offers that for read replicas, because backtrace as well has some multi-GB sqlite DBs. It's a VFS (loadable at runtime as a sqlite extension), instead of a filesystem.

I don't remember if that's in the docs; commit that adds the configuration field https://github.com/backtrace-labs/verneuil/commit/027318ba74... and commit for the enum https://github.com/backtrace-labs/verneuil/blob/e6697498f3ba...

(the actual implementation is nothing special https://github.com/backtrace-labs/verneuil/commit/b6bdfcf7bc...)

Issue for the feature: https://github.com/backtrace-labs/verneuil/issues/12

benbjohnson 3 years ago | |

I'm not sure how that could be implemented with Litestream since it runs as a separate process. It could be possible with LiteFS where it just pages in data on the fly. That's on the roadmap but it's probably still a ways out.

mandeepj 3 years ago | |

> is there a way to lazily download the db only once it's being accessed?

Not sure what's your full scenario, because you mentioned "lazily download" so thought you might have a luck here https://news.ycombinator.com/item?id=27016630

tl;dr: using http range and a smaller page size - might be the way to go

ok_dad 3 years ago |

Very cool, you should add raft so every node could be the primary if the primary fails. You just need to add election and a few minor state things on top of what’s there already, I think.

vaxman 3 years ago |

LiteFS, a tool for edge-case data loss, unintended service hangs and data corruption /s

this kind of scheme was explored and naturally selected away in the pre-Internet networking era (whose applied knowledge was mostly lost and unavailable to newer generations after the boomer purge that began in the 2000 market crash) .. this kind of scheme should always be isolated to tightly coupled machines operating over fault tolerant and RELATIVELY high-speed links (like a cluster of boards interconnected via PCIe, fibre or Thunderbolt, each with [controller] ECC memory) not between WAN zones.

But in a tightly-coupled cluster-type environment, better solutions would exist like Redis Cluster (and RediSQL) that could be further upgraded with some kind of shared pagefile.

But for testing near-obsolete non-serverless cloud code against edge-cases, consider building a testbed of 4 or 5 rPis talking to each other over (really) slow LoRA links ..then amp up the RFi, thermal and vibration procedures and add all the code to maintain the data integrity as you monitor for the various failure scenarios.

vluft 3 years ago |

looks pretty neat, and I'm a big fan of litestream, but I can't help but feel that requiring a 400k loc election mechanism (2 mil if you include deps) for your 180k loc database is slightly excessive.

ROLLBACK; // cancel the tx e.g. because a different dbconn thread detected updated data before the tx was to be COMMITted. // Replay the tx BEGIN; // replay the same SQL statements COMMIT;