I love it when at first glance it isn't clear if a project is a crazy idea from someone just goofing around vs a highly leveraged crazy idea that will be a foundational part of a major technology shift.
I suspect it's the latter and the strategy though is to layer this on top of litestream to create an easy way to use sqlite transparently in a widely distributed multi-node environment (litestream providing the backups and/or readonly replication to remote sites, with LiteFS handling low latency local access in a cluster, POP, or data center).
Cool stuff. It will be fun to see where fly takes this :)
As for strategy, it unfortunately doesn't work to layer with Litestream as backups need some strict control over who is the current primary. Instead, I'm adding S3 replication support [1] directly into LiteFS. LiteFS also uses a different transactional file format called LTX so it wouldn't be compatible with Litestream. The LTX format is optimized for compactions so point-in-time restores can be nearly instant.
The end goal isn't much of a secret. We want to let folks spin up nodes in regions across the world, automatically connect to one another, and have the whole thing have the ease of a single node app. We still have a ways to go on that vision but that's what we're going for.
Nice! There's a lot of value one can get out of a blob store, despite it appearing seemingly at odds with block-device dependent systems, like most sql dbms.
When a database at BigCloud layered replication (real-time backups) atop S3, they did so by shipping both the WAL and the on-disk files. For write heavy tables, WAL was streamed every second, and on-disk files (snapshots) every 30mins (or at some apt size-based threshold).
While WAL streaming also doubled-up as a key foundation for them to build materialized views, support real-time triggers, and act as an online data-verification layer; S3 itself served as an insurance against hardware errors (memory, cpu, network, disk) and data corruption.
https://web.archive.org/web/20220712155558/https://www.useni... (keyword search S3)
Elasticsearch / OpenSearch does something similar but it only implements snapshot-based replication to S3 (periodic backups).
https://web.archive.org/web/20190722153122/https://www.micro... / https://archive.is/Q5jUj (docs)
Has anything other then the Cloud presented a true foundational shift in how applications are built? Kubernetes, Serverless, Blockchain, React, Swift, these things are big but not big enough.
I think we just like pretending every little thing is the next big thing.
One perfectly reasonable use case for a read replica of a database is a bastion server. Database + web server on a machine that is firewalled both from the internet and from the business network. With read only access there is a much smaller blast radius if someone manages to compromise the machine.
The problem is that every single replication implementation I've seen expects the replicas to phone home to the master copy, not for the master copy to know of the replicas and stream updates to them. This means that your bastion machine needs to be able to reach into your LAN, which defeats half the point.
The most important question is, "what options exist to support this?" but I think the bigger question is why do we treat replicas as if they are full peers of the system of record when so often not only are they not - mechanically or philosophically - and in some cases couldn't be even if we wanted to? (eg, a database without multi-master support).
I think rqlite having a single binary that handles Raft / consensus _and_ includes SQLITE makes it simpler. Beyond 'hello world', Consul isn't trivial to run and Fly have blogged about this [0]
The advantage of LiteFS/Litestream is that, for the most part, the database is "just" SQLite. You can't really say that, to the same extent, about rqlite.
I hope rqlite takes off! It's a good project.
We've spent a lot of time at Fly.io wrestling with Consul, but that's because we abuse it. That's what the article is about: we shoehorned Consul into a part of our architecture where we're taxed for features it has that we don't actually use (the overwhelming majority of all the data we have in Consul is stuff for which there's a single, continually available source of truth for the data, and Consul was just a "convenient" way to replicate it). Consul is great for the stuff it's meant for.
I wouldn't hesitate to reach for Consul in a new design. I just wouldn't use it for the thing we used it for.
Hats off to Ben and Fly.io, you're doing some cool stuff.
There are great solutions out there that handle these things and have for a long time.
I know SQLite has become the new hotness, but I really do not want SQLite to get good at all these things because then it would no longer be great at what it does marvelously already.
> """LiteFS is a FUSE-based file system for replicating SQLite databases across a cluster of machines."""
> """Leader election: currently implemented by Consul using sessions """
That sounds enterprisy to me.
Ben and Matt - appreciate your contributions in this area. I'm interested in making contributions along with you. Please let me know if you are looking for help. Much Thanks.
I don't remember if that's in the docs; commit that adds the configuration field https://github.com/backtrace-labs/verneuil/commit/027318ba74... and commit for the enum https://github.com/backtrace-labs/verneuil/blob/e6697498f3ba...
(the actual implementation is nothing special https://github.com/backtrace-labs/verneuil/commit/b6bdfcf7bc...)
Issue for the feature: https://github.com/backtrace-labs/verneuil/issues/12
Not sure what's your full scenario, because you mentioned "lazily download" so thought you might have a luck here https://news.ycombinator.com/item?id=27016630
tl;dr: using http range and a smaller page size - might be the way to go
this kind of scheme was explored and naturally selected away in the pre-Internet networking era (whose applied knowledge was mostly lost and unavailable to newer generations after the boomer purge that began in the 2000 market crash) .. this kind of scheme should always be isolated to tightly coupled machines operating over fault tolerant and RELATIVELY high-speed links (like a cluster of boards interconnected via PCIe, fibre or Thunderbolt, each with [controller] ECC memory) not between WAN zones.
But in a tightly-coupled cluster-type environment, better solutions would exist like Redis Cluster (and RediSQL) that could be further upgraded with some kind of shared pagefile.
But for testing near-obsolete non-serverless cloud code against edge-cases, consider building a testbed of 4 or 5 rPis talking to each other over (really) slow LoRA links ..then amp up the RFi, thermal and vibration procedures and add all the code to maintain the data integrity as you monitor for the various failure scenarios.
Feel free to ping me at benbjohnson@yahoo.com if you want to chat WAL & journaling. I think it'd be difficult to implement CRDT at the journal level since it's all physical pages. Merging data in pages is going to sometimes cause splits and that'll be hairy to deal with.
As someone who spends an awful amount of time using FUSE, my recommendation is to only use it in cases where the software that interacts with the file system isn't easily changeable. For example, for Buildbarn which I maintain (https://github.com/buildbarn/bb-remote-execution), I need to use it. It's infeasible to change arbitrary compilers and tests to all interact with a network distributed build cache. Designing the FUSE file system was a pretty heavy investment though, as you really need to be POSIXly correct to make it all work. The quality of implementations of FUSE also varies between OSes and their versions. macFUSE, for example, is quite different from Linux FUSE.
Given that SQLite already has all of the hooks in place, I would strongly recommend using those. In addition to increasing portability, it also makes it easier to set up/run. As an example, it's pretty hard to mount a FUSE file system inside of a container running on Kubernetes without risking locking up the underlying host. Doing the same thing with the SQLite VFS hooks is likely easy and also doesn't require your container to run with superuser privileges.
There's some interesting work I'm looking at with the VFS and WASM where you could deploy something like this across pure serverless functions like Vercel or Deno.
As WAL'd sqlite on networked filesystems is a no-go, once LiteFS gets wal support I think this will be a savior software for many operators with existing sqlite deployments.
Could we even host LiteFS with underlying dbs on a filesystem that otherwise wouldn't play well with WAL (say nfs, ceph or gluster)? An effectively single-node deployment that would achieve redundancy through the FS.
I was very excited for the experimental support for live replication in Litestream, but as I understand that didn't pan out. This looks to be the successor, with fanout replication an explicit feature.
Using a FUSE layer that's detecting changes is sure to have some performance tradeoffs. What benchmarks are under way? Do you need any help?
The FUSE layer does have performance trade-offs. Read queries aren't affected as much since the OS page cache is in the kernel and avoids the FUSE layer entirely. Most databases are under 10GB (or honestly probably less than 1GB) so you can fit most of that in the page cache for any moderately sized VM. The write side will incur a performance hit but I haven't taken benchmarks yet. There's a lot that can still be optimized around that.
The biggest help that I could use is just to get feedback on how it feels to use and any ways I can improve it. The end goal is to make it ridiculously simple to spin up and "just work" so any ideas towards that goal would be awesome.
Another goal is to simplify deployment. It's a lot easier and cost-effective to just deploy out 10 application nodes across a bunch of regions rather than having to also deploy Postgres replicas in each of those regions too.
Litestream interacts with the SQLite API for locking but it uses a regular file descriptor to read the WAL. It has to do similar parsing of pages but it's a bit more complicated since it needs to re-verify its position every time it reads from the WAL.
Another use case is moving data close to users. If you're only targeting users in a single country then it's not as big of a deal but RTT from the US to Europe is ~100ms and US to Asia is ~250ms. That's a big latency hit depending on what you're trying to do.
You can set up a PostgreSQL replica to be driven purely off of archive logs. It does not need direct access to the source database as it can pull the archive files via read only to a third location (e.g. file server or S3) that gets pushed by the source database server. The catch is that it will only be updated when an WAL file is pushed which can be driven either by size (automatically on an "active" database) or time (every N seconds or minutes). If you're fine with potentially being a minute behind the source, you can easily set this up.
To be fair, there are a number of ways a hostile endpoint can screw with another server even just by screwing around with TCP protocol behavior, so perhaps I'm putting too fine a point on it.
I believe Cassandra does not have a Master\follower architecture; it's following a ring based structure.
When a new node added, it takes responsibility for a segment of a ring. When node removed, it's segment get redistributed.
It's however in no way a drop-in replacement for RDBMS and requires a careful planning around application read and writes patterns that is very different from your typical RDBMS. Definitely, can't be used in this scenario - every node needs to be able to access every other node and client must be able to access at least one node.
> Isolation And Concurrency: SQLite implements isolation and concurrency control (and atomicity) using transient journal files that appear in the same directory as the database file. There are two major "journal modes". The older "rollback mode" corresponds to using the "DELETE", "PERSIST", or "TRUNCATE" options to the journal_mode pragma. In rollback mode, changes are written directly into the database file, while simultaneously a separate rollback journal file is constructed that is able to restore the database to its original state if the transaction rolls back. Rollback mode (specifically DELETE mode, meaning that the rollback journal is deleted from disk at the conclusion of each transaction) is the current default behavior.
> Since version 3.7.0 (2010-07-21), SQLite also supports "WAL mode". In WAL mode, changes are not written to the original database file. Instead, changes go into a separate "write-ahead log" or "WAL" file. Later, after the transaction commits, those changes will be moved from the WAL file back into the original database in an operation called "checkpoint". WAL mode is enabled by running "PRAGMA journal_mode=WAL".
> In rollback mode, SQLite implements isolation by locking the database file and preventing any reads by other database connections while each write transaction is underway. Readers can be active at the beginning of a write, before any content is flushed to disk and while all changes are still held in the writer's private memory space. But before any changes are made to the database file on disk, all readers must be (temporarily) expelled in order to give the writer exclusive access to the database file. Hence, readers are prohibited from seeing incomplete transactions by virtue of being locked out of the database while the transaction is being written to disk. Only after the transaction is completely written and synced to disk and committed are the readers allowed back into the database. Hence readers never get a chance to see partially written changes.
> WAL mode permits simultaneous readers and writers. It can do this because changes do not overwrite the original database file, but rather go into the separate write-ahead log file. That means that readers can continue to read the old, original, unaltered content from the original database file at the same time that the writer is appending to the write-ahead log. In WAL mode, SQLite exhibits "snapshot isolation". When a read transaction starts, that reader continues to see an unchanging "snapshot" of the database file as it existed at the moment in time when the read transaction started. Any write transactions that commit while the read transaction is active are still invisible to the read transaction, because the reader is seeing a snapshot of database file from a prior moment in time.
> An example: Suppose there are two database connections X and Y. X starts a read transaction using BEGIN followed by one or more SELECT statements. Then Y comes along and runs an UPDATE statement to modify the database. X can subsequently do a SELECT against the records that Y modified but X will see the older unmodified entries because Y's changes are all invisible to X while X is holding a read transaction. If X wants to see the changes that Y made, then X must end its read transaction and start a new one (by running COMMIT followed by another BEGIN.)
Or:
ROLLBACK; // cancel the tx e.g. because a different dbconn thread detected updated data before the tx was to be COMMITted.
// Replay the tx
BEGIN;
// replay the same SQL statements
COMMIT;> Usually, SQLite allows at most one writer to proceed concurrently. The BEGIN CONCURRENT enhancement allows multiple writers to process write transactions simultanously if the database is in "wal" or "wal2" mode, although the system still serializes COMMIT commands.
https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begi...
Having worked with FUSE in the past, there is always some performance penalty, but the advantage is that there is no change in the application.
The VFS will need change in the application but less overhead. Trade-offs!
Technically, an extension and VFS are separate things.
An extension when loads, allows you to register the VFS with the Sqlite. It can also register virtual tables as well.