Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs. In the modern application/server environment, no user level process has accurate information about the total state of the machine, only the kernel (or hypervisor) does and it's an exercise in futility to try to manage paging etc at the user level.
As Dr. Michael Stonebraker put it: The Traditional RDBMS Wisdom is (Almost Certainly) All Wrong. https://slideshot.epfl.ch/play/suri_stonebraker (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.
Granted, even using mmap you still need to know wtf you're doing. MongoDB's original mmap backing store was a poster child for Doing It Wrong, getting all of the reliability problems and none of the performance benefits. LMDB is an example of doing it right: perfect crash-proof reliability, and perfect linear read scalability across arbitrarily many CPUs with zero-copy reads and no wasted effort, and a hot code path that fits into a CPU's 32KB L1 instruction cache.
This is co-authored by Pavlo, Viktor Leiss, with feedback from Neumann. I'm sorry, but if someone on the internet claims to know better than those 3, you're going to need some monumental evidence of your credibility.
Additionally, what you link here:
> ... (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.
Is discussing "Main Memory" databases. These databases do no I/O outside of potential initial reads, because all of the data fits in-memory!These databases represent a small portion of contemporary DBMS usage when compared to traditional RDBMS.
All you have to do is look at the bandwidth and reads/sec from the paper when using O_DIRECT "pread()"s versus mmap'ed IO.
(My understanding is that the GP wrote LMDB, works on openLDAP, and was a maintainer for BerkelyDB for a number of years. But even if he'd only written 'hello, world!' I'm much more interested in the specific arguments).
There's nothing special about kernel programmers. In fact, if I had to compare, I'd go with storage people being the more experienced / knowledgeable ones. They have a highly competitive environment, which requires a lot more understanding and inventiveness to succeed, whereas kernel programmers proper don't compete -- Linux won many years ago. Kernel programmers who deal with stuff like drivers or various "extensions" are, largely, in the same group as storage (often time literally the same people).
As for "single process" argument... well, if you run a database inside an OS, then, obviously, that will never happen as OS has its own processes to run. But, if you ignore that -- no DBA worth their salt would put database in the environment where it has to share resources with applications. People who do that are, probably, Web developers who don't have high expectations from their database anyways and would have no idea how to configure / tune it for high performance, so, it doesn't matter how they run it, they aren't the target audience -- they are light years behind on what's possible to achieve with their resources.
This has nothing to do with mmap though. mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.
Yes, that was a shorthand generalization for "people who've studied computer architecture" - which most application developers never have.
> no DBA worth their salt would put database in the environment where it has to share resources with applications.
Most applications today are running on smartphones/mobile devices. That means they're running with local embedded databases - it's all about "edge computing". There's far more DBs in use in the world than there are DBAs managing them.
> mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.
Well, you're half right. That's why by default LMDB uses a read-only mmap and uses regular (p)write syscalls for writes. But the central point of databases is to be able to persist data such that it can be retrieved again in the future, efficiently. And that's where the read characteristics of using mmap are superior.
In my experience -- and in line with the article -- mmap works fine with small working sets. It seems that most benchmarks of lmdb have relatively small data sets.
Where did you look? This is a sample using DB 5x and 50x larger than RAM http://www.lmdb.tech/bench/hyperdex/
There are plenty of other larger-than-RAM benchmarks there.
The article is about DBMS developers. For DBMS developers, "in practice" (a) and (b) are usually true I think.
Those who do that don't know what they are doing (even if they outnumber the other side hundred to one, they "don't count" because they aren't aiming for good performance anyways).
Well, maybe not quite... of course it's possible that someone would want to deploy a database in a container because of the convenience of assembling all dependencies in a single "package", however, they would never run database on the same node as applications -- that's insanity.
But, even the idea of deploying a database alongside something like kubelet service is cringe... This service is very "fat" and can spike in memory / CPU usage. I would be very strongly opposed to an idea of running a database on the same VM that runs Kubernetes or any container runtime that requires a service to run it.
Obviously, it says nothing about the number of processes that will run on the database node. At the minimum, you'd want to run some stuff for monitoring, that's beside all the system services... but I don't think GP meant "one process" literally. Neither that is realistic nor is it necessary.
(But just in containers, not in Kubernetes. I'm not crazy.)
And we are running them at the scale that most people can’t even imagine.
Note that Varnish dates to 2006, in the days of hard disk drives, SCSI, and 2-core server CPUs. Mmap might well have been as good or even better than I/O back then - a lot of the issues discussed in this paper (TLB shootdown overhead, single flush thread) get much worse as the core count increases.
AFAIK the persistent backend was dropped pretty early on (eventually replaced with a more traditional read()/write()-based one as part of Varnish Plus), and the general recommendation became just to use malloc and hope you didn't swap.
What did you differently in your custom one that was faster then varnish?
reminds me of how industries typically start out dominated by vertically integrated companies, move to specialized horizontal companies, then generally move back to vertical integration due to efficiency. Car industry started this way with Ford, went away from it, and now Tesla is doing it again. Lots of other examples in other industries
You almost always want somewhere in the middle, but it’s often much easier to move back after a large jump in one direction than to push towards the middle.
And there was a very well known cartoon video discussion about it with “web scale” and “just write to dev null” and other classics that became memes :)
Are You Sure You Want to Use MMAP in Your Database Management System? [pdf] - https://news.ycombinator.com/item?id=31504052 - May 2022 (43 comments)
Are you sure you want to use MMAP in your database management system? [pdf] - https://news.ycombinator.com/item?id=29936104 - Jan 2022 (127 comments)
You notice it when web servers are doing kernel bypass to for zero-copy, low-latency networking, or database engines throw away the kernel's page cache to implement their own file buffer.
With mmap, you get to avoid thinking about how much data to buffer at once, caching data to speed up repeated access, or shedding that cache when memory pressure is high. The kernel does all that. It may not do it in the absolute ideal way for your program but the benefit is you don't have to think about these logistics.
But if you're already writing intense systems code then you can probably do a better job than the kernel by optimizing for your use case.
You'll find DPDK mentioned a lot in the networking/HPC/data center literature. An example of a backend framework that uses DPDK is the seastar framework [2]. Also, I recently stumbled upon a paper for efficient RPC networks in data centers [3].
If you want to learn more, the p99 conference has tons of speakers talking about some interesting challenges in that space.
Edit: Hm, it might not be possible to mmap files with huge-pages. This LWN article[1] from 5 years ago talks about the work that would be required, but I haven't seen any follow-ups.
Then there's the part with writes being delayed. Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.
If you have the resources to write and maintain the bespoke method great. The large database developers probably have this. For others please don't take this link and go around claiming mmap is bad though. That gets tiresome and is misguided. Mmap is a shortcut to access large files in a non linear fashion. It's good at that too. Just not as good as a bespoke function.
This is an appeal to core database engineers to stop using the wrong tool for the job.
This is not specific to mmap -- regular old write() calls have the same behavior. You need to fsync() (or, with mmap, msync()) to guarantee data is on disk.
This is not true. This depends on how the file was opened. You may request DIRECT | SYNC when opening and the writes are acknowledged when they are actually written. This is obviously a lot slower than writing to cache, but this is the way for "simple" user-space applications to implement their own cache.
In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs. Sometimes it's even configurable. This is why databases also offer various modes of synchronization -- you need to know how your appliance works and configure the database accordingly.
Here is an LWN article discussing the whole problem as the Postgres team found out about it.
For the second part of your comment, on Linux systems, there is the msync() system call that can be used to flush the page cache on demand.
for everyone, not just the file you mapped to memory. I.e. the guarantee is that your file will be written, but there's no way to do that w/o affecting others. This is not such a hot idea in an environment where multiple threads / processes are doing I/O.
Is there a performance benefit to be had by managing the memory and paging yourself? Yes. But eventually you will also consider running processes next to your database, for logging, auditing, ingesting data, running backups, etc. Virtual memory across the whole system helps with that, especially if other people will be using your database in ways you can't predict. As for the efficiency of MMUs and the OS, seems like for almost all cases it's "satisfactory" enough[1].
[0] http://denninginstitute.com/pjd/PUBS/bvm.pdf
[1] From 1969! https://dl.acm.org/doi/pdf/10.1145/363626.363629
The reality is there will always be a hierarchy for storage, and paging will always be the best mechanism to deal with it. Because primary memory will always be most expensive, no matter what technology it's based on. There will always be something slower, cheaper, and denser that will be used for secondary storage. There will always be cheaper storage. And its capacity will exceed primary, and it will always be most efficient to reference secondary storage in chunks - pages - and not at individual byte addresses.
From reading the paper most of the concerns are around the write side. LMDB is the primary implementation that I know which leans heavily into mmap but it also comes with a number of constraints there(single writer, read locks can lead to unbounded appending to the WAL, etc). As with any tech choice it's about knowing constraints/trade-offs and making appropriate choices for your domain.
The opposite with actual file io sucks in terms of complexity. I get that you can write bespoke code that performs better but mmap is a one liner to turn a file into an array.
As for why disk reads fail, yes that's a thing. Less common on internal storage (bad sectors), but more common on removable USB devices or Network drives (especially on wifi).
There's so much you get "for free" and the UX/DX of reads/writes to it, especially if you're primarily operating on structs instead of raw byte/string data.
(Example, reading a file and "reinterpret_cast<>"'ing it from bytes to in-memory struct representations)
It's just that for the _particular_ case of a DBMS that relies on optimal I/O and transactionality, the general-purpose kernel implementation of mmap falls short of what you can implement by hand.
The point was simply about other processes that could be competing for resources - CPU, memory, or I/O. It is expensive for a user-level process to perform accounting for all of these resources, and without such accounting you can't optimally allocate them.
If there are other apps that can suddenly spike memory usage then any careful buffer tuning you've done goes out the window. Likewise for any I/O scheduling you've done, etc.
It's also strange to me that there's no transition in performance when the data set size grows beyond cache.
It is a realistic concern, I’ve lived it for more than a decade across many orgs, though I shared your opinion at one point. Storage density is massively important for both workload scalability and economic efficiency. Low storage density means buying a ton of server hardware that sits idle under max load and vastly larger clusters than would otherwise be necessary, which have their own costs.
When your database is sufficiently large, backup and restore often isn’t even a technical possibility so that requirement is a red herring. The kinds of workloads that can be recovered from backup at that scale on a single server, and some can, benefit massively from the economics of running it on a single server. A solution that has 10x the AWS bill for the same workload performance doesn’t get chosen.
At scale, hardware footprint economics is one of the central business decision drivers. Data isn’t getting smaller. It is increasingly ordinary for innocuous organizations to have a single table with a trillion records in it.
For better or worse, the market increasingly drives my technical design decisions to optimize for hardware/cloud costs above all else, and dense storage is a huge win for that.
Another technique that can only be done with mmap is to map two contiguous regions of virtual memory to the same underlying buffer. This allows you to use a ring buffer but only read from/write to what looks like a contiguous region of memory.
Also, I've never tested this, but I believe mapped files will get flushed as long as the system stays running. So if you only need resilience against abnormal termination rather than system crashes, it seems like a good option?
Linux will not lose data written to a MAP_SHARED mapping when the process crashes.
But! Linux will synchronously update mtime when starting to write to a currently write protected mapping (e.g. one which was just written out). This means (a) POSIX is violated (IMO) and (b) what should be a minor fault to enable writes turns into an actual metadata write, which can cause actual synchronous IO.
I have an ancient patch set to fix this, but I never got it all the way into upstream Linux.
What you can do is mmap a file on a tmpfs as long as you trust yourself to have some other reliable process handle the data even if your application terminates abnormally. This is awkward with a container solution if you need to survive termination of the entire container.
However, Java can build a special library file of the core JRE classes that it can mmap into memory with the intent to speed up startup times, mostly for small Java programs.
Guile scheme will mmap files that have been compiled to byte code. You can visualize a contrived (especially today) scenario where Guile is used for CGI handlers, having the bulk of their code mapped, the overall memory impact of simultaneous handlers is much lower, as well as start up times.
The process model is less common today so the value of this goes down, but it can still have its place.
A very over-simplified and probably a bit incorrect description of what it did was to create a smaller version of the image - one that could fit in memory - by sub sampling every nth pixel, which was addressed via mmap.
It actually dealt with jpegs so I have no idea how that bit worked as they are not bitmaps.
Admittedly I live in a world where big distributed transactions are a given and work fine and sql speeds us up not slows us down. I’m guessing sql and acid scaled after all?
Yes and no. Distributed transactions and two-phase commit have been superseded by things like Paxos and Raft, with a variety of consistency models, so the implementation is drastically different.
There are some applications that require high throughput (usually write) but can be fine with read consistency.
Couple of examples - consumer facing comment systems where other users are OK to miss your comment by 30 seconds - timeseries logging where you are usually reading infrequently but writing very much in a denormalized format so joins aren't as critical
For general CRUD, ACID is important though.
I do wonder what trend is going to win: bypass the kernel or embrace the kernel for everything?
The way I see it, latency decreases either way (as long as you don't have to switch back and forth between kernel and user space), but userspace seems better from a security standpoint.
Then again, everyone is doing eBPF, so probably the "embrace the kernel" approach is going to win. Who knows.
That said, I'm not sure many people write webservers in DPDK, since the Kernel is pretty well suited to webservers (sendfile, etc.). Most applications that use kernel-bypass are more specialized.
That may be acceptable for your purposes, or it may not.
I'm not very familiar with that though.
Then I found out Apache supports it via the EnableSendfile directive. Nice.
>This directive controls whether httpd may use the sendfile support from the kernel to transmit file contents to the client. By default, when the handling of a request requires no access to the data within a file -- for example, when delivering a static file -- Apache httpd uses sendfile to deliver the file contents without ever reading the file if the OS supports it.
* nginx: [1] * Haskell webserver module: [2] * caddy: [3]
[1]: https://nginx.org/en/docs/http/ngx_http_core_module.html#sen... [2]: https://hackage.haskell.org/package/warp-3.3.28/docs/Network... [3]: https://github.com/caddyserver/caddy/pull/5022
That said, it's tricky to use if the server also does TLS termination... then you need kTLS, which is a much bigger can of worms.
> int msync(void addr[.length], size_t length, int flags);
> msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to the filesystem
If you don't know why it's not possible, here's a simplified version of it: hardware protocols (s.a. SCSI) must have fixed size messages to fit them through the pipeline. I.e. you cannot have a message larger than the memory segment used for communication with the device, because that will cause fragmentation and will lead to a possibility of message being corrupted (the "tail" being lost or arriving out of order).
On the other hand, to "flush" a file to persistent storage you'd have to specify all blocks associated with the file that need to be written. If you try to do this, it will create a message of arbitrary size, possibly larger than the memory you can store it in. So, the only way to "flush" all blocks associated with a file is to "flush" everything on a particular disk / disks used by the filesystem. And this is what happens in reality when you do any of the sync family commands. The difference is only in what portion of the not-yet synced data the OS will send to the disk before requesting a sync, but the sync itself is for the entire disk, there aren't any other syncs.
Source: I'm a Linux kernel developer.
Well sure, but 99.9% of people don't do that (and shouldn't, unless they really know what they are doing).
> In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs.
What network-attached storage actually uses O_SYNC behavior without being asked? I'd be quite surprised if any did this as it would make typical workloads incredibly slow in order to provide a guarantee they didn't ask for.
Also, most of the network-attached storage we people use is in the form of things like EBS, which is very careful to imitate the behavior of a real disk, but with different performance and some different (albeit very rare) failure modes.
If you are developing an DBMS and haven't studied computer architecture, the best idea is probably to ask more experienced people to help out with your ideas.
From my limited knowledge, I don't think the article is old enough to be obsolete, just that there's a lot more to it.
Not to be gatekeeping or anything, but it is a pretty well studied field with lots of very knowledgeable people around, who are probably more than keep to help. There aren't too many qualified jobs around and you probably have a budget if you are developing a database commercially.
It's been a while since I've dealt with mmap(), but isn't this what msync() does? You can synchronously or asynchronously force dirty pages to be flushed to disk without waiting until munmap().
That's patently false. There are about 8 bn. people. Even if everyone has a smartphone or two, it's nothing compared to the total of all devices that can be called "computer". I think that "smart TV" alone will beat the number of smartphones. But even that is a drop in a bucket when it comes to the total of running programs on Earth / its orbit.
But, that's beside the point. Smartphones aren't designed to run database servers. Even if they indeed were the majority, they'd still be irrelevant for this conversation because they are a wrong platform for deploying databases. In other words, it doesn't matter how people deploy databases to smartphones -- they have no hopes of achieving good performance, and whether they use mmap or not is of no consequences -- they've lost the race before they even qualified for it.
> LMDB
Are we talking about this? https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Databa... If so, this is irrelevant for databases in general.
> LMDB databases may have only one writer at a time
(Taken from the page above) -- this isn't a serious contender for database server space. It's a toy database. You shouldn't give general advice based on whatever this system does or doesn't.
Also, it brings nothing of value to the table, but requires a lot of dance around it to keep it going. I.e. if you are a decent DBA, you don't have a problem setting up a node to run your database of choice, you would be probably opposed to using pre-packaged Docker images anyways.
Also, Kubernetes sucks at managing storage... basically, it doesn't offer anything that'd be useful to a DBA. Things that might be useful come as CSI... and, obviously, it's better / easier to not use a CSI, but to interface directly with the storage you want instead.
That's not to say that storage products don't offer these CSI... so, a legitimate question would be why would anyone do that? -- and the answer is -- not because it's useful, but because a lot of people think they need / want it. Instead of fighting stupidity, why not make an extra buck?
If I run a db workload in K8s, it’s a tiny fraction of the operational overhead, and not a massively noticeable performance loss.
I would absolutely love a way to deploy and manage db’s as easily as K8s with fewer of the quite significant issues that have mentioned, so if you know of something that is better behaved around singular workloads, but keeps the simple deploys, the resiliency, the ease of networking and config deployments, the ease of monitoring, etc, I am all ears.
1. fsync. You cannot "divide" it between containers. Whoever does it, stalls I/O for everyone else.
2. Context switches. Unless you do a lot of configurations outside of container runtime, you cannot ensure exclusive access to the number of CPU cores you need.
3. Networking has the same problem. You would either have to dedicate a whole NIC or SRI-OV-style virtual NIC to your database server. Otherwise just the amount of chatter that goes on through the control plane of something like Kubernetes will be a noticeable disadvantage. Again, containers don't help here, they only get in the way as to get that kind of exclusive network access you need more configuration on the host, and, possible an CNI to deal with it.
4. kubelet is not optimized to get out of your way. It needs a lot of resources and may spike, hindering or outright stalling database process.
5. Kubernetes sucks at managing memory-intensive processes. It doesn't work (well or at all) with swap (which, again, cannot be properly divided between containers). It doesn't integrate well with OOM killer (it cannot replace it, so any configurations you make inside Kubernetes are kind of irrelevant, because system's OOM killer will do how it pleases, ignoring Kubernetes).
---
Bottom line... Kubernetes is lame from infrastructure perspective. It's written for Web developers. To make things appear simpler for them, while sacrificing a lot of resources and hiding a lot of actual complexity... which is impossible to hide, and which, in an even of failure will come to bite you. You don't want that kind of program near your database.
- lots of containers running on a single host
- containers are each isolated in a VM (aka virtualized)
- workloads are not homogenous and change often (your neighbor today may not be your neighbor tomorrow)
I believe these are fair assumptions if you’re running on generic infrastructure with kubernetes.
In this setup, my concerns are pretty much noisy neighbors + throttling. You may get latency spikes out of nowhere and the cause could be any of:
- your neighbor is hogging IO (disk or network)
- your database spawned too many threads and got throttled by CFS
- CFS scheduled your DBs threads on a different CPU and you lost your cache lines
In short, the DB does not have stable, predictable performance, which are exactly the characteristics you want it to have. If you ran the DB on a dedicated host you avoid this whole suite of issues.
You can alleviate most of this if you make sure the DB’s container gets the entire host’s resources and doesn’t have neighbors.
Wrong. I wouldn't use kubelet at all. Kubernetes and good performance are not compatible. The goal of Kubernetes is to make it easier to deploy Web sites. Web is a very popular technology, so Kubernetes was adopted in many places where it's irrelevant / harmful because Web developers are plentiful and will help to power through the nonsense of this program. It's there because it makes trivial things even easier for less qualified personnel. It's not meant as a way to make things go faster, or to use less memory, or to use less persistent storage, or less network etc... it's the wheelchair of ops, not a highly-optimized professional-grade equipment.
It's simple, until you hit a problem. And then it becomes a lot worse than if you had never touched it. You are now in the stage of a person who'd never made backups and never had a failure that required them to restore from backups, and you are wondering why would anyone do it. Adverse events are rare, and you may go like this for years, or, perhaps the rest of your life... unfortunately, your experience will not translate into a general advice.
But, again, you just might be in the camp where performance doesn't matter. Nor does uptime matter, nor does your data have very high value... and in that case it's OK to use tools that don't offer any of that, and save you some time. But, you cannot advise others based on that perspective. Or, at least, not w/o mentioning the downsides.
I am not sure what complexity Kubernetes adds in this situation. Anything Kubernetes can do to you, your cloud provider (or a poorly aimed fire extinguisher) can do to you. You have to be ready for a disaster no matter the platform.
For fun, there is no guarantee in terms of writing a page in what order it is written. SQLite documents that they assume (but cannot verify) that _sector_ writes are linear, but not atomic. https://www.sqlite.org/atomiccommit.html
> If a power failure occurs in the middle of a sector write it might be that part of the sector was modified and another part was left unchanged. The key assumption by SQLite is that if any part of the sector gets changed, then either the first or the last bytes will be changed. So the hardware will never start writing a sector in the middle and work towards the ends. We do not know if this assumption is always true but it seems reasonable.
You are talking several levels higher than that, at the page level (composed of multiple sectors).
Assume that they reside in _different_ physical locations, and are written at different times. That's fun.
> Currently all hard drive/SSD manufacturers guarantee that 512 byte sector writes are atomic. As such, failure to write the 106 byte header is not something we account for in current LMDB releases. Also, failures of this type should result in ECC errors in the disk sector - it should be impossible to successfully read a sector that was written incorrectly in the ways you describe.
Even in extreme cases, the probability of failure to write the leading 128 out of 512 bytes of a sector is nearly nil - even on very old hard drives, before 512-byte sector write guarantees. We would have to go back nearly 30 years to find such a device, e.g.
https://archive.org/details/bitsavers_quantumQuaroductManual...
Page 23, Section 2.1 "No damage or loss of data will occur if power is applied or removed during drive operation, except that data may be lost in the sector being written at the time of power loss."
From the specs on page 15, the data transfer rate to/from the platters is
1.25MB/sec, so the time to write one full sector is 0.4096ms; the time to
write the leading 128 bytes of the sector is thus 1/4 of that: 0.10ms. You
would have to be very very unlucky to have a power failure hit the drive
within this .1ms window of time. Fast-forward to present day and it's simply
not an issue.
^ above quoted from https://lists.openldap.org/hyperkitty/list/openldap-devel@op...Assume 512 sectors ( I know those are rare ), but I don't think that there is any guarantees that 4KB page would be:
* Written atomically * Written in a particular order
Andy and I have had this debate going for a long time already.
Isn't LMBD closer to an embedded key-value store than an RDBMS, though? Also there's a section in the paper that mentions it's single-writer.
Interestingly, most of the reason for these problems has to do with theoretical limitations of cache replacement algorithms as drivers of I/O scheduling. There are alternative approaches to scheduling I/O that work much better in these cases but mmap() can’t express them, so in those cases bypassing mmap() offers large gains.
<<<There is one significant drawback that should not be understated. Algorithm design using topology manipulation can be enormously challenging to reason about. You are often taking a conceptually simple algorithm, like a nested loop or hash join, and replacing it with a much more efficient algorithm involving the non-trivial manipulation of complex high-dimensionality constraint spaces that effect the same result. Routinely reasoning about complex object relationships in greater than three dimensions, and constructing correct parallel algorithms that exploit them, becomes easier but never easy.>>>
- Queries can trigger blocking page faults when accessing (transparently) evicted pages, causing unexpected I/O stalls
- mmap() complicates transactionality and error-handling
- Page table contention, single-threaded page eviction, and TLB shootdowns become bottlenecks
2 - complexity? this is simply false. LMDB's ACID txns using MVCC are much simpler than any "traditional" approach.
3 - contention is a red herring since this approach is already single-writer, as is common for most embedded k/v stores these days. You lose more perf by trying to make the write path multi-threaded, in lock contention and cache thrashing.
Excuse me for a silly question, but whilst an I/O stall may be unavoidable, wouldn't a thread stall be avoidable if you're not using mmap?
Assuming that you're not swapping, you'll generally know if you've loaded something into memory or not, whilst mmap doesn't help you know if the relevant page is cached. If the data isn't in memory, you can send the I/O request to a thread to retrieve it, and the initiating thread can then move onto the next connection. I suspect this isn't doable under mmap based access?
We shouldn't apply a higher bar to the counterargument than we applied to the argument in the first place.
Why are you assuming containers are virtualized? Is there some container runtime that does that as an added security measure? I thought they all use namespaces on Linux.
Alway allocate whole cores, just mask them off
Dedicate physical IO devices for sensitive workloads
You can have per cgroup swap if you want, but imo swap is not useful
I think all of this is possible in k8s
Would using direct-Io API’s fix most of the fsync issues? If workloads pin their stuff to specific cores can we incite some of the overhead here? (Assuming we’re only running a single dedicated workload + kubelet on the node).
> You would either have to dedicate a whole NIC or SRI-OV-style virtual NIC to your database server
Tbh I’ve no idea we could do this with commodity cloud servers, nor do I know how, but I’m terribly interested in knowing how, do you know if there’s like a “dummy’s guide to better networking”? Haha
> kubelet is not optimized to get out of your way...Kubernetes sucks at managing memory-intensive processes
Definitely agree on both these issues, I’ve blown up the kubelet by overallocating memory before, which basically borked the node until some watchdog process kicked in. Sounds like the better solution here is a kubelet rebuilt to operate more efficiently and more predictably? Is the solution a db-optimised kubelet/K8s?
If you're not in control of the system, and thus kubelet, obviously your hands are tied. I'm not sure anyone is suggesting that for a serious workload.
Now to dispell your myths:
1. You can assign dedicated storage devices to your database. Outside of mount operations you're not going to see much alien fsync activity. This is paranoid.
2. You can pin kubelet CPU cores. You can ensure exclusive access to the remaining ones. There are a number of advanced techniques that are not at all necessary if you want to be a control freak, such as creating your own cgroups. This isn't "outside" of the runtime. Kubernetes is designed to conform to your managed cgroups. That's the whole point. RTFM.
3. The general theme of your complaint has nothing to do with kubernetes. There's no beating a dedicated NIC and even network fabric. Some cloud providers even allow you to multi-NIC out of the box so this is pretty solvable. Also, like, the dumbest QoS rules can drastically minimize this problem generally. Who cares.
4. Nah. RTFM. This is total FUD.
5.a. I don't understand. Are you sharing resources on the node or not? If you're not, then swap works fine. If you are, then this smells like cognitive dissonance and maybe listen to your own advice, but also swap is still very doable. It's just disk. swapon to your heart's content. But also swap is almost entirely dumb these days. Are you suggesting swapping to your primary IO device? Come on. More FUD.
5.b. OOM killer does what it wants. What's a better alternative that integrates "well" with the OOM killer? Do you even understand how resource limits work? The OOM killer is only ever a problem if you either do not configure your workload properly (true regardless of execution environment) or you run out of actual memory.
Bottom line: come down off your high horse and acknowledge that dedicated resources and kernel tuning is the secret to extreme high performance. I don't care how you're orchestrating your workloads, the best practices are essentially universal.
And to be clear, I'm not recommending using Kubernetes to run a high performance database but it's not really any worse (today) than alternatives.
> It's written for Web developers. To make things appear simpler for them, while sacrificing a lot of resources and hiding a lot of actual complexity... which is impossible to hide, and which, in an even of failure will come to bite you.
What planet are you currently on? This makes no sense. It's a set of abstractions and patterns, the intent isn't to hide the complexity but to make it manageable at scale. I'd argue it succeeds at that.
Seriously, what is the alternative runtime you'd prefer here? systemd? hand rolled bash scripts? puppet and ansible? All of the above??
This is word salad. Do you even know what fsync is for? I'm not even asking if you know how it works... What is "alien" fsync activity? Mount is perhaps the one system call that has nothing to do with fsync... so, I wouldn't expect any fsync activity when calling mount...
Finally, I didn't say that you cannot allocate a dedicated storage device -- what I said is that Kubernetes or Docker or Singularity or containerd or... well, none of container (management) runtimes that I've ever used know how to do it. You need external tools to do it. The point isn't that you cannot, the point is that a container runtime will only stand in your way when you try to do it.
> You can pin kubelet CPU cores. You can ensure exclusive access to the remaining ones.
No you cannot. Not through Kubernetes. You need to do this on the node that hosts kubelet.
And... I don't have the time or the patience necessary to answer to the rest of the nonsense. Bottom line: you don't understand what you are replying to, and arguing with something I either didn't say, or just stringing meaningless words together.
I do, though perhaps an ignorant life would be simpler. "Alien" is a word with a definition. Perhaps "foreign" is a better word. Forgive me for attempting to wield the English language.
No one well will use your fucking disk if you mount it exclusively in a pod. Does that make sense? You must be a joy to work with.
> The point isn't that you cannot, the point is that a container runtime will only stand in your way when you try to do it.
I have no idea what this means. How does kubernetes stand in your way?
> No you cannot. Not through Kubernetes. You need to do this on the node that hosts kubelet.
This is incorrect. You can absolutely configure the kubelet to reserve cores and offer exclusive cores to pods by setting a CPU management policy. I know because I was waiting for this for a very long time for all of the reason in the discussion here. It works fine.
You clearly have an axe to grind and it seems pretty obvious you're not willing to do the work to understand what you're complaining about. It might help to start by googling what a container runtime even is, but I'm not optimistic.
Our experience with OpenLDAP was that multi-writer concurrency cost too much overhead. Even though you may be writing primary records to independent regions of the DB, if you're indexing any of that data (which all real DBs do, for query perf) you wind up getting a lot of contention in the indices. That leads to row locking conflicts, txn rollbacks, and retries. With a single writer txn model, you never get conflicts, never need rollbacks.
This only works on systems with sufficiently slow storage. If your server has a bunch of NVMe, which is a pretty normal database config these days, you will be hard-pressed to get anywhere close to the theoretical throughput of the storage with a single writer. That requires 10+ GB/s sustained. It is a piece of cake with multiple writers and a good architecture.
Writes through indexing can be sustained at this rate (assuming appropriate data structures), most of the technical challenge is driving the network at the necessary rate in my experience.
http://www.lmdb.tech/bench/hyperdex/
RAM is relatively cheap too, there's no real reason to be running multi-TB databases at greater than a 50x ratio.
> The database size for one warehouse is approximately 100 MB (we experiment with five warehouses for a total size of 500MB).
It is not surprising that when your database basically fits in RAM, serializing on one writer is worth doing, because it just plainly reduces contention. You basically gain nothing in a DB engine from multi-writer transactions when this is the case. A large part of a write (the vast majority of write latency) in many systems with a large database comes from reading the index up to the point where you plan to write. If that tree is in RAM, there is no work here, and you instead incur overhead on consistency of that tree by having multiple writers.
I'm not suggesting that these results are useless. They are useful for people whose databases are small because they are meaningfully better than RocksDB/LevelDB which implicitly assume that your database is a *lot* bigger than RAM.
Where are you getting that assumption from? LevelDB was built to be used in Google Chrome, not for multi-TB DBs. RocksDB was optimized specifically for in-memory workloads.
Pretty much all storage libraries written in the past couple of decades are using single writer. Note that single writer doesn't mean single transaction. Merging transactions is easy and highly profitable, after all.
No "mainstream" database I'm aware of has a global single writer design.
RocksDB is Facebook's offshoot of LevelDB, basically keeping the core architecture of the storage engine (but multithreading it), and is used internally at Facebook as the backing store for many of their database systems. I have never heard from anyone that RocksDB was optimized for in-memory workloads at all, and I think most benchmarks can conclusively say the opposite: both of those DB engines are pretty bad for workloads that fit in memory.
Was a PITA to optimise though; tons of options and little insight into which ones work.
> Now that the total database is 50 times larger than RAM, around half of the key lookups will require a disk I/O.
That is an insanely high cache hit rate, which should have probably set off your "unrepresentative benchmark" detector. I am also a little surprised at the lack of a random writes benchmark. I get that this is marketing material, though.
Eh? This was 20% random writes, 80% random reads. LMDB is for read-heavy workloads.
> That is an insanely high cache hit rate, which should have probably set off your "unrepresentative benchmark" detector.
No, that is normal for a B+tree; the root page and most of the branch pages will always be in cache. This is why you can get excellent efficiency and performance from a DB without tuning to a specific workload.
The page says "updates," not "writes." Updates are a constrained form of write where you are writing to an existing key. Updates, importantly, do not affect your index structure, while writes do.
> No, that is normal for a B+tree; the root page and most of the branch pages will always be in cache. This is why you can get excellent efficiency and performance from a DB without tuning to a specific workload.
It is normal for a small B+tree relative to the memory size available on the machine. The "small" was the unrepresentative part of the benchmark, not the "B+tree."
OK, I see your point. It would only have made things even worse for LevelDB here to do an Add/Delete workload because its garbage compaction passes would have had to do a lot more work.
> It is normal for a small B+tree relative to the memory size available on the machine. The "small" was the unrepresentative part of the benchmark, not the "B+tree."
This was 100 million records, and a 5-level deep tree. To get to 6 levels deep it would be about 10 billion records. Most of the branch pages would still fit in RAM; most queries would require at most 1 more I/O than the 5-level case. The cost is still better than any other approach.