Are You Sure You Want to Use MMAP in Your Database Management System? (2022)

Are You Sure You Want to Use MMAP in Your Database Management System? (2022)(db.cs.cmu.edu)

192 points by nethunters 3 years ago | 177 comments

hyc_symas 3 years ago |

This is a pretty old argument and IMO it's far out of date/obsolete.

Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs. In the modern application/server environment, no user level process has accurate information about the total state of the machine, only the kernel (or hypervisor) does and it's an exercise in futility to try to manage paging etc at the user level.

As Dr. Michael Stonebraker put it: The Traditional RDBMS Wisdom is (Almost Certainly) All Wrong. https://slideshot.epfl.ch/play/suri_stonebraker (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.

Granted, even using mmap you still need to know wtf you're doing. MongoDB's original mmap backing store was a poster child for Doing It Wrong, getting all of the reliability problems and none of the performance benefits. LMDB is an example of doing it right: perfect crash-proof reliability, and perfect linear read scalability across arbitrarily many CPUs with zero-copy reads and no wasted effort, and a hot code path that fits into a CPU's 32KB L1 instruction cache.

gavinray 3 years ago | |

Out of curiosity, how many databases have you written?

This is co-authored by Pavlo, Viktor Leiss, with feedback from Neumann. I'm sorry, but if someone on the internet claims to know better than those 3, you're going to need some monumental evidence of your credibility.

Additionally, what you link here:

  > ... (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.

Is discussing "Main Memory" databases. These databases do no I/O outside of potential initial reads, because all of the data fits in-memory!

These databases represent a small portion of contemporary DBMS usage when compared to traditional RDBMS.

All you have to do is look at the bandwidth and reads/sec from the paper when using O_DIRECT "pread()"s versus mmap'ed IO.

LAC-Tech 3 years ago | | |

This is a classic appeal to authority. Let's play the argument, not the man.

(My understanding is that the GP wrote LMDB, works on openLDAP, and was a maintainer for BerkelyDB for a number of years. But even if he'd only written 'hello, world!' I'm much more interested in the specific arguments).

ilyt 3 years ago | | |

Out of curiosity, do you have anything actually useful to add or are just throwing appeals to authority because you don't ?

Mikhail_Edoshin 3 years ago | | |

Even thought the data resides mostly in-memory they still have to write transactions to disk to preserve them, don't they?

crabbone 3 years ago | |

> your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs.

There's nothing special about kernel programmers. In fact, if I had to compare, I'd go with storage people being the more experienced / knowledgeable ones. They have a highly competitive environment, which requires a lot more understanding and inventiveness to succeed, whereas kernel programmers proper don't compete -- Linux won many years ago. Kernel programmers who deal with stuff like drivers or various "extensions" are, largely, in the same group as storage (often time literally the same people).

As for "single process" argument... well, if you run a database inside an OS, then, obviously, that will never happen as OS has its own processes to run. But, if you ignore that -- no DBA worth their salt would put database in the environment where it has to share resources with applications. People who do that are, probably, Web developers who don't have high expectations from their database anyways and would have no idea how to configure / tune it for high performance, so, it doesn't matter how they run it, they aren't the target audience -- they are light years behind on what's possible to achieve with their resources.

This has nothing to do with mmap though. mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.

hyc_symas 3 years ago | | |

> There's nothing special about kernel programmers.

Yes, that was a shorthand generalization for "people who've studied computer architecture" - which most application developers never have.

> no DBA worth their salt would put database in the environment where it has to share resources with applications.

Most applications today are running on smartphones/mobile devices. That means they're running with local embedded databases - it's all about "edge computing". There's far more DBs in use in the world than there are DBAs managing them.

> mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.

Well, you're half right. That's why by default LMDB uses a read-only mmap and uses regular (p)write syscalls for writes. But the central point of databases is to be able to persist data such that it can be retrieved again in the future, efficiently. And that's where the read characteristics of using mmap are superior.

sakras 3 years ago | |

Can you comment on what the paper gets wrong? It says that scalability with mmap is poor due to page table contention and others. How does LMDB manage to scale well with mmap? Is page table contention just not an issue in practice?

tadfisher 3 years ago | |

Maybe someone should pull LMDB's mmap/paging system into a usable library. I'd love to use the k/v store part of course, but I keep hitting the default key size limitation and would prefer not to link statically.

hyc_symas 3 years ago | | |

It wouldn't be much use without the B+tree as well; it's the B+tree's cache friendliness that allows applications to run so efficiently without the OS knowing any specifics of the app's usage patterns.

ori_b 3 years ago | |

Do you have benchmarks of lmdb when the working set is much larger than memory? I couldn't find any.

In my experience -- and in line with the article -- mmap works fine with small working sets. It seems that most benchmarks of lmdb have relatively small data sets.

hyc_symas 3 years ago | | |

> Do you have benchmarks of lmdb when the working set is much larger than memory? I couldn't find any.

Where did you look? This is a sample using DB 5x and 50x larger than RAM http://www.lmdb.tech/bench/hyperdex/

There are plenty of other larger-than-RAM benchmarks there.

jerrygenser 3 years ago | |

> Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs.

The article is about DBMS developers. For DBMS developers, "in practice" (a) and (b) are usually true I think.

danappelxx 3 years ago | |

Who is deploying databases in containers?

orbz 3 years ago | | |

A disturbingly large number of deployments I’ve seen using Kubernetes or docker compose have databases deployed as such.

crabbone 3 years ago | | |

Nobody who matters.

Those who do that don't know what they are doing (even if they outnumber the other side hundred to one, they "don't count" because they aren't aiming for good performance anyways).

Well, maybe not quite... of course it's possible that someone would want to deploy a database in a container because of the convenience of assembling all dependencies in a single "package", however, they would never run database on the same node as applications -- that's insanity.

But, even the idea of deploying a database alongside something like kubelet service is cringe... This service is very "fat" and can spike in memory / CPU usage. I would be very strongly opposed to an idea of running a database on the same VM that runs Kubernetes or any container runtime that requires a service to run it.

Obviously, it says nothing about the number of processes that will run on the database node. At the minimum, you'd want to run some stuff for monitoring, that's beside all the system services... but I don't think GP meant "one process" literally. Neither that is realistic nor is it necessary.

morelisp 3 years ago | | |

I'm running prod databases in containers so the server infra team doesn't have to know anything about how that specific database works or how to upgrade it, they just need to know how to issue generic container start/stop commands if they want to do some maintenance.

(But just in containers, not in Kubernetes. I'm not crazy.)

didip 3 years ago | | |

My group and a bunch of my peer groups.

And we are running them at the scale that most people can’t even imagine.

huahaiy 3 years ago | | |

Embedded DB

jandrewrogers 3 years ago |

Another interesting limitation of mmap() is that real-world storage volumes can exceed the virtual address space a CPU can address. A 64-bit CPU may have 64-bit pointers but typically cannot address anywhere close to 64 bits of memory, virtually or physically. A normal buffer pool does not have this limitation. You can get EC2 instances on AWS with more direct-attached storage than addressable virtual address space on the local microarchitecture.

glandium 3 years ago | |

To put concrete numbers: x86-64 is limited to 48 bits for virtual addresses, which is "only" 256TiB (281TB).

hyc_symas 3 years ago | | |

All of that is true, but I don't think it's a realistic concern. You're going to be sharding your data across multiple nodes before it gets that large. Nobody wants to sit around backing up or restoring a monolithic 256 TiB database.

Svetlitski 3 years ago | | |

Starting with Ice Lake there’s support for 5-level paging, which increases this to 128 PiB. Can’t say that I’ve ever seen this used in the wild though.

stevefan1999 3 years ago | | |

Intel now extended the page table level to 5-level making this number not so valid. Granted, PL5 creates more TLB pressure and longer memory access time due to that.

pjdesno 3 years ago |

Not just databases - we ran into the same issues when we needed a high-performance caching HTTP reverse proxy for a research project. We were just going to drop in Varnish, which is mmap-based, but performance sucked and we had to write our own.

Note that Varnish dates to 2006, in the days of hard disk drives, SCSI, and 2-core server CPUs. Mmap might well have been as good or even better than I/O back then - a lot of the issues discussed in this paper (TLB shootdown overhead, single flush thread) get much worse as the core count increases.

Sesse__ 3 years ago | |

Varnish' design wasn't very fast even for 2006-era hardware. It _was_ fast compared to Squid, though (which was the only real competitor at the time), and most importantly, much more flexible for the origin server case. But it came from a culture of “the FreeBSD kernel is so awesome that the best thing userspace can do is to offload as many decisions as humanly possible to the kernel”, which caused, well, suboptimal performance.

AFAIK the persistent backend was dropped pretty early on (eventually replaced with a more traditional read()/write()-based one as part of Varnish Plus), and the general recommendation became just to use malloc and hope you didn't swap.

tayo42 3 years ago | |

Varnish has a file system backed cache that depends on the page cache to keep it fast.

What did you differently in your custom one that was faster then varnish?

pjdesno 3 years ago | | |

Simple multithreaded read/write. On a 20-core 40-thread machine with a couple of fast NVMe drives it was way faster.

wood_spirit 3 years ago |

Old timers will recall when using mmap was a prominently promoted selling point for the “no sql” dbms.

ren_engineer 3 years ago | |

seems like all databases are moving towards the middle. Postgres has JSON support, MongoDB has transactions and also a columnar extension for OLAP type data. NoSQL seems almost meaningless as a term now. Feels like a move towards a winner takes all multi-modal database that can work with most types of data fairly well. Postgres with all of it's specialized extensions seems like it will be the most popular choice. The convenience of not having to manage multiple databases is hard to beat unless performance is exponentially better, Postgres with these extensions can probably be "good enough" for a lot of companies

reminds me of how industries typically start out dominated by vertically integrated companies, move to specialized horizontal companies, then generally move back to vertical integration due to efficiency. Car industry started this way with Ford, went away from it, and now Tesla is doing it again. Lots of other examples in other industries

TheGeminon 3 years ago | | |

The pendulum swing is common in any system, and is a really effective mechanism for evaluation.

You almost always want somewhere in the middle, but it’s often much easier to move back after a large jump in one direction than to push towards the middle.

nemo44x 3 years ago | |

For documents it made access fast since there’s no joins, etc. that require paging from all over. The problem ended up being updates and compaction issues.

wood_spirit 3 years ago | | |

My memory is that the problem was ACID. The document stores didn’t promise to be reliable because apparently that didn’t scale.

And there was a very well known cartoon video discussion about it with “web scale” and “just write to dev null” and other classics that became memes :)

dang 3 years ago |

Are You Sure You Want to Use MMAP in Your Database Management System? [pdf] - https://news.ycombinator.com/item?id=31504052 - May 2022 (43 comments)

Are you sure you want to use MMAP in your database management system? [pdf] - https://news.ycombinator.com/item?id=29936104 - Jan 2022 (127 comments)

dist1ll 3 years ago |

Many general-purpose OS abstractions start leaking when you're working on systems-like software.

You notice it when web servers are doing kernel bypass to for zero-copy, low-latency networking, or database engines throw away the kernel's page cache to implement their own file buffer.

kentonv 3 years ago | |

Yes. I think mmap() is misunderstood as being an advanced tool for systems hackers, but it's actually the opposite: it's a tool to make application code simpler by leaving the systems stuff to the kernel.

With mmap, you get to avoid thinking about how much data to buffer at once, caching data to speed up repeated access, or shedding that cache when memory pressure is high. The kernel does all that. It may not do it in the absolute ideal way for your program but the benefit is you don't have to think about these logistics.

But if you're already writing intense systems code then you can probably do a better job than the kernel by optimizing for your use case.

arter4 3 years ago | |

Web servers doing kernel bypass for zero-copy networking? Do you have a specific example in mind? I'm curious.

dist1ll 3 years ago | | |

The most common example is DPDK [1]. It's a framework for building bespoke networking stacks that are usable from userspace, without involving the kernel.

You'll find DPDK mentioned a lot in the networking/HPC/data center literature. An example of a backend framework that uses DPDK is the seastar framework [2]. Also, I recently stumbled upon a paper for efficient RPC networks in data centers [3].

If you want to learn more, the p99 conference has tons of speakers talking about some interesting challenges in that space.

[1] https://www.dpdk.org/.

[2] https://github.com/scylladb/seastar

[3] https://github.com/erpc-io/eRPC

kentonv 3 years ago | | |

Probably the most common example is sendfile() for writing file contents out to a socket without reading them into userspace:

https://man7.org/linux/man-pages/man2/sendfile.2.html

kwohlfahrt 3 years ago |

It sounds like a lot of the performance issues are TLB-related. Am I right in thinking huge-pages would help here? If so, it's a bit unfortunate they didn't test this in the paper.

Edit: Hm, it might not be possible to mmap files with huge-pages. This LWN article[1] from 5 years ago talks about the work that would be required, but I haven't seen any follow-ups.

[1]: https://lwn.net/Articles/718102/

hyc_symas 3 years ago | |

Huge pages aren't pageable, so they wouldn't be particularly advantageous for a mmap DB anyway, you'd have to do traditional I/O & buffer management for everything.

ori_b 3 years ago | |

No, huge pages wouldn't help. They would change when the TLB gets flushed, but the flushes would still be there.

Dwedit 3 years ago |

Memory-Mapped Files = access violations when a disk read fails. If you're not prepared to handle those, don't use memory-mapped files. (Access violation exceptions are the same thing that happens when you attempt to read a null pointer)

Then there's the part with writes being delayed. Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.

mpweiher 3 years ago |

Yes, I definitely would want to use mmap() in my storage system. And would love to see the limitations that make this tricky addressed.

zffr 3 years ago |

The TLDR is that MMAP sorta does what you want, but DBMSes need more control over how/when data is paged in/out of memory. Without this extra control, there can be issues with transactional safety, and performance.

benlivengood 3 years ago |

For all of its usefulness in the good old days of rusty disks I wonder if virtual memory is worth having for dedicated databases, caches, and storage heads. Avoiding TLB flushes entirely sounds like a huge win for massively multithreaded software and memory management in a large shared flat address space doesn't sound impossibly hard.

jasonhansel 3 years ago |

I've become convinced that there are very few, if any, reasons to MMAP a file on disk. It seems to simplify things in the common case, but in the end it adds a massive amount of unnecessary complexity.

AnotherGoodName 3 years ago |

A well written bespoke function can beat a generalized function at a specific task.

If you have the resources to write and maintain the bespoke method great. The large database developers probably have this. For others please don't take this link and go around claiming mmap is bad though. That gets tiresome and is misguided. Mmap is a shortcut to access large files in a non linear fashion. It's good at that too. Just not as good as a bespoke function.

dist1ll 3 years ago | |

This paper isn't aimed at random developers, and it's not a criticism of mmap in general.

This is an appeal to core database engineers to stop using the wrong tool for the job.

formerly_proven 3 years ago | |

mmap can be handy but usually is not a good idea when you care about ACID properties. So it tends to be most useful outside databases.

josephg 3 years ago | | |

Can you give some examples where mmap is useful?

SoftTalker 3 years ago |

This reads more like "don't write your own DBMS" than "don't use mmap."

jFriedensreich 3 years ago |

maybe a stupid question but what is wrong with coffee and spicy food?

orf 3 years ago | |

For the majority of the world, nothing. But if your diet consists of fairly bland food then it can result in unpleasant trips to the toilet.

mattnewton 3 years ago | | |

Acid reflux I thought

pizza 3 years ago | |

to put it crudely I think the punchline is the spicy food hurts on the way out, and the coffee makes that happen with greater velocity

toxik 3 years ago | |

Just doesn’t taste good together I think