Completely right. This sounds like a communication failure. Maybe Linux maintainers should pick a few applications that have "priority support" and problems with these applications are also problems with Linux itself. Breaking Postgres is a serious regression.
Reminds me of a situation where Fedora couldn't be updated if you had Wine installed and one side of the argument was "user applications are user problem" while the other was "it's Wine, like come on".
So it's not going to affect everybody both running PostgreSQL and upgrading to the latest kernel. Conditions seems to be: arm64, shitloads of core, kernel 7.0, current version of PostgreSQL.
That is not going to be 100% of the installed PostgreSQL DBs out there in the wild when 7.0 lands in a few weeks.
Yes, Macs going ARM has been a huge boon, but I've also seen crazy regressions on AWS Graviton (compared to how its supposed to perform), on .NET (and node as well), which frankly I have no expertise or time digging into.
Which was the main reason we ultimately cancelled our migration.
I'm sure this is the same reason why its important to AWS.
If someone is running postgres in a serious backend environment, i doubt they are using Ubuntu or even touching 7.x for months (or years). It’ll be some flavor of Debian or Red Hat still on 6.x (maybe even 5?). Those same users won’t touch 7.x until there has been months of testing by distros.
At worst it might become a permanent part of building a PG server and a FAQ... but if it affects one thing this badly, it will affect others.
Now, the kernel engineer who introduced the brand new mechanism (introduced in Linux 7.0) for handling pre-emption says the "fix" is for Postgres to start using this new mechanism (I think the sister comment below links to what one of the Postgres engineers thinks of that, and I'm inclined mostly to agree).
> Hah. I had reflexively used huge_pages=on - as that is the only sane thing to do with 10s to 100s of GB of shared memory and thus part of all my benchmarking infrastructure - during the benchmark runs mentioned above.
> Turns out, if I disable huge pages, I actually can reproduce the contention that Salvatore reported (didn't see whether it's a regression for me though). Not anywhere close to the same degree, because the bottleneck for me is the writes.
But, they can speak for themselves here [0].
He may simply be waiting until more is known on exactly what’s causing it.
Indeed! Especially if said regression happens to impact anything trade/market related...
Doubtless someone will have to do the yelling.
If a user wants to spin in an infinite loop all day every day, I don't see the problem with that. Even if the spinning will provably never do any useful work.
"Enhanced and smarter parallelisation; initial benchmarks indicate up to 40% faster analytical queries".
[1] PostgreSQL 18 released: Key features & upgrade tips:
https://www.baremon.eu/postgresql-18-released-key-features-u...
Postgres uses spinlocks to hold shared memory for very critical processes. Spinlocks are an infinite loop with no sleep to attempt to hold a lock, thus "spinning". Previous kernels allowed spinlocking processes to run with PREEMPT_NONE. This flag tells the kernel to let the locking process complete their work before doing anything. Now the latest kernel removed this functionality and is interrupting spinlocking processes. So if a process that is holding a lock gets interrupted, all other postgres spinlocks processes that need the same lock spin in place for way longer times, leading to performance degradation.
In postgres, connections are handled with a process fork, not a new thread. If such a fork first reads memory, even if it already exists, that causes a minor page fault, which goes back to the kernel so it can update memory mapping tables.
The operation under lock is only a few instructions, but if it takes longer than expected, then that causes lock contention. Regression in the kernel handling minor faults?
The whole thing is then made worse because it's a spinlock, causing all waiting processes to contend over the cpus which adds to kernel processing.
Mitigated by using huge pages, which dramatically reduces the number of mapping entries and faults. I reckon that it could also be mitigated in postgres by pre-faulting all shared memory early?
While using huge pages whenever possible is the right solution and this should be enough for PostgreSQL, perhaps there are applications that cannot use huge pages and which are affected by the regression.
So I do not think that it is right to just ignore what happened.
It will be more interesting to talk about those applications if and when they are found. And I wouldn't assume the solutions are limited to reverting this change, starting to use the new spinlock time-slice extension mechanism, and enabling huge pages.
It sounds like using 4K pages with 100G of buffer cache was just the thing that made this spinlock's critical section become longer than PostgreSQL's developers had seen before. So when trying to apply the solution to some hypothetical other software that is suddenly benchmarking poorly, I'd generalize from "enable huge pages" to "look for other differences between your benchmark configuration and what the software's authors tested on".
Someone should be testing these things and reporting regressions
On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake). Turns out that that increase in memory barriers causes regressions that are nontrivial to avoid.
Another difficulty is that most of the remaining spinlocks are just a single bit in a 8 larger byte atomic. Futexes still don't support anything but 4 bytes (we could probably get away with using it on a part of the 8 byte atomic with some reordering) and unfortunately postgres still supports platforms with no 8 byte atomics (which I think is supremely silly), and the support for a fallback implementation makes it harder to use futexes.
The spinlock triggering the contention in the report was just stupid and we only recently got around to removing it, because it isn't used during normal operation.
Edit: forgot to add that the spinlock contention is not measurable on much more extreme workloads when using huge pages. A 100GB buffer pool with 4KB pages doesn't make much sense.
A quick hack shows the contended performance to be nearly indistinguishable with a futex based lock. Which makes sense, non-PI futexes don't transfer the scheduler slice the lock owner, because they don't know who the lock owner is. Postgres' spinlock use randomized exponential backoff, so they don't prevent the lock owner from getting scheduled.
Thus the contention is worse with PREEMPT_LAZY, even with non-PI futexes (which is what typical lock implementations are based on), because the lock holder gets scheduled out more often.
Probably worth repeating: This contention is due to an absurd configuration that should never be used in practice.
Now you've gotten me wondering. This issue is, in some sense, artificial: the actual conceptual futex unlock operation does not require sequential consistency. What's needed is (roughly, anyway) an release operation that synchronizes with whoever subsequently acquires the lock (on x86, any non-WC store is sufficient) along with a promise that the kernel will get notified eventually (and preferably fairly quickly) if there was a non-spinning sleeper. But there is no requirement that the notification occur in any particular order wrt anything else except that the unlock must be visible by the time the notification occurs [0]; there isn't even a requirement that the notification not occur if there is no futex waiter.
I think that, in common cache coherence protocols, this is kind of straightforward -- the unlock is a store-release, and as long as the cache line ends up being written locally, the hardware or ucode or whatever simply [1] needs to check whether a needs-notification flag is set in the same cacheline. Or the futex-wait operation needs to do a super-heavyweight barrier to synchronize with the releasing thread even though the releasing thread does not otherwise have any barrier that would do the job.
One nasty approach that might work is to use something like membarrier, but I'm guessing that membarrier is so outrageously expensive that this would be a huge performance loss.
But maybe there are sneaky tricks. I'm wondering whether CMPXCHG (no lock) is secretly good enough for this. Imagine a lock word where bit 0 set means locked and bit 1 set means that there is a waiter. The wait operation observes (via plain MOV?) that bit 0 is set and then sets bit 1 (let's say this is done with LOCK CMPXCHG for simplicity) and then calls futex_wait(), so it thinks the lock word has the value 3. The unlock operation does plain CMPXCHG to release the lock. The failure case would be that it reports success while changing the value from 1 to 0. I don't know whether this can happen on Intel or AMD architectures.
I do expect that it would be nearly impossible to convince an x86 CPU vendor to commit to an answer either way.
(Do other architectures, e.g. the most recent ARM variants, have an RMW release operation that naturally does this? I've tried, and entirely failed AFAICT, to convince x86 HW designers to add lighter weight atomics.)
[0] Visible to the remote thread, but the kernel can easily mediate this, effectively for free.
[1] Famous last words. At least in ossified microarchitectures, nothing is simple.
This got me thinking about 64-bit futexes again. Obviously that can't work with PI... but for just FUTEX_WAIT/FUTEX_WAKE, why not?
Somebody tried a long time ago, it got dropped but I didn't actually see any major objection: https://lore.kernel.org/lkml/20070327110757.GY355@devserv.de...
Yeah, exactly. "Doctor, help, somebody replaced my wooden hammer with a metal one, and now I can't hit myself in the face with it as many times."
If you use spinlocks in userspace, you're gonna have a bad time.
The expectation is that the kernel should somehow detect applications that are spinning, and avoid preempting them early.
From the article: "Linux 7.0 stable is due out in about two weeks. This is also the kernel version powering Ubuntu 26.04 LTS to be released later in April."
Unfortunately, lots of people will be running it in less than a month. At the moment, it'll take a kernel patch (not a sysctl) to undo this-- hopefully something changes.
As someone with a heavy QA/Dev Opps background I don't think we have enough details.
Is it only ARM64 ? How many ARM64 PG DBs are running 96 cores?
However...
This is the most popular database in the world. Odds are this will effect a bunch of other lesser known applications.
``` $ grep PREEMPT_DYNAMIC /boot/config-$(uname -r) CONFIG_PREEMPT_DYNAMIC=y CONFIG_HAVE_PREEMPT_DYNAMIC=y CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y ```
if your kernel has CONFIG_PREEMPT_DYNAMIC then you can go back to the pre 7.0 default by adding preempt=none to your grub config. I haven't seen any plans by Ubuntu to drop CONFIG_PREEMPT_DYNAMIC from the default kernel config.
But other software won't and may not even be noticed, except as a (I hate using the term) enshittification.
Better to introduce the "correct way" in 7.0 but not regress the old (translate the "correct" into the old if necessary) - and then in 8.0 or some future release implement the regression.
Whereas if emerge is taking 5-10 minutes, you have to remember to come back to it, or script it.
I do not know why using huge pages mitigates the regression, but it could be just because when the application uses huge pages it uses spinlocks much less frequently so the additional delays do not accumulate enough to cause a significant performance reduction.
If your pages are 1GB instead of 4kB, this happens much less often.
Eventually you'd expect that something has to give.
Redis recommend disabling hugepages: https://redis.io/docs/latest/operate/oss_and_stack/managemen...
---
Actually, looks like they changed the log warning to be more specific, as it's just the "always" setting which seems to cause Redis grief?
I'd be curious to know if there's still a regression with hugepages turned on in older kernels.
If you are benchmarking something and the only changed variable between benchmarks is the kernel, that is useful information. Even if your environment isn't correctly setup.
ie Redis:
https://redis.io/docs/latest/operate/oss_and_stack/managemen...
> Maybe we should, but requiring the use of a new low level facility that was introduced in the 7.0 kernel, to address a regression that exists only in 7.0+, seems not great.
... so that leaves me confused. My understanding is that the regression is triggered with the 7.0+ kernel and can be mitigated with huge pages turned on.
My question therefore was how come this regression hasn't been visible with huge pages turned off with older kernel versions? You say that it was but I can't find this data point.
With what we know so far, I expect that there are just about no real world workloads that aren't already completely falling over that will be affected.
Turns out to be pretty crucial for performance though... Not manipulating them with a single atomic leads to way way worse performance.
For quite a while it was a 32bit atomic, but I recently made it a 64bit one, to allow the content lock (i.e. protecting the buffer contents, rather than the buffer header) to be in the same atomic var. That's for one nice for performance, it's e.g. very common to release a pin and a lock at the same time and there are more fun perf things we can do in the future. But the real motivation was work on adding support for async writes - an exclusive locker might need to consume an IO completion for a write that's in flight that is prevent it from acquiring the lock. And that was hard to do with a separate content lock and buffer state...
> And there are like ten open coded spin waits around the uses... you certainly have my empathy :)
Well, nearly all of those are all to avoid needing to hold a spinlock, which, as lamented a lot around this issue, don't perform that well when really contended :)
We're on our way to barely ever need the spinlock for the buffer header, which then should allow us to get rid of many of those loops.
> This got me thinking about 64-bit futexes again. Obviously that can't work with PI... but for just FUTEX_WAIT/FUTEX_WAKE, why not?
It'd be pretty nice to have. There are lot of cases where one needs more lock state than one can really encode into a 32bit lock state.
I'm quite keen to experiment with the rseq time slice extension stuff. Think it'll help with some important locks (which are not spinlocks...).
I don't doubt it. I just meant nasty with respect to using futex() to sleep instead of spin, I was having some "fun" trying.
I can certainly see how pushing that state into one atomic would simplify things, I didn't really mean to question that.
> We're on our way to barely ever need the spinlock for the buffer header, which then should allow us to get rid of many of those loops.
I'm cheering you on, I hadn't looked at this code before and its been fun looking through some of the recent work on it.
> It'd be pretty nice to have. There are lot of cases where one needs more lock state than one can really encode into a 32bit lock state.
I've seen too much open coded spinning around 64-bit CAS in proprietary code, where it was a real demonstrable problem, and similar to here it was often not straightforward to avoid. I confess to some bias because of this experience ("not all spinlocks...") :)
I remember a lot of cases where FUTEX_WAIT64/FUTEX_WAKE64 would have been a drop-in solution, that seems compelling to me.
> [...] used huge_pages=on - as that is the only sane thing to do with 10s to 100s of GB of shared memory [...] if I disable huge pages, I actually can reproduce the contention [...]
So it looks like an edge case, as usually you need huge pages at scale ??
Of course not a nice regression but you should not run PostgreSQL on large servers without huge pages enabled so thud regression will only hurt people who have a bad configuration. That said I think these bad configurations are common out there, especially in containerized environments where the one running PostgreSQL may not have the ability to enable huge pages.
Bad because as of Splunk 10.x, Splunk bundles postgres to integrate with their SOAR platform. Parenthetically, this practice of bundling stuff with Splunk is making vuln remediation a real pain. Splunk bundles its own python, mongod, and now postgres, instead of doing dependency checking. They're going to have to keep doing it as long as they release a .tgz and not just an RPM. The most recent postgres vuln is not fixed in Splunk.
For example, Redis officially advised people to disable it due to a latency impact:
https://redis.io/docs/latest/operate/oss_and_stack/managemen...
Pretty sure Redis even outputs a warning to the logs upon startup when it detects hugepages are enabled.
Note that I'm not a Redis expert, I just remember this from when I ran it as a dependency for other software I was using.
> Now you've gotten me wondering. This issue is, in some sense, artificial: the actual conceptual futex unlock operation does not require sequential consistency. What's needed is (roughly, anyway) an release operation that synchronizes with whoever subsequently acquires the lock (on x86, any non-WC store is sufficient) along with a promise that the kernel will get notified eventually (and preferably fairly quickly) if there was a non-spinning sleeper. But there is no requirement that the notification occur in any particular order wrt anything else except that the unlock must be visible by the time the notification occurs [0]; there isn't even a requirement that the notification not occur if there is no futex waiter.
Hah.
> ... > But maybe there are sneaky tricks. I'm wondering whether CMPXCHG (no lock) is secretly good enough for this. Imagine a lock word where bit 0 set means locked and bit 1 set means that there is a waiter. The wait operation observes (via plain MOV?) that bit 0 is set and then sets bit 1 (let's say this is done with LOCK CMPXCHG for simplicity) and then calls futex_wait(), so it thinks the lock word has the value 3. The unlock operation does plain CMPXCHG to release the lock. The failure case would be that it reports success while changing the value from 1 to 0. I don't know whether this can happen on Intel or AMD architectures.
I suspect the problem isn't so much the lock prefix, but that the non-futex spinlock release just is a store, whereas a futex release has to be a RMW operation.
I'm talking out of my ass here, but my guess is that the reason for the performance gain of the plain-store-is-a-spinlock-release on x86 comes from being able to do the release via the store buffer, without having to wait for exclusive ownership of the cache line. Due to being a somewhat contended simple spinlock, often embedded on the same line as the to-be-protected data, it's common for the line not not be in modified ownership anymore at release.
> I'm talking out of my ass here, but my guess is that the reason for the performance gain of the plain-store-is-a-spinlock-release on x86 comes from being able to do the release via the store buffer, without having to wait for exclusive ownership of the cache line.
I don’t think so. The CPU is pretty good about hiding that kind of latency — reading a contended cache line and doing a correctly predicted branch shouldn’t stall anything after it.
But LOCK and MFENCE are quite expensive.
Implementing locks does not need this kind of loops, which may greatly increase the overhead, but only loops that do simple loads, for detecting changes, or the invocation of a FUTEX_WAIT, which is equivalent with that.
Besides loops that wait for changes, any kind of lock may be implemented with atomic read-modify-write instructions (e.g. on x86 XCHG, LOCK XADD, LOCK BTS and so on, and equivalent instructions on Armv8.1-A or later ISAs) that are not used in loops, so they have predictable overhead. For example, a futex may be used by a thread that waits for multiple events, if the other threads use a locked bit-test-and-set on the futex variable to signal the occurrence of an event, where each event is assigned to a distinct bit.
CMPXCHG and the equivalent load-and-lock/store conditional are really needed far less often than some people use them. The culprit is a widely-quoted research paper that has shown that these instructions are more universal than simple atomic fetch-and-operation instructions, allowing the implementation of lock-free algorithms, but the fact that they can do more does not mean that they should also be used when their extra power is not necessary, because that is paid dearly by introducing non-deterministic overhead.
A simple atomic instruction has an overhead much greater than an access to the L1 cache or the L2 cache, but typically the overhead is similar to that of a simple access to the L3 cache and significantly lower than the overhead of a simple access to the main memory, which remains the most expensive operation in modern CPUs.
Moreover, while mutual exclusion can be implemented reasonably efficiently with locks, it is also used far more often than necessary. It is possible to implement shared buffers or message queues that use neither mutual exclusion nor optimistic access that may need to be retried (a.k.a. lock-free access), but instead of those they use dynamic partitioning of the shared resource, allowing concurrent accesses without interference.
Organizing the cooperation between threads around shared buffers/message queues is frequently much better than using mutual exclusion, which stalls all contending threads, serializing their execution, and also much better than lock-free access, which may need an unpredictable number of retries when contention is high.
When unlocking a futex-backed mutex, one needs to do two things. First, one needs to actually unlock it: this is a store-release in modern lingo, and on x86 almost any store instruction has the correct ordering semantics. Second, one needs to determine whether to call futex_wake, which is conceptually just reading a flag “is someone waiting” and then branching on the result. The problem is that the load needs to be ordered after (or at least not before) the store.
x86 provides two main ways to do this, MFENCE and LOCK. For whatever reason, at least Intel has tried pretty hard to optimize LOCK, and it’s often the case that LOCKed operations on a hot cache line is faster than MFENCE. (I have benchmarked this, and Linux uses this trick.)
My point is that the specific algorithm of unlocking a futex-backed mutex does not require the full ordering semantics of MFENCE or LOCK. And my secondary observation is that x86 has some non-LOCKed RMW instructions, one of which is plain CMPXCHG. Unlocked CMPXCHG is much faster than LOCK anything or MFENCE — I’ve benchmarked it. There are also the flag outputs from operations like ADD. And I’m speculating that maybe some of these instructions are secretly actually ordered strongly enough for futex unlock.
It gets a bit worse with preempt_lazy - for me just 15% percent or so - because the lock holder is scheduled out a bit more often. But it was bad before.
> My question therefore was how come this regression hasn't been visible with huge pages turned off with older kernel versions? You say that it was but I can't find this data point.
I mean it wasn't a regression before, because this is how it has behaved for a long time.
This workload is not a realistic thing that anybody would encounter in this form in the real world. Even without the contention - which only happens the first time the buffer pool is filled - you lose so much by not using huge pages with a 100gb buffer pool that you will have many other issues.
We (postgres and me personally) were concerned enough about potential contention in this path that we did get rid of that lock half a year ago (buffer replacement selection has been lock free for close to a decade, just unused buffers were found via a list protected by this lock).
But the performance gains we saw were relatively small, we didn't measure large buffer pools without huge pages though.
And at least I didn't test with this many connections doing small random reads into a cold buffer pool, just because it doesn't seem that interesting.
https://matklad.github.io/2020/01/04/mutexes-are-faster-than...
If I make something 1% faster on average, but now a random 0.000001% of its users see a ten-second stall every day, I lose.
It is tempting to think about it as a latency/throughput tradeoff. But it isn't that simple, the unbounded thrashing can be more like a crash in terms of impact to the system.
That's what you'd do if manually scheduling. Ideally the dynamic scheduler would do that on its own.
The other problem with spin-wait is that it overshoots, especially with an increasing backoff. Part of the overhead of sleeping is paid back by being woken up immediately.
When it's made to work, the backoff is often "overfit" in that very slight random differences in kernel scheduler behavior can cause huge apparent regressions.
Surely they would be testing the configuration(s) that they use in production? They’re not running RDS without hugepages turned on, right?
I'd guess they have dozens of people across say a Linux kernel team, a Graviton hardware integration team, an EC2 team, and a Amazon RDS for PostgreSQL team who might at one point or another run a benchmark like this. They probably coordinate to an extent, but not so much that only one person would ever run this test. So yes it is duplicative. And they're likely intending to test the configurations they use in production, yes, but people just make mistakes.
That said, Ubuntu in large production fleets isn't too bad. Sure, other distros are better, but Ubuntu's perfectly serviceable in that role. It needs talented SRE staff making sure automation, release engineering, monitoring, and de/provisioning behave well, but that's true of any you-run-the-underlying-VM large cloud deployment.
But the reality:
a) may get irreversible upgrades (e.g. new underlying database structure)
b) permanent worse performance / regression (e.g. iOS 26)
c) added instability
d) new security issues (litellm)
e) time wasted migrating / debugging
f) may need rewrite of consumers / users of APIs / sys calls
g) potential new IP or licensing issues
etc.A couple of the few reasons to upgrade something is:
a) new features provide genuine comfort or performance upgrade (or... some revert)
b) there is an extremely critical security issue
c) you do not care about stability because reverting is uneventful and production impact is nil (e.g. Claude Code)
but 99% of the time, if ain't broke, don't fix it.https://en.wikipedia.org/wiki/2024_CrowdStrike-related_IT_ou...
Even if the vulnerability itself is discovered through other means than by an LLM, it's trivial to ask a SOTA model to "monitor all new commits to project X and decide which ones are likely patching an exploitable vulnerability, and then write a PoC." That's a lot easier than finding the vulnerable itself.
I won't be surprised if update windows (for open source networked services) shrink to ~10 minutes within a year or two. It's going to be a brutal world.
The idea of advanced testing of new versions of software (that they’ll be forced to use eventually) never seems to occur, or they spend so much time fighting fires they never get around to it.
ymmv, but in my experience projects like postgresql which have been reliable, tend to continue to be so.
While that's true, for new deployments the story is often "deploy on the latest release of things available at the time".
So, there will probably be a substantial deployment of new projects / testing projects using the Linux 7.0 kernel along with the latest available software packages in a few weeks.
In the Linux world this is the worst possible scenario, distro with the largest adoption, LTS.
Here's how they announced 24.04.0. It says LTS and doesn't mention anything about LTS coming in the .1 release: https://canonical.com/blog/canonical-releases-ubuntu-24-04-n...
https://ubuntu.com/about/release-cycle
We're just now looking at moving production machines to 24.04.
If it aint broken, don't fix it.
* explicit huge pages
* transparent huge pages system-wide default
* app-specific or even mapping-specific toggles
* various memory allocator settings to raise its effectiveness
It would be really surprising to me to see a workload for which it's optimal to not use huge pages anywhere on the system.
When you deploy software all around the globe and not only on your servers that you fully control this becomes problematic. Even in the latter case it is frowned upon by admins/teams if you can't prove the benefit.
Yes, there are workloads where huge-pages do not bring any measurable benefit, I don't understand why would that be questionable? Even if they don't bring the runtime performance down, which they could, extra work and complexity they incur is in a sense not optimal when compared to the baseline of not using huge-pages.
I really doubt it, except of course workloads where you just use a trivial amount of memory to begin with. In systems I've seen, anywhere from 5% to 15% of the CPU time is spent waiting for TLB misses. It's obvious then that huge pages can be hugely beneficial if properly used; by definition they hugely relieve TLB pressure.
You can of course end up in situations where transparent TLB scanning is worse than nothing, but that's exactly why I pointed out there's a variety of ways to use huge pages.