AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy

AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy(phoronix.com)

410 points by crcastle 89 days ago | 165 comments

https://lore.kernel.org/lkml/yr3inlzesdb45n6i6lpbimwr7b25kqk...

lfittl 89 days ago |

Its worth reading this follow-up LKML post by Andres Freund (who works on Postgres): https://lore.kernel.org/lkml/yr3inlzesdb45n6i6lpbimwr7b25kqk...

aftbit 89 days ago | |

>If this somehow does end up being a reproducible performance issue (I still suspect something more complicated is going on), I don't see how userspace could be expected to mitigate a substantial perf regression in 7.0 that can only be mitigated by a default-off non-trivial functionality also introduced in 7.0.

cr125rider 88 days ago | | |

They said the magic words to get Linus to start flipping tables. Never break userspace. Unusably slow is broken

anal_reactor 88 days ago | |

> Maybe we should, but requiring the use of a new low level facility that was introduced in the 7.0 kernel, to address a regression that exists only in 7.0+, seems not great.

Completely right. This sounds like a communication failure. Maybe Linux maintainers should pick a few applications that have "priority support" and problems with these applications are also problems with Linux itself. Breaking Postgres is a serious regression.

Reminds me of a situation where Fedora couldn't be updated if you had Wine installed and one side of the argument was "user applications are user problem" while the other was "it's Wine, like come on".

falcor84 88 days ago | | |

I for one liked the old and simple WE DO NOT BREAK USERSPACE attitude.

https://linuxreviews.org/WE_DO_NOT_BREAK_USERSPACE

jeffbee 89 days ago | |

Funny how "use hugepages" is right there on the table and 99% of users ignore it.

bombcar 89 days ago | | |

I’m absolutely flabbergasted by the performance left on the table; even by myself - just yesterday I learned Gentoo’s emerge can use git and be a billion times faster.

justinclift 89 days ago | |

Note that it's just not a single post, and there's additional further information in following the full thread. :)

adrian_b 88 days ago | | |

Yes, and in the following messages the conclusion was that the regression is mitigated when using huge pages.

TacticalCoder 89 days ago | |

AIUI in that thread they're saying "0.51x" the perf on a 96-core arm64 machine and they're also saying they cannot reproduce it on a 96-core amd64 machine.

So it's not going to affect everybody both running PostgreSQL and upgrading to the latest kernel. Conditions seems to be: arm64, shitloads of core, kernel 7.0, current version of PostgreSQL.

That is not going to be 100% of the installed PostgreSQL DBs out there in the wild when 7.0 lands in a few weeks.

torginus 88 days ago | | |

It's a huge issue of ARM based systems, that hardly anyone uses or tests things on them (in production).

Yes, Macs going ARM has been a huge boon, but I've also seen crazy regressions on AWS Graviton (compared to how its supposed to perform), on .NET (and node as well), which frankly I have no expertise or time digging into.

Which was the main reason we ultimately cancelled our migration.

I'm sure this is the same reason why its important to AWS.

zamalek 89 days ago | | |

It was later reproduced on the same machine without huge pages enabled. PICNIC?

MBCook 89 days ago | | |

So perhaps this is a regression specifically in the arm64 code, or said differently maybe it’s a performance bug that has been there for a long time but covered up by the scheduler part that was removed?

master_crab 89 days ago | | |

For production Postgres, i would assume it’s close to almost no effect?

If someone is running postgres in a serious backend environment, i doubt they are using Ubuntu or even touching 7.x for months (or years). It’ll be some flavor of Debian or Red Hat still on 6.x (maybe even 5?). Those same users won’t touch 7.x until there has been months of testing by distros.

fxtentacle 88 days ago | |

.. which confirms all of my stereotypes. Looks like the AWS engineer who reported it used a m8g.24xlarge instance with 384 GB of RAM, but somehow didn't know or care to enable huge pages. And once enabling them, the performance regression disappears.

bushbaba 88 days ago | | |

Because such settings aren’t obvious to those not familiar with them. LLMs should make discoverability easier though

monocasa 89 days ago |

I feel like using spinlocks in user space at all without kernel support like rseq is just asking for weird performance degradations.

dsr_ 89 days ago |

Nobody sensible runs the latest kernel; nobody running PG in production should be afraid of setting a non-default at either boot time or as a sysctl. So this will, most likely, be another step in building a PG database server (turn off pre-emption if your kernel is 7.0 or later and PG is pre-whatever-version).

At worst it might become a permanent part of building a PG server and a FAQ... but if it affects one thing this badly, it will affect others.

harshreality 89 days ago |

Background on PREEMPT_LAZY:

https://lwn.net/Articles/994322/

longislandguido 89 days ago |

Anyone check to see if Jia Tan has submitted any kernel patches lately?

rs_rs_rs_rs_rs 88 days ago | |

They don't need to, there's about a billion bugs they can exploit.

FireBeyond 89 days ago |

Once upon a time, Linus would shout and yell about how the kernel should never "break" userspace (and I see in some places, some arguments of "It's not broken, it's just a performance regression" - personally I'd argue a 50% hit to performance of a pre-eminent database engine is ... quite the regression).

Now, the kernel engineer who introduced the brand new mechanism (introduced in Linux 7.0) for handling pre-emption says the "fix" is for Postgres to start using this new mechanism (I think the sister comment below links to what one of the Postgres engineers thinks of that, and I'm inclined mostly to agree).

shakna 88 days ago | |

Freund seems to suggest that hugepages is the right way to run a system under this sort of load - which is the fix.

> Hah. I had reflexively used huge_pages=on - as that is the only sane thing to do with 10s to 100s of GB of shared memory and thus part of all my benchmarking infrastructure - during the benchmark runs mentioned above.

> Turns out, if I disable huge pages, I actually can reproduce the contention that Salvatore reported (didn't see whether it's a regression for me though). Not anywhere close to the same degree, because the bottleneck for me is the writes.

But, they can speak for themselves here [0].

[0] https://news.ycombinator.com/item?id=47646332

perching_aix 89 days ago | |

Entertaining perspective - I thought that the whole "it's not an outage it's a (horizontal or vertical) degradation" thing was exclusive to web services, but thinking about it, I guess it does apply even in cases like this.

MBCook 89 days ago | |

It wouldn’t be the first time one of the other maintainers ran afoul of “Linus’s law“.

He may simply be waiting until more is known on exactly what’s causing it.

bear8642 89 days ago | |

> I'd argue a 50% hit to performance [...] is ... quite the regression

Indeed! Especially if said regression happens to impact anything trade/market related...

quietsegfault 89 days ago | |

This was my immediate thought - kernel doesn’t break software, or at least it didn’t used to.

arjie 89 days ago | |

Well, the reason he'd yell about it is that someone did it. If no one ever did it, he'd never yell and we'd never have the rule. So one can only imagine that this is one of those things where someone has to keep holding the line rather than one of those things where you set some rule and it self-holds.

Doubtless someone will have to do the yelling.

cperciva 89 days ago |

This makes me feel better about the 10% performance regression I just measured between FreeBSD 14 and FreeBSD 15.0.

db48x 89 days ago | |

Heh. Did they at least add useful features to balance out that cost?

cperciva 88 days ago | | |

FreeBSD 15 has lots of useful features! And better performance on other benchmarks; I just need to track down what's going wrong with this particular one.

cdelsolar 89 days ago |

https://lkml.org/lkml/2012/12/23/75

bob1029 88 days ago |

I'm struggling a bit with why we need all these fancy dynamic preemption modes. Is this about hyperscalars shoving more VMs per physical machine? What does a person trying to host a software solution gain from this kernel change?

If a user wants to spin in an infinite loop all day every day, I don't see the problem with that. Even if the spinning will provably never do any useful work.

ponco 88 days ago | |

more throughput WITHOUT huge tail latency is my understanding. A user above posted this link https://lwn.net/Articles/994322/ which goes into the background. My mental model is "give the kernel more explicit information" and it will be able to make better decisions

teleforce 88 days ago |

Does the PostgresSQL 18 performance increased with the latest asynchronous I/O, smarter query planning with improved parallelism kind of offset this performance hits? [1].

"Enhanced and smarter parallelisation; initial benchmarks indicate up to 40% faster analytical queries".

[1] PostgreSQL 18 released: Key features & upgrade tips:

https://www.baremon.eu/postgresql-18-released-key-features-u...

anal_reactor 88 days ago |

Can someone explain to me what's the problem? I have very little knowledge of Linux kernel, but I'm curious. I've tried reading a little, but it's jargon over jargon.

alienchow 88 days ago | |

I'm not familiar with the jargon either, but based on some reading it comes down to how the latest kernel treats process preempts.

Postgres uses spinlocks to hold shared memory for very critical processes. Spinlocks are an infinite loop with no sleep to attempt to hold a lock, thus "spinning". Previous kernels allowed spinlocking processes to run with PREEMPT_NONE. This flag tells the kernel to let the locking process complete their work before doing anything. Now the latest kernel removed this functionality and is interrupting spinlocking processes. So if a process that is holding a lock gets interrupted, all other postgres spinlocks processes that need the same lock spin in place for way longer times, leading to performance degradation.

anal_reactor 88 days ago | | |

Why does it only appear on arm64 and not x86?

tijsvd 88 days ago | |

From what I understand in the follow up: postgres uses shared memory for buffers. This shared memory is read by a new connection while locked.

In postgres, connections are handled with a process fork, not a new thread. If such a fork first reads memory, even if it already exists, that causes a minor page fault, which goes back to the kernel so it can update memory mapping tables.

The operation under lock is only a few instructions, but if it takes longer than expected, then that causes lock contention. Regression in the kernel handling minor faults?

The whole thing is then made worse because it's a spinlock, causing all waiting processes to contend over the cpus which adds to kernel processing.

Mitigated by using huge pages, which dramatically reduces the number of mapping entries and faults. I reckon that it could also be mitigated in postgres by pre-faulting all shared memory early?

Deeg9rie9usi 88 days ago |

Once again phoronix shoot out an article without further researching nor letting the mail thread in question cool down. The follow up mails make clear that the issue is more or less a non-issue since the benchmark is wrong.

adrian_b 88 days ago | |

The following up mails conclude that the regression happens only when huge pages are not used.

While using huge pages whenever possible is the right solution and this should be enough for PostgreSQL, perhaps there are applications that cannot use huge pages and which are affected by the regression.

So I do not think that it is right to just ignore what happened.

scottlamb 88 days ago | | |

> While using huge pages whenever possible is the right solution and this should be enough for PostgreSQL, perhaps there are applications that cannot use huge pages and which are affected by the regression.

It will be more interesting to talk about those applications if and when they are found. And I wouldn't assume the solutions are limited to reverting this change, starting to use the new spinlock time-slice extension mechanism, and enabling huge pages.

It sounds like using 4K pages with 100G of buffer cache was just the thing that made this spinlock's critical section become longer than PostgreSQL's developers had seen before. So when trying to apply the solution to some hypothetical other software that is suddenly benchmarking poorly, I'd generalize from "enable huge pages" to "look for other differences between your benchmark configuration and what the software's authors tested on".

Deeg9rie9usi 88 days ago | | |

I agree with you. The lurid headlines of phoronix.com just annoy me...

galbar 89 days ago |

It's not a good look to break userspace applications without a deprecation period where both old and new solutions exist, allowing for a transition period.

up2isomorphism 88 days ago |

Not sure why people have to upgrade to the newest major kernel version as soon as it is released.

conradludgate 88 days ago | |

It's the performance team's job to test these things. Doesn't mean they're going to deploy it immediately.

Someone should be testing these things and reporting regressions

jeltz 88 days ago | |

If nobody tests and reports these things when the version is released the regression would not be fixed when people start using it in production.

IshKebab 88 days ago | |

Don't make excuses.

dboreham 88 days ago |

THP again?

dmitrygr 88 days ago |

And this is exactly why we need the old Linus. Someone needs to yell “we do not break user space“

carlsborg 88 days ago |

Perhaps in due time we will see workload specific forks of Linux maintained by a team of agents