* [PATCH 0/1] sched: Restore PREEMPT_NONE as default
@ 2026-04-03 19:19 Salvatore Dipietro
2026-04-03 19:19 ` [PATCH 1/1] " Salvatore Dipietro
` (2 more replies)
0 siblings, 3 replies; 20+ messages in thread
From: Salvatore Dipietro @ 2026-04-03 19:19 UTC (permalink / raw)
To: linux-kernel
Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, peterz,
Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior
We are reporting a throughput and latency regression on PostgreSQL
pgbench (simple-update) on arm64 caused by commit 7dadeaa6e851
("sched: Further restrict the preemption modes") introduced in
v7.0-rc1.
The regression manifests as a 0.51x throughput drop on a pgbench
simple-update workload with 1024 clients on a 96-vCPU
(AWS EC2 m8g.24xlarge) Graviton4 arm64 system. Perf profiling
shows 55% of CPU time is consumed spinning in PostgreSQL's
userspace spinlock (s_lock()) under PREEMPT_LAZY:
|- 56.03% - StartReadBuffer
|- 55.93% - GetVictimBuffer
|- 55.93% - StrategyGetBuffer
|- 55.60% - s_lock <<<< 55% of time
| |- 0.39% - el0t_64_irq
| |- 0.10% - perform_spin_delay
|- 0.08% - LockBufHdr
|- 0.07% - hash_search_with_hash_value
|- 0.40% - WaitReadBuffers
1. Test environment
___________________
Hardware: 1x AWS EC2 m8g.24xlarge (12x 1TB IO2 32000 iops RAID0 XFS)
OS: AL2023 (ami-03a8d3251f401ffca)
Kernel: next-20260331
Database: PostgreSQL 17
Workload: pgbench simple-update
1024 clients, 96 threads, 1200s duration
scale factor 8470, fillfactor=90, prepared protocol
2. Results
__________
Configuration Run 1 Run 2 Run 3 Average vs Base
__________________ _________ _________ _________ _________ _______
BASELINE 47242.39 53369.18 51644.29 50751.96 1.00x
w/ revert 92906.62 103976.03 98814.94 98565.86 1.94x
3. Reproduction
_______________
On the AWS EC2 m8g.24xlarge, install and run the PostgreSQL
database using the repro-collection repository like:
# Reproducer code:
git clone https://github.com/aws/repro-collection.git ~/repro-collection
# Setup and start PostgreSQL server in terminal 1:
~/repro-collection/run.sh postgresql SUT --ldg=127.0.0.1
# Run pgbench load generator in terminal 2:
PGBENCH_SCALE=8470 \
PGBENCH_INIT_EXTRA_ARGS="--fillfactor=90" \
PGBENCH_CLIENTS=1024 \
PGBENCH_THREADS=96 \
PGBENCH_DURATION=1200 \
PGBENCH_BUILTIN=simple-update \
PGBENCH_PROTOCOL=prepared \
~/repro-collection/run.sh postgresql LDG --sut=127.0.0.1
Salvatore Dipietro (1):
sched: Restore PREEMPT_NONE as default
kernel/Kconfig.preempt | 3 ---
1 file changed, 3 deletions(-)
base-commit: 9147566d801602c9e7fc7f85e989735735bf38ba
--
2.50.1 (Apple Git-155)
AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico
^ permalink raw reply [flat|nested] 20+ messages in thread* [PATCH 1/1] sched: Restore PREEMPT_NONE as default 2026-04-03 19:19 [PATCH 0/1] sched: Restore PREEMPT_NONE as default Salvatore Dipietro @ 2026-04-03 19:19 ` Salvatore Dipietro 2026-04-03 21:32 ` [PATCH 0/1] " Peter Zijlstra 2026-04-05 14:44 ` Mitsumasa KONDO 2 siblings, 0 replies; 20+ messages in thread From: Salvatore Dipietro @ 2026-04-03 19:19 UTC (permalink / raw) To: linux-kernel Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, peterz, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior Commit 7dadeaa6e851 ("sched: Further restrict the preemption modes") changed the default preemption model to PREEMPT_LAZY on architectures that support it and made PREEMPT_NONE depend on ARCH_NO_PREEMPT. This causes a 0.51x throughput regression on PostgreSQL pgbench (simple-update) with 1024 clients on a 96-vCPU Graviton4 arm64 system. Perf profiling shows 55% of CPU time spinning in PostgreSQL's userspace spinlock (s_lock()) under PREEMPT_LAZY. Restore PREEMPT_NONE as the default preemption model, remove the ARCH_NO_PREEMPT dependency from PREEMPT_NONE, and remove the ARCH_HAS_PREEMPT_LAZY restriction from PREEMPT_VOLUNTARY. Fixes: 7dadeaa6e851 ("sched: Further restrict the preemption modes") Signed-off-by: Salvatore Dipietro <dipiets@amazon.it> --- kernel/Kconfig.preempt | 3 --- 1 file changed, 3 deletions(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 88c594c6d7fc..da326800c1c9 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -16,13 +16,11 @@ config ARCH_HAS_PREEMPT_LAZY choice prompt "Preemption Model" - default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY default PREEMPT_NONE config PREEMPT_NONE bool "No Forced Preemption (Server)" depends on !PREEMPT_RT - depends on ARCH_NO_PREEMPT select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC help This is the traditional Linux preemption model, geared towards @@ -37,7 +35,6 @@ config PREEMPT_NONE config PREEMPT_VOLUNTARY bool "Voluntary Kernel Preemption (Desktop)" - depends on !ARCH_HAS_PREEMPT_LAZY depends on !ARCH_NO_PREEMPT depends on !PREEMPT_RT select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC -- 2.50.1 (Apple Git-155) AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-03 19:19 [PATCH 0/1] sched: Restore PREEMPT_NONE as default Salvatore Dipietro 2026-04-03 19:19 ` [PATCH 1/1] " Salvatore Dipietro @ 2026-04-03 21:32 ` Peter Zijlstra 2026-04-04 17:42 ` Andres Freund 2026-04-06 0:43 ` Qais Yousef 2026-04-05 14:44 ` Mitsumasa KONDO 2 siblings, 2 replies; 20+ messages in thread From: Peter Zijlstra @ 2026-04-03 21:32 UTC (permalink / raw) To: Salvatore Dipietro Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote: > We are reporting a throughput and latency regression on PostgreSQL > pgbench (simple-update) on arm64 caused by commit 7dadeaa6e851 > ("sched: Further restrict the preemption modes") introduced in > v7.0-rc1. > > The regression manifests as a 0.51x throughput drop on a pgbench > simple-update workload with 1024 clients on a 96-vCPU > (AWS EC2 m8g.24xlarge) Graviton4 arm64 system. Perf profiling > shows 55% of CPU time is consumed spinning in PostgreSQL's > userspace spinlock (s_lock()) under PREEMPT_LAZY: > > |- 56.03% - StartReadBuffer > |- 55.93% - GetVictimBuffer > |- 55.93% - StrategyGetBuffer > |- 55.60% - s_lock <<<< 55% of time > | |- 0.39% - el0t_64_irq > | |- 0.10% - perform_spin_delay > |- 0.08% - LockBufHdr > |- 0.07% - hash_search_with_hash_value > |- 0.40% - WaitReadBuffers The fix here is to make PostgreSQL make use of rseq slice extension: https://lkml.kernel.org/r/20251215155615.870031952@linutronix.de That should limit the exposure to lock holder preemption (unless PostgreSQL is doing seriously egregious things). ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-03 21:32 ` [PATCH 0/1] " Peter Zijlstra @ 2026-04-04 17:42 ` Andres Freund 2026-04-05 1:40 ` Andres Freund 2026-04-07 8:49 ` Peter Zijlstra 2026-04-06 0:43 ` Qais Yousef 1 sibling, 2 replies; 20+ messages in thread From: Andres Freund @ 2026-04-04 17:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland Hi, On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote: > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote: > > We are reporting a throughput and latency regression on PostgreSQL > > pgbench (simple-update) on arm64 caused by commit 7dadeaa6e851 > > ("sched: Further restrict the preemption modes") introduced in > > v7.0-rc1. > > > > The regression manifests as a 0.51x throughput drop on a pgbench > > simple-update workload with 1024 clients on a 96-vCPU > > (AWS EC2 m8g.24xlarge) Graviton4 arm64 system. Perf profiling > > shows 55% of CPU time is consumed spinning in PostgreSQL's > > userspace spinlock (s_lock()) under PREEMPT_LAZY: > > > > |- 56.03% - StartReadBuffer > > |- 55.93% - GetVictimBuffer > > |- 55.93% - StrategyGetBuffer > > |- 55.60% - s_lock <<<< 55% of time > > | |- 0.39% - el0t_64_irq > > | |- 0.10% - perform_spin_delay > > |- 0.08% - LockBufHdr > > |- 0.07% - hash_search_with_hash_value > > |- 0.40% - WaitReadBuffers > > The fix here is to make PostgreSQL make use of rseq slice extension: > > https://lkml.kernel.org/r/20251215155615.870031952@linutronix.de > > That should limit the exposure to lock holder preemption (unless > PostgreSQL is doing seriously egregious things). Maybe we should, but requiring the use of a new low level facility that was introduced in the 7.0 kernel, to address a regression that exists only in 7.0+, seems not great. It's not like it's a completely trivial thing to add support for either, so I doubt it'll be the right thing to backpatch it into already released major versions of postgres. This specific spinlock doesn't actually exist anymore in postgres' trunk (feature freeze in a few days, release early autumn). But there is at least one other one that can often be quite hotly contended (although there is a relatively low limit to the number of backends that can acquire it concurrently, which might be the saving grace here). I'm not quite sure I understand why the spinlock in Salvatore's benchmark does shows up this heavily: - For something like the benchmark here, it should only be used until postgres' buffer pool is fully used, as the freelist only contains buffers not in use, and we check without a lock whether it contains buffers. Once running, buffers are only added to the freelist if tables/indexes are dropped/truncated. And the benchmark seems like it runs long enough that we should actually reach the point the freelist should be empty? - The section covered by the spinlock is only a few instructions long and it is only hit if we have to do a somewhat heavyweight operation afterwards (read in a page into the buffer pool), it seems surprising that this short section gets interrupted frequently enough to cause a regression of this magnitude. For a moment I thought it might be because, while holding the spinlock, some memory is touched for the first time, but that is actually not the case. The benchmark script seems to indicate that huge pages aren't in use: https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15 I wonder if somehow the pages underlying the portions of postgres' shared memory are getting paged out for some reason, leading to page faults while holding the spinlock? Salvatore, could you repeat that benchmark in some variations? 1) Use huge pages 2) 1) + prewarm the buffer pool pool before running the benchmark CREATE EXTENSION pg_prewarm; -- prewarm table data SELECT pg_prewarm(oid) FROM pg_class WHERE relname LIKE 'pgbench_accounts%' and relkind = 'r'; -- prewarm indexes, do so after tables, as indexes are more important, and -- the buffer pool might not be big enough SELECT pg_prewarm(oid) FROM pg_class WHERE relname LIKE 'pgbench_accounts%' and relkind = 'i'; I assume postgres was built with either an -march suffient to use atomic operations (i.e. -march=armv8.1-a or such) instead of ll/sc? Or at least -moutline-atomics was used? Greetings, Andres Freund ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-04 17:42 ` Andres Freund @ 2026-04-05 1:40 ` Andres Freund 2026-04-05 4:21 ` Andres Freund 2026-04-07 8:49 ` Peter Zijlstra 1 sibling, 1 reply; 20+ messages in thread From: Andres Freund @ 2026-04-05 1:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland Hi, On 2026-04-04 13:42:22 -0400, Andres Freund wrote: > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote: > > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote: > I'm not quite sure I understand why the spinlock in Salvatore's benchmark does > shows up this heavily: > > - For something like the benchmark here, it should only be used until > postgres' buffer pool is fully used, as the freelist only contains buffers > not in use, and we check without a lock whether it contains buffers. Once > running, buffers are only added to the freelist if tables/indexes are > dropped/truncated. And the benchmark seems like it runs long enough that we > should actually reach the point the freelist should be empty? > > - The section covered by the spinlock is only a few instructions long and it > is only hit if we have to do a somewhat heavyweight operation afterwards > (read in a page into the buffer pool), it seems surprising that this short > section gets interrupted frequently enough to cause a regression of this > magnitude. > > For a moment I thought it might be because, while holding the spinlock, some > memory is touched for the first time, but that is actually not the case. > I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96 threads, so it's smaller, and it's x86, not arm, but it's what I can quickly update to an unreleased kernel. So far I don't see such a regression and I basically see no time spent GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%). Which I don't find surprising, this workload doesn't read enough to have contention in there. Salvatore reported on the order of 100k transactions/sec (with one update, one read and one insert). Even if just about all of those were misses - and they shouldn't be with 25% of 384G as postgres' shared_buffers as the script indicates, and we know that s_b is not full due to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the page cache. That's not that much. Now, this machine is smaller and a different arch, so who knows. The 7.0-rc numbers I am getting are higher than what Salvatore reported on a bigger machine. It's hard to compare though, as I am testing with local storage, and this workload should be extremely write latency bound (but my storage has crappy fsync latency, so ...). I *do* see some contention where it's conceivable that rseq slice extension could help some, but a) It's a completely different locks: On the WALWrite lock Which is precisely the lock you'd expect in a commit latency bound workload with a lot of clients (the lock is used to wait for an in-flight WAL flush to complete). b) So far I have not observed a regression from 6.18. For me a profile looks like this: - 60.99% 0.95% postgres postgres [.] PostgresMain - 60.04% PostgresMain - 22.57% PortalRun - 20.88% PortalRunMulti - 16.70% standard_ExecutorRun - 16.55% ExecModifyTable + 10.78% ExecScan + 3.19% ExecUpdate + 1.53% ExecInsert + 2.94% standard_ExecutorStart 0.54% standard_ExecutorEnd + 1.60% PortalRunSelect - 15.89% CommitTransactionCommand - 15.50% CommitTransaction - 11.90% XLogFlush - 7.66% LWLockAcquireOrWait 6.70% LWLockQueueSelf 0.57% perform_spin_delay Which is about what I would expect. Salvatore, is there a chance your profile is corrupted and you did observe contention, but on a different lock? E.g. due to out-of-date debug symbols or such? Could you run something like the following while the benchmark is running: SELECT backend_type, wait_event_type, wait_event, state, count(*) FROM pg_stat_activity where wait_event_type NOT IN ('Activity') GROUP BY backend_type, wait_event_type, wait_event, state order by count(*) desc \watch 1 and show what you see at the time your profile shows the bad contention? On 2026-04-04 13:42:22 -0400, Andres Freund wrote: > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote: > > The fix here is to make PostgreSQL make use of rseq slice extension: > > > > https://lkml.kernel.org/r/20251215155615.870031952@linutronix.de > > > > That should limit the exposure to lock holder preemption (unless > > PostgreSQL is doing seriously egregious things). > > Maybe we should, but requiring the use of a new low level facility that was > introduced in the 7.0 kernel, to address a regression that exists only in > 7.0+, seems not great. > > It's not like it's a completely trivial thing to add support for either, so I > doubt it'll be the right thing to backpatch it into already released major > versions of postgres. It's not even suggested to be enabled by default: CONFIG_RSEQ_SLICE_EXTENSION: Allows userspace to request a limited time slice extension when returning from an interrupt to user space via the RSEQ shared data ABI. If granted, that allows to complete a critical section, so that other threads are not stuck on a conflicted resource, while the task is scheduled out. If unsure, say N. And enabling it requires EXPERT=1. If this somehow does end up being a reproducible performance issue (I still suspect something more complicated is going on), I don't see how userspace could be expected to mitigate a substantial perf regression in 7.0 that can only be mitigated by a default-off non-trivial functionality also introduced in 7.0. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-05 1:40 ` Andres Freund @ 2026-04-05 4:21 ` Andres Freund 2026-04-05 6:08 ` Ritesh Harjani 2026-04-07 11:19 ` Mark Rutland 0 siblings, 2 replies; 20+ messages in thread From: Andres Freund @ 2026-04-05 4:21 UTC (permalink / raw) To: Peter Zijlstra Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland Hi, On 2026-04-04 21:40:29 -0400, Andres Freund wrote: > On 2026-04-04 13:42:22 -0400, Andres Freund wrote: > > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote: > > > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote: > > I'm not quite sure I understand why the spinlock in Salvatore's benchmark does > > shows up this heavily: > > > > - For something like the benchmark here, it should only be used until > > postgres' buffer pool is fully used, as the freelist only contains buffers > > not in use, and we check without a lock whether it contains buffers. Once > > running, buffers are only added to the freelist if tables/indexes are > > dropped/truncated. And the benchmark seems like it runs long enough that we > > should actually reach the point the freelist should be empty? > > > > - The section covered by the spinlock is only a few instructions long and it > > is only hit if we have to do a somewhat heavyweight operation afterwards > > (read in a page into the buffer pool), it seems surprising that this short > > section gets interrupted frequently enough to cause a regression of this > > magnitude. > > > > For a moment I thought it might be because, while holding the spinlock, some > > memory is touched for the first time, but that is actually not the case. > > > > I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of > memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96 > threads, so it's smaller, and it's x86, not arm, but it's what I can quickly > update to an unreleased kernel. > > > So far I don't see such a regression and I basically see no time spent > GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%). > > Which I don't find surprising, this workload doesn't read enough to have > contention in there. Salvatore reported on the order of 100k transactions/sec > (with one update, one read and one insert). Even if just about all of those > were misses - and they shouldn't be with 25% of 384G as postgres' > shared_buffers as the script indicates, and we know that s_b is not full due > to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the > page cache. That's not that much. > The benchmark script seems to indicate that huge pages aren't in use: > https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15 > > > I wonder if somehow the pages underlying the portions of postgres' shared > memory are getting paged out for some reason, leading to page faults while > holding the spinlock? Hah. I had reflexively used huge_pages=on - as that is the only sane thing to do with 10s to 100s of GB of shared memory and thus part of all my benchmarking infrastructure - during the benchmark runs mentioned above. Turns out, if I *disable* huge pages, I actually can reproduce the contention that Salvatore reported (didn't see whether it's a regression for me though). Not anywhere close to the same degree, because the bottleneck for me is the writes. If I change the workload to a read-only benchmark, which obviously reads a lot more due to not being bottleneck by durable-write-latency, I see more contention: - 12.76% postgres postgres [.] s_lock - 12.75% s_lock - 12.69% StrategyGetBuffer GetVictimBuffer - StartReadBuffer - 12.69% ReleaseAndReadBuffer + 12.65% heapam_index_fetch_tuple While what I said above is true, the memory touched at the time of contention it isn't the first access to the relevant shared memory (i.e. it is already backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer() will be the first access of the connection processes to the relevant 4kB pages. Thus there will be a *lot* of minor faults and tlb misses while holding a spinlock. Unsurprisingly that's bad for performance. I don't see a reason to particularly care about the regression if that's the sole way to trigger it. Using a buffer pool of ~100GB without huge pages is not an interesting workload. With a smaller buffer pool the problem would not happen either. Note that the performance effect of not using huge pages is terrible *regardless* the spinlock. PG 19 does have the spinlock in this path anymore, but not using huge pages is still utterly terrible (like 1/3 of the throughput). I did run some benchmarks here and I don't see a clearly reproducible regression with huge pages. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-05 4:21 ` Andres Freund @ 2026-04-05 6:08 ` Ritesh Harjani 2026-04-05 14:09 ` Andres Freund 2026-04-07 8:20 ` Peter Zijlstra 2026-04-07 11:19 ` Mark Rutland 1 sibling, 2 replies; 20+ messages in thread From: Ritesh Harjani @ 2026-04-05 6:08 UTC (permalink / raw) To: Andres Freund, Peter Zijlstra Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland Andres Freund <andres@anarazel.de> writes: > Hi, > > On 2026-04-04 21:40:29 -0400, Andres Freund wrote: >> On 2026-04-04 13:42:22 -0400, Andres Freund wrote: >> > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote: >> > > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote: >> > I'm not quite sure I understand why the spinlock in Salvatore's benchmark does >> > shows up this heavily: >> > >> > - For something like the benchmark here, it should only be used until >> > postgres' buffer pool is fully used, as the freelist only contains buffers >> > not in use, and we check without a lock whether it contains buffers. Once >> > running, buffers are only added to the freelist if tables/indexes are >> > dropped/truncated. And the benchmark seems like it runs long enough that we >> > should actually reach the point the freelist should be empty? >> > >> > - The section covered by the spinlock is only a few instructions long and it >> > is only hit if we have to do a somewhat heavyweight operation afterwards >> > (read in a page into the buffer pool), it seems surprising that this short >> > section gets interrupted frequently enough to cause a regression of this >> > magnitude. >> > >> > For a moment I thought it might be because, while holding the spinlock, some >> > memory is touched for the first time, but that is actually not the case. >> > >> >> I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of >> memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96 >> threads, so it's smaller, and it's x86, not arm, but it's what I can quickly >> update to an unreleased kernel. >> >> >> So far I don't see such a regression and I basically see no time spent >> GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%). >> >> Which I don't find surprising, this workload doesn't read enough to have >> contention in there. Salvatore reported on the order of 100k transactions/sec >> (with one update, one read and one insert). Even if just about all of those >> were misses - and they shouldn't be with 25% of 384G as postgres' >> shared_buffers as the script indicates, and we know that s_b is not full due >> to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the >> page cache. That's not that much. > > >> The benchmark script seems to indicate that huge pages aren't in use: >> https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15 >> >> >> I wonder if somehow the pages underlying the portions of postgres' shared >> memory are getting paged out for some reason, leading to page faults while >> holding the spinlock? > > Hah. I had reflexively used huge_pages=on - as that is the only sane thing to > do with 10s to 100s of GB of shared memory and thus part of all my > benchmarking infrastructure - during the benchmark runs mentioned above. > > Turns out, if I *disable* huge pages, I actually can reproduce the contention > that Salvatore reported (didn't see whether it's a regression for me > though). Not anywhere close to the same degree, because the bottleneck for me > is the writes. > > If I change the workload to a read-only benchmark, which obviously reads a lot > more due to not being bottleneck by durable-write-latency, I see more > contention: > > - 12.76% postgres postgres [.] s_lock > - 12.75% s_lock > - 12.69% StrategyGetBuffer > GetVictimBuffer > - StartReadBuffer > - 12.69% ReleaseAndReadBuffer > + 12.65% heapam_index_fetch_tuple > > > While what I said above is true, the memory touched at the time of contention > it isn't the first access to the relevant shared memory (i.e. it is already > backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer() > will be the first access of the connection processes to the relevant 4kB > pages. > > Thus there will be a *lot* of minor faults and tlb misses while holding a > spinlock. Unsurprisingly that's bad for performance. > > > I don't see a reason to particularly care about the regression if that's the > sole way to trigger it. Using a buffer pool of ~100GB without huge pages is > not an interesting workload. With a smaller buffer pool the problem would not > happen either. > > Note that the performance effect of not using huge pages is terrible > *regardless* the spinlock. PG 19 does have the spinlock in this path anymore, > but not using huge pages is still utterly terrible (like 1/3 of the > throughput). > > > I did run some benchmarks here and I don't see a clearly reproducible > regression with huge pages. > However, for curiosity, I was hoping if someone more familiar with the scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes performance regression w/o huge pages? Minor page fault handling has micro-secs latency, where as sched ticks is in milli-secs. Besides, both preemption models should anyway schedule() if TIF_NEED_RESCHED is set on return to userspace, right? So was curious to understand how is the preemption model causing performance regression with no hugepages in this case? -ritesh ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-05 6:08 ` Ritesh Harjani @ 2026-04-05 14:09 ` Andres Freund 2026-04-05 14:44 ` Andres Freund ` (2 more replies) 2026-04-07 8:20 ` Peter Zijlstra 1 sibling, 3 replies; 20+ messages in thread From: Andres Freund @ 2026-04-05 14:09 UTC (permalink / raw) To: Ritesh Harjani Cc: Peter Zijlstra, Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland Hi, On 2026-04-05 11:38:59 +0530, Ritesh Harjani wrote: > Andres Freund <andres@anarazel.de> writes: > > Hah. I had reflexively used huge_pages=on - as that is the only sane thing to > > do with 10s to 100s of GB of shared memory and thus part of all my > > benchmarking infrastructure - during the benchmark runs mentioned above. > > > > Turns out, if I *disable* huge pages, I actually can reproduce the contention > > that Salvatore reported (didn't see whether it's a regression for me > > though). Not anywhere close to the same degree, because the bottleneck for me > > is the writes. > > > > If I change the workload to a read-only benchmark, which obviously reads a lot > > more due to not being bottleneck by durable-write-latency, I see more > > contention: > > > > - 12.76% postgres postgres [.] s_lock > > - 12.75% s_lock > > - 12.69% StrategyGetBuffer > > GetVictimBuffer > > - StartReadBuffer > > - 12.69% ReleaseAndReadBuffer > > + 12.65% heapam_index_fetch_tuple > > > > > > While what I said above is true, the memory touched at the time of contention > > it isn't the first access to the relevant shared memory (i.e. it is already > > backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer() > > will be the first access of the connection processes to the relevant 4kB > > pages. > > > > Thus there will be a *lot* of minor faults and tlb misses while holding a > > spinlock. Unsurprisingly that's bad for performance. > > > > > > I don't see a reason to particularly care about the regression if that's the > > sole way to trigger it. Using a buffer pool of ~100GB without huge pages is > > not an interesting workload. With a smaller buffer pool the problem would not > > happen either. > > > > Note that the performance effect of not using huge pages is terrible > > *regardless* the spinlock. PG 19 does have the spinlock in this path anymore, > > but not using huge pages is still utterly terrible (like 1/3 of the > > throughput). > > > > > > I did run some benchmarks here and I don't see a clearly reproducible > > regression with huge pages. > > > > However, for curiosity, I was hoping if someone more familiar with the > scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes > performance regression w/o huge pages? > > Minor page fault handling has micro-secs latency, where as sched ticks > is in milli-secs. Besides, both preemption models should anyway > schedule() if TIF_NEED_RESCHED is set on return to userspace, right? > So was curious to understand how is the preemption model causing > performance regression with no hugepages in this case? An attempt at answering that, albeit not from the angle of somebody knowing the scheduler code to a meaningful degree: I think the effect of 4KB pages (and the associated minor faults and TLB misses) is just to create contention on a spinlock that would normally never be contended, due to occuring while the spinlock is held in this quite extreme workload. This contention happens with PREEMPT_NONE as well - the performance is quite bad compared to when using huge pages. My guess is PREEMPT_LAZY just exacerbates the terrible contention by scheduling out the lock holder more often. But you're already in deep trouble at this point, even without PREEMPT_LAZY making it "worse". On my machine (smaller than Salvatore's) PREEMPT_LAZY is a worse, but not by that much. I suspect for Salvatore PREEMPT_LAZY just made already terrible contention worse. I think we really need a comparison run from Salvatore with huge pages with both PREEMPT_NONE and PREEMPT_LAZY. FWIW, from what I can tell, the whole "WHAA, it's a userspace spinlock" part has just about nothing to do with the problem. My understanding is that default futexes don't transfer the lock waiter's scheduler slice to the lock holder (there's no information about who the lock holder is unless it's a PI futex), Postgres' spinlock have randomized exponential backoff and the amount of spinning is adjusted over time, so you don't actually end up with spinlock waiters preventing the lock owner from getting scheduled to a significant degree. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-05 14:09 ` Andres Freund @ 2026-04-05 14:44 ` Andres Freund 2026-04-07 8:29 ` Peter Zijlstra 2026-04-07 8:27 ` Peter Zijlstra 2026-04-07 10:17 ` David Laight 2 siblings, 1 reply; 20+ messages in thread From: Andres Freund @ 2026-04-05 14:44 UTC (permalink / raw) To: Ritesh Harjani Cc: Peter Zijlstra, Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland Hi, On 2026-04-05 10:09:35 -0400, Andres Freund wrote: > FWIW, from what I can tell, the whole "WHAA, it's a userspace spinlock" part > has just about nothing to do with the problem. My understanding is that > default futexes don't transfer the lock waiter's scheduler slice to the lock > holder (there's no information about who the lock holder is unless it's a PI > futex), Postgres' spinlock have randomized exponential backoff and the amount > of spinning is adjusted over time, so you don't actually end up with spinlock > waiters preventing the lock owner from getting scheduled to a significant > degree. Confirmed, a hack moving this to a futex based lock mirrors has very similar performance, including somewhat worse performance with PREEMPT_LAZY than PREEMPT_NONE. Using futexes has a bit lower throughput but also reduces CPU usage a bit for the same amount of work, which is about what you'd expect. Just to re-emphasize: That is just due to using 4kB pages in a huge buffer pool and would vanish once running for a bit longer or when using a smaller buffer pool. I'll look at trying out the rseq slice extension in a bit, it looks like it nice for performance regardless of using spinlocks. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-05 14:44 ` Andres Freund @ 2026-04-07 8:29 ` Peter Zijlstra 0 siblings, 0 replies; 20+ messages in thread From: Peter Zijlstra @ 2026-04-07 8:29 UTC (permalink / raw) To: Andres Freund Cc: Ritesh Harjani, Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland On Sun, Apr 05, 2026 at 10:44:22AM -0400, Andres Freund wrote: > I'll look at trying out the rseq slice extension in a bit, it looks like it > nice for performance regardless of using spinlocks. Thanks! It seems to work well for the Oracle folks, please let us know if you have comments. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-05 14:09 ` Andres Freund 2026-04-05 14:44 ` Andres Freund @ 2026-04-07 8:27 ` Peter Zijlstra 2026-04-07 10:17 ` David Laight 2 siblings, 0 replies; 20+ messages in thread From: Peter Zijlstra @ 2026-04-07 8:27 UTC (permalink / raw) To: Andres Freund Cc: Ritesh Harjani, Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland On Sun, Apr 05, 2026 at 10:09:35AM -0400, Andres Freund wrote: > FWIW, from what I can tell, the whole "WHAA, it's a userspace spinlock" part > has just about nothing to do with the problem. :-) > My understanding is that > default futexes don't transfer the lock waiter's scheduler slice to the lock > holder (there's no information about who the lock holder is unless it's a PI > futex), (and robust; both PI and robust store the owner TID in the futex field) The difference is that while mutex lock holder preemption is also bad, it mostly just leads to idle time, which you can sometimes fill with doing other work. Whereas with spinlocks, the time gets soaked up with spinners. So both 'bad', but both different. > Postgres' spinlock have randomized exponential backoff and the amount > of spinning is adjusted over time, so you don't actually end up with spinlock > waiters preventing the lock owner from getting scheduled to a significant > degree. Fair enough. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-05 14:09 ` Andres Freund 2026-04-05 14:44 ` Andres Freund 2026-04-07 8:27 ` Peter Zijlstra @ 2026-04-07 10:17 ` David Laight 2 siblings, 0 replies; 20+ messages in thread From: David Laight @ 2026-04-07 10:17 UTC (permalink / raw) To: Andres Freund Cc: Ritesh Harjani, Peter Zijlstra, Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland On Sun, 5 Apr 2026 10:09:35 -0400 Andres Freund <andres@anarazel.de> wrote: ... > FWIW, from what I can tell, the whole "WHAA, it's a userspace spinlock" part > has just about nothing to do with the problem. My understanding is that > default futexes don't transfer the lock waiter's scheduler slice to the lock > holder (there's no information about who the lock holder is unless it's a PI > futex), Postgres' spinlock have randomized exponential backoff and the amount > of spinning is adjusted over time, so you don't actually end up with spinlock > waiters preventing the lock owner from getting scheduled to a significant > degree. Another problem (which also affects futex) is that the lock holder can get preempted (or just stolen by network interrupts and softint code). So even if the lock is only held for a few instructions sometimes that can take several milliseconds. Pretty much the only way to avoid that is changing the code to be lockless. (eg changing linked lists to arrays and using atmomic_increment() on a index.) A much more subtle effect can also affect performance - especially of a single-threaded cpu-intensive program. The clock speed of the busy cpu will be boosted (perhaps to 5GHz). Then a lot of other higher priority processes are all scheduled at the same time, the cpu-intensive program is preempted. When the the high priority processes sleep (perhaps almost immediately) the cpu-intensive program is scheduled on a 'random' cpu that has been idle so runs at initially (say) 800MHz, before being boosted again. That effect increased the compile time for an fpga image from ~10 minutes to ~20 minutes when the renaming of a kernel config option caused one of the mitigations that significantly slow down system call entry/exit to slow down an idle daemon than woke threads every 10ms (to process RTP audio) so that all the threads were active at the same time. David > > Greetings, > > Andres Freund > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-05 6:08 ` Ritesh Harjani 2026-04-05 14:09 ` Andres Freund @ 2026-04-07 8:20 ` Peter Zijlstra 2026-04-07 9:07 ` Peter Zijlstra 1 sibling, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2026-04-07 8:20 UTC (permalink / raw) To: Ritesh Harjani Cc: Andres Freund, Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland On Sun, Apr 05, 2026 at 11:38:59AM +0530, Ritesh Harjani wrote: > However, for curiosity, I was hoping if someone more familiar with the > scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes > performance regression w/o huge pages? > > Minor page fault handling has micro-secs latency, where as sched ticks > is in milli-secs. Besides, both preemption models should anyway > schedule() if TIF_NEED_RESCHED is set on return to userspace, right? > > So was curious to understand how is the preemption model causing > performance regression with no hugepages in this case? So yes, everything can schedule on return-to-user (very much including NONE). Which is why rseq slice ext is heavily recommended for anything attempting user space spinlocks. The thing where the other preemption modes differ is the scheduling while in kernel mode. So if the workload is spending significant time in the kernel, this could cause more scheduling. As you already mentioned, no huge pages, gives us more overhead on #PF (and TLB miss, but that's mostly hidden in access latency rather than immediate system time). This gives more system time, and more room to schedule. If we get preempted in the middle of a #PF, rather than finishing it, this increases the #PF completion time and if userspace is trying to access this page concurrently.... But we should see that in mmap_lock contention/idle time :/ I'm not sure I can explain any of this. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-07 8:20 ` Peter Zijlstra @ 2026-04-07 9:07 ` Peter Zijlstra 0 siblings, 0 replies; 20+ messages in thread From: Peter Zijlstra @ 2026-04-07 9:07 UTC (permalink / raw) To: Ritesh Harjani Cc: Andres Freund, Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland On Tue, Apr 07, 2026 at 10:20:18AM +0200, Peter Zijlstra wrote: > On Sun, Apr 05, 2026 at 11:38:59AM +0530, Ritesh Harjani wrote: > > > However, for curiosity, I was hoping if someone more familiar with the > > scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes > > performance regression w/o huge pages? > > > > Minor page fault handling has micro-secs latency, where as sched ticks > > is in milli-secs. Besides, both preemption models should anyway > > schedule() if TIF_NEED_RESCHED is set on return to userspace, right? > > > > So was curious to understand how is the preemption model causing > > performance regression with no hugepages in this case? > > So yes, everything can schedule on return-to-user (very much including > NONE). Which is why rseq slice ext is heavily recommended for anything > attempting user space spinlocks. > > The thing where the other preemption modes differ is the scheduling > while in kernel mode. So if the workload is spending significant time in > the kernel, this could cause more scheduling. > > As you already mentioned, no huge pages, gives us more overhead on #PF > (and TLB miss, but that's mostly hidden in access latency rather than > immediate system time). This gives more system time, and more room to > schedule. > > If we get preempted in the middle of a #PF, rather than finishing it, > this increases the #PF completion time and if userspace is trying to > access this page concurrently.... But we should see that in mmap_lock > contention/idle time :/ Sorry, insufficient wake-up juice applied. Concurrent page-faults are serialized on the page-table (spin) locks. Not mmap_lock. So it would increase system time and give more rise to kernel preemption. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-05 4:21 ` Andres Freund 2026-04-05 6:08 ` Ritesh Harjani @ 2026-04-07 11:19 ` Mark Rutland 1 sibling, 0 replies; 20+ messages in thread From: Mark Rutland @ 2026-04-07 11:19 UTC (permalink / raw) To: Salvatore Dipietro Cc: Andres Freund, Peter Zijlstra, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior On Sun, Apr 05, 2026 at 12:21:55AM -0400, Andres Freund wrote: > On 2026-04-04 21:40:29 -0400, Andres Freund wrote: > > On 2026-04-04 13:42:22 -0400, Andres Freund wrote: > > The benchmark script seems to indicate that huge pages aren't in use: > > https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15 For the benefit of those reading mail without a browser, the line in question is: ${PG_HUGE_PAGES:=off} # off, try, on Per the PostgreSQL 17 documentation: https://www.postgresql.org/docs/17/runtime-config-resource.html#GUC-HUGE-PAGES ... the default is 'try', though IIUC some additional system configuration may be necessary, to actually reserve huge pages, which is also documented: https://www.postgresql.org/docs/17/kernel-resources.html#LINUX-HUGE-PAGES > > I wonder if somehow the pages underlying the portions of postgres' shared > > memory are getting paged out for some reason, leading to page faults while > > holding the spinlock? > > Hah. I had reflexively used huge_pages=on - as that is the only sane thing to > do with 10s to 100s of GB of shared memory and thus part of all my > benchmarking infrastructure - during the benchmark runs mentioned above. Salvatore, was there a specific reason to test with PG_HUGE_PAGES=off rather than PG_HUGE_PAGES=try? Was that arbitrary (e.g. because it was the first of the possible options)? IIUC from what Andres says here (and in other mails in this thread), that's not a sensible/realistic configuration for this sort of workload, and is the root cause of the contention (which seems to be exacerbated by the scheduler model change). As Andres noted, even ignoring the scheduler model, running with PG_HUGE_PAGES=off results in a substantial performance penalty: > *regardless* the spinlock. PG 19 does have the spinlock in this path anymore, > but not using huge pages is still utterly terrible (like 1/3 of the > throughput). > > I did run some benchmarks here and I don't see a clearly reproducible > regression with huge pages. Is the PG_HUGE_PAGES=off configuration important to you for some reason? Mark. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-04 17:42 ` Andres Freund 2026-04-05 1:40 ` Andres Freund @ 2026-04-07 8:49 ` Peter Zijlstra 1 sibling, 0 replies; 20+ messages in thread From: Peter Zijlstra @ 2026-04-07 8:49 UTC (permalink / raw) To: Andres Freund Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland On Sat, Apr 04, 2026 at 01:42:22PM -0400, Andres Freund wrote: > Hi, > > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote: > > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote: > > > We are reporting a throughput and latency regression on PostgreSQL > > > pgbench (simple-update) on arm64 caused by commit 7dadeaa6e851 > > > ("sched: Further restrict the preemption modes") introduced in > > > v7.0-rc1. > > > > > > The regression manifests as a 0.51x throughput drop on a pgbench > > > simple-update workload with 1024 clients on a 96-vCPU > > > (AWS EC2 m8g.24xlarge) Graviton4 arm64 system. Perf profiling > > > shows 55% of CPU time is consumed spinning in PostgreSQL's > > > userspace spinlock (s_lock()) under PREEMPT_LAZY: > > > > > > |- 56.03% - StartReadBuffer > > > |- 55.93% - GetVictimBuffer > > > |- 55.93% - StrategyGetBuffer > > > |- 55.60% - s_lock <<<< 55% of time > > > | |- 0.39% - el0t_64_irq > > > | |- 0.10% - perform_spin_delay > > > |- 0.08% - LockBufHdr > > > |- 0.07% - hash_search_with_hash_value > > > |- 0.40% - WaitReadBuffers > > > > The fix here is to make PostgreSQL make use of rseq slice extension: > > > > https://lkml.kernel.org/r/20251215155615.870031952@linutronix.de > > > > That should limit the exposure to lock holder preemption (unless > > PostgreSQL is doing seriously egregious things). > > Maybe we should, but requiring the use of a new low level facility that was > introduced in the 7.0 kernel, to address a regression that exists only in > 7.0+, seems not great. > > It's not like it's a completely trivial thing to add support for either, so I > doubt it'll be the right thing to backpatch it into already released major > versions of postgres. Just to clarify my response: all I really saw was 'userspace spinlock' and we just did the rseq slice ext stuff (with Oracle) for exactly this type of thing. And even NONE is susceptible to scheduling the lock holder. It was also the last email I did on Good Friday and thinking hard really wasn't high on the list of things :-) Anyway, IF we revert -- and I think you've already made a fine case for not doing that -- it will be a very temporary thing, NONE will go away. As to kernel version thing; why should people upgrade to the very latest kernel release and not also be expected to upgrade PostgreSQL to the very latest? If they want to use old PostgreSQL, they can use old kernel too, right? Both have stable releases that should keep them afloat for a while. Again, not saying we can't do better, but also sometimes you have to break eggs to make cake :-) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-03 21:32 ` [PATCH 0/1] " Peter Zijlstra 2026-04-04 17:42 ` Andres Freund @ 2026-04-06 0:43 ` Qais Yousef 1 sibling, 0 replies; 20+ messages in thread From: Qais Yousef @ 2026-04-06 0:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland On 04/03/26 23:32, Peter Zijlstra wrote: > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote: > > We are reporting a throughput and latency regression on PostgreSQL > > pgbench (simple-update) on arm64 caused by commit 7dadeaa6e851 > > ("sched: Further restrict the preemption modes") introduced in > > v7.0-rc1. > > > > The regression manifests as a 0.51x throughput drop on a pgbench > > simple-update workload with 1024 clients on a 96-vCPU > > (AWS EC2 m8g.24xlarge) Graviton4 arm64 system. Perf profiling > > shows 55% of CPU time is consumed spinning in PostgreSQL's > > userspace spinlock (s_lock()) under PREEMPT_LAZY: > > > > |- 56.03% - StartReadBuffer > > |- 55.93% - GetVictimBuffer > > |- 55.93% - StrategyGetBuffer > > |- 55.60% - s_lock <<<< 55% of time > > | |- 0.39% - el0t_64_irq > > | |- 0.10% - perform_spin_delay > > |- 0.08% - LockBufHdr > > |- 0.07% - hash_search_with_hash_value > > |- 0.40% - WaitReadBuffers > > The fix here is to make PostgreSQL make use of rseq slice extension: Or perhaps use a longer base_slice_ns in debugfs? I think we end up just short of 4ms in most systems now. 5 or 6 ms might help re-hide it. > > https://lkml.kernel.org/r/20251215155615.870031952@linutronix.de > > That should limit the exposure to lock holder preemption (unless > PostgreSQL is doing seriously egregious things). ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-03 19:19 [PATCH 0/1] sched: Restore PREEMPT_NONE as default Salvatore Dipietro 2026-04-03 19:19 ` [PATCH 1/1] " Salvatore Dipietro 2026-04-03 21:32 ` [PATCH 0/1] " Peter Zijlstra @ 2026-04-05 14:44 ` Mitsumasa KONDO 2026-04-05 16:43 ` Andres Freund 2 siblings, 1 reply; 20+ messages in thread From: Mitsumasa KONDO @ 2026-04-05 14:44 UTC (permalink / raw) To: andres Cc: abuehaze, alisaidi, bigeasy, blakgeof, dipietro.salvatore, dipiets, linux-kernel, mark.rutland, peterz, ritesh.list, tglx, vschneid, kondo.mitsumasa I believe the root cause is the inadequacy of PostgreSQL's arm64 spin_delay() implementation, which PREEMPT_LAZY merely exposed. PostgreSQL's SPIN_DELAY() uses dramatically different instructions per architecture (src/include/storage/s_lock.h): x86_64: rep; nop (PAUSE, ~140 cycles) arm64: isb (pipeline flush, ~10-20 cycles) Under PREEMPT_NONE, lock holders are rarely preempted, so spin duration is short and ISB's lightweight delay is sufficient. Under PREEMPT_LAZY, lock holder preemption becomes more frequent. When this occurs, waiters enter a sustained spin loop. On arm64, ISB provides negligible delay, so the loop runs at near-full speed, hammering the lock cacheline via TAS_SPIN's *(lock) load on every iteration. This generates massive cache coherency traffic that in turn slows the lock holder's execution after rescheduling, creating a feedback loop that escalates on high-core-count systems. On x86_64, PAUSE throttles this loop sufficiently to prevent the feedback loop, which explains why this is not reproducible there. Patching PostgreSQL's arm64 spin_delay() to use WFE instead of ISB should significantly reduce the regression without kernel changes. That said, this change is likely to cause similar breakage in other user-space applications beyond PostgreSQL that rely on lightweight spin loops on arm64. So I agree that the patch to retain PREEMPT_NONE is the right approach. At the same time, this is also something that distributions can resolve by patching their default kernel configuration. Regards, -- Mitsumasa Kondo NTT Software Innovation Center ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-05 14:44 ` Mitsumasa KONDO @ 2026-04-05 16:43 ` Andres Freund 2026-04-06 1:46 ` Mitsumasa KONDO 0 siblings, 1 reply; 20+ messages in thread From: Andres Freund @ 2026-04-05 16:43 UTC (permalink / raw) To: Mitsumasa KONDO Cc: abuehaze, alisaidi, bigeasy, blakgeof, dipietro.salvatore, dipiets, linux-kernel, mark.rutland, peterz, ritesh.list, tglx, vschneid Hi, On 2026-04-05 23:44:25 +0900, Mitsumasa KONDO wrote: > I believe the root cause is the inadequacy of PostgreSQL's arm64 > spin_delay() implementation, which PREEMPT_LAZY merely exposed. > > PostgreSQL's SPIN_DELAY() uses dramatically different instructions > per architecture (src/include/storage/s_lock.h): > > x86_64: rep; nop (PAUSE, ~140 cycles) > arm64: isb (pipeline flush, ~10-20 cycles) > > Under PREEMPT_NONE, lock holders are rarely preempted, so spin > duration is short and ISB's lightweight delay is sufficient. > > Under PREEMPT_LAZY, lock holder preemption becomes more frequent. > When this occurs, waiters enter a sustained spin loop. It's not sustained, the spinning just lasts a between 10 and 1000 iterations, after that there's randomized exponential backoff using nanosleep. Which actually will happen after a smaller number of cycles of with a shorter SPIN_DELAY. In the 4kB workload, nearly all backends are in the exponential backoff. If I remove the rep nop on x86-64, the performance of the 4kB pages workload is basically unaffected, even with PREEMPT_LAZY. The spinning helps with workloads that are contended for very short amounts of time. But that's not the case in this workload without huge pages, instead of low 10s of cycles, we regularly spend a few orders of magnitude more cycles holding the lock. That's not to say the arm64 spin delay implementation is good. It just doesn't seem like it affects this case much. As hinted at by my neighboring email, I see some performance differences due to PREEMPT_LAZY even when replacing the spinlock with a futex based lock, as long as I use 4kB pages. Which seems like the expected thing? Greetings, Andres Freund ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default 2026-04-05 16:43 ` Andres Freund @ 2026-04-06 1:46 ` Mitsumasa KONDO 0 siblings, 0 replies; 20+ messages in thread From: Mitsumasa KONDO @ 2026-04-06 1:46 UTC (permalink / raw) To: andres; +Cc: linux-kernel, dipiets, peterz, tglx, kondo.mitsumasa Hi Andres, Thank you for testing this. On 2026-04-06, Andres Freund wrote: > It's not sustained, the spinning just lasts a between 10 and 1000 > iterations, after that there's randomized exponential backoff using > nanosleep. > Which actually will happen after a smaller number of cycles of with a > shorter SPIN_DELAY. > If I remove the rep nop on x86-64, the performance of the 4kB pages > workload is basically unaffected, even with PREEMPT_LAZY. The fact that removing rep nop made no difference suggests that the spinlock is not the bottleneck in your environment. Could you share your storage configuration? Salvatore's setup uses 12x 1TB AWS io2 at 32000 IOPS each (384K IOPS total in RAID0), which effectively eliminates WAL fsync as a bottleneck. In a storage-limited environment, changes to spin delay behavior would naturally be invisible because throughput is capped by I/O before spinlock contention becomes material. Also worth noting: Salvatore's environment is an EC2 instance (m8g.24xlarge), not bare metal. Hypervisor-level vCPU scheduling adds another layer on top of PREEMPT_LAZY -- a lock holder can be descheduled not only by the kernel scheduler but also by the hypervisor, and the guest kernel has no visibility into this. This could amplify the regression in ways that are not reproducible on bare-metal systems, regardless of architecture. If you want to isolate the effect of SPIN_DELAY on throughput under PREEMPT_LAZY, I would suggest: 1. Use synchronous_commit = off or unlogged tables to remove I/O from the critical path entirely. 2. Use a read-only workload (pgbench -S) with shared_buffers sized to force buffer eviction contention. 3. Run on a high-core-count system with all CPUs saturated under PREEMPT_LAZY. This should expose the pure impact of spin loop behavior without I/O or WAL masking the results. > The spinning helps with workloads that are contended for very short > amounts of time. But that's not the case in this workload without > huge pages, instead of low 10s of cycles, we regularly spend a few > orders of magnitude more cycles holding the lock. I agree that the 4kB page / huge page difference is significant. But even when individual spin durations are short, the cumulative effect across hundreds of backends matters. Small per-iteration overhead in the spin loop, multiplied by high concurrency, can add up to measurable throughput loss -- the effect that becomes visible only when I/O is not the dominant bottleneck. Regards, -- Mitsumasa KONDO NTT Software Innovation Center ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2026-04-07 11:19 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-03 19:19 [PATCH 0/1] sched: Restore PREEMPT_NONE as default Salvatore Dipietro 2026-04-03 19:19 ` [PATCH 1/1] " Salvatore Dipietro 2026-04-03 21:32 ` [PATCH 0/1] " Peter Zijlstra 2026-04-04 17:42 ` Andres Freund 2026-04-05 1:40 ` Andres Freund 2026-04-05 4:21 ` Andres Freund 2026-04-05 6:08 ` Ritesh Harjani 2026-04-05 14:09 ` Andres Freund 2026-04-05 14:44 ` Andres Freund 2026-04-07 8:29 ` Peter Zijlstra 2026-04-07 8:27 ` Peter Zijlstra 2026-04-07 10:17 ` David Laight 2026-04-07 8:20 ` Peter Zijlstra 2026-04-07 9:07 ` Peter Zijlstra 2026-04-07 11:19 ` Mark Rutland 2026-04-07 8:49 ` Peter Zijlstra 2026-04-06 0:43 ` Qais Yousef 2026-04-05 14:44 ` Mitsumasa KONDO 2026-04-05 16:43 ` Andres Freund 2026-04-06 1:46 ` Mitsumasa KONDO
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox