[PATCH 0/1] sched: Restore PREEMPT

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/1] sched: Restore PREEMPT_NONE as default
@ 2026-04-03 19:19 Salvatore Dipietro
  2026-04-03 19:19 ` [PATCH 1/1] " Salvatore Dipietro
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Salvatore Dipietro @ 2026-04-03 19:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, peterz,
	Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior

We are reporting a throughput and latency regression on PostgreSQL
pgbench (simple-update) on arm64 caused by commit 7dadeaa6e851
("sched: Further restrict the preemption modes") introduced in
v7.0-rc1.

The regression manifests as a 0.51x throughput drop on a pgbench
simple-update workload with 1024 clients on a 96-vCPU
(AWS EC2 m8g.24xlarge) Graviton4 arm64 system. Perf profiling
shows 55% of CPU time is consumed spinning in PostgreSQL's
userspace spinlock (s_lock()) under PREEMPT_LAZY:

  |- 56.03% - StartReadBuffer
    |- 55.93% - GetVictimBuffer
        |- 55.93% - StrategyGetBuffer
          |- 55.60% - s_lock        <<<< 55% of time
          |  |- 0.39% - el0t_64_irq
          |  |- 0.10% - perform_spin_delay
          |- 0.08% - LockBufHdr
          |- 0.07% - hash_search_with_hash_value
    |- 0.40% - WaitReadBuffers

1. Test environment
___________________

  Hardware:  1x AWS EC2 m8g.24xlarge (12x 1TB IO2 32000 iops RAID0 XFS)
  OS:        AL2023 (ami-03a8d3251f401ffca)
  Kernel:    next-20260331
  Database:  PostgreSQL 17
  Workload:  pgbench simple-update
             1024 clients, 96 threads, 1200s duration
             scale factor 8470, fillfactor=90, prepared protocol

2. Results
__________

  Configuration       Run 1      Run 2      Run 3      Average    vs Base
  __________________  _________  _________  _________  _________  _______
  BASELINE            47242.39   53369.18   51644.29   50751.96   1.00x
  w/ revert           92906.62   103976.03  98814.94   98565.86   1.94x

3. Reproduction
_______________

On the AWS EC2 m8g.24xlarge, install and run the PostgreSQL
database using the repro-collection repository like:

  # Reproducer code:
  git clone https://github.com/aws/repro-collection.git ~/repro-collection

  # Setup and start PostgreSQL server in terminal 1:
  ~/repro-collection/run.sh postgresql SUT --ldg=127.0.0.1

  # Run pgbench load generator in terminal 2:
  PGBENCH_SCALE=8470 \
  PGBENCH_INIT_EXTRA_ARGS="--fillfactor=90" \
  PGBENCH_CLIENTS=1024 \
  PGBENCH_THREADS=96 \
  PGBENCH_DURATION=1200 \
  PGBENCH_BUILTIN=simple-update \
  PGBENCH_PROTOCOL=prepared \
  ~/repro-collection/run.sh postgresql LDG --sut=127.0.0.1

Salvatore Dipietro (1):
  sched: Restore PREEMPT_NONE as default

 kernel/Kconfig.preempt | 3 ---
 1 file changed, 3 deletions(-)


base-commit: 9147566d801602c9e7fc7f85e989735735bf38ba
-- 
2.50.1 (Apple Git-155)




AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/1] sched: Restore PREEMPT_NONE as default
  2026-04-03 19:19 [PATCH 0/1] sched: Restore PREEMPT_NONE as default Salvatore Dipietro
@ 2026-04-03 19:19 ` Salvatore Dipietro
  2026-04-03 21:32 ` [PATCH 0/1] " Peter Zijlstra
  2026-04-05 14:44 ` Mitsumasa KONDO
  2 siblings, 0 replies; 20+ messages in thread
From: Salvatore Dipietro @ 2026-04-03 19:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, peterz,
	Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior

Commit 7dadeaa6e851 ("sched: Further restrict the preemption
modes") changed the default preemption model to PREEMPT_LAZY on
architectures that support it and made PREEMPT_NONE depend on
ARCH_NO_PREEMPT.

This causes a 0.51x throughput regression on PostgreSQL pgbench
(simple-update) with 1024 clients on a 96-vCPU Graviton4 arm64
system. Perf profiling shows 55% of CPU time spinning in
PostgreSQL's userspace spinlock (s_lock()) under PREEMPT_LAZY.

Restore PREEMPT_NONE as the default preemption model, remove
the ARCH_NO_PREEMPT dependency from PREEMPT_NONE, and remove
the ARCH_HAS_PREEMPT_LAZY restriction from PREEMPT_VOLUNTARY.

Fixes: 7dadeaa6e851 ("sched: Further restrict the preemption modes")
Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>
---
 kernel/Kconfig.preempt | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 88c594c6d7fc..da326800c1c9 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -16,13 +16,11 @@ config ARCH_HAS_PREEMPT_LAZY

 choice
 	prompt "Preemption Model"
-	default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY
 	default PREEMPT_NONE

 config PREEMPT_NONE
 	bool "No Forced Preemption (Server)"
 	depends on !PREEMPT_RT
-	depends on ARCH_NO_PREEMPT
 	select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
 	help
 	  This is the traditional Linux preemption model, geared towards
@@ -37,7 +35,6 @@ config PREEMPT_NONE

 config PREEMPT_VOLUNTARY
 	bool "Voluntary Kernel Preemption (Desktop)"
-	depends on !ARCH_HAS_PREEMPT_LAZY
 	depends on !ARCH_NO_PREEMPT
 	depends on !PREEMPT_RT
 	select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
-- 
2.50.1 (Apple Git-155)

AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-03 19:19 [PATCH 0/1] sched: Restore PREEMPT_NONE as default Salvatore Dipietro
  2026-04-03 19:19 ` [PATCH 1/1] " Salvatore Dipietro
@ 2026-04-03 21:32 ` Peter Zijlstra
  2026-04-04 17:42   ` Andres Freund
  2026-04-06  0:43   ` Qais Yousef
  2026-04-05 14:44 ` Mitsumasa KONDO
  2 siblings, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2026-04-03 21:32 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	Thomas Gleixner, Valentin Schneider, Sebastian Andrzej Siewior,
	Mark Rutland

On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
> We are reporting a throughput and latency regression on PostgreSQL
> pgbench (simple-update) on arm64 caused by commit 7dadeaa6e851
> ("sched: Further restrict the preemption modes") introduced in
> v7.0-rc1.
> 
> The regression manifests as a 0.51x throughput drop on a pgbench
> simple-update workload with 1024 clients on a 96-vCPU
> (AWS EC2 m8g.24xlarge) Graviton4 arm64 system. Perf profiling
> shows 55% of CPU time is consumed spinning in PostgreSQL's
> userspace spinlock (s_lock()) under PREEMPT_LAZY:
> 
>   |- 56.03% - StartReadBuffer
>     |- 55.93% - GetVictimBuffer
>         |- 55.93% - StrategyGetBuffer
>           |- 55.60% - s_lock        <<<< 55% of time
>           |  |- 0.39% - el0t_64_irq
>           |  |- 0.10% - perform_spin_delay
>           |- 0.08% - LockBufHdr
>           |- 0.07% - hash_search_with_hash_value
>     |- 0.40% - WaitReadBuffers

The fix here is to make PostgreSQL make use of rseq slice extension:

  https://lkml.kernel.org/r/20251215155615.870031952@linutronix.de

That should limit the exposure to lock holder preemption (unless
PostgreSQL is doing seriously egregious things).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-03 21:32 ` [PATCH 0/1] " Peter Zijlstra
@ 2026-04-04 17:42   ` Andres Freund
  2026-04-05  1:40     ` Andres Freund
  2026-04-07  8:49     ` Peter Zijlstra
  2026-04-06  0:43   ` Qais Yousef
  1 sibling, 2 replies; 20+ messages in thread
From: Andres Freund @ 2026-04-04 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze,
	dipietro.salvatore, Thomas Gleixner, Valentin Schneider,
	Sebastian Andrzej Siewior, Mark Rutland

Hi,

On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
> On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
> > We are reporting a throughput and latency regression on PostgreSQL
> > pgbench (simple-update) on arm64 caused by commit 7dadeaa6e851
> > ("sched: Further restrict the preemption modes") introduced in
> > v7.0-rc1.
> >
> > The regression manifests as a 0.51x throughput drop on a pgbench
> > simple-update workload with 1024 clients on a 96-vCPU
> > (AWS EC2 m8g.24xlarge) Graviton4 arm64 system. Perf profiling
> > shows 55% of CPU time is consumed spinning in PostgreSQL's
> > userspace spinlock (s_lock()) under PREEMPT_LAZY:
> >
> >   |- 56.03% - StartReadBuffer
> >     |- 55.93% - GetVictimBuffer
> >         |- 55.93% - StrategyGetBuffer
> >           |- 55.60% - s_lock        <<<< 55% of time
> >           |  |- 0.39% - el0t_64_irq
> >           |  |- 0.10% - perform_spin_delay
> >           |- 0.08% - LockBufHdr
> >           |- 0.07% - hash_search_with_hash_value
> >     |- 0.40% - WaitReadBuffers
>
> The fix here is to make PostgreSQL make use of rseq slice extension:
>
>   https://lkml.kernel.org/r/20251215155615.870031952@linutronix.de
>
> That should limit the exposure to lock holder preemption (unless
> PostgreSQL is doing seriously egregious things).

Maybe we should, but requiring the use of a new low level facility that was
introduced in the 7.0 kernel, to address a regression that exists only in
7.0+, seems not great.

It's not like it's a completely trivial thing to add support for either, so I
doubt it'll be the right thing to backpatch it into already released major
versions of postgres.

This specific spinlock doesn't actually exist anymore in postgres' trunk
(feature freeze in a few days, release early autumn).  But there is at least
one other one that can often be quite hotly contended (although there is a
relatively low limit to the number of backends that can acquire it
concurrently, which might be the saving grace here).


I'm not quite sure I understand why the spinlock in Salvatore's benchmark does
shows up this heavily:

- For something like the benchmark here, it should only be used until
  postgres' buffer pool is fully used, as the freelist only contains buffers
  not in use, and we check without a lock whether it contains buffers. Once
  running, buffers are only added to the freelist if tables/indexes are
  dropped/truncated.  And the benchmark seems like it runs long enough that we
  should actually reach the point the freelist should be empty?

- The section covered by the spinlock is only a few instructions long and it
  is only hit if we have to do a somewhat heavyweight operation afterwards
  (read in a page into the buffer pool), it seems surprising that this short
  section gets interrupted frequently enough to cause a regression of this
  magnitude.

  For a moment I thought it might be because, while holding the spinlock, some
  memory is touched for the first time, but that is actually not the case.


The benchmark script seems to indicate that huge pages aren't in use:
https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15


I wonder if somehow the pages underlying the portions of postgres' shared
memory are getting paged out for some reason, leading to page faults while
holding the spinlock?


Salvatore, could you repeat that benchmark in some variations?
1) Use huge pages
2) 1) + prewarm the buffer pool pool before running the benchmark
  CREATE EXTENSION pg_prewarm;
  -- prewarm table data
  SELECT pg_prewarm(oid) FROM pg_class WHERE relname LIKE 'pgbench_accounts%' and relkind = 'r';
  -- prewarm indexes, do so after tables, as indexes are more important, and
  -- the buffer pool might not be big enough
  SELECT pg_prewarm(oid) FROM pg_class WHERE relname LIKE 'pgbench_accounts%' and relkind = 'i';


I assume postgres was built with either an -march suffient to use atomic
operations (i.e. -march=armv8.1-a or such) instead of ll/sc? Or at least
-moutline-atomics was used?

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-04 17:42   ` Andres Freund
@ 2026-04-05  1:40     ` Andres Freund
  2026-04-05  4:21       ` Andres Freund
  2026-04-07  8:49     ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Andres Freund @ 2026-04-05  1:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze,
	dipietro.salvatore, Thomas Gleixner, Valentin Schneider,
	Sebastian Andrzej Siewior, Mark Rutland

Hi,

On 2026-04-04 13:42:22 -0400, Andres Freund wrote:
> On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
> > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
> I'm not quite sure I understand why the spinlock in Salvatore's benchmark does
> shows up this heavily:
>
> - For something like the benchmark here, it should only be used until
>   postgres' buffer pool is fully used, as the freelist only contains buffers
>   not in use, and we check without a lock whether it contains buffers. Once
>   running, buffers are only added to the freelist if tables/indexes are
>   dropped/truncated.  And the benchmark seems like it runs long enough that we
>   should actually reach the point the freelist should be empty?
>
> - The section covered by the spinlock is only a few instructions long and it
>   is only hit if we have to do a somewhat heavyweight operation afterwards
>   (read in a page into the buffer pool), it seems surprising that this short
>   section gets interrupted frequently enough to cause a regression of this
>   magnitude.
>
>   For a moment I thought it might be because, while holding the spinlock, some
>   memory is touched for the first time, but that is actually not the case.
>

I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of
memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96
threads, so it's smaller, and it's x86, not arm, but it's what I can quickly
update to an unreleased kernel.

So far I don't see such a regression and I basically see no time spent
GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%).

Which I don't find surprising, this workload doesn't read enough to have
contention in there. Salvatore reported on the order of 100k transactions/sec
(with one update, one read and one insert). Even if just about all of those
were misses - and they shouldn't be with 25% of 384G as postgres'
shared_buffers as the script indicates, and we know that s_b is not full due
to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the
page cache.  That's not that much.

Now, this machine is smaller and a different arch, so who knows.

The 7.0-rc numbers I am getting are higher than what Salvatore reported on a
bigger machine. It's hard to compare though, as I am testing with local
storage, and this workload should be extremely write latency bound (but my
storage has crappy fsync latency, so ...).

I *do* see some contention where it's conceivable that rseq slice extension
could help some, but

a) It's a completely different locks: On the WALWrite lock

   Which is precisely the lock you'd expect in a commit latency bound workload
   with a lot of clients (the lock is used to wait for an in-flight WAL flush
   to complete).

b) So far I have not observed a regression from 6.18.

For me a profile looks like this:
-   60.99%     0.95%  postgres         postgres                     [.] PostgresMain
   - 60.04% PostgresMain
      - 22.57% PortalRun
         - 20.88% PortalRunMulti
            - 16.70% standard_ExecutorRun
               - 16.55% ExecModifyTable
                  + 10.78% ExecScan
                  + 3.19% ExecUpdate
                  + 1.53% ExecInsert
            + 2.94% standard_ExecutorStart
              0.54% standard_ExecutorEnd
         + 1.60% PortalRunSelect
      - 15.89% CommitTransactionCommand
         - 15.50% CommitTransaction
            - 11.90% XLogFlush
               - 7.66% LWLockAcquireOrWait
                    6.70% LWLockQueueSelf
                    0.57% perform_spin_delay

Which is about what I would expect.

Salvatore, is there a chance your profile is corrupted and you did observe
contention, but on a different lock? E.g. due to out-of-date debug symbols or
such?

Could you run something like the following while the benchmark is running:

SELECT backend_type, wait_event_type, wait_event, state, count(*) FROM pg_stat_activity where wait_event_type NOT IN ('Activity') GROUP BY backend_type, wait_event_type, wait_event, state order by count(*) desc \watch 1

and show what you see at the time your profile shows the bad contention?

On 2026-04-04 13:42:22 -0400, Andres Freund wrote:
> On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
> > The fix here is to make PostgreSQL make use of rseq slice extension:
> >
> >   https://lkml.kernel.org/r/20251215155615.870031952@linutronix.de
> >
> > That should limit the exposure to lock holder preemption (unless
> > PostgreSQL is doing seriously egregious things).
>
> Maybe we should, but requiring the use of a new low level facility that was
> introduced in the 7.0 kernel, to address a regression that exists only in
> 7.0+, seems not great.
>
> It's not like it's a completely trivial thing to add support for either, so I
> doubt it'll be the right thing to backpatch it into already released major
> versions of postgres.

It's not even suggested to be enabled by default:

   CONFIG_RSEQ_SLICE_EXTENSION:

   Allows userspace to request a limited time slice extension when
   returning from an interrupt to user space via the RSEQ shared
   data ABI. If granted, that allows to complete a critical section,
   so that other threads are not stuck on a conflicted resource,
   while the task is scheduled out.

   If unsure, say N.

And enabling it requires EXPERT=1.

If this somehow does end up being a reproducible performance issue (I still
suspect something more complicated is going on), I don't see how userspace
could be expected to mitigate a substantial perf regression in 7.0 that can
only be mitigated by a default-off non-trivial functionality also introduced
in 7.0.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-05  1:40     ` Andres Freund
@ 2026-04-05  4:21       ` Andres Freund
  2026-04-05  6:08         ` Ritesh Harjani
  2026-04-07 11:19         ` Mark Rutland
  0 siblings, 2 replies; 20+ messages in thread
From: Andres Freund @ 2026-04-05  4:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze,
	dipietro.salvatore, Thomas Gleixner, Valentin Schneider,
	Sebastian Andrzej Siewior, Mark Rutland

Hi,

On 2026-04-04 21:40:29 -0400, Andres Freund wrote:
> On 2026-04-04 13:42:22 -0400, Andres Freund wrote:
> > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
> > > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
> > I'm not quite sure I understand why the spinlock in Salvatore's benchmark does
> > shows up this heavily:
> >
> > - For something like the benchmark here, it should only be used until
> >   postgres' buffer pool is fully used, as the freelist only contains buffers
> >   not in use, and we check without a lock whether it contains buffers. Once
> >   running, buffers are only added to the freelist if tables/indexes are
> >   dropped/truncated.  And the benchmark seems like it runs long enough that we
> >   should actually reach the point the freelist should be empty?
> >
> > - The section covered by the spinlock is only a few instructions long and it
> >   is only hit if we have to do a somewhat heavyweight operation afterwards
> >   (read in a page into the buffer pool), it seems surprising that this short
> >   section gets interrupted frequently enough to cause a regression of this
> >   magnitude.
> >
> >   For a moment I thought it might be because, while holding the spinlock, some
> >   memory is touched for the first time, but that is actually not the case.
> >
>
> I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of
> memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96
> threads, so it's smaller, and it's x86, not arm, but it's what I can quickly
> update to an unreleased kernel.
>
>
> So far I don't see such a regression and I basically see no time spent
> GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%).
>
> Which I don't find surprising, this workload doesn't read enough to have
> contention in there. Salvatore reported on the order of 100k transactions/sec
> (with one update, one read and one insert). Even if just about all of those
> were misses - and they shouldn't be with 25% of 384G as postgres'
> shared_buffers as the script indicates, and we know that s_b is not full due
> to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the
> page cache.  That's not that much.


> The benchmark script seems to indicate that huge pages aren't in use:
> https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15
>
>
> I wonder if somehow the pages underlying the portions of postgres' shared
> memory are getting paged out for some reason, leading to page faults while
> holding the spinlock?

Hah. I had reflexively used huge_pages=on - as that is the only sane thing to
do with 10s to 100s of GB of shared memory and thus part of all my
benchmarking infrastructure - during the benchmark runs mentioned above.

Turns out, if I *disable* huge pages, I actually can reproduce the contention
that Salvatore reported (didn't see whether it's a regression for me
though). Not anywhere close to the same degree, because the bottleneck for me
is the writes.

If I change the workload to a read-only benchmark, which obviously reads a lot
more due to not being bottleneck by durable-write-latency, I see more
contention:

-   12.76%  postgres         postgres                   [.] s_lock
   - 12.75% s_lock
      - 12.69% StrategyGetBuffer
           GetVictimBuffer
         - StartReadBuffer
            - 12.69% ReleaseAndReadBuffer
               + 12.65% heapam_index_fetch_tuple


While what I said above is true, the memory touched at the time of contention
it isn't the first access to the relevant shared memory (i.e. it is already
backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer()
will be the first access of the connection processes to the relevant 4kB
pages.

Thus there will be a *lot* of minor faults and tlb misses while holding a
spinlock. Unsurprisingly that's bad for performance.


I don't see a reason to particularly care about the regression if that's the
sole way to trigger it.  Using a buffer pool of ~100GB without huge pages is
not an interesting workload.  With a smaller buffer pool the problem would not
happen either.

Note that the performance effect of not using huge pages is terrible
*regardless* the spinlock. PG 19 does have the spinlock in this path anymore,
but not using huge pages is still utterly terrible (like 1/3 of the
throughput).


I did run some benchmarks here and I don't see a clearly reproducible
regression with huge pages.


Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-05  4:21       ` Andres Freund
@ 2026-04-05  6:08         ` Ritesh Harjani
  2026-04-05 14:09           ` Andres Freund
  2026-04-07  8:20           ` Peter Zijlstra
  2026-04-07 11:19         ` Mark Rutland
  1 sibling, 2 replies; 20+ messages in thread
From: Ritesh Harjani @ 2026-04-05  6:08 UTC (permalink / raw)
  To: Andres Freund, Peter Zijlstra
  Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze,
	dipietro.salvatore, Thomas Gleixner, Valentin Schneider,
	Sebastian Andrzej Siewior, Mark Rutland

Andres Freund <andres@anarazel.de> writes:

> Hi,
>
> On 2026-04-04 21:40:29 -0400, Andres Freund wrote:
>> On 2026-04-04 13:42:22 -0400, Andres Freund wrote:
>> > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
>> > > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
>> > I'm not quite sure I understand why the spinlock in Salvatore's benchmark does
>> > shows up this heavily:
>> >
>> > - For something like the benchmark here, it should only be used until
>> >   postgres' buffer pool is fully used, as the freelist only contains buffers
>> >   not in use, and we check without a lock whether it contains buffers. Once
>> >   running, buffers are only added to the freelist if tables/indexes are
>> >   dropped/truncated.  And the benchmark seems like it runs long enough that we
>> >   should actually reach the point the freelist should be empty?
>> >
>> > - The section covered by the spinlock is only a few instructions long and it
>> >   is only hit if we have to do a somewhat heavyweight operation afterwards
>> >   (read in a page into the buffer pool), it seems surprising that this short
>> >   section gets interrupted frequently enough to cause a regression of this
>> >   magnitude.
>> >
>> >   For a moment I thought it might be because, while holding the spinlock, some
>> >   memory is touched for the first time, but that is actually not the case.
>> >
>>
>> I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of
>> memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96
>> threads, so it's smaller, and it's x86, not arm, but it's what I can quickly
>> update to an unreleased kernel.
>>
>>
>> So far I don't see such a regression and I basically see no time spent
>> GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%).
>>
>> Which I don't find surprising, this workload doesn't read enough to have
>> contention in there. Salvatore reported on the order of 100k transactions/sec
>> (with one update, one read and one insert). Even if just about all of those
>> were misses - and they shouldn't be with 25% of 384G as postgres'
>> shared_buffers as the script indicates, and we know that s_b is not full due
>> to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the
>> page cache.  That's not that much.
>
>
>> The benchmark script seems to indicate that huge pages aren't in use:
>> https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15
>>
>>
>> I wonder if somehow the pages underlying the portions of postgres' shared
>> memory are getting paged out for some reason, leading to page faults while
>> holding the spinlock?
>
> Hah. I had reflexively used huge_pages=on - as that is the only sane thing to
> do with 10s to 100s of GB of shared memory and thus part of all my
> benchmarking infrastructure - during the benchmark runs mentioned above.
>
> Turns out, if I *disable* huge pages, I actually can reproduce the contention
> that Salvatore reported (didn't see whether it's a regression for me
> though). Not anywhere close to the same degree, because the bottleneck for me
> is the writes.
>
> If I change the workload to a read-only benchmark, which obviously reads a lot
> more due to not being bottleneck by durable-write-latency, I see more
> contention:
>
> -   12.76%  postgres         postgres                   [.] s_lock
>    - 12.75% s_lock
>       - 12.69% StrategyGetBuffer
>            GetVictimBuffer
>          - StartReadBuffer
>             - 12.69% ReleaseAndReadBuffer
>                + 12.65% heapam_index_fetch_tuple
>
>
> While what I said above is true, the memory touched at the time of contention
> it isn't the first access to the relevant shared memory (i.e. it is already
> backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer()
> will be the first access of the connection processes to the relevant 4kB
> pages.
>
> Thus there will be a *lot* of minor faults and tlb misses while holding a
> spinlock. Unsurprisingly that's bad for performance.
>
>
> I don't see a reason to particularly care about the regression if that's the
> sole way to trigger it.  Using a buffer pool of ~100GB without huge pages is
> not an interesting workload.  With a smaller buffer pool the problem would not
> happen either.
>
> Note that the performance effect of not using huge pages is terrible
> *regardless* the spinlock. PG 19 does have the spinlock in this path anymore,
> but not using huge pages is still utterly terrible (like 1/3 of the
> throughput).
>
>
> I did run some benchmarks here and I don't see a clearly reproducible
> regression with huge pages.
>

However, for curiosity, I was hoping if someone more familiar with the
scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes
performance regression w/o huge pages? 

Minor page fault handling has micro-secs latency, where as sched ticks
is in milli-secs. Besides, both preemption models should anyway
schedule() if TIF_NEED_RESCHED is set on return to userspace, right?

So was curious to understand how is the preemption model causing
performance regression with no hugepages in this case?

-ritesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-05  6:08         ` Ritesh Harjani
@ 2026-04-05 14:09           ` Andres Freund
  2026-04-05 14:44             ` Andres Freund
                               ` (2 more replies)
  2026-04-07  8:20           ` Peter Zijlstra
  1 sibling, 3 replies; 20+ messages in thread
From: Andres Freund @ 2026-04-05 14:09 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Peter Zijlstra, Salvatore Dipietro, linux-kernel, alisaidi,
	blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner,
	Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland

Hi,

On 2026-04-05 11:38:59 +0530, Ritesh Harjani wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Hah. I had reflexively used huge_pages=on - as that is the only sane thing to
> > do with 10s to 100s of GB of shared memory and thus part of all my
> > benchmarking infrastructure - during the benchmark runs mentioned above.
> >
> > Turns out, if I *disable* huge pages, I actually can reproduce the contention
> > that Salvatore reported (didn't see whether it's a regression for me
> > though). Not anywhere close to the same degree, because the bottleneck for me
> > is the writes.
> >
> > If I change the workload to a read-only benchmark, which obviously reads a lot
> > more due to not being bottleneck by durable-write-latency, I see more
> > contention:
> >
> > -   12.76%  postgres         postgres                   [.] s_lock
> >    - 12.75% s_lock
> >       - 12.69% StrategyGetBuffer
> >            GetVictimBuffer
> >          - StartReadBuffer
> >             - 12.69% ReleaseAndReadBuffer
> >                + 12.65% heapam_index_fetch_tuple
> >
> >
> > While what I said above is true, the memory touched at the time of contention
> > it isn't the first access to the relevant shared memory (i.e. it is already
> > backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer()
> > will be the first access of the connection processes to the relevant 4kB
> > pages.
> >
> > Thus there will be a *lot* of minor faults and tlb misses while holding a
> > spinlock. Unsurprisingly that's bad for performance.
> >
> >
> > I don't see a reason to particularly care about the regression if that's the
> > sole way to trigger it.  Using a buffer pool of ~100GB without huge pages is
> > not an interesting workload.  With a smaller buffer pool the problem would not
> > happen either.
> >
> > Note that the performance effect of not using huge pages is terrible
> > *regardless* the spinlock. PG 19 does have the spinlock in this path anymore,
> > but not using huge pages is still utterly terrible (like 1/3 of the
> > throughput).
> >
> >
> > I did run some benchmarks here and I don't see a clearly reproducible
> > regression with huge pages.
> >
>
> However, for curiosity, I was hoping if someone more familiar with the
> scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes
> performance regression w/o huge pages?
>
> Minor page fault handling has micro-secs latency, where as sched ticks
> is in milli-secs. Besides, both preemption models should anyway
> schedule() if TIF_NEED_RESCHED is set on return to userspace, right?

> So was curious to understand how is the preemption model causing
> performance regression with no hugepages in this case?

An attempt at answering that, albeit not from the angle of somebody knowing
the scheduler code to a meaningful degree:

I think the effect of 4KB pages (and the associated minor faults and TLB
misses) is just to create contention on a spinlock that would normally never
be contended, due to occuring while the spinlock is held in this quite extreme
workload.  This contention happens with PREEMPT_NONE as well - the performance
is quite bad compared to when using huge pages.

My guess is PREEMPT_LAZY just exacerbates the terrible contention by
scheduling out the lock holder more often.  But you're already in deep trouble
at this point, even without PREEMPT_LAZY making it "worse".

On my machine (smaller than Salvatore's) PREEMPT_LAZY is a worse, but not by
that much. I suspect for Salvatore PREEMPT_LAZY just made already terrible
contention worse.  I think we really need a comparison run from Salvatore with
huge pages with both PREEMPT_NONE and PREEMPT_LAZY.

FWIW, from what I can tell, the whole "WHAA, it's a userspace spinlock" part
has just about nothing to do with the problem.  My understanding is that
default futexes don't transfer the lock waiter's scheduler slice to the lock
holder (there's no information about who the lock holder is unless it's a PI
futex), Postgres' spinlock have randomized exponential backoff and the amount
of spinning is adjusted over time, so you don't actually end up with spinlock
waiters preventing the lock owner from getting scheduled to a significant
degree.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-05 14:09           ` Andres Freund
@ 2026-04-05 14:44             ` Andres Freund
  2026-04-07  8:29               ` Peter Zijlstra
  2026-04-07  8:27             ` Peter Zijlstra
  2026-04-07 10:17             ` David Laight
  2 siblings, 1 reply; 20+ messages in thread
From: Andres Freund @ 2026-04-05 14:44 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Peter Zijlstra, Salvatore Dipietro, linux-kernel, alisaidi,
	blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner,
	Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland

Hi,

On 2026-04-05 10:09:35 -0400, Andres Freund wrote:
> FWIW, from what I can tell, the whole "WHAA, it's a userspace spinlock" part
> has just about nothing to do with the problem.  My understanding is that
> default futexes don't transfer the lock waiter's scheduler slice to the lock
> holder (there's no information about who the lock holder is unless it's a PI
> futex), Postgres' spinlock have randomized exponential backoff and the amount
> of spinning is adjusted over time, so you don't actually end up with spinlock
> waiters preventing the lock owner from getting scheduled to a significant
> degree.

Confirmed, a hack moving this to a futex based lock mirrors has very similar
performance, including somewhat worse performance with PREEMPT_LAZY than
PREEMPT_NONE.  Using futexes has a bit lower throughput but also reduces CPU
usage a bit for the same amount of work, which is about what you'd expect.

Just to re-emphasize: That is just due to using 4kB pages in a huge buffer
pool and would vanish once running for a bit longer or when using a smaller
buffer pool.

I'll look at trying out the rseq slice extension in a bit, it looks like it
nice for performance regardless of using spinlocks.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-05 14:44             ` Andres Freund
@ 2026-04-07  8:29               ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2026-04-07  8:29 UTC (permalink / raw)
  To: Andres Freund
  Cc: Ritesh Harjani, Salvatore Dipietro, linux-kernel, alisaidi,
	blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner,
	Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland

On Sun, Apr 05, 2026 at 10:44:22AM -0400, Andres Freund wrote:

> I'll look at trying out the rseq slice extension in a bit, it looks like it
> nice for performance regardless of using spinlocks.

Thanks! It seems to work well for the Oracle folks, please let us know
if you have comments.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-05 14:09           ` Andres Freund
  2026-04-05 14:44             ` Andres Freund
@ 2026-04-07  8:27             ` Peter Zijlstra
  2026-04-07 10:17             ` David Laight
  2 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2026-04-07  8:27 UTC (permalink / raw)
  To: Andres Freund
  Cc: Ritesh Harjani, Salvatore Dipietro, linux-kernel, alisaidi,
	blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner,
	Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland

On Sun, Apr 05, 2026 at 10:09:35AM -0400, Andres Freund wrote:

> FWIW, from what I can tell, the whole "WHAA, it's a userspace spinlock" part
> has just about nothing to do with the problem. 

:-)

> My understanding is that
> default futexes don't transfer the lock waiter's scheduler slice to the lock
> holder (there's no information about who the lock holder is unless it's a PI
> futex), 

(and robust; both PI and robust store the owner TID in the futex field)

The difference is that while mutex lock holder preemption is also bad,
it mostly just leads to idle time, which you can sometimes fill with
doing other work.

Whereas with spinlocks, the time gets soaked up with spinners.

So both 'bad', but both different.

> Postgres' spinlock have randomized exponential backoff and the amount
> of spinning is adjusted over time, so you don't actually end up with spinlock
> waiters preventing the lock owner from getting scheduled to a significant
> degree.

Fair enough.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-05 14:09           ` Andres Freund
  2026-04-05 14:44             ` Andres Freund
  2026-04-07  8:27             ` Peter Zijlstra
@ 2026-04-07 10:17             ` David Laight
  2 siblings, 0 replies; 20+ messages in thread
From: David Laight @ 2026-04-07 10:17 UTC (permalink / raw)
  To: Andres Freund
  Cc: Ritesh Harjani, Peter Zijlstra, Salvatore Dipietro, linux-kernel,
	alisaidi, blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner,
	Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland

On Sun, 5 Apr 2026 10:09:35 -0400
Andres Freund <andres@anarazel.de> wrote:

...
> FWIW, from what I can tell, the whole "WHAA, it's a userspace spinlock" part
> has just about nothing to do with the problem.  My understanding is that
> default futexes don't transfer the lock waiter's scheduler slice to the lock
> holder (there's no information about who the lock holder is unless it's a PI
> futex), Postgres' spinlock have randomized exponential backoff and the amount
> of spinning is adjusted over time, so you don't actually end up with spinlock
> waiters preventing the lock owner from getting scheduled to a significant
> degree.

Another problem (which also affects futex) is that the lock holder can
get preempted (or just stolen by network interrupts and softint code).
So even if the lock is only held for a few instructions sometimes that can
take several milliseconds.
Pretty much the only way to avoid that is changing the code to be lockless.
(eg changing linked lists to arrays and using atmomic_increment() on a index.)

A much more subtle effect can also affect performance - especially of a
single-threaded cpu-intensive program.
The clock speed of the busy cpu will be boosted (perhaps to 5GHz).
Then a lot of other higher priority processes are all scheduled at the
same time, the cpu-intensive program is preempted.
When the the high priority processes sleep (perhaps almost immediately)
the cpu-intensive program is scheduled on a 'random' cpu that has
been idle so runs at initially (say) 800MHz, before being boosted again.

That effect increased the compile time for an fpga image from ~10 minutes
to ~20 minutes when the renaming of a kernel config option caused one of
the mitigations that significantly slow down system call entry/exit to
slow down an idle daemon than woke threads every 10ms (to process RTP
audio) so that all the threads were active at the same time.

	David

> 
> Greetings,
> 
> Andres Freund
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-05  6:08         ` Ritesh Harjani
  2026-04-05 14:09           ` Andres Freund
@ 2026-04-07  8:20           ` Peter Zijlstra
  2026-04-07  9:07             ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2026-04-07  8:20 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Andres Freund, Salvatore Dipietro, linux-kernel, alisaidi,
	blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner,
	Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland

On Sun, Apr 05, 2026 at 11:38:59AM +0530, Ritesh Harjani wrote:

> However, for curiosity, I was hoping if someone more familiar with the
> scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes
> performance regression w/o huge pages? 
> 
> Minor page fault handling has micro-secs latency, where as sched ticks
> is in milli-secs. Besides, both preemption models should anyway
> schedule() if TIF_NEED_RESCHED is set on return to userspace, right?
> 
> So was curious to understand how is the preemption model causing
> performance regression with no hugepages in this case?

So yes, everything can schedule on return-to-user (very much including
NONE). Which is why rseq slice ext is heavily recommended for anything
attempting user space spinlocks.

The thing where the other preemption modes differ is the scheduling
while in kernel mode. So if the workload is spending significant time in
the kernel, this could cause more scheduling.

As you already mentioned, no huge pages, gives us more overhead on #PF
(and TLB miss, but that's mostly hidden in access latency rather than
immediate system time). This gives more system time, and more room to
schedule.

If we get preempted in the middle of a #PF, rather than finishing it,
this increases the #PF completion time and if userspace is trying to
access this page concurrently.... But we should see that in mmap_lock
contention/idle time :/

I'm not sure I can explain any of this.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-07  8:20           ` Peter Zijlstra
@ 2026-04-07  9:07             ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2026-04-07  9:07 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Andres Freund, Salvatore Dipietro, linux-kernel, alisaidi,
	blakgeof, abuehaze, dipietro.salvatore, Thomas Gleixner,
	Valentin Schneider, Sebastian Andrzej Siewior, Mark Rutland

On Tue, Apr 07, 2026 at 10:20:18AM +0200, Peter Zijlstra wrote:
> On Sun, Apr 05, 2026 at 11:38:59AM +0530, Ritesh Harjani wrote:
> 
> > However, for curiosity, I was hoping if someone more familiar with the
> > scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes
> > performance regression w/o huge pages? 
> > 
> > Minor page fault handling has micro-secs latency, where as sched ticks
> > is in milli-secs. Besides, both preemption models should anyway
> > schedule() if TIF_NEED_RESCHED is set on return to userspace, right?
> > 
> > So was curious to understand how is the preemption model causing
> > performance regression with no hugepages in this case?
> 
> So yes, everything can schedule on return-to-user (very much including
> NONE). Which is why rseq slice ext is heavily recommended for anything
> attempting user space spinlocks.
> 
> The thing where the other preemption modes differ is the scheduling
> while in kernel mode. So if the workload is spending significant time in
> the kernel, this could cause more scheduling.
> 
> As you already mentioned, no huge pages, gives us more overhead on #PF
> (and TLB miss, but that's mostly hidden in access latency rather than
> immediate system time). This gives more system time, and more room to
> schedule.
> 
> If we get preempted in the middle of a #PF, rather than finishing it,
> this increases the #PF completion time and if userspace is trying to
> access this page concurrently.... But we should see that in mmap_lock
> contention/idle time :/

Sorry, insufficient wake-up juice applied. Concurrent page-faults are
serialized on the page-table (spin) locks. Not mmap_lock.

So it would increase system time and give more rise to kernel
preemption.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-05  4:21       ` Andres Freund
  2026-04-05  6:08         ` Ritesh Harjani
@ 2026-04-07 11:19         ` Mark Rutland
  1 sibling, 0 replies; 20+ messages in thread
From: Mark Rutland @ 2026-04-07 11:19 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: Andres Freund, Peter Zijlstra, linux-kernel, alisaidi, blakgeof,
	abuehaze, dipietro.salvatore, Thomas Gleixner, Valentin Schneider,
	Sebastian Andrzej Siewior

On Sun, Apr 05, 2026 at 12:21:55AM -0400, Andres Freund wrote:
> On 2026-04-04 21:40:29 -0400, Andres Freund wrote:
> > On 2026-04-04 13:42:22 -0400, Andres Freund wrote:

> > The benchmark script seems to indicate that huge pages aren't in use:
> > https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15

For the benefit of those reading mail without a browser, the line in
question is:

  ${PG_HUGE_PAGES:=off}                    # off, try, on

Per the PostgreSQL 17 documentation:

  https://www.postgresql.org/docs/17/runtime-config-resource.html#GUC-HUGE-PAGES

... the default is 'try', though IIUC some additional system
configuration may be necessary, to actually reserve huge pages, which
is also documented:

  https://www.postgresql.org/docs/17/kernel-resources.html#LINUX-HUGE-PAGES

> > I wonder if somehow the pages underlying the portions of postgres' shared
> > memory are getting paged out for some reason, leading to page faults while
> > holding the spinlock?
> 
> Hah. I had reflexively used huge_pages=on - as that is the only sane thing to
> do with 10s to 100s of GB of shared memory and thus part of all my
> benchmarking infrastructure - during the benchmark runs mentioned above.

Salvatore, was there a specific reason to test with PG_HUGE_PAGES=off
rather than PG_HUGE_PAGES=try? Was that arbitrary (e.g. because it was
the first of the possible options)?

IIUC from what Andres says here (and in other mails in this thread),
that's not a sensible/realistic configuration for this sort of workload,
and is the root cause of the contention (which seems to be exacerbated
by the scheduler model change).

As Andres noted, even ignoring the scheduler model, running with
PG_HUGE_PAGES=off results in a substantial performance penalty:

> *regardless* the spinlock. PG 19 does have the spinlock in this path anymore,
> but not using huge pages is still utterly terrible (like 1/3 of the
> throughput).
> 
> I did run some benchmarks here and I don't see a clearly reproducible
> regression with huge pages.

Is the PG_HUGE_PAGES=off configuration important to you for some reason?

Mark.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-04 17:42   ` Andres Freund
  2026-04-05  1:40     ` Andres Freund
@ 2026-04-07  8:49     ` Peter Zijlstra
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2026-04-07  8:49 UTC (permalink / raw)
  To: Andres Freund
  Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze,
	dipietro.salvatore, Thomas Gleixner, Valentin Schneider,
	Sebastian Andrzej Siewior, Mark Rutland

On Sat, Apr 04, 2026 at 01:42:22PM -0400, Andres Freund wrote:
> Hi,
> 
> On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
> > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
> > > We are reporting a throughput and latency regression on PostgreSQL
> > > pgbench (simple-update) on arm64 caused by commit 7dadeaa6e851
> > > ("sched: Further restrict the preemption modes") introduced in
> > > v7.0-rc1.
> > >
> > > The regression manifests as a 0.51x throughput drop on a pgbench
> > > simple-update workload with 1024 clients on a 96-vCPU
> > > (AWS EC2 m8g.24xlarge) Graviton4 arm64 system. Perf profiling
> > > shows 55% of CPU time is consumed spinning in PostgreSQL's
> > > userspace spinlock (s_lock()) under PREEMPT_LAZY:
> > >
> > >   |- 56.03% - StartReadBuffer
> > >     |- 55.93% - GetVictimBuffer
> > >         |- 55.93% - StrategyGetBuffer
> > >           |- 55.60% - s_lock        <<<< 55% of time
> > >           |  |- 0.39% - el0t_64_irq
> > >           |  |- 0.10% - perform_spin_delay
> > >           |- 0.08% - LockBufHdr
> > >           |- 0.07% - hash_search_with_hash_value
> > >     |- 0.40% - WaitReadBuffers
> >
> > The fix here is to make PostgreSQL make use of rseq slice extension:
> >
> >   https://lkml.kernel.org/r/20251215155615.870031952@linutronix.de
> >
> > That should limit the exposure to lock holder preemption (unless
> > PostgreSQL is doing seriously egregious things).
> 
> Maybe we should, but requiring the use of a new low level facility that was
> introduced in the 7.0 kernel, to address a regression that exists only in
> 7.0+, seems not great.
> 
> It's not like it's a completely trivial thing to add support for either, so I
> doubt it'll be the right thing to backpatch it into already released major
> versions of postgres.

Just to clarify my response: all I really saw was 'userspace spinlock'
and we just did the rseq slice ext stuff (with Oracle) for exactly this
type of thing. And even NONE is susceptible to scheduling the lock
holder.

It was also the last email I did on Good Friday and thinking hard really
wasn't high on the list of things :-)


Anyway, IF we revert -- and I think you've already made a fine case for
not doing that -- it will be a very temporary thing, NONE will go away.


As to kernel version thing; why should people upgrade to the very latest
kernel release and not also be expected to upgrade PostgreSQL to the
very latest?

If they want to use old PostgreSQL, they can use old kernel too, right?
Both have stable releases that should keep them afloat for a while.

Again, not saying we can't do better, but also sometimes you have to
break eggs to make cake :-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-03 21:32 ` [PATCH 0/1] " Peter Zijlstra
  2026-04-04 17:42   ` Andres Freund
@ 2026-04-06  0:43   ` Qais Yousef
  1 sibling, 0 replies; 20+ messages in thread
From: Qais Yousef @ 2026-04-06  0:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze,
	dipietro.salvatore, Thomas Gleixner, Valentin Schneider,
	Sebastian Andrzej Siewior, Mark Rutland

On 04/03/26 23:32, Peter Zijlstra wrote:
> On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
> > We are reporting a throughput and latency regression on PostgreSQL
> > pgbench (simple-update) on arm64 caused by commit 7dadeaa6e851
> > ("sched: Further restrict the preemption modes") introduced in
> > v7.0-rc1.
> > 
> > The regression manifests as a 0.51x throughput drop on a pgbench
> > simple-update workload with 1024 clients on a 96-vCPU
> > (AWS EC2 m8g.24xlarge) Graviton4 arm64 system. Perf profiling
> > shows 55% of CPU time is consumed spinning in PostgreSQL's
> > userspace spinlock (s_lock()) under PREEMPT_LAZY:
> > 
> >   |- 56.03% - StartReadBuffer
> >     |- 55.93% - GetVictimBuffer
> >         |- 55.93% - StrategyGetBuffer
> >           |- 55.60% - s_lock        <<<< 55% of time
> >           |  |- 0.39% - el0t_64_irq
> >           |  |- 0.10% - perform_spin_delay
> >           |- 0.08% - LockBufHdr
> >           |- 0.07% - hash_search_with_hash_value
> >     |- 0.40% - WaitReadBuffers
> 
> The fix here is to make PostgreSQL make use of rseq slice extension:

Or perhaps use a longer base_slice_ns in debugfs? I think we end up just short
of 4ms in most systems now. 5 or 6 ms might help re-hide it.

> 
>   https://lkml.kernel.org/r/20251215155615.870031952@linutronix.de
> 
> That should limit the exposure to lock holder preemption (unless
> PostgreSQL is doing seriously egregious things).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-03 19:19 [PATCH 0/1] sched: Restore PREEMPT_NONE as default Salvatore Dipietro
  2026-04-03 19:19 ` [PATCH 1/1] " Salvatore Dipietro
  2026-04-03 21:32 ` [PATCH 0/1] " Peter Zijlstra
@ 2026-04-05 14:44 ` Mitsumasa KONDO
  2026-04-05 16:43   ` Andres Freund
  2 siblings, 1 reply; 20+ messages in thread
From: Mitsumasa KONDO @ 2026-04-05 14:44 UTC (permalink / raw)
  To: andres
  Cc: abuehaze, alisaidi, bigeasy, blakgeof, dipietro.salvatore,
	dipiets, linux-kernel, mark.rutland, peterz, ritesh.list, tglx,
	vschneid, kondo.mitsumasa

I believe the root cause is the inadequacy of PostgreSQL's arm64
spin_delay() implementation, which PREEMPT_LAZY merely exposed.

PostgreSQL's SPIN_DELAY() uses dramatically different instructions
per architecture (src/include/storage/s_lock.h):

  x86_64:  rep; nop    (PAUSE, ~140 cycles)
  arm64:   isb         (pipeline flush, ~10-20 cycles)

Under PREEMPT_NONE, lock holders are rarely preempted, so spin
duration is short and ISB's lightweight delay is sufficient.

Under PREEMPT_LAZY, lock holder preemption becomes more frequent.
When this occurs, waiters enter a sustained spin loop. On arm64,
ISB provides negligible delay, so the loop runs at near-full speed,
hammering the lock cacheline via TAS_SPIN's *(lock) load on every
iteration. This generates massive cache coherency traffic that in
turn slows the lock holder's execution after rescheduling, creating
a feedback loop that escalates on high-core-count systems.

On x86_64, PAUSE throttles this loop sufficiently to prevent the
feedback loop, which explains why this is not reproducible there.

Patching PostgreSQL's arm64 spin_delay() to use WFE instead of ISB
should significantly reduce the regression without kernel changes.

That said, this change is likely to cause similar breakage in other
user-space applications beyond PostgreSQL that rely on lightweight
spin loops on arm64. So I agree that the patch to retain PREEMPT_NONE
is the right approach. At the same time, this is also something that
distributions can resolve by patching their default kernel configuration.

Regards,
-- 
Mitsumasa Kondo
NTT Software Innovation Center

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-05 14:44 ` Mitsumasa KONDO
@ 2026-04-05 16:43   ` Andres Freund
  2026-04-06  1:46     ` Mitsumasa KONDO
  0 siblings, 1 reply; 20+ messages in thread
From: Andres Freund @ 2026-04-05 16:43 UTC (permalink / raw)
  To: Mitsumasa KONDO
  Cc: abuehaze, alisaidi, bigeasy, blakgeof, dipietro.salvatore,
	dipiets, linux-kernel, mark.rutland, peterz, ritesh.list, tglx,
	vschneid

Hi,

On 2026-04-05 23:44:25 +0900, Mitsumasa KONDO wrote:
> I believe the root cause is the inadequacy of PostgreSQL's arm64
> spin_delay() implementation, which PREEMPT_LAZY merely exposed.
> 
> PostgreSQL's SPIN_DELAY() uses dramatically different instructions
> per architecture (src/include/storage/s_lock.h):
> 
>   x86_64:  rep; nop    (PAUSE, ~140 cycles)
>   arm64:   isb         (pipeline flush, ~10-20 cycles)
> 
> Under PREEMPT_NONE, lock holders are rarely preempted, so spin
> duration is short and ISB's lightweight delay is sufficient.
> 
> Under PREEMPT_LAZY, lock holder preemption becomes more frequent.
> When this occurs, waiters enter a sustained spin loop.

It's not sustained, the spinning just lasts a between 10 and 1000 iterations,
after that there's randomized exponential backoff using nanosleep.

Which actually will happen after a smaller number of cycles of with a shorter
SPIN_DELAY.

In the 4kB workload, nearly all backends are in the exponential backoff.

If I remove the rep nop on x86-64, the performance of the 4kB pages workload
is basically unaffected, even with PREEMPT_LAZY.

The spinning helps with workloads that are contended for very short amounts of
time. But that's not the case in this workload without huge pages, instead of
low 10s of cycles, we regularly spend a few orders of magnitude more cycles
holding the lock.

That's not to say the arm64 spin delay implementation is good. It just doesn't
seem like it affects this case much.

As hinted at by my neighboring email, I see some performance differences due
to PREEMPT_LAZY even when replacing the spinlock with a futex based lock, as
long as I use 4kB pages.  Which seems like the expected thing?

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
  2026-04-05 16:43   ` Andres Freund
@ 2026-04-06  1:46     ` Mitsumasa KONDO
  0 siblings, 0 replies; 20+ messages in thread
From: Mitsumasa KONDO @ 2026-04-06  1:46 UTC (permalink / raw)
  To: andres; +Cc: linux-kernel, dipiets, peterz, tglx, kondo.mitsumasa

Hi Andres,

Thank you for testing this.

On 2026-04-06, Andres Freund wrote:
> It's not sustained, the spinning just lasts a between 10 and 1000
> iterations, after that there's randomized exponential backoff using
> nanosleep.
> Which actually will happen after a smaller number of cycles of with a
> shorter SPIN_DELAY.

> If I remove the rep nop on x86-64, the performance of the 4kB pages
> workload is basically unaffected, even with PREEMPT_LAZY.

The fact that removing rep nop made no difference suggests that the
spinlock is not the bottleneck in your environment. Could you share
your storage configuration? Salvatore's setup uses 12x 1TB AWS io2 at
32000 IOPS each (384K IOPS total in RAID0), which effectively
eliminates WAL fsync as a bottleneck. In a storage-limited
environment, changes to spin delay behavior would naturally be
invisible because throughput is capped by I/O before spinlock
contention becomes material.

Also worth noting: Salvatore's environment is an EC2 instance
(m8g.24xlarge), not bare metal. Hypervisor-level vCPU scheduling
adds another layer on top of PREEMPT_LAZY -- a lock holder can be
descheduled not only by the kernel scheduler but also by the
hypervisor, and the guest kernel has no visibility into this. This
could amplify the regression in ways that are not reproducible on
bare-metal systems, regardless of architecture.

If you want to isolate the effect of SPIN_DELAY on throughput
under PREEMPT_LAZY, I would suggest:

  1. Use synchronous_commit = off or unlogged tables to remove
     I/O from the critical path entirely.
  2. Use a read-only workload (pgbench -S) with shared_buffers
     sized to force buffer eviction contention.
  3. Run on a high-core-count system with all CPUs saturated
     under PREEMPT_LAZY.

This should expose the pure impact of spin loop behavior without
I/O or WAL masking the results.

> The spinning helps with workloads that are contended for very short
> amounts of time. But that's not the case in this workload without
> huge pages, instead of low 10s of cycles, we regularly spend a few
> orders of magnitude more cycles holding the lock.

I agree that the 4kB page / huge page difference is significant.
But even when individual spin durations are short, the cumulative
effect across hundreds of backends matters. Small per-iteration
overhead in the spin loop, multiplied by high concurrency, can
add up to measurable throughput loss -- the effect that becomes
visible only when I/O is not the dominant bottleneck.

Regards,
-- 
Mitsumasa KONDO
NTT Software Innovation Center

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-04-07 11:19 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-03 19:19 [PATCH 0/1] sched: Restore PREEMPT_NONE as default Salvatore Dipietro
2026-04-03 19:19 ` [PATCH 1/1] " Salvatore Dipietro
2026-04-03 21:32 ` [PATCH 0/1] " Peter Zijlstra
2026-04-04 17:42   ` Andres Freund
2026-04-05  1:40     ` Andres Freund
2026-04-05  4:21       ` Andres Freund
2026-04-05  6:08         ` Ritesh Harjani
2026-04-05 14:09           ` Andres Freund
2026-04-05 14:44             ` Andres Freund
2026-04-07  8:29               ` Peter Zijlstra
2026-04-07  8:27             ` Peter Zijlstra
2026-04-07 10:17             ` David Laight
2026-04-07  8:20           ` Peter Zijlstra
2026-04-07  9:07             ` Peter Zijlstra
2026-04-07 11:19         ` Mark Rutland
2026-04-07  8:49     ` Peter Zijlstra
2026-04-06  0:43   ` Qais Yousef
2026-04-05 14:44 ` Mitsumasa KONDO
2026-04-05 16:43   ` Andres Freund
2026-04-06  1:46     ` Mitsumasa KONDO

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox