From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Andres Freund <andres@anarazel.de>,
Peter Zijlstra <peterz@infradead.org>
Cc: Salvatore Dipietro <dipiets@amazon.it>,
linux-kernel@vger.kernel.org, alisaidi@amazon.com,
blakgeof@amazon.com, abuehaze@amazon.de,
dipietro.salvatore@gmail.com, Thomas Gleixner <tglx@kernel.org>,
Valentin Schneider <vschneid@redhat.com>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
Mark Rutland <mark.rutland@arm.com>
Subject: Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
Date: Sun, 05 Apr 2026 11:38:59 +0530 [thread overview]
Message-ID: <1pgulz0k.ritesh.list@gmail.com> (raw)
In-Reply-To: <xxbnmxqhx4ntc4ztztllbhnral2adogseot2bzu4g5eutxtgza@dzchaqremz32>
Andres Freund <andres@anarazel.de> writes:
> Hi,
>
> On 2026-04-04 21:40:29 -0400, Andres Freund wrote:
>> On 2026-04-04 13:42:22 -0400, Andres Freund wrote:
>> > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
>> > > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
>> > I'm not quite sure I understand why the spinlock in Salvatore's benchmark does
>> > shows up this heavily:
>> >
>> > - For something like the benchmark here, it should only be used until
>> > postgres' buffer pool is fully used, as the freelist only contains buffers
>> > not in use, and we check without a lock whether it contains buffers. Once
>> > running, buffers are only added to the freelist if tables/indexes are
>> > dropped/truncated. And the benchmark seems like it runs long enough that we
>> > should actually reach the point the freelist should be empty?
>> >
>> > - The section covered by the spinlock is only a few instructions long and it
>> > is only hit if we have to do a somewhat heavyweight operation afterwards
>> > (read in a page into the buffer pool), it seems surprising that this short
>> > section gets interrupted frequently enough to cause a regression of this
>> > magnitude.
>> >
>> > For a moment I thought it might be because, while holding the spinlock, some
>> > memory is touched for the first time, but that is actually not the case.
>> >
>>
>> I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of
>> memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96
>> threads, so it's smaller, and it's x86, not arm, but it's what I can quickly
>> update to an unreleased kernel.
>>
>>
>> So far I don't see such a regression and I basically see no time spent
>> GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%).
>>
>> Which I don't find surprising, this workload doesn't read enough to have
>> contention in there. Salvatore reported on the order of 100k transactions/sec
>> (with one update, one read and one insert). Even if just about all of those
>> were misses - and they shouldn't be with 25% of 384G as postgres'
>> shared_buffers as the script indicates, and we know that s_b is not full due
>> to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the
>> page cache. That's not that much.
>
>
>> The benchmark script seems to indicate that huge pages aren't in use:
>> https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15
>>
>>
>> I wonder if somehow the pages underlying the portions of postgres' shared
>> memory are getting paged out for some reason, leading to page faults while
>> holding the spinlock?
>
> Hah. I had reflexively used huge_pages=on - as that is the only sane thing to
> do with 10s to 100s of GB of shared memory and thus part of all my
> benchmarking infrastructure - during the benchmark runs mentioned above.
>
> Turns out, if I *disable* huge pages, I actually can reproduce the contention
> that Salvatore reported (didn't see whether it's a regression for me
> though). Not anywhere close to the same degree, because the bottleneck for me
> is the writes.
>
> If I change the workload to a read-only benchmark, which obviously reads a lot
> more due to not being bottleneck by durable-write-latency, I see more
> contention:
>
> - 12.76% postgres postgres [.] s_lock
> - 12.75% s_lock
> - 12.69% StrategyGetBuffer
> GetVictimBuffer
> - StartReadBuffer
> - 12.69% ReleaseAndReadBuffer
> + 12.65% heapam_index_fetch_tuple
>
>
> While what I said above is true, the memory touched at the time of contention
> it isn't the first access to the relevant shared memory (i.e. it is already
> backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer()
> will be the first access of the connection processes to the relevant 4kB
> pages.
>
> Thus there will be a *lot* of minor faults and tlb misses while holding a
> spinlock. Unsurprisingly that's bad for performance.
>
>
> I don't see a reason to particularly care about the regression if that's the
> sole way to trigger it. Using a buffer pool of ~100GB without huge pages is
> not an interesting workload. With a smaller buffer pool the problem would not
> happen either.
>
> Note that the performance effect of not using huge pages is terrible
> *regardless* the spinlock. PG 19 does have the spinlock in this path anymore,
> but not using huge pages is still utterly terrible (like 1/3 of the
> throughput).
>
>
> I did run some benchmarks here and I don't see a clearly reproducible
> regression with huge pages.
>
However, for curiosity, I was hoping if someone more familiar with the
scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes
performance regression w/o huge pages?
Minor page fault handling has micro-secs latency, where as sched ticks
is in milli-secs. Besides, both preemption models should anyway
schedule() if TIF_NEED_RESCHED is set on return to userspace, right?
So was curious to understand how is the preemption model causing
performance regression with no hugepages in this case?
-ritesh
next prev parent reply other threads:[~2026-04-05 8:03 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-03 19:19 [PATCH 0/1] sched: Restore PREEMPT_NONE as default Salvatore Dipietro
2026-04-03 19:19 ` [PATCH 1/1] " Salvatore Dipietro
2026-04-03 21:32 ` [PATCH 0/1] " Peter Zijlstra
2026-04-04 17:42 ` Andres Freund
2026-04-05 1:40 ` Andres Freund
2026-04-05 4:21 ` Andres Freund
2026-04-05 6:08 ` Ritesh Harjani [this message]
2026-04-05 14:09 ` Andres Freund
2026-04-05 14:44 ` Andres Freund
2026-04-07 8:29 ` Peter Zijlstra
2026-04-07 8:27 ` Peter Zijlstra
2026-04-07 10:17 ` David Laight
2026-04-07 8:20 ` Peter Zijlstra
2026-04-07 9:07 ` Peter Zijlstra
2026-04-07 11:19 ` Mark Rutland
2026-04-08 20:08 ` Salvatore Dipietro
2026-04-08 20:51 ` Andres Freund
2026-04-10 15:38 ` Mitsumasa KONDO
2026-04-07 8:49 ` Peter Zijlstra
2026-04-06 0:43 ` Qais Yousef
2026-04-05 14:44 ` Mitsumasa KONDO
2026-04-05 16:43 ` Andres Freund
2026-04-06 1:46 ` Mitsumasa KONDO
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1pgulz0k.ritesh.list@gmail.com \
--to=ritesh.list@gmail.com \
--cc=abuehaze@amazon.de \
--cc=alisaidi@amazon.com \
--cc=andres@anarazel.de \
--cc=bigeasy@linutronix.de \
--cc=blakgeof@amazon.com \
--cc=dipietro.salvatore@gmail.com \
--cc=dipiets@amazon.it \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=peterz@infradead.org \
--cc=tglx@kernel.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.