From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Andres Freund <andres@anarazel.de>,
Peter Zijlstra <peterz@infradead.org>
Cc: Salvatore Dipietro <dipiets@amazon.it>,
linux-kernel@vger.kernel.org, alisaidi@amazon.com,
blakgeof@amazon.com, abuehaze@amazon.de,
dipietro.salvatore@gmail.com, Thomas Gleixner <tglx@kernel.org>,
Valentin Schneider <vschneid@redhat.com>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
Mark Rutland <mark.rutland@arm.com>
Subject: Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
Date: Sun, 05 Apr 2026 11:38:59 +0530 [thread overview]
Message-ID: <1pgulz0k.ritesh.list@gmail.com> (raw)
In-Reply-To: <xxbnmxqhx4ntc4ztztllbhnral2adogseot2bzu4g5eutxtgza@dzchaqremz32>
Andres Freund <andres@anarazel.de> writes:
> Hi,
>
> On 2026-04-04 21:40:29 -0400, Andres Freund wrote:
>> On 2026-04-04 13:42:22 -0400, Andres Freund wrote:
>> > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
>> > > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
>> > I'm not quite sure I understand why the spinlock in Salvatore's benchmark does
>> > shows up this heavily:
>> >
>> > - For something like the benchmark here, it should only be used until
>> > postgres' buffer pool is fully used, as the freelist only contains buffers
>> > not in use, and we check without a lock whether it contains buffers. Once
>> > running, buffers are only added to the freelist if tables/indexes are
>> > dropped/truncated. And the benchmark seems like it runs long enough that we
>> > should actually reach the point the freelist should be empty?
>> >
>> > - The section covered by the spinlock is only a few instructions long and it
>> > is only hit if we have to do a somewhat heavyweight operation afterwards
>> > (read in a page into the buffer pool), it seems surprising that this short
>> > section gets interrupted frequently enough to cause a regression of this
>> > magnitude.
>> >
>> > For a moment I thought it might be because, while holding the spinlock, some
>> > memory is touched for the first time, but that is actually not the case.
>> >
>>
>> I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of
>> memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96
>> threads, so it's smaller, and it's x86, not arm, but it's what I can quickly
>> update to an unreleased kernel.
>>
>>
>> So far I don't see such a regression and I basically see no time spent
>> GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%).
>>
>> Which I don't find surprising, this workload doesn't read enough to have
>> contention in there. Salvatore reported on the order of 100k transactions/sec
>> (with one update, one read and one insert). Even if just about all of those
>> were misses - and they shouldn't be with 25% of 384G as postgres'
>> shared_buffers as the script indicates, and we know that s_b is not full due
>> to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the
>> page cache. That's not that much.
>
>
>> The benchmark script seems to indicate that huge pages aren't in use:
>> https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15
>>
>>
>> I wonder if somehow the pages underlying the portions of postgres' shared
>> memory are getting paged out for some reason, leading to page faults while
>> holding the spinlock?
>
> Hah. I had reflexively used huge_pages=on - as that is the only sane thing to
> do with 10s to 100s of GB of shared memory and thus part of all my
> benchmarking infrastructure - during the benchmark runs mentioned above.
>
> Turns out, if I *disable* huge pages, I actually can reproduce the contention
> that Salvatore reported (didn't see whether it's a regression for me
> though). Not anywhere close to the same degree, because the bottleneck for me
> is the writes.
>
> If I change the workload to a read-only benchmark, which obviously reads a lot
> more due to not being bottleneck by durable-write-latency, I see more
> contention:
>
> - 12.76% postgres postgres [.] s_lock
> - 12.75% s_lock
> - 12.69% StrategyGetBuffer
> GetVictimBuffer
> - StartReadBuffer
> - 12.69% ReleaseAndReadBuffer
> + 12.65% heapam_index_fetch_tuple
>
>
> While what I said above is true, the memory touched at the time of contention
> it isn't the first access to the relevant shared memory (i.e. it is already
> backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer()
> will be the first access of the connection processes to the relevant 4kB
> pages.
>
> Thus there will be a *lot* of minor faults and tlb misses while holding a
> spinlock. Unsurprisingly that's bad for performance.
>
>
> I don't see a reason to particularly care about the regression if that's the
> sole way to trigger it. Using a buffer pool of ~100GB without huge pages is
> not an interesting workload. With a smaller buffer pool the problem would not
> happen either.
>
> Note that the performance effect of not using huge pages is terrible
> *regardless* the spinlock. PG 19 does have the spinlock in this path anymore,
> but not using huge pages is still utterly terrible (like 1/3 of the
> throughput).
>
>
> I did run some benchmarks here and I don't see a clearly reproducible
> regression with huge pages.
>
However, for curiosity, I was hoping if someone more familiar with the
scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes
performance regression w/o huge pages?
Minor page fault handling has micro-secs latency, where as sched ticks
is in milli-secs. Besides, both preemption models should anyway
schedule() if TIF_NEED_RESCHED is set on return to userspace, right?
So was curious to understand how is the preemption model causing
performance regression with no hugepages in this case?
-ritesh
next prev parent reply other threads:[~2026-04-05 8:03 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-03 19:19 [PATCH 0/1] sched: Restore PREEMPT_NONE as default Salvatore Dipietro
2026-04-03 19:19 ` [PATCH 1/1] " Salvatore Dipietro
2026-04-03 21:32 ` [PATCH 0/1] " Peter Zijlstra
2026-04-04 17:42 ` Andres Freund
2026-04-05 1:40 ` Andres Freund
2026-04-05 4:21 ` Andres Freund
2026-04-05 6:08 ` Ritesh Harjani [this message]
2026-04-05 14:09 ` Andres Freund
2026-04-05 14:44 ` Andres Freund
2026-04-07 8:29 ` Peter Zijlstra
2026-04-07 8:27 ` Peter Zijlstra
2026-04-07 10:17 ` David Laight
2026-04-07 8:20 ` Peter Zijlstra
2026-04-07 9:07 ` Peter Zijlstra
2026-04-07 8:49 ` Peter Zijlstra
2026-04-06 0:43 ` Qais Yousef
2026-04-05 14:44 ` Mitsumasa KONDO
2026-04-05 16:43 ` Andres Freund
2026-04-06 1:46 ` Mitsumasa KONDO
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1pgulz0k.ritesh.list@gmail.com \
--to=ritesh.list@gmail.com \
--cc=abuehaze@amazon.de \
--cc=alisaidi@amazon.com \
--cc=andres@anarazel.de \
--cc=bigeasy@linutronix.de \
--cc=blakgeof@amazon.com \
--cc=dipietro.salvatore@gmail.com \
--cc=dipiets@amazon.it \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=peterz@infradead.org \
--cc=tglx@kernel.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox