public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Andres Freund <andres@anarazel.de>,
	Peter Zijlstra <peterz@infradead.org>
Cc: Salvatore Dipietro <dipiets@amazon.it>,
	linux-kernel@vger.kernel.org, alisaidi@amazon.com,
	blakgeof@amazon.com, abuehaze@amazon.de,
	dipietro.salvatore@gmail.com, Thomas Gleixner <tglx@kernel.org>,
	Valentin Schneider <vschneid@redhat.com>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Mark Rutland <mark.rutland@arm.com>
Subject: Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
Date: Sun, 05 Apr 2026 11:38:59 +0530	[thread overview]
Message-ID: <1pgulz0k.ritesh.list@gmail.com> (raw)
In-Reply-To: <xxbnmxqhx4ntc4ztztllbhnral2adogseot2bzu4g5eutxtgza@dzchaqremz32>

Andres Freund <andres@anarazel.de> writes:

> Hi,
>
> On 2026-04-04 21:40:29 -0400, Andres Freund wrote:
>> On 2026-04-04 13:42:22 -0400, Andres Freund wrote:
>> > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
>> > > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
>> > I'm not quite sure I understand why the spinlock in Salvatore's benchmark does
>> > shows up this heavily:
>> >
>> > - For something like the benchmark here, it should only be used until
>> >   postgres' buffer pool is fully used, as the freelist only contains buffers
>> >   not in use, and we check without a lock whether it contains buffers. Once
>> >   running, buffers are only added to the freelist if tables/indexes are
>> >   dropped/truncated.  And the benchmark seems like it runs long enough that we
>> >   should actually reach the point the freelist should be empty?
>> >
>> > - The section covered by the spinlock is only a few instructions long and it
>> >   is only hit if we have to do a somewhat heavyweight operation afterwards
>> >   (read in a page into the buffer pool), it seems surprising that this short
>> >   section gets interrupted frequently enough to cause a regression of this
>> >   magnitude.
>> >
>> >   For a moment I thought it might be because, while holding the spinlock, some
>> >   memory is touched for the first time, but that is actually not the case.
>> >
>>
>> I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of
>> memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96
>> threads, so it's smaller, and it's x86, not arm, but it's what I can quickly
>> update to an unreleased kernel.
>>
>>
>> So far I don't see such a regression and I basically see no time spent
>> GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%).
>>
>> Which I don't find surprising, this workload doesn't read enough to have
>> contention in there. Salvatore reported on the order of 100k transactions/sec
>> (with one update, one read and one insert). Even if just about all of those
>> were misses - and they shouldn't be with 25% of 384G as postgres'
>> shared_buffers as the script indicates, and we know that s_b is not full due
>> to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the
>> page cache.  That's not that much.
>
>
>> The benchmark script seems to indicate that huge pages aren't in use:
>> https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15
>>
>>
>> I wonder if somehow the pages underlying the portions of postgres' shared
>> memory are getting paged out for some reason, leading to page faults while
>> holding the spinlock?
>
> Hah. I had reflexively used huge_pages=on - as that is the only sane thing to
> do with 10s to 100s of GB of shared memory and thus part of all my
> benchmarking infrastructure - during the benchmark runs mentioned above.
>
> Turns out, if I *disable* huge pages, I actually can reproduce the contention
> that Salvatore reported (didn't see whether it's a regression for me
> though). Not anywhere close to the same degree, because the bottleneck for me
> is the writes.
>
> If I change the workload to a read-only benchmark, which obviously reads a lot
> more due to not being bottleneck by durable-write-latency, I see more
> contention:
>
> -   12.76%  postgres         postgres                   [.] s_lock
>    - 12.75% s_lock
>       - 12.69% StrategyGetBuffer
>            GetVictimBuffer
>          - StartReadBuffer
>             - 12.69% ReleaseAndReadBuffer
>                + 12.65% heapam_index_fetch_tuple
>
>
> While what I said above is true, the memory touched at the time of contention
> it isn't the first access to the relevant shared memory (i.e. it is already
> backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer()
> will be the first access of the connection processes to the relevant 4kB
> pages.
>
> Thus there will be a *lot* of minor faults and tlb misses while holding a
> spinlock. Unsurprisingly that's bad for performance.
>
>
> I don't see a reason to particularly care about the regression if that's the
> sole way to trigger it.  Using a buffer pool of ~100GB without huge pages is
> not an interesting workload.  With a smaller buffer pool the problem would not
> happen either.
>
> Note that the performance effect of not using huge pages is terrible
> *regardless* the spinlock. PG 19 does have the spinlock in this path anymore,
> but not using huge pages is still utterly terrible (like 1/3 of the
> throughput).
>
>
> I did run some benchmarks here and I don't see a clearly reproducible
> regression with huge pages.
>

However, for curiosity, I was hoping if someone more familiar with the
scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes
performance regression w/o huge pages? 

Minor page fault handling has micro-secs latency, where as sched ticks
is in milli-secs. Besides, both preemption models should anyway
schedule() if TIF_NEED_RESCHED is set on return to userspace, right?

So was curious to understand how is the preemption model causing
performance regression with no hugepages in this case?

-ritesh

  reply	other threads:[~2026-04-05  8:03 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-03 19:19 [PATCH 0/1] sched: Restore PREEMPT_NONE as default Salvatore Dipietro
2026-04-03 19:19 ` [PATCH 1/1] " Salvatore Dipietro
2026-04-03 21:32 ` [PATCH 0/1] " Peter Zijlstra
2026-04-04 17:42   ` Andres Freund
2026-04-05  1:40     ` Andres Freund
2026-04-05  4:21       ` Andres Freund
2026-04-05  6:08         ` Ritesh Harjani [this message]
2026-04-05 14:09           ` Andres Freund
2026-04-05 14:44             ` Andres Freund
2026-04-07  8:29               ` Peter Zijlstra
2026-04-07  8:27             ` Peter Zijlstra
2026-04-07 10:17             ` David Laight
2026-04-07  8:20           ` Peter Zijlstra
2026-04-07  9:07             ` Peter Zijlstra
2026-04-07  8:49     ` Peter Zijlstra
2026-04-06  0:43   ` Qais Yousef
2026-04-05 14:44 ` Mitsumasa KONDO
2026-04-05 16:43   ` Andres Freund
2026-04-06  1:46     ` Mitsumasa KONDO

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1pgulz0k.ritesh.list@gmail.com \
    --to=ritesh.list@gmail.com \
    --cc=abuehaze@amazon.de \
    --cc=alisaidi@amazon.com \
    --cc=andres@anarazel.de \
    --cc=bigeasy@linutronix.de \
    --cc=blakgeof@amazon.com \
    --cc=dipietro.salvatore@gmail.com \
    --cc=dipiets@amazon.it \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=peterz@infradead.org \
    --cc=tglx@kernel.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox