From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com [209.85.210.182])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 24613326D65
	for <linux-kernel@vger.kernel.org>; Sun,  5 Apr 2026 08:03:49 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.182
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1775376230; cv=none; b=BTZIAsoRV78yWlo8frhQQQg1u+yCfiF8GrGqUECXkRs6ohYaOZFuD8vU+abMrPm7C8CrUujkCj1nVkVwpoCahffrUGthd0ekTjyTBrkVcfH/7dKoJKpo8JIpnlh/1Vz7oe9w1/mUJtVZvqJDRNKYIj7CJvx7a+KawMZOinicrl8=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1775376230; c=relaxed/simple;
	bh=XziWz3trM04DahJUM86GZRyFaFWThDNkOdzpL1S3VwY=;
	h=From:To:Cc:Subject:In-Reply-To:Date:Message-ID:References; b=SRPCmOx4zxipTHFz1Pkra2Z9jm94XJ0JgeQzw3PnOHJQknU03w13DKvs+jQFD17y7BxOm07GmPj8M8zkwdZb2u9D04HJw4puflLU74SHMqJBbKPHgrFPWDGGDtI4hLKqIdaDJ+sctPqbmmq2OykJDSajLBDfcJ3zlZ10cHQeLlA=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=T3evsTcv; arc=none smtp.client-ip=209.85.210.182
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="T3evsTcv"
Received: by mail-pf1-f182.google.com with SMTP id d2e1a72fcca58-82ce49785a0so1266997b3a.2
        for <linux-kernel@vger.kernel.org>; Sun, 05 Apr 2026 01:03:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20251104; t=1775376228; x=1775981028; darn=vger.kernel.org;
        h=references:message-id:date:in-reply-to:subject:cc:to:from:from:to
         :cc:subject:date:message-id:reply-to;
        bh=gP1cOcDoRXryeISFxe64xa15dM+jMNxjCvbg9WbTb7I=;
        b=T3evsTcvM6zR8WBipH9u12nsow6g02PBnjhaS+zaJmwkZEdfpf7soKhPwHIewKh6A0
         CO6Gdh76y4VJ9Stew0qVz2fOx06MMMdtRIfFUOgcJXhy5vrgidBr53+/YAadftK1gfeQ
         K/ZyG4nVr4rIBuMfGrNTstsN4fvI3LdMey0ZQRLJTyiilik1XYbTAT1HgRP3SbwVu+9i
         4akd9EFOGaxQ1MBUKs4+gqUNx2dXdP9384laLa+tX/ATR4mMAZnJuKPCuhpzJxm96hqj
         kJw5MY5lt1UU8DFaoDIFUaQpvmCTEmlZAxvvUeG1O0s5/sHImr9oSaSKOULxEpalWZEA
         PmOw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1775376228; x=1775981028;
        h=references:message-id:date:in-reply-to:subject:cc:to:from:x-gm-gg
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=gP1cOcDoRXryeISFxe64xa15dM+jMNxjCvbg9WbTb7I=;
        b=Pm2odFE6aVa5PSkZSdgFqWkVOE9aWLmWOIvENq63DTaQri/0+UgA9WBdXQXbBFDisE
         0Z7xCtN7cgoCAcJrAzZUE7wIM/mH0mBB7yj4D2UNBGdo+6lTI1nC8Jp3ZRqjIfFND7n/
         SNioeOwMDIdI0o5et+paKQd2/hxDumGeiVmROltILWC3Nmbj5iLR25psFHv9Zl2Z3cRt
         zFxr2ZZTtF63oUv/Z4JPf1e7m0dK6qbHCtnSuIhqKz90MPv7IdbsdKHDDiQEM/L/4LKK
         I7gxszvxT4+03M+Xis7takGFsRGE4NetcW3ISBqzGIbYuosZRlqq3HFmcsmbEQ03gbdP
         4G1Q==
X-Forwarded-Encrypted: i=1; AJvYcCXpbSnBgD54l6fAvuhBUgfhijbO9D9lIjcW2JLNNJt9CImezU9M1fkreHVtbF4MXuoEGLkRw2vnhK2kEZs=@vger.kernel.org
X-Gm-Message-State: AOJu0Yxl8O0y1276rBHgHVYDjm0/HWCHoQITnvhneP3jE1XJt/LjwXAO
	+JcsuluvX0iEKmKhAwmyqOKqbbhKR2OSu11D+xd8bubkH6q45Bu3rXSM
X-Gm-Gg: AeBDietEaDgM/wor6JLvwOfGCcRJ1WCijRmHpHREWF/zeaZTrvw0ETl8NOePUrTJdxa
	qw8BOAa+nt8jBh+gszrf43eYVzerXYDX6HGxtuYP5mOZB8ANVKfsWRY/yBMRylQ1ahZRX4fErF5
	zv/BmxhL0SYAFPoR7NBxh/PwdCbWLqgRPnF/00llETMIwDcE2282VyhU80OAVOHhxc0H64IyB8Q
	VPPg0YxPE+LfnH4bj1SlngEE7qMt1Xc5X3V+RKMwrMuHWW4vVgGbP50/WTBR0VovIsHEPjFWrJn
	r9Urda0CZmi1jnvO60zo7Xny4uSUJopiuU+cml/y5DS2CciReGzWuG3MO8lFc+1mCV5PGJmRVbs
	hgWEwmuTqnYFAqE3V2lV8EvG8gGj9JovAiepLGpTXt4TVkCK2Pw6KBgzoJ96GqaaTgUH7g/djzK
	FvDTSTSLT5Ecqhzm/pPa9bJ41tlLd9hLk2
X-Received: by 2002:a05:6a00:2998:b0:82c:9f7e:518c with SMTP id d2e1a72fcca58-82d0da9c5femr8085598b3a.25.1775376228375;
        Sun, 05 Apr 2026 01:03:48 -0700 (PDT)
Received: from pve-server ([49.205.216.49])
        by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-82cf9c3d439sm11090248b3a.35.2026.04.05.01.03.43
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 05 Apr 2026 01:03:47 -0700 (PDT)
From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Andres Freund <andres@anarazel.de>, Peter Zijlstra <peterz@infradead.org>
Cc: Salvatore Dipietro <dipiets@amazon.it>, linux-kernel@vger.kernel.org, alisaidi@amazon.com, blakgeof@amazon.com, abuehaze@amazon.de, dipietro.salvatore@gmail.com, Thomas Gleixner <tglx@kernel.org>, Valentin Schneider <vschneid@redhat.com>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Mark Rutland <mark.rutland@arm.com>
Subject: Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default
In-Reply-To: <xxbnmxqhx4ntc4ztztllbhnral2adogseot2bzu4g5eutxtgza@dzchaqremz32>
Date: Sun, 05 Apr 2026 11:38:59 +0530
Message-ID: <1pgulz0k.ritesh.list@gmail.com>
References: <20260403191942.21410-1-dipiets@amazon.it> <20260403213207.GF2872@noisy.programming.kicks-ass.net> <yr3inlzesdb45n6i6lpbimwr7b25kqkn37qzlvvzgad5hfd7ut@xv4cihno76wu> <wggtu3jjnloqq4fpokuiusbczatxvsgpghkiljzfz5cn53sigr@xcvnvlblte7o> <xxbnmxqhx4ntc4ztztllbhnral2adogseot2bzu4g5eutxtgza@dzchaqremz32>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>

Andres Freund <andres@anarazel.de> writes:

> Hi,
>
> On 2026-04-04 21:40:29 -0400, Andres Freund wrote:
>> On 2026-04-04 13:42:22 -0400, Andres Freund wrote:
>> > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote:
>> > > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote:
>> > I'm not quite sure I understand why the spinlock in Salvatore's benchmark does
>> > shows up this heavily:
>> >
>> > - For something like the benchmark here, it should only be used until
>> >   postgres' buffer pool is fully used, as the freelist only contains buffers
>> >   not in use, and we check without a lock whether it contains buffers. Once
>> >   running, buffers are only added to the freelist if tables/indexes are
>> >   dropped/truncated.  And the benchmark seems like it runs long enough that we
>> >   should actually reach the point the freelist should be empty?
>> >
>> > - The section covered by the spinlock is only a few instructions long and it
>> >   is only hit if we have to do a somewhat heavyweight operation afterwards
>> >   (read in a page into the buffer pool), it seems surprising that this short
>> >   section gets interrupted frequently enough to cause a regression of this
>> >   magnitude.
>> >
>> >   For a moment I thought it might be because, while holding the spinlock, some
>> >   memory is touched for the first time, but that is actually not the case.
>> >
>>
>> I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of
>> memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96
>> threads, so it's smaller, and it's x86, not arm, but it's what I can quickly
>> update to an unreleased kernel.
>>
>>
>> So far I don't see such a regression and I basically see no time spent
>> GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%).
>>
>> Which I don't find surprising, this workload doesn't read enough to have
>> contention in there. Salvatore reported on the order of 100k transactions/sec
>> (with one update, one read and one insert). Even if just about all of those
>> were misses - and they shouldn't be with 25% of 384G as postgres'
>> shared_buffers as the script indicates, and we know that s_b is not full due
>> to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the
>> page cache.  That's not that much.
>
>
>> The benchmark script seems to indicate that huge pages aren't in use:
>> https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15
>>
>>
>> I wonder if somehow the pages underlying the portions of postgres' shared
>> memory are getting paged out for some reason, leading to page faults while
>> holding the spinlock?
>
> Hah. I had reflexively used huge_pages=on - as that is the only sane thing to
> do with 10s to 100s of GB of shared memory and thus part of all my
> benchmarking infrastructure - during the benchmark runs mentioned above.
>
> Turns out, if I *disable* huge pages, I actually can reproduce the contention
> that Salvatore reported (didn't see whether it's a regression for me
> though). Not anywhere close to the same degree, because the bottleneck for me
> is the writes.
>
> If I change the workload to a read-only benchmark, which obviously reads a lot
> more due to not being bottleneck by durable-write-latency, I see more
> contention:
>
> -   12.76%  postgres         postgres                   [.] s_lock
>    - 12.75% s_lock
>       - 12.69% StrategyGetBuffer
>            GetVictimBuffer
>          - StartReadBuffer
>             - 12.69% ReleaseAndReadBuffer
>                + 12.65% heapam_index_fetch_tuple
>
>
> While what I said above is true, the memory touched at the time of contention
> it isn't the first access to the relevant shared memory (i.e. it is already
> backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer()
> will be the first access of the connection processes to the relevant 4kB
> pages.
>
> Thus there will be a *lot* of minor faults and tlb misses while holding a
> spinlock. Unsurprisingly that's bad for performance.
>
>
> I don't see a reason to particularly care about the regression if that's the
> sole way to trigger it.  Using a buffer pool of ~100GB without huge pages is
> not an interesting workload.  With a smaller buffer pool the problem would not
> happen either.
>
> Note that the performance effect of not using huge pages is terrible
> *regardless* the spinlock. PG 19 does have the spinlock in this path anymore,
> but not using huge pages is still utterly terrible (like 1/3 of the
> throughput).
>
>
> I did run some benchmarks here and I don't see a clearly reproducible
> regression with huge pages.
>

However, for curiosity, I was hoping if someone more familiar with the
scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes
performance regression w/o huge pages? 

Minor page fault handling has micro-secs latency, where as sched ticks
is in milli-secs. Besides, both preemption models should anyway
schedule() if TIF_NEED_RESCHED is set on return to userspace, right?

So was curious to understand how is the preemption model causing
performance regression with no hugepages in this case?

-ritesh