From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com [209.85.210.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 24613326D65 for ; Sun, 5 Apr 2026 08:03:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.182 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775376230; cv=none; b=BTZIAsoRV78yWlo8frhQQQg1u+yCfiF8GrGqUECXkRs6ohYaOZFuD8vU+abMrPm7C8CrUujkCj1nVkVwpoCahffrUGthd0ekTjyTBrkVcfH/7dKoJKpo8JIpnlh/1Vz7oe9w1/mUJtVZvqJDRNKYIj7CJvx7a+KawMZOinicrl8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775376230; c=relaxed/simple; bh=XziWz3trM04DahJUM86GZRyFaFWThDNkOdzpL1S3VwY=; h=From:To:Cc:Subject:In-Reply-To:Date:Message-ID:References; b=SRPCmOx4zxipTHFz1Pkra2Z9jm94XJ0JgeQzw3PnOHJQknU03w13DKvs+jQFD17y7BxOm07GmPj8M8zkwdZb2u9D04HJw4puflLU74SHMqJBbKPHgrFPWDGGDtI4hLKqIdaDJ+sctPqbmmq2OykJDSajLBDfcJ3zlZ10cHQeLlA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=T3evsTcv; arc=none smtp.client-ip=209.85.210.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="T3evsTcv" Received: by mail-pf1-f182.google.com with SMTP id d2e1a72fcca58-82ce49785a0so1266997b3a.2 for ; Sun, 05 Apr 2026 01:03:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775376228; x=1775981028; darn=vger.kernel.org; h=references:message-id:date:in-reply-to:subject:cc:to:from:from:to :cc:subject:date:message-id:reply-to; bh=gP1cOcDoRXryeISFxe64xa15dM+jMNxjCvbg9WbTb7I=; b=T3evsTcvM6zR8WBipH9u12nsow6g02PBnjhaS+zaJmwkZEdfpf7soKhPwHIewKh6A0 CO6Gdh76y4VJ9Stew0qVz2fOx06MMMdtRIfFUOgcJXhy5vrgidBr53+/YAadftK1gfeQ K/ZyG4nVr4rIBuMfGrNTstsN4fvI3LdMey0ZQRLJTyiilik1XYbTAT1HgRP3SbwVu+9i 4akd9EFOGaxQ1MBUKs4+gqUNx2dXdP9384laLa+tX/ATR4mMAZnJuKPCuhpzJxm96hqj kJw5MY5lt1UU8DFaoDIFUaQpvmCTEmlZAxvvUeG1O0s5/sHImr9oSaSKOULxEpalWZEA PmOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775376228; x=1775981028; h=references:message-id:date:in-reply-to:subject:cc:to:from:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=gP1cOcDoRXryeISFxe64xa15dM+jMNxjCvbg9WbTb7I=; b=Pm2odFE6aVa5PSkZSdgFqWkVOE9aWLmWOIvENq63DTaQri/0+UgA9WBdXQXbBFDisE 0Z7xCtN7cgoCAcJrAzZUE7wIM/mH0mBB7yj4D2UNBGdo+6lTI1nC8Jp3ZRqjIfFND7n/ SNioeOwMDIdI0o5et+paKQd2/hxDumGeiVmROltILWC3Nmbj5iLR25psFHv9Zl2Z3cRt zFxr2ZZTtF63oUv/Z4JPf1e7m0dK6qbHCtnSuIhqKz90MPv7IdbsdKHDDiQEM/L/4LKK I7gxszvxT4+03M+Xis7takGFsRGE4NetcW3ISBqzGIbYuosZRlqq3HFmcsmbEQ03gbdP 4G1Q== X-Forwarded-Encrypted: i=1; AJvYcCXpbSnBgD54l6fAvuhBUgfhijbO9D9lIjcW2JLNNJt9CImezU9M1fkreHVtbF4MXuoEGLkRw2vnhK2kEZs=@vger.kernel.org X-Gm-Message-State: AOJu0Yxl8O0y1276rBHgHVYDjm0/HWCHoQITnvhneP3jE1XJt/LjwXAO +JcsuluvX0iEKmKhAwmyqOKqbbhKR2OSu11D+xd8bubkH6q45Bu3rXSM X-Gm-Gg: AeBDietEaDgM/wor6JLvwOfGCcRJ1WCijRmHpHREWF/zeaZTrvw0ETl8NOePUrTJdxa qw8BOAa+nt8jBh+gszrf43eYVzerXYDX6HGxtuYP5mOZB8ANVKfsWRY/yBMRylQ1ahZRX4fErF5 zv/BmxhL0SYAFPoR7NBxh/PwdCbWLqgRPnF/00llETMIwDcE2282VyhU80OAVOHhxc0H64IyB8Q VPPg0YxPE+LfnH4bj1SlngEE7qMt1Xc5X3V+RKMwrMuHWW4vVgGbP50/WTBR0VovIsHEPjFWrJn r9Urda0CZmi1jnvO60zo7Xny4uSUJopiuU+cml/y5DS2CciReGzWuG3MO8lFc+1mCV5PGJmRVbs hgWEwmuTqnYFAqE3V2lV8EvG8gGj9JovAiepLGpTXt4TVkCK2Pw6KBgzoJ96GqaaTgUH7g/djzK FvDTSTSLT5Ecqhzm/pPa9bJ41tlLd9hLk2 X-Received: by 2002:a05:6a00:2998:b0:82c:9f7e:518c with SMTP id d2e1a72fcca58-82d0da9c5femr8085598b3a.25.1775376228375; Sun, 05 Apr 2026 01:03:48 -0700 (PDT) Received: from pve-server ([49.205.216.49]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-82cf9c3d439sm11090248b3a.35.2026.04.05.01.03.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 05 Apr 2026 01:03:47 -0700 (PDT) From: Ritesh Harjani (IBM) To: Andres Freund , Peter Zijlstra Cc: Salvatore Dipietro , linux-kernel@vger.kernel.org, alisaidi@amazon.com, blakgeof@amazon.com, abuehaze@amazon.de, dipietro.salvatore@gmail.com, Thomas Gleixner , Valentin Schneider , Sebastian Andrzej Siewior , Mark Rutland Subject: Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default In-Reply-To: Date: Sun, 05 Apr 2026 11:38:59 +0530 Message-ID: <1pgulz0k.ritesh.list@gmail.com> References: <20260403191942.21410-1-dipiets@amazon.it> <20260403213207.GF2872@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Andres Freund writes: > Hi, > > On 2026-04-04 21:40:29 -0400, Andres Freund wrote: >> On 2026-04-04 13:42:22 -0400, Andres Freund wrote: >> > On 2026-04-03 23:32:07 +0200, Peter Zijlstra wrote: >> > > On Fri, Apr 03, 2026 at 07:19:36PM +0000, Salvatore Dipietro wrote: >> > I'm not quite sure I understand why the spinlock in Salvatore's benchmark does >> > shows up this heavily: >> > >> > - For something like the benchmark here, it should only be used until >> > postgres' buffer pool is fully used, as the freelist only contains buffers >> > not in use, and we check without a lock whether it contains buffers. Once >> > running, buffers are only added to the freelist if tables/indexes are >> > dropped/truncated. And the benchmark seems like it runs long enough that we >> > should actually reach the point the freelist should be empty? >> > >> > - The section covered by the spinlock is only a few instructions long and it >> > is only hit if we have to do a somewhat heavyweight operation afterwards >> > (read in a page into the buffer pool), it seems surprising that this short >> > section gets interrupted frequently enough to cause a regression of this >> > magnitude. >> > >> > For a moment I thought it might be because, while holding the spinlock, some >> > memory is touched for the first time, but that is actually not the case. >> > >> >> I tried to reproduce the regression on a 2x Xeon Gold 6442Y with 256GB of >> memory, running 3aae9383f42f (7.0.0-rc6 + some). That's just 48 cores / 96 >> threads, so it's smaller, and it's x86, not arm, but it's what I can quickly >> update to an unreleased kernel. >> >> >> So far I don't see such a regression and I basically see no time spent >> GetVictimBuffer()->StrategyGetBuffer()->s_lock() (< 0.2%). >> >> Which I don't find surprising, this workload doesn't read enough to have >> contention in there. Salvatore reported on the order of 100k transactions/sec >> (with one update, one read and one insert). Even if just about all of those >> were misses - and they shouldn't be with 25% of 384G as postgres' >> shared_buffers as the script indicates, and we know that s_b is not full due >> to even hitting GetVictimBuffer() - that'd just be a ~200k IOs/sec from the >> page cache. That's not that much. > > >> The benchmark script seems to indicate that huge pages aren't in use: >> https://github.com/aws/repro-collection/blob/main/workloads/postgresql/main.sh#L15 >> >> >> I wonder if somehow the pages underlying the portions of postgres' shared >> memory are getting paged out for some reason, leading to page faults while >> holding the spinlock? > > Hah. I had reflexively used huge_pages=on - as that is the only sane thing to > do with 10s to 100s of GB of shared memory and thus part of all my > benchmarking infrastructure - during the benchmark runs mentioned above. > > Turns out, if I *disable* huge pages, I actually can reproduce the contention > that Salvatore reported (didn't see whether it's a regression for me > though). Not anywhere close to the same degree, because the bottleneck for me > is the writes. > > If I change the workload to a read-only benchmark, which obviously reads a lot > more due to not being bottleneck by durable-write-latency, I see more > contention: > > - 12.76% postgres postgres [.] s_lock > - 12.75% s_lock > - 12.69% StrategyGetBuffer > GetVictimBuffer > - StartReadBuffer > - 12.69% ReleaseAndReadBuffer > + 12.65% heapam_index_fetch_tuple > > > While what I said above is true, the memory touched at the time of contention > it isn't the first access to the relevant shared memory (i.e. it is already > backed by memory), in this workload GetVictimBuffer()->StrategyGetBuffer() > will be the first access of the connection processes to the relevant 4kB > pages. > > Thus there will be a *lot* of minor faults and tlb misses while holding a > spinlock. Unsurprisingly that's bad for performance. > > > I don't see a reason to particularly care about the regression if that's the > sole way to trigger it. Using a buffer pool of ~100GB without huge pages is > not an interesting workload. With a smaller buffer pool the problem would not > happen either. > > Note that the performance effect of not using huge pages is terrible > *regardless* the spinlock. PG 19 does have the spinlock in this path anymore, > but not using huge pages is still utterly terrible (like 1/3 of the > throughput). > > > I did run some benchmarks here and I don't see a clearly reproducible > regression with huge pages. > However, for curiosity, I was hoping if someone more familiar with the scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes performance regression w/o huge pages? Minor page fault handling has micro-secs latency, where as sched ticks is in milli-secs. Besides, both preemption models should anyway schedule() if TIF_NEED_RESCHED is set on return to userspace, right? So was curious to understand how is the preemption model causing performance regression with no hugepages in this case? -ritesh