From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4B3B53B637A for ; Tue, 16 Jun 2026 09:51:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781603479; cv=none; b=lKtt1WM4MG8D3ESAuN0LgI3yHZOevR5OxLOStsf9WwBTzZ5fnFnVkZkTAKzel3Srw+V6oKOQv0wszRWN6m/yN3QYRxjhAKKabwn7njHHNrQ2Ex+NZDJ/3ndPmcdOn2bS5V4CaUVRhmYc7gWyBNHeWbgSHcrgsJ2Uhg+x3sWAJ70= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781603479; c=relaxed/simple; bh=gqv7M2MgpyGsZ/gGeDGUy0O+FH7kvjsXAaCt7bACqsU=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=W0VvQj29uRAVE7A4Ynjrk1sjxpuB7xAxQoUtfEaR/z+R64OiyC3w8ptZ9UCtX6/gt4/CpfEEJ3bbLaFRH8lQaDfmGTlIP4u3Af439EaKiNonKZ0q7W3zZrAn1XqRRLn6YK/uSrykNnuVogMDSSDT3+7uu1yszVAgnuk8QN8MeTY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=UwIw4HWM; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="UwIw4HWM" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 86E6D1F00A3A; Tue, 16 Jun 2026 09:51:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1781603478; bh=Z9R4Ff3DLuOD3RUL4ytN1/VqMRda2YhMeJRWk5vY+Gw=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=UwIw4HWMNt/hs3BBQAyprHiYfbiLKiiJWSBFgF5ef15SD9HKQsb/OQP7VpOLqSbvo pTuVfifpnCsP+m8fAtCLHtGvp6BuKT8Qb4yx+az0nc4PriBQagz1KxOpb8r1PmPh5K L6R8UKwOTu9gwhRhDnGC+pVrwTubIeNKrcPWYZ01LiiI2jTfK0DWFaDXrNLQDapuS+ ctcmOr60gC5ankbTSxxqDS10XQD1FvU+BKoh8q+OJRgf1/o8rz74U76xouHk0H+bjj oW5NpxlS3MurROYCOwmA8JlNDoC9drxA3by7qBd+v6P7q5j+F4pKtrW/oIxVAd+dEc HLlcnO3pA6S7g== Date: Tue, 16 Jun 2026 11:51:12 +0200 From: Ingo Molnar To: Linus Torvalds Cc: Peter Zijlstra , "H. Peter Anvin" , tglx@kernel.org, mingo@redhat.com, bp@alien8.de, Nathan Chancellor , Calvin Owens , Dave Hansen , x86-ML , LKML Subject: Re: 8aeb879baf12 - significant system call latency regression, bisected Message-ID: References: <20260613085919.GF42921@noisy.programming.kicks-ass.net> <203E61B7-290F-4F87-860F-B352D0072703@zytor.com> <20260616082814.GQ48970@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: * Linus Torvalds wrote: > On Tue, 16 Jun 2026 at 13:58, Peter Zijlstra wrote: > > > > So ISTR the Intel I-fetch window was 16 bytes, so the above things would > > make sense. However, Gemini, or whatever AI sits in google search, is > > trying to tell me Intel moved to 32 byte I-fetch with Alderlake. > > Even with 16-byte fetch, the cacheline size is 64 bytes, so it hurts > to not be 64-byte aligned - simply because you may need to fetch more > cachelines (assuming fairly linear code). > > And afaik, some of the newer ones aren't 32-byte wide, but can do 48 > bytes as three 16-byte fetches. > > But I don't know if they can do the old "split line access" that older > cores could do, where a Pentium would do two 8-byte accesses at the > same time, and they didn't have to be in the same cache line. > > So 64-byte alignment would always be the best option if you only look > at a *particular* piece of code. > > But it obviously is very wasteful and hurts when there is code around > it that could be loaded into the cache at the same time. > > So almost certainly not a good idea in general. > > But 64-byte alignment is probably what things like interrupt and > system call entrypoints should use, because those things would make > sense to look at as isolated things, not part of a bigger load". And > they are quite likely to start from a fairly cold-cache situation. > > So *not* some general compiler option in a config file, but maybe a > special "entry point alignment" macro? Yeah, agreed on that approach - but before/while we fix it, I'm also still somewhat baffled by the numbers hpa reported: >>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact >>> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles >>> (with the top and bottom octiles removed as outlier protection); with 7.1rc7 >>> with the above patch reverted it is 397.5±0.4. - this is in fact a 20% >>> increase in latency, not 13%... Now that we know that this regression is caused by entry function alignment changes, do we know *why* it causes a 80 cycles shift in system call entry performance? What does the benchmark measure, cache-cold or cache-hot execution? 1) Cache-cold performance: If it is cold-cache performance, does the misaligned case fetch one more cold cacheline? >From which cache does it miss? Fetching from the 2-4MB Panther Lake L2 shouldn't be 80 cycles, it should be ~17 cycles. If it's fetching from the 18MB L3 (which I'd say is the norm for most workloads), then the L3->L1I latency is around ~55 cycles on Panther Lake, with everything included. It cannot really be DRAM latency, ie. true cache-cold latency, as that would be much more severe, in the 400 cycles range even with premium DRAM modules - and more like 500 cycles with mainstream DRAM modules and layouts. (Unless we are *lucky* with alignment and sizing and the alignment regression doesn't trigger full DRAM latency.) The on-die DRAM MSC cache's latency should be around 300 cycles - that too is too high. 2) Cache-hot performance: While cache-hot performance is less relevance for system calls (which tend to be cache-cold in practice), if the benchmark measures cache-hot performance, why is there a 80 cycles shift from just a single misaligned symbol? Ie. the specific and rather stable figure of 80 cycles overhead does not seem to match any of the Panther Lake latencies that ought to be relevant to this regression, if we use the simplest mental model of what's going on when alignment changes. So it is either some other uarch pathology, triggered by bad alignment, or something doesn't add up in my mental model of the root cause of this problem. :-) Side notes: - The 6 cycles noise in the 478±6 cycles measurement does suggest that we might have missed out to a deeper cache hierarchy level, versus the rather stable 397.5±0.4 pre-regression figure. - I'm also assuming that 'cycles' here is a frequency-invariant standardized constant 5.1 GHz TSC value or so? Thanks, Ingo