From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from stravinsky.debian.org (stravinsky.debian.org [82.195.75.108]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5C0244A13BC for ; Tue, 9 Jun 2026 15:28:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=82.195.75.108 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781018917; cv=none; b=UO1ihL7dcF50DBbHZgQlu9lRr9mKnwa54zGSmjqoS56PKvrowBYUKLx+OKWUC+ORNmQk1dMOGZxPFKt55U6/46W26/cbgYupS1s7DQ1FO1IhwFfTb6eWGrPhTKvD4Xgr4Vv3ze2WUZmbTKB8P5voTys1Zf53Lbbpzob1lIrWjW4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781018917; c=relaxed/simple; bh=beLFej057r38qhz3oeSqLZFtzmCD1y4CA2R995MPUMQ=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=l7oHX65zXGtaf3Yp6IaXcYE299wigI9UV34SQkLbAFFZnx8m+5J3jmjO02yoDI27/E09oncfMg4AXkXkIorlpKtMuy97+Fp0FDDntWnhO/Q8joHtNYR2UNVzertrCZK40JZu46wm4fVDUF1RbmZMIcgrYaq1UdGnS84JURnzpnA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=debian.org; spf=pass smtp.mailfrom=debian.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b=pIRkGhQB; arc=none smtp.client-ip=82.195.75.108 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=debian.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=debian.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b="pIRkGhQB" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debian.org; s=smtpauto.stravinsky; h=X-Debian-User:In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=T1QflihHdefDxnbe5IdLqYwDKwMl4Z6pl2T4jRX5Td8=; b=pIRkGhQBqPxpoNl943BLqtRDAk lGPY3V/Olbo+yqzBsLZNjzFB6twTBbmxBLSn8oYbsNEDOsNzMxViKz8rI+CnLkItFkZ92Gpb+9i+u qlFfHdQ9fzk694jl+GatN8G5tmq748/+OJS0EJOfbAt3nOjD0uScqvx3pEWzSSfoHQaXcBRvAp4t3 XD/qbe/VR5KWQaXob3207Z2ZeWqpTcXev9H1FGcdj8g2jwhf6Grp3tlg/BcOzC2TygnOGZsnAGweX iFd5Qgw/TMvlUg5L34Z4SecFe1xTTL/kvXdX7ZFZkLkvMk9+0pTv9oFYYuI1EzjsVGejWsl4koTy3 hZkfClEQ==; Received: from authenticated-user by stravinsky.debian.org with esmtpsa (TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim 4.96) (envelope-from ) id 1wWyNJ-008UqV-0f; Tue, 09 Jun 2026 15:28:21 +0000 Date: Tue, 9 Jun 2026 08:28:16 -0700 From: Breno Leitao To: Peter Zijlstra Cc: Thomas Gleixner , Ingo Molnar , Darren Hart , Davidlohr Bueso , =?utf-8?B?QW5kcsOp?= Almeida , linux-kernel@vger.kernel.org, puranjay@kernel.org, rmikey@meta.com, stuclar@meta.com, namhyung@kernel.org, kernel-team@meta.com Subject: Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock Message-ID: References: <20260605-futex-v1-1-4ad4a0d6f265@debian.org> <20260609104603.GA48970@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260609104603.GA48970@noisy.programming.kicks-ass.net> X-Debian-User: leitao Hello Peter, On Tue, Jun 09, 2026 at 12:46:03PM +0200, Peter Zijlstra wrote: > On Fri, Jun 05, 2026 at 09:53:12AM -0700, Breno Leitao wrote: > > struct futex_hash_bucket packs (atomic_t waiters, spinlock_t lock, > > struct plist_head chain, struct futex_private_hash *priv) into a > > single ____cacheline_aligned_in_smp 64-byte block. Three distinct > > access patterns hit that line: > > > > 1. Lockless atomic_read(&hb->waiters) via futex_hb_waiters_pending() > > on the fast path before taking the lock. > > 2. spin_lock(&hb->lock) contenders writing the lock word. > > 3. The lock holder modifying chain.{next,prev} on every futex_wake, > > futex_q_unlock, plist_add, __futex_unqueue. > > > > This was first noticed on a Meta cache (ucache) production workload: > > perf c2c on a busy 176-core AMD EPYC 9D64 ranked this exact cacheline as > > the #1 HITM source: 129 Local + 31 Remote HITM, hit by 156 distinct > > CPUs in a second. > > > > The contention is not specific to that workload, though. Our very own > > "perf bench futex" hash exercises the same buckets and shows the same > > false sharing, so the rest of this changelog quantifies the fix with > > perf bench futex. > > So I can't see this. After 'fixing' the benchmark to run with a fixed > number of buckets (see below), a perf c2c record shows the > futex_hash_bucket::priv load to be the 'expensive' (when doing perf > report on that, rather than perf c2c report, because this latter is > total garbage) I am able to confirm with both. Keep in mind that I am using a multi CCD CPU (AMD EPYC 9D64). I ran perf c2c record on an EPYC 9D64 (88C/176T, 1 socket, multi-CCD) under `perf bench futex hash -b 256`, on baseline and patched. Top hot kernel HITM lines: Baseline (b99ae45861ec): offset 0x00 futex_q_lock core.c:865 (waiters) offset 0x04 queued_spin_lock_slowpath qspinlock.c (lock) offset 0x04 _raw_spin_lock atomic.h (lock) offset 0x18 futex_hash core.c:312 (priv) + With this patch offset 0x00 futex_q_lock core.c:865 (waiters) offset 0x04 queued_spin_lock_slowpath qspinlock.c (lock) offset 0x04 _raw_spin_lock atomic.h (lock) [no priv entry on this cacheline] `futex_hash` is literally the lockless `fph = hb->priv` read. On baseline it sits on the lock cacheline at offset 0x18 and is a top HITM source - exactly what you saw. On this patch that entry is gone from the lock cacheline. What remains at offsets 0x00 and 0x04 is intrinsic lock contention (waiters_pending fast path + queued spinlock hand-off); the patch can't reduce that without changing the lock itself. Throughput on the same run: baseline : 1,267,863 ops/sec +This patch: 1,460,971 ops/sec > > Move chain to its own cacheline so: > > - Lockless waiters_pending() readers no longer invalidate the line > > that lock contenders are spinning to acquire. > > - Cross-CCD lock handoffs ship only the (waiters, lock) line; the > > next holder reads chain from its own L2/L3 instead of fetching > > chain entries together with the lock byte. > > > > This improves "perf bench futex hash" on a 176-core AMD EPYC 9D64 by > > 15%: > > > > baseline +fix delta > > average 1,394,938 1,616,781 +15.9 % > > median 1,430,012 1,617,072 +13.1 % > > min 1,214,488 1,501,741 +23.5 % > > max 1,488,167 1,730,734 +16.3 % > > > > The distributions do not overlap: the slowest +fix run (1.50 M) is > > faster than every baseline run except the single fastest (1.49 M). > > When I run: "perf bench futex hash", I do see massive contention, but > not on the line you mention. Instead we hammer mm->futex.phash.atomic in > futex_ref_{get,put}(). > > These are the atomic_long_inc_not_zero() / atomic_long_dec_and_test(). > > The reason this happens is unfortunate, you would want this thing to hit > the PERCPU fast-path, but due to the per-thread auto scaling, the > benchmark startup phase allocates a (2 thread) small hash, then a bigger > and a bigger, for each next thread that comes in. > > Per there being a pending new hash, we drop to ATOMIC mode, such that we > can actually observe the 0 references. > > However, because the benchmark is in fact hammering the buckets (per > design), it will never actually hit 0 references and swap in the larger > hash. Ack. the auto-scaling pathology you described reproduces here too > If one were to specific an explicit number of buckets, the benchmark > will function correctly: > > v7.1-rc7 +patch > > perf bench futex hash 192479 195523 +1.5% > perf bench futex hash -b 256 3453734 3987880 +15.5% > > And then I do see the improvement from your patch, but I really cannot > make sense of your reasoning for it. So, let me rephrase it. The bucket cacheline takes hits from four access patterns - the three I listed (waiters_pending readers, lock spinners, lock-holder chain writes) plus the lockless `fph = hb->priv` load on the futex_hash() fast path, which is what c2c surfaced. That priv load is the dominant HITM source on baseline, not the chain writes I emphasized. > > Cost: one extra cacheline (56 B padding) per bucket. Would it be > > acceptable? > > I'm really not sure, it *doubles* the futex memory cost. I think it's worth the trade. The global hash scales linearly with num_possible_cpus(), so the extra bytes track the same curve as the machines that actually need the fix in simpler words, a box big enough to feel this contention has plenty of RAM headroom to absorb it. Thanks for the review, --breno