From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from stravinsky.debian.org (stravinsky.debian.org [82.195.75.108])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5C0244A13BC
	for <linux-kernel@vger.kernel.org>; Tue,  9 Jun 2026 15:28:35 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=82.195.75.108
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1781018917; cv=none; b=UO1ihL7dcF50DBbHZgQlu9lRr9mKnwa54zGSmjqoS56PKvrowBYUKLx+OKWUC+ORNmQk1dMOGZxPFKt55U6/46W26/cbgYupS1s7DQ1FO1IhwFfTb6eWGrPhTKvD4Xgr4Vv3ze2WUZmbTKB8P5voTys1Zf53Lbbpzob1lIrWjW4=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1781018917; c=relaxed/simple;
	bh=beLFej057r38qhz3oeSqLZFtzmCD1y4CA2R995MPUMQ=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=l7oHX65zXGtaf3Yp6IaXcYE299wigI9UV34SQkLbAFFZnx8m+5J3jmjO02yoDI27/E09oncfMg4AXkXkIorlpKtMuy97+Fp0FDDntWnhO/Q8joHtNYR2UNVzertrCZK40JZu46wm4fVDUF1RbmZMIcgrYaq1UdGnS84JURnzpnA=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=debian.org; spf=pass smtp.mailfrom=debian.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b=pIRkGhQB; arc=none smtp.client-ip=82.195.75.108
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=debian.org
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=debian.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b="pIRkGhQB"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debian.org;
	s=smtpauto.stravinsky; h=X-Debian-User:In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=T1QflihHdefDxnbe5IdLqYwDKwMl4Z6pl2T4jRX5Td8=; b=pIRkGhQBqPxpoNl943BLqtRDAk
	lGPY3V/Olbo+yqzBsLZNjzFB6twTBbmxBLSn8oYbsNEDOsNzMxViKz8rI+CnLkItFkZ92Gpb+9i+u
	qlFfHdQ9fzk694jl+GatN8G5tmq748/+OJS0EJOfbAt3nOjD0uScqvx3pEWzSSfoHQaXcBRvAp4t3
	XD/qbe/VR5KWQaXob3207Z2ZeWqpTcXev9H1FGcdj8g2jwhf6Grp3tlg/BcOzC2TygnOGZsnAGweX
	iFd5Qgw/TMvlUg5L34Z4SecFe1xTTL/kvXdX7ZFZkLkvMk9+0pTv9oFYYuI1EzjsVGejWsl4koTy3
	hZkfClEQ==;
Received: from authenticated-user
	by stravinsky.debian.org with esmtpsa (TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256)
	(Exim 4.96)
	(envelope-from <leitao@debian.org>)
	id 1wWyNJ-008UqV-0f;
	Tue, 09 Jun 2026 15:28:21 +0000
Date: Tue, 9 Jun 2026 08:28:16 -0700
From: Breno Leitao <leitao@debian.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@kernel.org>, Ingo Molnar <mingo@redhat.com>, 
	Darren Hart <dvhart@infradead.org>, Davidlohr Bueso <dave@stgolabs.net>, 
	=?utf-8?B?QW5kcsOp?= Almeida <andrealmeid@igalia.com>, linux-kernel@vger.kernel.org, puranjay@kernel.org, 
	rmikey@meta.com, stuclar@meta.com, namhyung@kernel.org, kernel-team@meta.com
Subject: Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the
 bucket lock
Message-ID: <aigTu8PqXn8wyhiK@gmail.com>
References: <20260605-futex-v1-1-4ad4a0d6f265@debian.org>
 <20260609104603.GA48970@noisy.programming.kicks-ass.net>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20260609104603.GA48970@noisy.programming.kicks-ass.net>
X-Debian-User: leitao

Hello Peter,

On Tue, Jun 09, 2026 at 12:46:03PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 05, 2026 at 09:53:12AM -0700, Breno Leitao wrote:
> > struct futex_hash_bucket packs (atomic_t waiters, spinlock_t lock,
> > struct plist_head chain, struct futex_private_hash *priv) into a
> > single ____cacheline_aligned_in_smp 64-byte block. Three distinct
> > access patterns hit that line:
> > 
> >   1. Lockless atomic_read(&hb->waiters) via futex_hb_waiters_pending()
> >      on the fast path before taking the lock.
> >   2. spin_lock(&hb->lock) contenders writing the lock word.
> >   3. The lock holder modifying chain.{next,prev} on every futex_wake,
> >      futex_q_unlock, plist_add, __futex_unqueue.
> > 
> > This was first noticed on a Meta cache (ucache) production workload:
> > perf c2c on a busy 176-core AMD EPYC 9D64 ranked this exact cacheline as
> > the #1 HITM source: 129 Local + 31 Remote HITM, hit by 156 distinct
> > CPUs in a second.
> > 
> > The contention is not specific to that workload, though. Our very own
> > "perf bench futex" hash exercises the same buckets and shows the same
> > false sharing, so the rest of this changelog quantifies the fix with
> > perf bench futex.
> 
> So I can't see this. After 'fixing' the benchmark to run with a fixed
> number of buckets (see below), a perf c2c record shows the
> futex_hash_bucket::priv load to be the 'expensive' (when doing perf
> report on that, rather than perf c2c report, because this latter is
> total garbage)

I am able to confirm with both. Keep in mind that I am using a multi CCD
CPU (AMD EPYC 9D64).

I ran perf c2c record on an EPYC 9D64 (88C/176T, 1 socket, multi-CCD)
under `perf bench futex hash -b 256`, on baseline and patched. Top hot
kernel HITM lines:

  Baseline (b99ae45861ec):
    offset 0x00  futex_q_lock                  core.c:865   (waiters)
    offset 0x04  queued_spin_lock_slowpath     qspinlock.c  (lock)
    offset 0x04  _raw_spin_lock                atomic.h     (lock)
    offset 0x18  futex_hash                    core.c:312   (priv)

  + With this patch 
    offset 0x00  futex_q_lock                  core.c:865   (waiters)
    offset 0x04  queued_spin_lock_slowpath     qspinlock.c  (lock)
    offset 0x04  _raw_spin_lock                atomic.h     (lock)
    [no priv entry on this cacheline]

`futex_hash` is literally the lockless `fph = hb->priv`
read. On baseline it sits on the lock cacheline at offset 0x18 and is
a top HITM source - exactly what you saw. On this patch that entry is
gone from the lock cacheline.

What remains at offsets 0x00 and 0x04 is intrinsic lock contention
(waiters_pending fast path + queued spinlock hand-off); the patch can't reduce
that without changing the lock itself.

Throughput on the same run:

  baseline   : 1,267,863 ops/sec
  +This patch: 1,460,971 ops/sec

> > Move chain to its own cacheline so:
> >   - Lockless waiters_pending() readers no longer invalidate the line
> >     that lock contenders are spinning to acquire.
> >   - Cross-CCD lock handoffs ship only the (waiters, lock) line; the
> >     next holder reads chain from its own L2/L3 instead of fetching
> >     chain entries together with the lock byte.
> > 
> > This improves "perf bench futex hash" on a 176-core AMD EPYC 9D64 by
> > 15%:
> > 
> >                    baseline    +fix       delta
> >   average      1,394,938   1,616,781    +15.9 %
> >   median       1,430,012   1,617,072    +13.1 %
> >   min          1,214,488   1,501,741    +23.5 %
> >   max          1,488,167   1,730,734    +16.3 %
> > 
> > The distributions do not overlap: the slowest +fix run (1.50 M) is
> > faster than every baseline run except the single fastest (1.49 M).
> 
> When I run: "perf bench futex hash", I do see massive contention, but
> not on the line you mention. Instead we hammer mm->futex.phash.atomic in
> futex_ref_{get,put}().
> 
> These are the atomic_long_inc_not_zero() / atomic_long_dec_and_test().
> 
> The reason this happens is unfortunate, you would want this thing to hit
> the PERCPU fast-path, but due to the per-thread auto scaling, the
> benchmark startup phase allocates a (2 thread) small hash, then a bigger
> and a bigger, for each next thread that comes in.
> 
> Per there being a pending new hash, we drop to ATOMIC mode, such that we
> can actually observe the 0 references.
> 
> However, because the benchmark is in fact hammering the buckets (per
> design), it will never actually hit 0 references and swap in the larger
> hash.

Ack.  the auto-scaling pathology you described reproduces here too

> If one were to specific an explicit number of buckets, the benchmark
> will function correctly:
> 
>   				       v7.1-rc7	+patch
> 
>   perf bench futex hash			 192479  195523  +1.5%
>   perf bench futex hash -b 256		3453734 3987880 +15.5%
> 
> And then I do see the improvement from your patch, but I really cannot
> make sense of your reasoning for it.

So, let me rephrase it. The bucket cacheline takes hits from four access
patterns - the three I listed (waiters_pending readers, lock spinners,
lock-holder chain writes) plus the lockless `fph = hb->priv` load on the
futex_hash() fast path, which is what c2c surfaced. That priv load is the
dominant HITM source on baseline, not the chain writes I emphasized. 

> > Cost: one extra cacheline (56 B padding) per bucket. Would it be
> > acceptable?
> 
> I'm really not sure, it *doubles* the futex memory cost.

I think it's worth the trade. The global hash scales linearly with
num_possible_cpus(), so the extra bytes track the same curve as the machines
that actually need the fix

in simpler words, a box big enough to feel this contention has plenty of RAM
headroom to absorb it.

Thanks for the review,
--breno