From: Breno Leitao <leitao@debian.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Thomas Gleixner" <tglx@kernel.org>,
"Ingo Molnar" <mingo@redhat.com>,
"Darren Hart" <dvhart@infradead.org>,
"Davidlohr Bueso" <dave@stgolabs.net>,
"André Almeida" <andrealmeid@igalia.com>,
linux-kernel@vger.kernel.org, puranjay@kernel.org,
rmikey@meta.com, stuclar@meta.com, namhyung@kernel.org,
kernel-team@meta.com
Subject: Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
Date: Tue, 9 Jun 2026 08:28:16 -0700 [thread overview]
Message-ID: <aigTu8PqXn8wyhiK@gmail.com> (raw)
In-Reply-To: <20260609104603.GA48970@noisy.programming.kicks-ass.net>
Hello Peter,
On Tue, Jun 09, 2026 at 12:46:03PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 05, 2026 at 09:53:12AM -0700, Breno Leitao wrote:
> > struct futex_hash_bucket packs (atomic_t waiters, spinlock_t lock,
> > struct plist_head chain, struct futex_private_hash *priv) into a
> > single ____cacheline_aligned_in_smp 64-byte block. Three distinct
> > access patterns hit that line:
> >
> > 1. Lockless atomic_read(&hb->waiters) via futex_hb_waiters_pending()
> > on the fast path before taking the lock.
> > 2. spin_lock(&hb->lock) contenders writing the lock word.
> > 3. The lock holder modifying chain.{next,prev} on every futex_wake,
> > futex_q_unlock, plist_add, __futex_unqueue.
> >
> > This was first noticed on a Meta cache (ucache) production workload:
> > perf c2c on a busy 176-core AMD EPYC 9D64 ranked this exact cacheline as
> > the #1 HITM source: 129 Local + 31 Remote HITM, hit by 156 distinct
> > CPUs in a second.
> >
> > The contention is not specific to that workload, though. Our very own
> > "perf bench futex" hash exercises the same buckets and shows the same
> > false sharing, so the rest of this changelog quantifies the fix with
> > perf bench futex.
>
> So I can't see this. After 'fixing' the benchmark to run with a fixed
> number of buckets (see below), a perf c2c record shows the
> futex_hash_bucket::priv load to be the 'expensive' (when doing perf
> report on that, rather than perf c2c report, because this latter is
> total garbage)
I am able to confirm with both. Keep in mind that I am using a multi CCD
CPU (AMD EPYC 9D64).
I ran perf c2c record on an EPYC 9D64 (88C/176T, 1 socket, multi-CCD)
under `perf bench futex hash -b 256`, on baseline and patched. Top hot
kernel HITM lines:
Baseline (b99ae45861ec):
offset 0x00 futex_q_lock core.c:865 (waiters)
offset 0x04 queued_spin_lock_slowpath qspinlock.c (lock)
offset 0x04 _raw_spin_lock atomic.h (lock)
offset 0x18 futex_hash core.c:312 (priv)
+ With this patch
offset 0x00 futex_q_lock core.c:865 (waiters)
offset 0x04 queued_spin_lock_slowpath qspinlock.c (lock)
offset 0x04 _raw_spin_lock atomic.h (lock)
[no priv entry on this cacheline]
`futex_hash` is literally the lockless `fph = hb->priv`
read. On baseline it sits on the lock cacheline at offset 0x18 and is
a top HITM source - exactly what you saw. On this patch that entry is
gone from the lock cacheline.
What remains at offsets 0x00 and 0x04 is intrinsic lock contention
(waiters_pending fast path + queued spinlock hand-off); the patch can't reduce
that without changing the lock itself.
Throughput on the same run:
baseline : 1,267,863 ops/sec
+This patch: 1,460,971 ops/sec
> > Move chain to its own cacheline so:
> > - Lockless waiters_pending() readers no longer invalidate the line
> > that lock contenders are spinning to acquire.
> > - Cross-CCD lock handoffs ship only the (waiters, lock) line; the
> > next holder reads chain from its own L2/L3 instead of fetching
> > chain entries together with the lock byte.
> >
> > This improves "perf bench futex hash" on a 176-core AMD EPYC 9D64 by
> > 15%:
> >
> > baseline +fix delta
> > average 1,394,938 1,616,781 +15.9 %
> > median 1,430,012 1,617,072 +13.1 %
> > min 1,214,488 1,501,741 +23.5 %
> > max 1,488,167 1,730,734 +16.3 %
> >
> > The distributions do not overlap: the slowest +fix run (1.50 M) is
> > faster than every baseline run except the single fastest (1.49 M).
>
> When I run: "perf bench futex hash", I do see massive contention, but
> not on the line you mention. Instead we hammer mm->futex.phash.atomic in
> futex_ref_{get,put}().
>
> These are the atomic_long_inc_not_zero() / atomic_long_dec_and_test().
>
> The reason this happens is unfortunate, you would want this thing to hit
> the PERCPU fast-path, but due to the per-thread auto scaling, the
> benchmark startup phase allocates a (2 thread) small hash, then a bigger
> and a bigger, for each next thread that comes in.
>
> Per there being a pending new hash, we drop to ATOMIC mode, such that we
> can actually observe the 0 references.
>
> However, because the benchmark is in fact hammering the buckets (per
> design), it will never actually hit 0 references and swap in the larger
> hash.
Ack. the auto-scaling pathology you described reproduces here too
> If one were to specific an explicit number of buckets, the benchmark
> will function correctly:
>
> v7.1-rc7 +patch
>
> perf bench futex hash 192479 195523 +1.5%
> perf bench futex hash -b 256 3453734 3987880 +15.5%
>
> And then I do see the improvement from your patch, but I really cannot
> make sense of your reasoning for it.
So, let me rephrase it. The bucket cacheline takes hits from four access
patterns - the three I listed (waiters_pending readers, lock spinners,
lock-holder chain writes) plus the lockless `fph = hb->priv` load on the
futex_hash() fast path, which is what c2c surfaced. That priv load is the
dominant HITM source on baseline, not the chain writes I emphasized.
> > Cost: one extra cacheline (56 B padding) per bucket. Would it be
> > acceptable?
>
> I'm really not sure, it *doubles* the futex memory cost.
I think it's worth the trade. The global hash scales linearly with
num_possible_cpus(), so the extra bytes track the same curve as the machines
that actually need the fix
in simpler words, a box big enough to feel this contention has plenty of RAM
headroom to absorb it.
Thanks for the review,
--breno
next prev parent reply other threads:[~2026-06-09 15:28 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-05 16:53 [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock Breno Leitao
2026-06-09 10:46 ` Peter Zijlstra
2026-06-09 15:28 ` Breno Leitao [this message]
2026-06-09 20:11 ` Peter Zijlstra
2026-06-09 20:18 ` Peter Zijlstra
2026-06-09 20:16 ` Thomas Gleixner
2026-06-09 20:23 ` Peter Zijlstra
2026-06-09 20:25 ` Peter Zijlstra
2026-06-09 20:32 ` Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aigTu8PqXn8wyhiK@gmail.com \
--to=leitao@debian.org \
--cc=andrealmeid@igalia.com \
--cc=dave@stgolabs.net \
--cc=dvhart@infradead.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=namhyung@kernel.org \
--cc=peterz@infradead.org \
--cc=puranjay@kernel.org \
--cc=rmikey@meta.com \
--cc=stuclar@meta.com \
--cc=tglx@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.