[PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
@ 2026-06-05 16:53 Breno Leitao
  2026-06-09 10:46 ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-06-05 16:53 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, André Almeida
  Cc: linux-kernel, puranjay, rmikey, stuclar, namhyung, kernel-team,
	Breno Leitao

struct futex_hash_bucket packs (atomic_t waiters, spinlock_t lock,
struct plist_head chain, struct futex_private_hash *priv) into a
single ____cacheline_aligned_in_smp 64-byte block. Three distinct
access patterns hit that line:

  1. Lockless atomic_read(&hb->waiters) via futex_hb_waiters_pending()
     on the fast path before taking the lock.
  2. spin_lock(&hb->lock) contenders writing the lock word.
  3. The lock holder modifying chain.{next,prev} on every futex_wake,
     futex_q_unlock, plist_add, __futex_unqueue.

This was first noticed on a Meta cache (ucache) production workload:
perf c2c on a busy 176-core AMD EPYC 9D64 ranked this exact cacheline as
the #1 HITM source: 129 Local + 31 Remote HITM, hit by 156 distinct
CPUs in a second.

The contention is not specific to that workload, though. Our very own
"perf bench futex" hash exercises the same buckets and shows the same
false sharing, so the rest of this changelog quantifies the fix with
perf bench futex.

Move chain to its own cacheline so:
  - Lockless waiters_pending() readers no longer invalidate the line
    that lock contenders are spinning to acquire.
  - Cross-CCD lock handoffs ship only the (waiters, lock) line; the
    next holder reads chain from its own L2/L3 instead of fetching
    chain entries together with the lock byte.

This improves "perf bench futex hash" on a 176-core AMD EPYC 9D64 by
15%:

                   baseline    +fix       delta
  average      1,394,938   1,616,781    +15.9 %
  median       1,430,012   1,617,072    +13.1 %
  min          1,214,488   1,501,741    +23.5 %
  max          1,488,167   1,730,734    +16.3 %

The distributions do not overlap: the slowest +fix run (1.50 M) is
faster than every baseline run except the single fastest (1.49 M).

This improves wake up latency as well:

perf bench futex wake -s (broadcast wakeup latency, lower is better):
  baseline:   0.300 / 0.329 / 0.266 ms   (avg 0.298)
  +fix:       0.292 / 0.253 / 0.270 ms   (avg 0.272, -9 %)

Cost: one extra cacheline (56 B padding) per bucket. Would it be
acceptable?

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/futex/futex.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 79ef2c709c81..4981dcf465a9 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -142,7 +142,16 @@ static inline bool should_fail_futex(bool fshared)
 struct futex_hash_bucket {
 	atomic_t waiters;
 	spinlock_t lock;
-	struct plist_head chain;
+	/*
+	 * Keep the plist_head chain on its own cacheline. Lockless
+	 * futex_hb_waiters_pending() readers and lock contenders touch
+	 * the (waiters, lock) line; the lock holder modifies chain on
+	 * every wake/queue. perf c2c on a busy 176-core AMD host showed
+	 * this bucket cacheline as the #1 HITM source (129 Lcl + 31 Rmt
+	 * in 5s), hit by 156 distinct CPUs at offset 0x4 (lock) and
+	 * 0x8/0x10 (chain.{next,prev}).
+	 */
+	struct plist_head chain ____cacheline_aligned_in_smp;
 	struct futex_private_hash *priv;
 } ____cacheline_aligned_in_smp;
 

---
base-commit: b99ae45861eccff1e1d8c7b05a13650be805d437
change-id: 20260605-futex-c5478d627985

Best regards,
-- 
Breno Leitao <leitao@debian.org>


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
  2026-06-05 16:53 [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock Breno Leitao
@ 2026-06-09 10:46 ` Peter Zijlstra
  2026-06-09 15:28   ` Breno Leitao
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2026-06-09 10:46 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Thomas Gleixner, Ingo Molnar, Darren Hart, Davidlohr Bueso,
	André Almeida, linux-kernel, puranjay, rmikey, stuclar,
	namhyung, kernel-team

On Fri, Jun 05, 2026 at 09:53:12AM -0700, Breno Leitao wrote:
> struct futex_hash_bucket packs (atomic_t waiters, spinlock_t lock,
> struct plist_head chain, struct futex_private_hash *priv) into a
> single ____cacheline_aligned_in_smp 64-byte block. Three distinct
> access patterns hit that line:
> 
>   1. Lockless atomic_read(&hb->waiters) via futex_hb_waiters_pending()
>      on the fast path before taking the lock.
>   2. spin_lock(&hb->lock) contenders writing the lock word.
>   3. The lock holder modifying chain.{next,prev} on every futex_wake,
>      futex_q_unlock, plist_add, __futex_unqueue.
> 
> This was first noticed on a Meta cache (ucache) production workload:
> perf c2c on a busy 176-core AMD EPYC 9D64 ranked this exact cacheline as
> the #1 HITM source: 129 Local + 31 Remote HITM, hit by 156 distinct
> CPUs in a second.
> 
> The contention is not specific to that workload, though. Our very own
> "perf bench futex" hash exercises the same buckets and shows the same
> false sharing, so the rest of this changelog quantifies the fix with
> perf bench futex.

So I can't see this. After 'fixing' the benchmark to run with a fixed
number of buckets (see below), a perf c2c record shows the
futex_hash_bucket::priv load to be the 'expensive' (when doing perf
report on that, rather than perf c2c report, because this latter is
total garbage)

> Move chain to its own cacheline so:
>   - Lockless waiters_pending() readers no longer invalidate the line
>     that lock contenders are spinning to acquire.
>   - Cross-CCD lock handoffs ship only the (waiters, lock) line; the
>     next holder reads chain from its own L2/L3 instead of fetching
>     chain entries together with the lock byte.
> 
> This improves "perf bench futex hash" on a 176-core AMD EPYC 9D64 by
> 15%:
> 
>                    baseline    +fix       delta
>   average      1,394,938   1,616,781    +15.9 %
>   median       1,430,012   1,617,072    +13.1 %
>   min          1,214,488   1,501,741    +23.5 %
>   max          1,488,167   1,730,734    +16.3 %
> 
> The distributions do not overlap: the slowest +fix run (1.50 M) is
> faster than every baseline run except the single fastest (1.49 M).

When I run: "perf bench futex hash", I do see massive contention, but
not on the line you mention. Instead we hammer mm->futex.phash.atomic in
futex_ref_{get,put}().

These are the atomic_long_inc_not_zero() / atomic_long_dec_and_test().

The reason this happens is unfortunate, you would want this thing to hit
the PERCPU fast-path, but due to the per-thread auto scaling, the
benchmark startup phase allocates a (2 thread) small hash, then a bigger
and a bigger, for each next thread that comes in.

Per there being a pending new hash, we drop to ATOMIC mode, such that we
can actually observe the 0 references.

However, because the benchmark is in fact hammering the buckets (per
design), it will never actually hit 0 references and swap in the larger
hash.

If one were to specific an explicit number of buckets, the benchmark
will function correctly:

  				       v7.1-rc7	+patch

  perf bench futex hash			 192479  195523  +1.5%
  perf bench futex hash -b 256		3453734 3987880 +15.5%

And then I do see the improvement from your patch, but I really cannot
make sense of your reasoning for it.

> Cost: one extra cacheline (56 B padding) per bucket. Would it be
> acceptable?

I'm really not sure, it *doubles* the futex memory cost.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
  2026-06-09 10:46 ` Peter Zijlstra
@ 2026-06-09 15:28   ` Breno Leitao
  2026-06-09 20:11     ` Peter Zijlstra
  2026-06-09 20:16     ` Thomas Gleixner
  0 siblings, 2 replies; 13+ messages in thread
From: Breno Leitao @ 2026-06-09 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Darren Hart, Davidlohr Bueso,
	André Almeida, linux-kernel, puranjay, rmikey, stuclar,
	namhyung, kernel-team

Hello Peter,

On Tue, Jun 09, 2026 at 12:46:03PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 05, 2026 at 09:53:12AM -0700, Breno Leitao wrote:
> > struct futex_hash_bucket packs (atomic_t waiters, spinlock_t lock,
> > struct plist_head chain, struct futex_private_hash *priv) into a
> > single ____cacheline_aligned_in_smp 64-byte block. Three distinct
> > access patterns hit that line:
> > 
> >   1. Lockless atomic_read(&hb->waiters) via futex_hb_waiters_pending()
> >      on the fast path before taking the lock.
> >   2. spin_lock(&hb->lock) contenders writing the lock word.
> >   3. The lock holder modifying chain.{next,prev} on every futex_wake,
> >      futex_q_unlock, plist_add, __futex_unqueue.
> > 
> > This was first noticed on a Meta cache (ucache) production workload:
> > perf c2c on a busy 176-core AMD EPYC 9D64 ranked this exact cacheline as
> > the #1 HITM source: 129 Local + 31 Remote HITM, hit by 156 distinct
> > CPUs in a second.
> > 
> > The contention is not specific to that workload, though. Our very own
> > "perf bench futex" hash exercises the same buckets and shows the same
> > false sharing, so the rest of this changelog quantifies the fix with
> > perf bench futex.
> 
> So I can't see this. After 'fixing' the benchmark to run with a fixed
> number of buckets (see below), a perf c2c record shows the
> futex_hash_bucket::priv load to be the 'expensive' (when doing perf
> report on that, rather than perf c2c report, because this latter is
> total garbage)

I am able to confirm with both. Keep in mind that I am using a multi CCD
CPU (AMD EPYC 9D64).

I ran perf c2c record on an EPYC 9D64 (88C/176T, 1 socket, multi-CCD)
under `perf bench futex hash -b 256`, on baseline and patched. Top hot
kernel HITM lines:

  Baseline (b99ae45861ec):
    offset 0x00  futex_q_lock                  core.c:865   (waiters)
    offset 0x04  queued_spin_lock_slowpath     qspinlock.c  (lock)
    offset 0x04  _raw_spin_lock                atomic.h     (lock)
    offset 0x18  futex_hash                    core.c:312   (priv)

  + With this patch 
    offset 0x00  futex_q_lock                  core.c:865   (waiters)
    offset 0x04  queued_spin_lock_slowpath     qspinlock.c  (lock)
    offset 0x04  _raw_spin_lock                atomic.h     (lock)
    [no priv entry on this cacheline]

`futex_hash` is literally the lockless `fph = hb->priv`
read. On baseline it sits on the lock cacheline at offset 0x18 and is
a top HITM source - exactly what you saw. On this patch that entry is
gone from the lock cacheline.

What remains at offsets 0x00 and 0x04 is intrinsic lock contention
(waiters_pending fast path + queued spinlock hand-off); the patch can't reduce
that without changing the lock itself.

Throughput on the same run:

  baseline   : 1,267,863 ops/sec
  +This patch: 1,460,971 ops/sec

> > Move chain to its own cacheline so:
> >   - Lockless waiters_pending() readers no longer invalidate the line
> >     that lock contenders are spinning to acquire.
> >   - Cross-CCD lock handoffs ship only the (waiters, lock) line; the
> >     next holder reads chain from its own L2/L3 instead of fetching
> >     chain entries together with the lock byte.
> > 
> > This improves "perf bench futex hash" on a 176-core AMD EPYC 9D64 by
> > 15%:
> > 
> >                    baseline    +fix       delta
> >   average      1,394,938   1,616,781    +15.9 %
> >   median       1,430,012   1,617,072    +13.1 %
> >   min          1,214,488   1,501,741    +23.5 %
> >   max          1,488,167   1,730,734    +16.3 %
> > 
> > The distributions do not overlap: the slowest +fix run (1.50 M) is
> > faster than every baseline run except the single fastest (1.49 M).
> 
> When I run: "perf bench futex hash", I do see massive contention, but
> not on the line you mention. Instead we hammer mm->futex.phash.atomic in
> futex_ref_{get,put}().
> 
> These are the atomic_long_inc_not_zero() / atomic_long_dec_and_test().
> 
> The reason this happens is unfortunate, you would want this thing to hit
> the PERCPU fast-path, but due to the per-thread auto scaling, the
> benchmark startup phase allocates a (2 thread) small hash, then a bigger
> and a bigger, for each next thread that comes in.
> 
> Per there being a pending new hash, we drop to ATOMIC mode, such that we
> can actually observe the 0 references.
> 
> However, because the benchmark is in fact hammering the buckets (per
> design), it will never actually hit 0 references and swap in the larger
> hash.

Ack.  the auto-scaling pathology you described reproduces here too

> If one were to specific an explicit number of buckets, the benchmark
> will function correctly:
> 
>   				       v7.1-rc7	+patch
> 
>   perf bench futex hash			 192479  195523  +1.5%
>   perf bench futex hash -b 256		3453734 3987880 +15.5%
> 
> And then I do see the improvement from your patch, but I really cannot
> make sense of your reasoning for it.

So, let me rephrase it. The bucket cacheline takes hits from four access
patterns - the three I listed (waiters_pending readers, lock spinners,
lock-holder chain writes) plus the lockless `fph = hb->priv` load on the
futex_hash() fast path, which is what c2c surfaced. That priv load is the
dominant HITM source on baseline, not the chain writes I emphasized. 

> > Cost: one extra cacheline (56 B padding) per bucket. Would it be
> > acceptable?
> 
> I'm really not sure, it *doubles* the futex memory cost.

I think it's worth the trade. The global hash scales linearly with
num_possible_cpus(), so the extra bytes track the same curve as the machines
that actually need the fix

in simpler words, a box big enough to feel this contention has plenty of RAM
headroom to absorb it.

Thanks for the review,
--breno

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
  2026-06-09 15:28   ` Breno Leitao
@ 2026-06-09 20:11     ` Peter Zijlstra
  2026-06-09 20:18       ` Peter Zijlstra
  2026-06-09 20:16     ` Thomas Gleixner
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2026-06-09 20:11 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Thomas Gleixner, Ingo Molnar, Darren Hart, Davidlohr Bueso,
	André Almeida, linux-kernel, puranjay, rmikey, stuclar,
	namhyung, kernel-team

On Tue, Jun 09, 2026 at 08:28:16AM -0700, Breno Leitao wrote:

> > I'm really not sure, it *doubles* the futex memory cost.
> 
> I think it's worth the trade. The global hash scales linearly with
> num_possible_cpus(), so the extra bytes track the same curve as the machines
> that actually need the fix
> 
> in simpler words, a box big enough to feel this contention has plenty of RAM
> headroom to absorb it.

You might not have heard, but RAM has gotten ludicrously expensive.

Anyway, how does something like the below work for you? It's a total
hack job, but it (sorta) builds and runs.


---
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index ff2a4fb2993f..8555c76077af 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -124,7 +124,7 @@ late_initcall(fail_futex_debugfs);
 #endif /* CONFIG_FAIL_FUTEX */
 
 static struct futex_hash_bucket *
-__futex_hash(union futex_key *key, struct futex_private_hash *fph);
+__futex_hash(union futex_key *key, struct futex_private_hash *fph, struct futex_private_hash **fph_p);
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 static bool futex_ref_get(struct futex_private_hash *fph);
@@ -183,14 +183,6 @@ __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
 {
 	u32 hash;
 
-	if (!futex_key_is_private(key))
-		return NULL;
-
-	if (!fph)
-		fph = rcu_dereference(key->private.mm->futex_phash);
-	if (!fph || !fph->hash_mask)
-		return NULL;
-
 	hash = jhash2((void *)&key->private.address,
 		      sizeof(key->private.address) / 4,
 		      key->both.offset);
@@ -211,13 +203,12 @@ static void futex_rehash_private(struct futex_private_hash *old,
 
 		spin_lock(&hb_old->lock);
 		plist_for_each_entry_safe(this, tmp, &hb_old->chain, list) {
-
 			plist_del(&this->list, &hb_old->chain);
 			futex_hb_waiters_dec(hb_old);
 
 			WARN_ON_ONCE(this->lock_ptr != &hb_old->lock);
 
-			hb_new = __futex_hash(&this->key, new);
+			hb_new = __futex_hash(&this->key, new, NULL);
 			futex_hb_waiters_inc(hb_new);
 			/*
 			 * The new pointer isn't published yet but an already
@@ -299,18 +290,17 @@ struct futex_private_hash *futex_private_hash(void)
 	goto again;
 }
 
-struct futex_hash_bucket *futex_hash(union futex_key *key)
+struct futex_bucket_ref futex_hash(union futex_key *key)
 {
-	struct futex_private_hash *fph;
+	struct futex_private_hash *fph = NULL;
 	struct futex_hash_bucket *hb;
 
 again:
 	scoped_guard(rcu) {
-		hb = __futex_hash(key, NULL);
-		fph = hb->priv;
+		hb = __futex_hash(key, NULL, &fph);
 
 		if (!fph || futex_private_hash_get(fph))
-			return hb;
+			return (struct futex_bucket_ref){ .hb = hb, .fph = fph };
 	}
 	futex_pivot_hash(key->private.mm);
 	goto again;
@@ -412,17 +402,19 @@ static int futex_mpol(struct mm_struct *mm, unsigned long addr)
  * global hash is returned.
  */
 static struct futex_hash_bucket *
-__futex_hash(union futex_key *key, struct futex_private_hash *fph)
+__futex_hash(union futex_key *key, struct futex_private_hash *fph, struct futex_private_hash **fph_p)
 {
 	int node = key->both.node;
 	u32 hash;
 
-	if (node == FUTEX_NO_NODE) {
-		struct futex_hash_bucket *hb;
-
-		hb = __futex_hash_private(key, fph);
-		if (hb)
-			return hb;
+	if (node == FUTEX_NO_NODE && futex_key_is_private(key)) {
+		if (!fph)
+			fph = rcu_dereference(key->private.mm->futex_phash);
+		if (fph && fph->hash_mask) {
+			if (fph_p)
+				*fph_p = fph;
+			return __futex_hash_private(key, fph);
+		}
 	}
 
 	hash = jhash2((u32 *)key,
@@ -1348,7 +1340,8 @@ static void exit_pi_state_list(struct task_struct *curr)
 		pi_state = list_entry(next, struct futex_pi_state, list);
 		key = pi_state->key;
 		if (1) {
-			CLASS(hb, hb)(&key);
+			CLASS(hb, hbr)(&key);
+			struct futex_hash_bucket *hb = hbr.hb;
 
 			/*
 			 * We can race against put_pi_state() removing itself from the
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 9f6bf6f585fc..37fc944edeb9 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -2,6 +2,7 @@
 #ifndef _FUTEX_H
 #define _FUTEX_H
 
+#include "linux/mm_types.h"
 #include <linux/futex.h>
 #include <linux/rtmutex.h>
 #include <linux/sched/wake_q.h>
@@ -222,7 +223,6 @@ extern struct hrtimer_sleeper *
 futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
 		  int flags, u64 range_ns);
 
-extern struct futex_hash_bucket *futex_hash(union futex_key *key);
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 extern void futex_hash_get(struct futex_hash_bucket *hb);
 extern void futex_hash_put(struct futex_hash_bucket *hb);
@@ -237,8 +237,15 @@ static inline struct futex_private_hash *futex_private_hash(void) { return NULL;
 static inline void futex_private_hash_put(struct futex_private_hash *fph) { }
 #endif
 
-DEFINE_CLASS(hb, struct futex_hash_bucket *,
-	     if (_T) futex_hash_put(_T),
+struct futex_bucket_ref {
+	struct futex_hash_bucket *hb;
+	struct futex_private_hash *fph;
+};
+
+extern struct futex_bucket_ref futex_hash(union futex_key *key);
+
+DEFINE_CLASS(hb, struct futex_bucket_ref,
+	     if (_T.fph) futex_private_hash_put(_T.fph),
 	     futex_hash(key), union futex_key *key);
 
 DEFINE_CLASS(private_hash, struct futex_private_hash *,
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 643199fdbe62..5c227a4d963d 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -945,7 +945,8 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 
 retry_private:
 	if (1) {
-		CLASS(hb, hb)(&q.key);
+		CLASS(hb, hbr)(&q.key);
+		struct futex_hash_bucket *hb = hbr.hb;
 
 		futex_q_lock(&q, hb);
 
@@ -1101,9 +1102,9 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 		futex_unqueue_pi(&q);
 		spin_unlock(q.lock_ptr);
 		if (q.drop_hb_ref) {
-			CLASS(hb, hb)(&q.key);
+			CLASS(hb, hbr)(&q.key);
 			/* Additional reference from futex_unlock_pi() */
-			futex_hash_put(hb);
+			futex_hash_put(hbr.hb);
 		}
 		goto out;
 
@@ -1162,7 +1163,8 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	if (ret)
 		return ret;
 
-	CLASS(hb, hb)(&key);
+	CLASS(hb, hbr)(&key);
+	struct futex_hash_bucket *hb = hbr.hb;
 	spin_lock(&hb->lock);
 retry_hb:
 
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 1d99a84dc9ad..8ae99b7cb873 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -459,8 +459,10 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
 
 retry_private:
 	if (1) {
-		CLASS(hb, hb1)(&key1);
-		CLASS(hb, hb2)(&key2);
+		CLASS(hb, hbr1)(&key1);
+		CLASS(hb, hbr2)(&key2);
+		struct futex_hash_bucket *hb1 = hbr1.hb;
+		struct futex_hash_bucket *hb2 = hbr2.hb;
 
 		futex_hb_waiters_inc(hb2);
 		double_lock_hb(hb1, hb2);
@@ -838,7 +840,8 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	switch (futex_requeue_pi_wakeup_sync(&q)) {
 	case Q_REQUEUE_PI_IGNORE:
 		{
-			CLASS(hb, hb)(&q.key);
+			CLASS(hb, hbr)(&q.key);
+			struct futex_hash_bucket *hb = hbr.hb;
 			/* The waiter is still on uaddr1 */
 			spin_lock(&hb->lock);
 			ret = handle_early_requeue_pi_wakeup(hb, &q, to);
@@ -909,9 +912,9 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		BUG();
 	}
 	if (q.drop_hb_ref) {
-		CLASS(hb, hb)(&q.key);
+		CLASS(hb, hbr)(&q.key);
 		/* Additional reference from requeue_pi_wake_futex() */
-		futex_hash_put(hb);
+		futex_hash_put(hbr.hb);
 	}
 
 out:
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index ceed9d879059..8c8e3ae899cb 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -169,7 +169,8 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	if ((flags & FLAGS_STRICT) && !nr_wake)
 		return 0;
 
-	CLASS(hb, hb)(&key);
+	CLASS(hb, hbr)(&key);
+	struct futex_hash_bucket *hb = hbr.hb;
 
 	/* Make sure we really have tasks to wakeup */
 	if (!futex_hb_waiters_pending(hb))
@@ -266,8 +267,10 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 
 retry_private:
 	if (1) {
-		CLASS(hb, hb1)(&key1);
-		CLASS(hb, hb2)(&key2);
+		CLASS(hb, hbr1)(&key1);
+		CLASS(hb, hbr2)(&key2);
+		struct futex_hash_bucket *hb1 = hbr1.hb;
+		struct futex_hash_bucket *hb2 = hbr2.hb;
 
 		double_lock_hb(hb1, hb2);
 		op_ret = futex_atomic_op_inuser(op, uaddr2);
@@ -446,7 +449,8 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
 		u32 val = vs[i].w.val;
 
 		if (1) {
-			CLASS(hb, hb)(&q->key);
+			CLASS(hb, hbr)(&q->key);
+			struct futex_hash_bucket *hb = hbr.hb;
 
 			futex_q_lock(q, hb);
 			ret = futex_get_value_locked(&uval, uaddr);
@@ -621,7 +625,8 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 
 retry_private:
 	if (1) {
-		CLASS(hb, hb)(&q->key);
+		CLASS(hb, hbr)(&q->key);
+		struct futex_hash_bucket *hb = hbr.hb;
 
 		futex_q_lock(q, hb);
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
  2026-06-09 15:28   ` Breno Leitao
  2026-06-09 20:11     ` Peter Zijlstra
@ 2026-06-09 20:16     ` Thomas Gleixner
  2026-06-09 20:23       ` Peter Zijlstra
  2026-06-09 20:25       ` Peter Zijlstra
  1 sibling, 2 replies; 13+ messages in thread
From: Thomas Gleixner @ 2026-06-09 20:16 UTC (permalink / raw)
  To: Breno Leitao, Peter Zijlstra
  Cc: Ingo Molnar, Darren Hart, Davidlohr Bueso, André Almeida,
	linux-kernel, puranjay, rmikey, stuclar, namhyung, kernel-team

Breno!

On Tue, Jun 09 2026 at 08:28, Breno Leitao wrote:
> On Tue, Jun 09, 2026 at 12:46:03PM +0200, Peter Zijlstra wrote:
>> On Fri, Jun 05, 2026 at 09:53:12AM -0700, Breno Leitao wrote:
>>   perf bench futex hash			 192479  195523  +1.5%
>>   perf bench futex hash -b 256		3453734 3987880 +15.5%
>> 
>> And then I do see the improvement from your patch, but I really cannot
>> make sense of your reasoning for it.
>
> So, let me rephrase it. The bucket cacheline takes hits from four access
> patterns - the three I listed (waiters_pending readers, lock spinners,
> lock-holder chain writes) plus the lockless `fph = hb->priv` load on the
> futex_hash() fast path, which is what c2c surfaced. That priv load is the
> dominant HITM source on baseline, not the chain writes I emphasized. 

Ok. That makes a lot more sense now.

>> > Cost: one extra cacheline (56 B padding) per bucket. Would it be
>> > acceptable?
>> 
>> I'm really not sure, it *doubles* the futex memory cost.
>
> I think it's worth the trade. The global hash scales linearly with
> num_possible_cpus(), so the extra bytes track the same curve as the machines
> that actually need the fix
>
> in simpler words, a box big enough to feel this contention has plenty of RAM
> headroom to absorb it.

Well, it's not only about the global hash. The per process private hash
is affected too.

Can you try the completely untested below?

Thanks,

        tglx
---
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -124,7 +124,7 @@ late_initcall(fail_futex_debugfs);
 #endif /* CONFIG_FAIL_FUTEX */
 
 static struct futex_hash_bucket *
-__futex_hash(union futex_key *key, struct futex_private_hash *fph);
+__futex_hash(union futex_key *key, struct futex_private_hash **fph);
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 static bool futex_ref_get(struct futex_private_hash *fph);
@@ -179,22 +179,25 @@ void futex_hash_put(struct futex_hash_bu
 }
 
 static struct futex_hash_bucket *
-__futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
+__futex_hash_private(union futex_key *key, struct futex_private_hash **fph)
 {
+	struct futex_private_hash *lfph = *fph;
 	u32 hash;
 
 	if (!futex_key_is_private(key))
 		return NULL;
 
-	if (!fph)
-		fph = rcu_dereference(key->private.mm->futex_phash);
-	if (!fph || !fph->hash_mask)
+	if (!lfph)
+		lfph = rcu_dereference(key->private.mm->futex_phash);
+	if (!lfph || !lfph->hash_mask)
 		return NULL;
 
+	*fph = lfph;
+
 	hash = jhash2((void *)&key->private.address,
 		      sizeof(key->private.address) / 4,
 		      key->both.offset);
-	return &fph->queues[hash & fph->hash_mask];
+	return &lfph->queues[hash & lfph->hash_mask];
 }
 
 static void futex_rehash_private(struct futex_private_hash *old,
@@ -217,7 +220,7 @@ static void futex_rehash_private(struct
 
 			WARN_ON_ONCE(this->lock_ptr != &hb_old->lock);
 
-			hb_new = __futex_hash(&this->key, new);
+			hb_new = __futex_hash(&this->key, &new);
 			futex_hb_waiters_inc(hb_new);
 			/*
 			 * The new pointer isn't published yet but an already
@@ -301,13 +304,12 @@ struct futex_private_hash *futex_private
 
 struct futex_hash_bucket *futex_hash(union futex_key *key)
 {
-	struct futex_private_hash *fph;
+	struct futex_private_hash *fph = NULL;
 	struct futex_hash_bucket *hb;
 
 again:
 	scoped_guard(rcu) {
-		hb = __futex_hash(key, NULL);
-		fph = hb->priv;
+		hb = __futex_hash(key, &fph);
 
 		if (!fph || futex_private_hash_get(fph))
 			return hb;
@@ -319,7 +321,7 @@ struct futex_hash_bucket *futex_hash(uni
 #else /* !CONFIG_FUTEX_PRIVATE_HASH */
 
 static struct futex_hash_bucket *
-__futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
+__futex_hash_private(union futex_key *key, struct futex_private_hash **fph)
 {
 	return NULL;
 }
@@ -412,7 +414,7 @@ static int futex_mpol(struct mm_struct *
  * global hash is returned.
  */
 static struct futex_hash_bucket *
-__futex_hash(union futex_key *key, struct futex_private_hash *fph)
+__futex_hash(union futex_key *key, struct futex_private_hash **fph)
 {
 	int node = key->both.node;
 	u32 hash;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
  2026-06-09 20:11     ` Peter Zijlstra
@ 2026-06-09 20:18       ` Peter Zijlstra
  2026-06-10 11:22         ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2026-06-09 20:18 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Thomas Gleixner, Ingo Molnar, Darren Hart, Davidlohr Bueso,
	André Almeida, linux-kernel, puranjay, rmikey, stuclar,
	namhyung, kernel-team

On Tue, Jun 09, 2026 at 10:11:17PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 09, 2026 at 08:28:16AM -0700, Breno Leitao wrote:
> 
> > > I'm really not sure, it *doubles* the futex memory cost.
> > 
> > I think it's worth the trade. The global hash scales linearly with
> > num_possible_cpus(), so the extra bytes track the same curve as the machines
> > that actually need the fix
> > 
> > in simpler words, a box big enough to feel this contention has plenty of RAM
> > headroom to absorb it.
> 
> You might not have heard, but RAM has gotten ludicrously expensive.
> 
> Anyway, how does something like the below work for you? It's a total
> hack job, but it (sorta) builds and runs.
> 

Please use this one, I spotted a silly bug.

---
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index ff2a4fb2993f..fa0674e5d058 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -124,7 +124,7 @@ late_initcall(fail_futex_debugfs);
 #endif /* CONFIG_FAIL_FUTEX */
 
 static struct futex_hash_bucket *
-__futex_hash(union futex_key *key, struct futex_private_hash *fph);
+__futex_hash(union futex_key *key, struct futex_private_hash *fph, struct futex_private_hash **fph_p);
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 static bool futex_ref_get(struct futex_private_hash *fph);
@@ -183,14 +183,6 @@ __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
 {
 	u32 hash;
 
-	if (!futex_key_is_private(key))
-		return NULL;
-
-	if (!fph)
-		fph = rcu_dereference(key->private.mm->futex_phash);
-	if (!fph || !fph->hash_mask)
-		return NULL;
-
 	hash = jhash2((void *)&key->private.address,
 		      sizeof(key->private.address) / 4,
 		      key->both.offset);
@@ -211,13 +203,12 @@ static void futex_rehash_private(struct futex_private_hash *old,
 
 		spin_lock(&hb_old->lock);
 		plist_for_each_entry_safe(this, tmp, &hb_old->chain, list) {
-
 			plist_del(&this->list, &hb_old->chain);
 			futex_hb_waiters_dec(hb_old);
 
 			WARN_ON_ONCE(this->lock_ptr != &hb_old->lock);
 
-			hb_new = __futex_hash(&this->key, new);
+			hb_new = __futex_hash(&this->key, new, NULL);
 			futex_hb_waiters_inc(hb_new);
 			/*
 			 * The new pointer isn't published yet but an already
@@ -299,18 +290,17 @@ struct futex_private_hash *futex_private_hash(void)
 	goto again;
 }
 
-struct futex_hash_bucket *futex_hash(union futex_key *key)
+struct futex_bucket_ref futex_hash(union futex_key *key)
 {
-	struct futex_private_hash *fph;
-	struct futex_hash_bucket *hb;
-
 again:
 	scoped_guard(rcu) {
-		hb = __futex_hash(key, NULL);
-		fph = hb->priv;
+		struct futex_private_hash *fph = NULL;
+		struct futex_hash_bucket *hb;
+
+		hb = __futex_hash(key, NULL, &fph);
 
 		if (!fph || futex_private_hash_get(fph))
-			return hb;
+			return (struct futex_bucket_ref){ .hb = hb, .fph = fph };
 	}
 	futex_pivot_hash(key->private.mm);
 	goto again;
@@ -412,17 +402,19 @@ static int futex_mpol(struct mm_struct *mm, unsigned long addr)
  * global hash is returned.
  */
 static struct futex_hash_bucket *
-__futex_hash(union futex_key *key, struct futex_private_hash *fph)
+__futex_hash(union futex_key *key, struct futex_private_hash *fph, struct futex_private_hash **fph_p)
 {
 	int node = key->both.node;
 	u32 hash;
 
-	if (node == FUTEX_NO_NODE) {
-		struct futex_hash_bucket *hb;
-
-		hb = __futex_hash_private(key, fph);
-		if (hb)
-			return hb;
+	if (node == FUTEX_NO_NODE && futex_key_is_private(key)) {
+		if (!fph)
+			fph = rcu_dereference(key->private.mm->futex_phash);
+		if (fph && fph->hash_mask) {
+			if (fph_p)
+				*fph_p = fph;
+			return __futex_hash_private(key, fph);
+		}
 	}
 
 	hash = jhash2((u32 *)key,
@@ -1348,7 +1340,8 @@ static void exit_pi_state_list(struct task_struct *curr)
 		pi_state = list_entry(next, struct futex_pi_state, list);
 		key = pi_state->key;
 		if (1) {
-			CLASS(hb, hb)(&key);
+			CLASS(hb, hbr)(&key);
+			struct futex_hash_bucket *hb = hbr.hb;
 
 			/*
 			 * We can race against put_pi_state() removing itself from the
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 9f6bf6f585fc..4cab346067fe 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -222,7 +222,6 @@ extern struct hrtimer_sleeper *
 futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
 		  int flags, u64 range_ns);
 
-extern struct futex_hash_bucket *futex_hash(union futex_key *key);
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 extern void futex_hash_get(struct futex_hash_bucket *hb);
 extern void futex_hash_put(struct futex_hash_bucket *hb);
@@ -237,8 +236,15 @@ static inline struct futex_private_hash *futex_private_hash(void) { return NULL;
 static inline void futex_private_hash_put(struct futex_private_hash *fph) { }
 #endif
 
-DEFINE_CLASS(hb, struct futex_hash_bucket *,
-	     if (_T) futex_hash_put(_T),
+struct futex_bucket_ref {
+	struct futex_hash_bucket *hb;
+	struct futex_private_hash *fph;
+};
+
+extern struct futex_bucket_ref futex_hash(union futex_key *key);
+
+DEFINE_CLASS(hb, struct futex_bucket_ref,
+	     if (_T.fph) futex_private_hash_put(_T.fph),
 	     futex_hash(key), union futex_key *key);
 
 DEFINE_CLASS(private_hash, struct futex_private_hash *,
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 643199fdbe62..5c227a4d963d 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -945,7 +945,8 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 
 retry_private:
 	if (1) {
-		CLASS(hb, hb)(&q.key);
+		CLASS(hb, hbr)(&q.key);
+		struct futex_hash_bucket *hb = hbr.hb;
 
 		futex_q_lock(&q, hb);
 
@@ -1101,9 +1102,9 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 		futex_unqueue_pi(&q);
 		spin_unlock(q.lock_ptr);
 		if (q.drop_hb_ref) {
-			CLASS(hb, hb)(&q.key);
+			CLASS(hb, hbr)(&q.key);
 			/* Additional reference from futex_unlock_pi() */
-			futex_hash_put(hb);
+			futex_hash_put(hbr.hb);
 		}
 		goto out;
 
@@ -1162,7 +1163,8 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	if (ret)
 		return ret;
 
-	CLASS(hb, hb)(&key);
+	CLASS(hb, hbr)(&key);
+	struct futex_hash_bucket *hb = hbr.hb;
 	spin_lock(&hb->lock);
 retry_hb:
 
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 1d99a84dc9ad..8ae99b7cb873 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -459,8 +459,10 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
 
 retry_private:
 	if (1) {
-		CLASS(hb, hb1)(&key1);
-		CLASS(hb, hb2)(&key2);
+		CLASS(hb, hbr1)(&key1);
+		CLASS(hb, hbr2)(&key2);
+		struct futex_hash_bucket *hb1 = hbr1.hb;
+		struct futex_hash_bucket *hb2 = hbr2.hb;
 
 		futex_hb_waiters_inc(hb2);
 		double_lock_hb(hb1, hb2);
@@ -838,7 +840,8 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	switch (futex_requeue_pi_wakeup_sync(&q)) {
 	case Q_REQUEUE_PI_IGNORE:
 		{
-			CLASS(hb, hb)(&q.key);
+			CLASS(hb, hbr)(&q.key);
+			struct futex_hash_bucket *hb = hbr.hb;
 			/* The waiter is still on uaddr1 */
 			spin_lock(&hb->lock);
 			ret = handle_early_requeue_pi_wakeup(hb, &q, to);
@@ -909,9 +912,9 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		BUG();
 	}
 	if (q.drop_hb_ref) {
-		CLASS(hb, hb)(&q.key);
+		CLASS(hb, hbr)(&q.key);
 		/* Additional reference from requeue_pi_wake_futex() */
-		futex_hash_put(hb);
+		futex_hash_put(hbr.hb);
 	}
 
 out:
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index ceed9d879059..8c8e3ae899cb 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -169,7 +169,8 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	if ((flags & FLAGS_STRICT) && !nr_wake)
 		return 0;
 
-	CLASS(hb, hb)(&key);
+	CLASS(hb, hbr)(&key);
+	struct futex_hash_bucket *hb = hbr.hb;
 
 	/* Make sure we really have tasks to wakeup */
 	if (!futex_hb_waiters_pending(hb))
@@ -266,8 +267,10 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 
 retry_private:
 	if (1) {
-		CLASS(hb, hb1)(&key1);
-		CLASS(hb, hb2)(&key2);
+		CLASS(hb, hbr1)(&key1);
+		CLASS(hb, hbr2)(&key2);
+		struct futex_hash_bucket *hb1 = hbr1.hb;
+		struct futex_hash_bucket *hb2 = hbr2.hb;
 
 		double_lock_hb(hb1, hb2);
 		op_ret = futex_atomic_op_inuser(op, uaddr2);
@@ -446,7 +449,8 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
 		u32 val = vs[i].w.val;
 
 		if (1) {
-			CLASS(hb, hb)(&q->key);
+			CLASS(hb, hbr)(&q->key);
+			struct futex_hash_bucket *hb = hbr.hb;
 
 			futex_q_lock(q, hb);
 			ret = futex_get_value_locked(&uval, uaddr);
@@ -621,7 +625,8 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 
 retry_private:
 	if (1) {
-		CLASS(hb, hb)(&q->key);
+		CLASS(hb, hbr)(&q->key);
+		struct futex_hash_bucket *hb = hbr.hb;
 
 		futex_q_lock(q, hb);
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
  2026-06-09 20:16     ` Thomas Gleixner
@ 2026-06-09 20:23       ` Peter Zijlstra
  2026-06-09 20:25       ` Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2026-06-09 20:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Breno Leitao, Ingo Molnar, Darren Hart, Davidlohr Bueso,
	André Almeida, linux-kernel, puranjay, rmikey, stuclar,
	namhyung, kernel-team

On Tue, Jun 09, 2026 at 10:16:31PM +0200, Thomas Gleixner wrote:
> Breno!
> 
> On Tue, Jun 09 2026 at 08:28, Breno Leitao wrote:
> > On Tue, Jun 09, 2026 at 12:46:03PM +0200, Peter Zijlstra wrote:
> >> On Fri, Jun 05, 2026 at 09:53:12AM -0700, Breno Leitao wrote:
> >>   perf bench futex hash			 192479  195523  +1.5%
> >>   perf bench futex hash -b 256		3453734 3987880 +15.5%
> >> 
> >> And then I do see the improvement from your patch, but I really cannot
> >> make sense of your reasoning for it.
> >
> > So, let me rephrase it. The bucket cacheline takes hits from four access
> > patterns - the three I listed (waiters_pending readers, lock spinners,
> > lock-holder chain writes) plus the lockless `fph = hb->priv` load on the
> > futex_hash() fast path, which is what c2c surfaced. That priv load is the
> > dominant HITM source on baseline, not the chain writes I emphasized. 
> 
> Ok. That makes a lot more sense now.
> 
> >> > Cost: one extra cacheline (56 B padding) per bucket. Would it be
> >> > acceptable?
> >> 
> >> I'm really not sure, it *doubles* the futex memory cost.
> >
> > I think it's worth the trade. The global hash scales linearly with
> > num_possible_cpus(), so the extra bytes track the same curve as the machines
> > that actually need the fix
> >
> > in simpler words, a box big enough to feel this contention has plenty of RAM
> > headroom to absorb it.
> 
> Well, it's not only about the global hash. The per process private hash
> is affected too.
> 
> Can you try the completely untested below?

This moves the access to futex_hash_put() :-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
  2026-06-09 20:16     ` Thomas Gleixner
  2026-06-09 20:23       ` Peter Zijlstra
@ 2026-06-09 20:25       ` Peter Zijlstra
  2026-06-09 20:32         ` Thomas Gleixner
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2026-06-09 20:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Breno Leitao, Ingo Molnar, Darren Hart, Davidlohr Bueso,
	André Almeida, linux-kernel, puranjay, rmikey, stuclar,
	namhyung, kernel-team

On Tue, Jun 09, 2026 at 10:16:31PM +0200, Thomas Gleixner wrote:
> @@ -301,13 +304,12 @@ struct futex_private_hash *futex_private
>  
>  struct futex_hash_bucket *futex_hash(union futex_key *key)
>  {
> -	struct futex_private_hash *fph;
> +	struct futex_private_hash *fph = NULL;
>  	struct futex_hash_bucket *hb;
>  
>  again:
>  	scoped_guard(rcu) {
> -		hb = __futex_hash(key, NULL);
> -		fph = hb->priv;
> +		hb = __futex_hash(key, &fph);
>  
>  		if (!fph || futex_private_hash_get(fph))
>  			return hb;

Also, same bug I had in my first patch, you need to re-set fph to NULL
on the goto again :-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
  2026-06-09 20:25       ` Peter Zijlstra
@ 2026-06-09 20:32         ` Thomas Gleixner
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2026-06-09 20:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Breno Leitao, Ingo Molnar, Darren Hart, Davidlohr Bueso,
	André Almeida, linux-kernel, puranjay, rmikey, stuclar,
	namhyung, kernel-team

On Tue, Jun 09 2026 at 22:25, Peter Zijlstra wrote:
> On Tue, Jun 09, 2026 at 10:16:31PM +0200, Thomas Gleixner wrote:
>> @@ -301,13 +304,12 @@ struct futex_private_hash *futex_private
>>  
>>  struct futex_hash_bucket *futex_hash(union futex_key *key)
>>  {
>> -	struct futex_private_hash *fph;
>> +	struct futex_private_hash *fph = NULL;
>>  	struct futex_hash_bucket *hb;
>>  
>>  again:
>>  	scoped_guard(rcu) {
>> -		hb = __futex_hash(key, NULL);
>> -		fph = hb->priv;
>> +		hb = __futex_hash(key, &fph);
>>  
>>  		if (!fph || futex_private_hash_get(fph))
>>  			return hb;
>
> Also, same bug I had in my first patch, you need to re-set fph to NULL
> on the goto again :-)

Figured that out by now :)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
  2026-06-09 20:18       ` Peter Zijlstra
@ 2026-06-10 11:22         ` Thomas Gleixner
  2026-06-10 11:25           ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2026-06-10 11:22 UTC (permalink / raw)
  To: Peter Zijlstra, Breno Leitao
  Cc: Ingo Molnar, Darren Hart, Davidlohr Bueso, André Almeida,
	linux-kernel, puranjay, rmikey, stuclar, namhyung, kernel-team

On Tue, Jun 09 2026 at 22:18, Peter Zijlstra wrote:
> On Tue, Jun 09, 2026 at 10:11:17PM +0200, Peter Zijlstra wrote:
>> Anyway, how does something like the below work for you? It's a total
>> hack job, but it (sorta) builds and runs.
>> 
>
> Please use this one, I spotted a silly bug.

So I ran this on two machines.

SKL dual socket 112 threads:

		Baseline	Patched

shared (16k)	1571857 	1641435         + 4.4%
autosize (512)	 646390 	 903371         +39.7%
-b 256		 464395 	 587014         +26.4%
-b 512		 715687 	 995943         +39.2%
-b 1024		 995085 	1396328         +40.3%
-b 2048		1293114         1668395         +29.0%
-b 4096		2124438 	2240228         + 5.5%

Zen3 dual socket 256 threads:

		Baseline	Patched

shared	(16k)	1275840		1381279	 	+ 8.2%
autosize (512)	1252745		1482179		+18.3%
-b 256		 856274		 955455		+11.5%
-b 512		1267490		1544010		+21.8%
-b 1024		1424013		1625424		+14.1%
-b 2048		1505181		1669342		+10.9%
-b 4096		1465993		1688932		+15.2%


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
  2026-06-10 11:22         ` Thomas Gleixner
@ 2026-06-10 11:25           ` Peter Zijlstra
  2026-06-10 13:55             ` Peter Zijlstra
  2026-06-10 13:56             ` Breno Leitao
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2026-06-10 11:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Breno Leitao, Ingo Molnar, Darren Hart, Davidlohr Bueso,
	André Almeida, linux-kernel, puranjay, rmikey, stuclar,
	namhyung, kernel-team

On Wed, Jun 10, 2026 at 01:22:34PM +0200, Thomas Gleixner wrote:
> On Tue, Jun 09 2026 at 22:18, Peter Zijlstra wrote:
> > On Tue, Jun 09, 2026 at 10:11:17PM +0200, Peter Zijlstra wrote:
> >> Anyway, how does something like the below work for you? It's a total
> >> hack job, but it (sorta) builds and runs.
> >> 
> >
> > Please use this one, I spotted a silly bug.
> 
> So I ran this on two machines.
> 
> SKL dual socket 112 threads:
> 
> 		Baseline	Patched
> 
> shared (16k)	1571857 	1641435         + 4.4%
> autosize (512)	 646390 	 903371         +39.7%
> -b 256		 464395 	 587014         +26.4%
> -b 512		 715687 	 995943         +39.2%
> -b 1024		 995085 	1396328         +40.3%
> -b 2048		1293114         1668395         +29.0%
> -b 4096		2124438 	2240228         + 5.5%
> 
> Zen3 dual socket 256 threads:
> 
> 		Baseline	Patched
> 
> shared	(16k)	1275840		1381279	 	+ 8.2%
> autosize (512)	1252745		1482179		+18.3%
> -b 256		 856274		 955455		+11.5%
> -b 512		1267490		1544010		+21.8%
> -b 1024		1424013		1625424		+14.1%
> -b 2048		1505181		1669342		+10.9%
> -b 4096		1465993		1688932		+15.2%

I suppose that means I'd better go make it prettier and survive
randconfig :-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
  2026-06-10 11:25           ` Peter Zijlstra
@ 2026-06-10 13:55             ` Peter Zijlstra
  2026-06-10 13:56             ` Breno Leitao
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2026-06-10 13:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Breno Leitao, Ingo Molnar, Darren Hart, Davidlohr Bueso,
	André Almeida, linux-kernel, puranjay, rmikey, stuclar,
	namhyung, kernel-team

On Wed, Jun 10, 2026 at 01:25:46PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 10, 2026 at 01:22:34PM +0200, Thomas Gleixner wrote:
> > On Tue, Jun 09 2026 at 22:18, Peter Zijlstra wrote:
> > > On Tue, Jun 09, 2026 at 10:11:17PM +0200, Peter Zijlstra wrote:
> > >> Anyway, how does something like the below work for you? It's a total
> > >> hack job, but it (sorta) builds and runs.
> > >> 
> > >
> > > Please use this one, I spotted a silly bug.
> > 
> > So I ran this on two machines.
> > 
> > SKL dual socket 112 threads:
> > 
> > 		Baseline	Patched
> > 
> > shared (16k)	1571857 	1641435         + 4.4%
> > autosize (512)	 646390 	 903371         +39.7%
> > -b 256		 464395 	 587014         +26.4%
> > -b 512		 715687 	 995943         +39.2%
> > -b 1024		 995085 	1396328         +40.3%
> > -b 2048		1293114         1668395         +29.0%
> > -b 4096		2124438 	2240228         + 5.5%
> > 
> > Zen3 dual socket 256 threads:
> > 
> > 		Baseline	Patched
> > 
> > shared	(16k)	1275840		1381279	 	+ 8.2%
> > autosize (512)	1252745		1482179		+18.3%
> > -b 256		 856274		 955455		+11.5%
> > -b 512		1267490		1544010		+21.8%
> > -b 1024		1424013		1625424		+14.1%
> > -b 2048		1505181		1669342		+10.9%
> > -b 4096		1465993		1688932		+15.2%
> 
> I suppose that means I'd better go make it prettier and survive
> randconfig :-)

I ended up with the below. I'll invent a Changelog and throw it to the
robots later tonight.

---
 kernel/futex/core.c     | 108 +++++++++++++-----------------------------------
 kernel/futex/futex.h    |  37 ++++++++++-------
 kernel/futex/pi.c       |  21 +++++-----
 kernel/futex/requeue.c  |  20 ++++-----
 kernel/futex/waitwake.c |  17 +++++---
 5 files changed, 84 insertions(+), 119 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index ff2a4fb2993f..844fe893646f 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -124,7 +124,7 @@ late_initcall(fail_futex_debugfs);
 #endif /* CONFIG_FAIL_FUTEX */
 
 static struct futex_hash_bucket *
-__futex_hash(union futex_key *key, struct futex_private_hash *fph);
+__futex_hash(union futex_key *key, struct futex_private_hash *fph, struct futex_private_hash **fph_p);
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 static bool futex_ref_get(struct futex_private_hash *fph);
@@ -133,15 +133,6 @@ static bool futex_ref_is_dead(struct futex_private_hash *fph);
 
 enum { FR_PERCPU = 0, FR_ATOMIC };
 
-static inline bool futex_key_is_private(union futex_key *key)
-{
-	/*
-	 * Relies on get_futex_key() to set either bit for shared
-	 * futexes -- see comment with union futex_key.
-	 */
-	return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
-}
-
 static bool futex_private_hash_get(struct futex_private_hash *fph)
 {
 	return futex_ref_get(fph);
@@ -149,48 +140,15 @@ static bool futex_private_hash_get(struct futex_private_hash *fph)
 
 void futex_private_hash_put(struct futex_private_hash *fph)
 {
-	if (futex_ref_put(fph))
+	if (fph && futex_ref_put(fph))
 		wake_up_var(fph->mm);
 }
 
-/**
- * futex_hash_get - Get an additional reference for the local hash.
- * @hb:                    ptr to the private local hash.
- *
- * Obtain an additional reference for the already obtained hash bucket. The
- * caller must already own an reference.
- */
-void futex_hash_get(struct futex_hash_bucket *hb)
-{
-	struct futex_private_hash *fph = hb->priv;
-
-	if (!fph)
-		return;
-	WARN_ON_ONCE(!futex_private_hash_get(fph));
-}
-
-void futex_hash_put(struct futex_hash_bucket *hb)
-{
-	struct futex_private_hash *fph = hb->priv;
-
-	if (!fph)
-		return;
-	futex_private_hash_put(fph);
-}
-
 static struct futex_hash_bucket *
 __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
 {
 	u32 hash;
 
-	if (!futex_key_is_private(key))
-		return NULL;
-
-	if (!fph)
-		fph = rcu_dereference(key->private.mm->futex_phash);
-	if (!fph || !fph->hash_mask)
-		return NULL;
-
 	hash = jhash2((void *)&key->private.address,
 		      sizeof(key->private.address) / 4,
 		      key->both.offset);
@@ -211,13 +169,12 @@ static void futex_rehash_private(struct futex_private_hash *old,
 
 		spin_lock(&hb_old->lock);
 		plist_for_each_entry_safe(this, tmp, &hb_old->chain, list) {
-
 			plist_del(&this->list, &hb_old->chain);
 			futex_hb_waiters_dec(hb_old);
 
 			WARN_ON_ONCE(this->lock_ptr != &hb_old->lock);
 
-			hb_new = __futex_hash(&this->key, new);
+			hb_new = __futex_hash(&this->key, new, NULL);
 			futex_hb_waiters_inc(hb_new);
 			/*
 			 * The new pointer isn't published yet but an already
@@ -271,9 +228,8 @@ static void futex_pivot_hash(struct mm_struct *mm)
 	}
 }
 
-struct futex_private_hash *futex_private_hash(void)
+struct futex_private_hash *futex_private_hash(struct mm_struct *mm)
 {
-	struct mm_struct *mm = current->mm;
 	/*
 	 * Ideally we don't loop. If there is a replacement in progress
 	 * then a new private hash is already prepared and a reference can't be
@@ -299,18 +255,17 @@ struct futex_private_hash *futex_private_hash(void)
 	goto again;
 }
 
-struct futex_hash_bucket *futex_hash(union futex_key *key)
+struct futex_bucket_ref futex_hash(union futex_key *key)
 {
-	struct futex_private_hash *fph;
-	struct futex_hash_bucket *hb;
-
 again:
 	scoped_guard(rcu) {
-		hb = __futex_hash(key, NULL);
-		fph = hb->priv;
+		struct futex_private_hash *fph = NULL;
+		struct futex_hash_bucket *hb;
+
+		hb = __futex_hash(key, NULL, &fph);
 
 		if (!fph || futex_private_hash_get(fph))
-			return hb;
+			return (struct futex_bucket_ref){ .hb = hb, .fph = fph };
 	}
 	futex_pivot_hash(key->private.mm);
 	goto again;
@@ -318,15 +273,9 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
 
 #else /* !CONFIG_FUTEX_PRIVATE_HASH */
 
-static struct futex_hash_bucket *
-__futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
+struct futex_bucket_ref futex_hash(union futex_key *key)
 {
-	return NULL;
-}
-
-struct futex_hash_bucket *futex_hash(union futex_key *key)
-{
-	return __futex_hash(key, NULL);
+	return (struct futex_bucket_ref){ .hb = __futex_hash(key, NULL, NULL), .fph = NULL };
 }
 
 #endif /* CONFIG_FUTEX_PRIVATE_HASH */
@@ -412,18 +361,22 @@ static int futex_mpol(struct mm_struct *mm, unsigned long addr)
  * global hash is returned.
  */
 static struct futex_hash_bucket *
-__futex_hash(union futex_key *key, struct futex_private_hash *fph)
+__futex_hash(union futex_key *key, struct futex_private_hash *fph, struct futex_private_hash **fph_p)
 {
 	int node = key->both.node;
 	u32 hash;
 
-	if (node == FUTEX_NO_NODE) {
-		struct futex_hash_bucket *hb;
-
-		hb = __futex_hash_private(key, fph);
-		if (hb)
-			return hb;
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+	if (node == FUTEX_NO_NODE && futex_key_is_private(key)) {
+		if (!fph)
+			fph = rcu_dereference(key->private.mm->futex_phash);
+		if (fph && fph->hash_mask) {
+			if (fph_p)
+				*fph_p = fph;
+			return __futex_hash_private(key, fph);
+		}
 	}
+#endif
 
 	hash = jhash2((u32 *)key,
 		      offsetof(typeof(*key), both.offset) / sizeof(u32),
@@ -1336,7 +1289,7 @@ static void exit_pi_state_list(struct task_struct *curr)
 	 * on the mutex.
 	 */
 	WARN_ON(curr != current);
-	guard(private_hash)();
+	guard(private_hash)(current->mm);
 	/*
 	 * We are a ZOMBIE and nobody can enqueue itself on
 	 * pi_state_list anymore, but we have to be careful
@@ -1348,7 +1301,8 @@ static void exit_pi_state_list(struct task_struct *curr)
 		pi_state = list_entry(next, struct futex_pi_state, list);
 		key = pi_state->key;
 		if (1) {
-			CLASS(hb, hb)(&key);
+			CLASS(hbr, hbr)(&key);
+			auto hb = hbr.hb;
 
 			/*
 			 * We can race against put_pi_state() removing itself from the
@@ -1516,12 +1470,8 @@ void futex_exit_release(struct task_struct *tsk)
 	futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
 }
 
-static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
-				   struct futex_private_hash *fph)
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
 {
-#ifdef CONFIG_FUTEX_PRIVATE_HASH
-	fhb->priv = fph;
-#endif
 	atomic_set(&fhb->waiters, 0);
 	plist_head_init(&fhb->chain);
 	spin_lock_init(&fhb->lock);
@@ -1822,7 +1772,7 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
 	fph->mm = mm;
 
 	for (i = 0; i < hash_slots; i++)
-		futex_hash_bucket_init(&fph->queues[i], fph);
+		futex_hash_bucket_init(&fph->queues[i]);
 
 	if (custom) {
 		/*
@@ -2001,7 +1951,7 @@ static int __init futex_init(void)
 		BUG_ON(!table);
 
 		for (i = 0; i < hashsize; i++)
-			futex_hash_bucket_init(&table[i], NULL);
+			futex_hash_bucket_init(&table[i]);
 
 		futex_queues[n] = table;
 	}
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 9f6bf6f585fc..7aec5e85e039 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -126,6 +126,15 @@ static inline bool should_fail_futex(bool fshared)
 }
 #endif
 
+static inline bool futex_key_is_private(union futex_key *key)
+{
+	/*
+	 * Relies on get_futex_key() to set either bit for shared
+	 * futexes -- see comment with union futex_key.
+	 */
+	return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
+}
+
 /*
  * Hash buckets are shared by all the futex_keys that hash to the same
  * location.  Each key may have multiple futex_q structures, one for each task
@@ -135,7 +144,6 @@ struct futex_hash_bucket {
 	atomic_t waiters;
 	spinlock_t lock;
 	struct plist_head chain;
-	struct futex_private_hash *priv;
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -175,7 +183,7 @@ typedef void (futex_wake_fn)(struct wake_q_head *wake_q, struct futex_q *q);
  * @requeue_pi_key:	the requeue_pi target futex key
  * @bitset:		bitset for the optional bitmasked wakeup
  * @requeue_state:	State field for futex_requeue_pi()
- * @drop_hb_ref:	Waiter should drop the extra hash bucket reference if true
+ * @drop_fph:		Waiter should drop the extra private hash reference when set
  * @requeue_wait:	RCU wait for futex_requeue_pi() (RT only)
  *
  * We use this hashed waitqueue, instead of a normal wait_queue_entry_t, so
@@ -202,7 +210,7 @@ struct futex_q {
 	union futex_key *requeue_pi_key;
 	u32 bitset;
 	atomic_t requeue_state;
-	bool drop_hb_ref;
+	struct futex_private_hash *drop_fph;
 #ifdef CONFIG_PREEMPT_RT
 	struct rcuwait requeue_wait;
 #endif
@@ -222,28 +230,29 @@ extern struct hrtimer_sleeper *
 futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
 		  int flags, u64 range_ns);
 
-extern struct futex_hash_bucket *futex_hash(union futex_key *key);
-#ifdef CONFIG_FUTEX_PRIVATE_HASH
-extern void futex_hash_get(struct futex_hash_bucket *hb);
-extern void futex_hash_put(struct futex_hash_bucket *hb);
+struct futex_bucket_ref {
+	struct futex_hash_bucket *hb;
+	struct futex_private_hash *fph;
+};
 
-extern struct futex_private_hash *futex_private_hash(void);
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+extern struct futex_private_hash *futex_private_hash(struct mm_struct *mm);
 extern void futex_private_hash_put(struct futex_private_hash *fph);
 
 #else /* !CONFIG_FUTEX_PRIVATE_HASH */
-static inline void futex_hash_get(struct futex_hash_bucket *hb) { }
-static inline void futex_hash_put(struct futex_hash_bucket *hb) { }
-static inline struct futex_private_hash *futex_private_hash(void) { return NULL; }
+static inline struct futex_private_hash *futex_private_hash(struct mm_struct *mm) { return NULL; }
 static inline void futex_private_hash_put(struct futex_private_hash *fph) { }
 #endif
 
-DEFINE_CLASS(hb, struct futex_hash_bucket *,
-	     if (_T) futex_hash_put(_T),
+extern struct futex_bucket_ref futex_hash(union futex_key *key);
+
+DEFINE_CLASS(hbr, struct futex_bucket_ref,
+	     if (_T.fph) futex_private_hash_put(_T.fph),
 	     futex_hash(key), union futex_key *key);
 
 DEFINE_CLASS(private_hash, struct futex_private_hash *,
 	     if (_T) futex_private_hash_put(_T),
-	     futex_private_hash(), void);
+	     futex_private_hash(mm), struct mm_struct *mm);
 
 /**
  * futex_match - Check whether two futex keys are equal
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 643199fdbe62..acc8f715c7da 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -945,7 +945,8 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 
 retry_private:
 	if (1) {
-		CLASS(hb, hb)(&q.key);
+		CLASS(hbr, hbr)(&q.key);
+		auto hb = hbr.hb;
 
 		futex_q_lock(&q, hb);
 
@@ -1009,7 +1010,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 		 * the thread, performing resize, will block on hb->lock during
 		 * the requeue.
 		 */
-		futex_hash_put(no_free_ptr(hb));
+		futex_private_hash_put(no_free_ptr(hbr.fph));
 		/*
 		 * Must be done before we enqueue the waiter, here is unfortunately
 		 * under the hb lock, but that *should* work because it does nothing.
@@ -1100,11 +1101,9 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 		__release(&hb->lock);
 		futex_unqueue_pi(&q);
 		spin_unlock(q.lock_ptr);
-		if (q.drop_hb_ref) {
-			CLASS(hb, hb)(&q.key);
-			/* Additional reference from futex_unlock_pi() */
-			futex_hash_put(hb);
-		}
+
+		/* Additional reference from futex_unlock_pi() */
+		futex_private_hash_put(q.drop_fph);
 		goto out;
 
 out_unlock_put_key:
@@ -1162,7 +1161,8 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	if (ret)
 		return ret;
 
-	CLASS(hb, hb)(&key);
+	CLASS(hbr, hbr)(&key);
+	auto hb = hbr.hb;
 	spin_lock(&hb->lock);
 retry_hb:
 
@@ -1219,8 +1219,9 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 			 * Acquire a reference for the leaving waiter to ensure
 			 * valid futex_q::lock_ptr.
 			 */
-			futex_hash_get(hb);
-			top_waiter->drop_hb_ref = true;
+			if (futex_key_is_private(&key))
+				top_waiter->drop_fph = futex_private_hash(key.private.mm);
+
 			__futex_unqueue(top_waiter);
 			raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 			goto retry_hb;
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 1d99a84dc9ad..7384672916fb 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -241,8 +241,8 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
 	 * Acquire a reference for the waiter to ensure valid
 	 * futex_q::lock_ptr.
 	 */
-	futex_hash_get(hb);
-	q->drop_hb_ref = true;
+	if (futex_key_is_private(key))
+		q->drop_fph = futex_private_hash(key->private.mm);
 	q->lock_ptr = &hb->lock;
 	task = READ_ONCE(q->task);
 
@@ -459,8 +459,10 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
 
 retry_private:
 	if (1) {
-		CLASS(hb, hb1)(&key1);
-		CLASS(hb, hb2)(&key2);
+		CLASS(hbr, hbr1)(&key1);
+		CLASS(hbr, hbr2)(&key2);
+		auto hb1 = hbr1.hb;
+		auto hb2 = hbr2.hb;
 
 		futex_hb_waiters_inc(hb2);
 		double_lock_hb(hb1, hb2);
@@ -838,7 +840,8 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	switch (futex_requeue_pi_wakeup_sync(&q)) {
 	case Q_REQUEUE_PI_IGNORE:
 		{
-			CLASS(hb, hb)(&q.key);
+			CLASS(hbr, hbr)(&q.key);
+			auto hb = hbr.hb;
 			/* The waiter is still on uaddr1 */
 			spin_lock(&hb->lock);
 			ret = handle_early_requeue_pi_wakeup(hb, &q, to);
@@ -908,11 +911,8 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	default:
 		BUG();
 	}
-	if (q.drop_hb_ref) {
-		CLASS(hb, hb)(&q.key);
-		/* Additional reference from requeue_pi_wake_futex() */
-		futex_hash_put(hb);
-	}
+	/* Additional reference from requeue_pi_wake_futex() */
+	futex_private_hash_put(q.drop_fph);
 
 out:
 	if (to) {
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index ceed9d879059..1f15fff4ada5 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -169,7 +169,8 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	if ((flags & FLAGS_STRICT) && !nr_wake)
 		return 0;
 
-	CLASS(hb, hb)(&key);
+	CLASS(hbr, hbr)(&key);
+	auto hb = hbr.hb;
 
 	/* Make sure we really have tasks to wakeup */
 	if (!futex_hb_waiters_pending(hb))
@@ -266,8 +267,10 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 
 retry_private:
 	if (1) {
-		CLASS(hb, hb1)(&key1);
-		CLASS(hb, hb2)(&key2);
+		CLASS(hbr, hbr1)(&key1);
+		CLASS(hbr, hbr2)(&key2);
+		auto hb1 = hbr1.hb;
+		auto hb2 = hbr2.hb;
 
 		double_lock_hb(hb1, hb2);
 		op_ret = futex_atomic_op_inuser(op, uaddr2);
@@ -409,7 +412,7 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
 	 * Make sure to have a reference on the private_hash such that we
 	 * don't block on rehash after changing the task state below.
 	 */
-	guard(private_hash)();
+	guard(private_hash)(current->mm);
 
 	/*
 	 * Enqueuing multiple futexes is tricky, because we need to enqueue
@@ -446,7 +449,8 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
 		u32 val = vs[i].w.val;
 
 		if (1) {
-			CLASS(hb, hb)(&q->key);
+			CLASS(hbr, hbr)(&q->key);
+			auto hb = hbr.hb;
 
 			futex_q_lock(q, hb);
 			ret = futex_get_value_locked(&uval, uaddr);
@@ -621,7 +625,8 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 
 retry_private:
 	if (1) {
-		CLASS(hb, hb)(&q->key);
+		CLASS(hbr, hbr)(&q->key);
+		auto hb = hbr.hb;
 
 		futex_q_lock(q, hb);
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
  2026-06-10 11:25           ` Peter Zijlstra
  2026-06-10 13:55             ` Peter Zijlstra
@ 2026-06-10 13:56             ` Breno Leitao
  1 sibling, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-06-10 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Darren Hart, Davidlohr Bueso,
	André Almeida, linux-kernel, puranjay, rmikey, stuclar,
	namhyung, kernel-team

On Wed, Jun 10, 2026 at 01:25:46PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 10, 2026 at 01:22:34PM +0200, Thomas Gleixner wrote:
> > On Tue, Jun 09 2026 at 22:18, Peter Zijlstra wrote:
> > > On Tue, Jun 09, 2026 at 10:11:17PM +0200, Peter Zijlstra wrote:
> > >> Anyway, how does something like the below work for you? It's a total
> > >> hack job, but it (sorta) builds and runs.
> > >> 
> > >
> > > Please use this one, I spotted a silly bug.
> > 
> > So I ran this on two machines.
> > 
> > SKL dual socket 112 threads:
> > 
> > 		Baseline	Patched
> > 
> > shared (16k)	1571857 	1641435         + 4.4%
> > autosize (512)	 646390 	 903371         +39.7%
> > -b 256		 464395 	 587014         +26.4%
> > -b 512		 715687 	 995943         +39.2%
> > -b 1024		 995085 	1396328         +40.3%
> > -b 2048		1293114         1668395         +29.0%
> > -b 4096		2124438 	2240228         + 5.5%
> > 
> > Zen3 dual socket 256 threads:
> > 
> > 		Baseline	Patched
> > 
> > shared	(16k)	1275840		1381279	 	+ 8.2%
> > autosize (512)	1252745		1482179		+18.3%
> > -b 256		 856274		 955455		+11.5%
> > -b 512		1267490		1544010		+21.8%
> > -b 1024		1424013		1625424		+14.1%
> > -b 2048		1505181		1669342		+10.9%
> > -b 4096		1465993		1688932		+15.2%
> 
> I suppose that means I'd better go make it prettier and survive
> randconfig :-)

I've Peter it here on the same machine I used earlier 176-thread AMD EPYC host,
10s perf bench futex hash per run, baseline = parent commit (acb7500801e98):

                       Baseline       Patched      Delta
  shared (16 buckets)  1,230,599      1,368,655    +11.2%
  autosize (1024)      1,285,440      1,556,946    +21.1%
  -b 256               1,341,471      1,520,303    +13.3%
  -b 512               1,438,330      1,599,319    +11.2%
  -b 1024              1,443,772      1,622,493    +12.4%
  -b 2048              1,472,108      1,643,975    +11.7%
  -b 4096              1,333,098      1,570,897    +17.8%

Stderr was 0.06%-0.22% across the board, so the deltas are well
outside noise.

The trade Peter sketched holds up here: no extra futex memory
cost, and we still recover most of what padding the bucket would
have bought.

Really good, thanks for your this patch,
--breno


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-06-10 13:56 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-05 16:53 [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock Breno Leitao
2026-06-09 10:46 ` Peter Zijlstra
2026-06-09 15:28   ` Breno Leitao
2026-06-09 20:11     ` Peter Zijlstra
2026-06-09 20:18       ` Peter Zijlstra
2026-06-10 11:22         ` Thomas Gleixner
2026-06-10 11:25           ` Peter Zijlstra
2026-06-10 13:55             ` Peter Zijlstra
2026-06-10 13:56             ` Breno Leitao
2026-06-09 20:16     ` Thomas Gleixner
2026-06-09 20:23       ` Peter Zijlstra
2026-06-09 20:25       ` Peter Zijlstra
2026-06-09 20:32         ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.