linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] futex: Use RCU-based per-CPU reference counting
@ 2025-07-07 14:36 Sebastian Andrzej Siewior
  2025-07-07 14:36 ` [PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes Sebastian Andrzej Siewior
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-07 14:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

I picked up PeterZ futex patch from
    https://lore.kernel.org/all/20250624190118.GB1490279@noisy.programming.kicks-ass.net/
 
and I am posting it here it now so it can be staged for v6.17.

Changes since its initial posting:
- A patch description has been added
- The testuite is "fixed" slightly different and has been split out
- futex_mm_init() is fixed up.
- The guard(preempt) has been replaced with guard(rcu) since there is
  no reason to disable preemption.

Since it was not yet released, should we rip out the IMMUTABLE bits and
just stick with GET/SET slots?

Sebastian

--------

Peter Zijlstra (1):
  futex: Use RCU-based per-CPU reference counting instead of rcuref_t

Sebastian Andrzej Siewior (2):
  selftests/futex: Adapt the private hash test to RCU related changes
  futex: Make futex_private_hash_get() static

 include/linux/futex.h                         |  14 +-
 include/linux/mm_types.h                      |   5 +
 init/Kconfig                                  |   4 -
 kernel/fork.c                                 |   6 +-
 kernel/futex/core.c                           | 239 ++++++++++++++++--
 kernel/futex/futex.h                          |   2 -
 .../futex/functional/futex_priv_hash.c        |  42 ++-
 7 files changed, 276 insertions(+), 36 deletions(-)

-- 
2.50.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes
  2025-07-07 14:36 [PATCH 0/3] futex: Use RCU-based per-CPU reference counting Sebastian Andrzej Siewior
@ 2025-07-07 14:36 ` Sebastian Andrzej Siewior
  2025-07-08  7:41   ` Peter Zijlstra
  2025-07-07 14:36 ` [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-07 14:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

The auto scaling on create creation used to automatically assign the new
hash because there was the private hash was unused and could be replaced
right away.

With the upcoming change to wait for a RCU grace period before the hash
can be assigned, the test will always fail.

If the reported number of hash buckets is not updated after an
auto scaling event, block on an acquired lock with a timeout. The timeout
is the delay to wait towards a grace period and locking and a locked
pthread_mutex_t ensure that glibc calls into kernel using futex
operation which will assign new private hash if available.
This will retry every 100ms up to 2 seconds in total.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 .../futex/functional/futex_priv_hash.c        | 42 ++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c b/tools/testing/selftests/futex/functional/futex_priv_hash.c
index 24a92dc94eb86..625e3be4129c3 100644
--- a/tools/testing/selftests/futex/functional/futex_priv_hash.c
+++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c
@@ -111,6 +111,30 @@ static void join_max_threads(void)
 	}
 }
 
+#define SEC_IN_NSEC	1000000000
+#define MSEC_IN_NSEC	1000000
+
+static void futex_dummy_op(void)
+{
+	pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+	struct timespec timeout;
+	int ret;
+
+	pthread_mutex_lock(&lock);
+	clock_gettime(CLOCK_REALTIME, &timeout);
+	timeout.tv_nsec += 100 * MSEC_IN_NSEC;
+	if (timeout.tv_nsec >=  SEC_IN_NSEC) {
+		timeout.tv_nsec -= SEC_IN_NSEC;
+		timeout.tv_sec++;
+	}
+	ret = pthread_mutex_timedlock(&lock, &timeout);
+	if (ret == 0)
+		ksft_exit_fail_msg("Succeffuly locked an already locked mutex.\n");
+
+	if (ret != ETIMEDOUT)
+		ksft_exit_fail_msg("pthread_mutex_timedlock() did not timeout: %d.\n", ret);
+}
+
 static void usage(char *prog)
 {
 	printf("Usage: %s\n", prog);
@@ -129,7 +153,7 @@ int main(int argc, char *argv[])
 	int futex_slots1, futex_slotsn, online_cpus;
 	pthread_mutexattr_t mutex_attr_pi;
 	int use_global_hash = 0;
-	int ret;
+	int ret, retry = 20;
 	int c;
 
 	while ((c = getopt(argc, argv, "cghv:")) != -1) {
@@ -208,8 +232,24 @@ int main(int argc, char *argv[])
 	 */
 	ksft_print_msg("Online CPUs: %d\n", online_cpus);
 	if (online_cpus > 16) {
+retry_getslots:
 		futex_slotsn = futex_hash_slots_get();
 		if (futex_slotsn < 0 || futex_slots1 == futex_slotsn) {
+			retry--;
+			/*
+			 * Auto scaling on thread creation can be slightly delayed
+			 * because it waits for a RCU grace period twice. The new
+			 * private hash is assigned upon the first futex operation
+			 * after grace period.
+			 * To cover all this for testing purposes the function
+			 * below will acquire a lock and acquire it again with a
+			 * 100ms timeout which must timeout. This ensures we
+			 * sleep for 100ms and issue a futex operation.
+			 */
+			if (retry > 0) {
+				futex_dummy_op();
+				goto retry_getslots;
+			}
 			ksft_print_msg("Expected increase of hash buckets but got: %d -> %d\n",
 				       futex_slots1, futex_slotsn);
 			ksft_exit_fail_msg(test_msg_auto_inc);
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t
  2025-07-07 14:36 [PATCH 0/3] futex: Use RCU-based per-CPU reference counting Sebastian Andrzej Siewior
  2025-07-07 14:36 ` [PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes Sebastian Andrzej Siewior
@ 2025-07-07 14:36 ` Sebastian Andrzej Siewior
  2025-07-08  6:51   ` Sebastian Andrzej Siewior
                     ` (3 more replies)
  2025-07-07 14:36 ` [PATCH 3/3] futex: Make futex_private_hash_get() static Sebastian Andrzej Siewior
  2025-07-08  7:50 ` [PATCH 0/3] futex: Use RCU-based per-CPU reference counting Peter Zijlstra
  3 siblings, 4 replies; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-07 14:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

From: Peter Zijlstra <peterz@infradead.org>

The use of rcuref_t for reference counting introduces a performance bottleneck
when accessed concurrently by multiple threads during futex operations.

Replace rcuref_t with special crafted per-CPU reference counters. The
lifetime logic remains the same.

The newly allocate private hash starts in FR_PERCPU state. In this state, each
futex operation that requires the private hash uses a per-CPU counter (an
unsigned int) for incrementing or decrementing the reference count.

When the private hash is about to be replaced, the per-CPU counters are
migrated to a atomic_t counter mm_struct::futex_atomic.
The migration process:
- Waiting for one RCU grace period to ensure all users observe the
  current private hash. This can be skipped if a grace period elapsed
  since the private hash was assigned.

- futex_private_hash::state is set to FR_ATOMIC, forcing all users to
  use mm_struct::futex_atomic for reference counting.

- After a RCU grace period, all users are guaranteed to be using the
  atomic counter. The per-CPU counters can now be summed up and added to
  the atomic_t counter. If the resulting count is zero, the hash can be
  safely replaced. Otherwise, active users still hold a valid reference.

- Once the atomic reference count drops to zero, the next futex
  operation will switch to the new private hash.

call_rcu_hurry() is used to speed up transition which otherwise might be
delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The
side effects would be that on auto scaling the new hash is used later
and the SET_SLOTS prctl() will block longer.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/futex.h    |  14 +--
 include/linux/mm_types.h |   5 +
 init/Kconfig             |   4 -
 kernel/fork.c            |   6 +-
 kernel/futex/core.c      | 237 ++++++++++++++++++++++++++++++++++++---
 5 files changed, 234 insertions(+), 32 deletions(-)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index b37193653e6b5..cd773febd497b 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -85,18 +85,12 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 int futex_hash_allocate_default(void);
 void futex_hash_free(struct mm_struct *mm);
-
-static inline void futex_mm_init(struct mm_struct *mm)
-{
-	RCU_INIT_POINTER(mm->futex_phash, NULL);
-	mm->futex_phash_new = NULL;
-	mutex_init(&mm->futex_hash_lock);
-}
+int futex_mm_init(struct mm_struct *mm);
 
 #else /* !CONFIG_FUTEX_PRIVATE_HASH */
 static inline int futex_hash_allocate_default(void) { return 0; }
-static inline void futex_hash_free(struct mm_struct *mm) { }
-static inline void futex_mm_init(struct mm_struct *mm) { }
+static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
+static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
 #endif /* CONFIG_FUTEX_PRIVATE_HASH */
 
 #else /* !CONFIG_FUTEX */
@@ -118,7 +112,7 @@ static inline int futex_hash_allocate_default(void)
 {
 	return 0;
 }
-static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
 static inline void futex_mm_init(struct mm_struct *mm) { }
 
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d6b91e8a66d6d..0f0662157066a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1070,6 +1070,11 @@ struct mm_struct {
 		struct mutex			futex_hash_lock;
 		struct futex_private_hash	__rcu *futex_phash;
 		struct futex_private_hash	*futex_phash_new;
+		/* futex-ref */
+		unsigned long			futex_batches;
+		struct rcu_head			futex_rcu;
+		atomic_long_t			futex_atomic;
+		unsigned int			__percpu *futex_ref;
 #endif
 
 		unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/init/Kconfig b/init/Kconfig
index 666783eb50abd..af4c2f0854554 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1716,13 +1716,9 @@ config FUTEX_PI
 	depends on FUTEX && RT_MUTEXES
 	default y
 
-#
-# marked broken for performance reasons; gives us one more cycle to sort things out.
-#
 config FUTEX_PRIVATE_HASH
 	bool
 	depends on FUTEX && !BASE_SMALL && MMU
-	depends on BROKEN
 	default y
 
 config FUTEX_MPOL
diff --git a/kernel/fork.c b/kernel/fork.c
index 1ee8eb11f38ba..66c4d4cc2340b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1046,7 +1046,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	RCU_INIT_POINTER(mm->exe_file, NULL);
 	mmu_notifier_subscriptions_init(mm);
 	init_tlb_flush_pending(mm);
-	futex_mm_init(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
 	mm->pmd_huge_pte = NULL;
 #endif
@@ -1061,6 +1060,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 		mm->def_flags = 0;
 	}
 
+	if (futex_mm_init(mm))
+		goto fail_mm_init;
+
 	if (mm_alloc_pgd(mm))
 		goto fail_nopgd;
 
@@ -1090,6 +1092,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 fail_noid:
 	mm_free_pgd(mm);
 fail_nopgd:
+	futex_hash_free(mm);
+fail_mm_init:
 	free_mm(mm);
 	return NULL;
 }
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 90d53fb0ee9e1..b578a536a4fe2 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -42,7 +42,6 @@
 #include <linux/fault-inject.h>
 #include <linux/slab.h>
 #include <linux/prctl.h>
-#include <linux/rcuref.h>
 #include <linux/mempolicy.h>
 #include <linux/mmap_lock.h>
 
@@ -65,7 +64,7 @@ static struct {
 #define futex_queues	(__futex_data.queues)
 
 struct futex_private_hash {
-	rcuref_t	users;
+	int		state;
 	unsigned int	hash_mask;
 	struct rcu_head	rcu;
 	void		*mm;
@@ -129,6 +128,12 @@ static struct futex_hash_bucket *
 __futex_hash(union futex_key *key, struct futex_private_hash *fph);
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
+static bool futex_ref_get(struct futex_private_hash *fph);
+static bool futex_ref_put(struct futex_private_hash *fph);
+static bool futex_ref_is_dead(struct futex_private_hash *fph);
+
+enum { FR_PERCPU = 0, FR_ATOMIC };
+
 static inline bool futex_key_is_private(union futex_key *key)
 {
 	/*
@@ -142,15 +147,14 @@ bool futex_private_hash_get(struct futex_private_hash *fph)
 {
 	if (fph->immutable)
 		return true;
-	return rcuref_get(&fph->users);
+	return futex_ref_get(fph);
 }
 
 void futex_private_hash_put(struct futex_private_hash *fph)
 {
-	/* Ignore return value, last put is verified via rcuref_is_dead() */
 	if (fph->immutable)
 		return;
-	if (rcuref_put(&fph->users))
+	if (futex_ref_put(fph))
 		wake_up_var(fph->mm);
 }
 
@@ -243,14 +247,18 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
 	fph = rcu_dereference_protected(mm->futex_phash,
 					lockdep_is_held(&mm->futex_hash_lock));
 	if (fph) {
-		if (!rcuref_is_dead(&fph->users)) {
+		if (!futex_ref_is_dead(fph)) {
 			mm->futex_phash_new = new;
 			return false;
 		}
 
 		futex_rehash_private(fph, new);
 	}
-	rcu_assign_pointer(mm->futex_phash, new);
+	new->state = FR_PERCPU;
+	scoped_guard(rcu) {
+		mm->futex_batches = get_state_synchronize_rcu();
+		rcu_assign_pointer(mm->futex_phash, new);
+	}
 	kvfree_rcu(fph, rcu);
 	return true;
 }
@@ -289,9 +297,7 @@ struct futex_private_hash *futex_private_hash(void)
 		if (!fph)
 			return NULL;
 
-		if (fph->immutable)
-			return fph;
-		if (rcuref_get(&fph->users))
+		if (futex_private_hash_get(fph))
 			return fph;
 	}
 	futex_pivot_hash(mm);
@@ -1527,16 +1533,213 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
 #define FH_IMMUTABLE	0x02
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
+
+/*
+ * futex-ref
+ *
+ * Heavily inspired by percpu-rwsem/percpu-refcount; not reusing any of that
+ * code because it just doesn't fit right.
+ *
+ * Dual counter, per-cpu / atomic approach like percpu-refcount, except it
+ * re-initializes the state automatically, such that the fph swizzle is also a
+ * transition back to per-cpu.
+ */
+
+static void futex_ref_rcu(struct rcu_head *head);
+
+static void __futex_ref_atomic_begin(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	/*
+	 * The counter we're about to switch to must have fully switched;
+	 * otherwise it would be impossible for it to have reported success
+	 * from futex_ref_is_dead().
+	 */
+	WARN_ON_ONCE(atomic_long_read(&mm->futex_atomic) != 0);
+
+	/*
+	 * Set the atomic to the bias value such that futex_ref_{get,put}()
+	 * will never observe 0. Will be fixed up in __futex_ref_atomic_end()
+	 * when folding in the percpu count.
+	 */
+	atomic_long_set(&mm->futex_atomic, LONG_MAX);
+	smp_store_release(&fph->state, FR_ATOMIC);
+
+	call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
+}
+
+static void __futex_ref_atomic_end(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+	unsigned int count = 0;
+	long ret;
+	int cpu;
+
+	/*
+	 * Per __futex_ref_atomic_begin() the state of the fph must be ATOMIC
+	 * and per this RCU callback, everybody must now observe this state and
+	 * use the atomic variable.
+	 */
+	WARN_ON_ONCE(fph->state != FR_ATOMIC);
+
+	/*
+	 * Therefore the per-cpu counter is now stable, sum and reset.
+	 */
+	for_each_possible_cpu(cpu) {
+		unsigned int *ptr = per_cpu_ptr(mm->futex_ref, cpu);
+		count += *ptr;
+		*ptr = 0;
+	}
+
+	/*
+	 * Re-init for the next cycle.
+	 */
+	this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
+
+	/*
+	 * Add actual count, subtract bias and initial refcount.
+	 *
+	 * The moment this atomic operation happens, futex_ref_is_dead() can
+	 * become true.
+	 */
+	ret = atomic_long_add_return(count - LONG_MAX - 1, &mm->futex_atomic);
+	if (!ret)
+		wake_up_var(mm);
+
+	WARN_ON_ONCE(ret < 0);
+}
+
+static void futex_ref_rcu(struct rcu_head *head)
+{
+	struct mm_struct *mm = container_of(head, struct mm_struct, futex_rcu);
+	struct futex_private_hash *fph = rcu_dereference_raw(mm->futex_phash);
+
+	if (fph->state == FR_PERCPU) {
+		/*
+		 * Per this extra grace-period, everybody must now observe
+		 * fph as the current fph and no previously observed fph's
+		 * are in-flight.
+		 *
+		 * Notably, nobody will now rely on the atomic
+		 * futex_ref_is_dead() state anymore so we can begin the
+		 * migration of the per-cpu counter into the atomic.
+		 */
+		__futex_ref_atomic_begin(fph);
+		return;
+	}
+
+	__futex_ref_atomic_end(fph);
+}
+
+/*
+ * Drop the initial refcount and transition to atomics.
+ */
+static void futex_ref_drop(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	/*
+	 * Can only transition the current fph;
+	 */
+	WARN_ON_ONCE(rcu_dereference_raw(mm->futex_phash) != fph);
+
+	/*
+	 * In order to avoid the following scenario:
+	 *
+	 * futex_hash()			__futex_pivot_hash()
+	 *   guard(rcu);		  guard(mm->futex_hash_lock);
+	 *   fph = mm->futex_phash;
+	 *				  rcu_assign_pointer(&mm->futex_phash, new);
+	 *				futex_hash_allocate()
+	 *				  futex_ref_drop()
+	 *				    fph->state = FR_ATOMIC;
+	 *				    atomic_set(, BIAS);
+	 *
+	 *   futex_private_hash_get(fph); // OOPS
+	 *
+	 * Where an old fph (which is FR_ATOMIC) and should fail on
+	 * inc_not_zero, will succeed because a new transition is started and
+	 * the atomic is bias'ed away from 0.
+	 *
+	 * There must be at least one full grace-period between publishing a
+	 * new fph and trying to replace it.
+	 */
+	if (poll_state_synchronize_rcu(mm->futex_batches)) {
+		/*
+		 * There was a grace-period, we can begin now.
+		 */
+		__futex_ref_atomic_begin(fph);
+		return;
+	}
+
+	call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
+}
+
+static bool futex_ref_get(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	guard(rcu)();
+
+	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
+		this_cpu_inc(*mm->futex_ref);
+		return true;
+	}
+
+	return atomic_long_inc_not_zero(&mm->futex_atomic);
+}
+
+static bool futex_ref_put(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	guard(rcu)();
+
+	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
+		this_cpu_dec(*mm->futex_ref);
+		return false;
+	}
+
+	return atomic_long_dec_and_test(&mm->futex_atomic);
+}
+
+static bool futex_ref_is_dead(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	guard(rcu)();
+
+	if (smp_load_acquire(&fph->state) == FR_PERCPU)
+		return false;
+
+	return atomic_long_read(&mm->futex_atomic) == 0;
+}
+
+int futex_mm_init(struct mm_struct *mm)
+{
+	mutex_init(&mm->futex_hash_lock);
+	RCU_INIT_POINTER(mm->futex_phash, NULL);
+	mm->futex_phash_new = NULL;
+	/* futex-ref */
+	atomic_long_set(&mm->futex_atomic, 0);
+	mm->futex_batches = get_state_synchronize_rcu();
+	mm->futex_ref = alloc_percpu(unsigned int);
+	if (!mm->futex_ref)
+		return -ENOMEM;
+	this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
+	return 0;
+}
+
 void futex_hash_free(struct mm_struct *mm)
 {
 	struct futex_private_hash *fph;
 
+	free_percpu(mm->futex_ref);
 	kvfree(mm->futex_phash_new);
 	fph = rcu_dereference_raw(mm->futex_phash);
-	if (fph) {
-		WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+	if (fph)
 		kvfree(fph);
-	}
 }
 
 static bool futex_pivot_pending(struct mm_struct *mm)
@@ -1549,7 +1752,7 @@ static bool futex_pivot_pending(struct mm_struct *mm)
 		return true;
 
 	fph = rcu_dereference(mm->futex_phash);
-	return rcuref_is_dead(&fph->users);
+	return futex_ref_is_dead(fph);
 }
 
 static bool futex_hash_less(struct futex_private_hash *a,
@@ -1598,11 +1801,11 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
 		}
 	}
 
-	fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
+	fph = kvzalloc(struct_size(fph, queues, hash_slots),
+		       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (!fph)
 		return -ENOMEM;
 
-	rcuref_init(&fph->users, 1);
 	fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
 	fph->custom = custom;
 	fph->immutable = !!(flags & FH_IMMUTABLE);
@@ -1645,7 +1848,7 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
 				 * allocated a replacement hash, drop the initial
 				 * reference on the existing hash.
 				 */
-				futex_private_hash_put(cur);
+				futex_ref_drop(cur);
 			}
 
 			if (new) {
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/3] futex: Make futex_private_hash_get() static
  2025-07-07 14:36 [PATCH 0/3] futex: Use RCU-based per-CPU reference counting Sebastian Andrzej Siewior
  2025-07-07 14:36 ` [PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes Sebastian Andrzej Siewior
  2025-07-07 14:36 ` [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t Sebastian Andrzej Siewior
@ 2025-07-07 14:36 ` Sebastian Andrzej Siewior
  2025-07-08  7:50 ` [PATCH 0/3] futex: Use RCU-based per-CPU reference counting Peter Zijlstra
  3 siblings, 0 replies; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-07 14:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

futex_private_hash_get() is not used outside if its compilation unit.
Make it static.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/core.c  | 2 +-
 kernel/futex/futex.h | 2 --
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index b578a536a4fe2..f184d92ed7264 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -143,7 +143,7 @@ static inline bool futex_key_is_private(union futex_key *key)
 	return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
 }
 
-bool futex_private_hash_get(struct futex_private_hash *fph)
+static bool futex_private_hash_get(struct futex_private_hash *fph)
 {
 	if (fph->immutable)
 		return true;
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index fcd1617212eed..c74eac572acd7 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -228,14 +228,12 @@ extern void futex_hash_get(struct futex_hash_bucket *hb);
 extern void futex_hash_put(struct futex_hash_bucket *hb);
 
 extern struct futex_private_hash *futex_private_hash(void);
-extern bool futex_private_hash_get(struct futex_private_hash *fph);
 extern void futex_private_hash_put(struct futex_private_hash *fph);
 
 #else /* !CONFIG_FUTEX_PRIVATE_HASH */
 static inline void futex_hash_get(struct futex_hash_bucket *hb) { }
 static inline void futex_hash_put(struct futex_hash_bucket *hb) { }
 static inline struct futex_private_hash *futex_private_hash(void) { return NULL; }
-static inline bool futex_private_hash_get(void) { return false; }
 static inline void futex_private_hash_put(struct futex_private_hash *fph) { }
 #endif
 
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t
  2025-07-07 14:36 ` [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t Sebastian Andrzej Siewior
@ 2025-07-08  6:51   ` Sebastian Andrzej Siewior
  2025-07-08  8:56   ` Hillf Danton
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-08  6:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On 2025-07-07 16:36:22 [+0200], To linux-kernel@vger.kernel.org wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> The use of rcuref_t for reference counting introduces a performance bottleneck
> when accessed concurrently by multiple threads during futex operations.

just folded this bit after kernel test robot complained:

diff --git a/include/linux/futex.h b/include/linux/futex.h
index cd773febd497b..9e9750f049805 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -113,7 +113,7 @@ static inline int futex_hash_allocate_default(void)
 	return 0;
 }
 static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
-static inline void futex_mm_init(struct mm_struct *mm) { }
+static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
 
 #endif
 

Sebastian

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes
  2025-07-07 14:36 ` [PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes Sebastian Andrzej Siewior
@ 2025-07-08  7:41   ` Peter Zijlstra
  2025-07-08  8:01     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2025-07-08  7:41 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On Mon, Jul 07, 2025 at 04:36:21PM +0200, Sebastian Andrzej Siewior wrote:
> The auto scaling on create creation used to automatically assign the new
> hash because there was the private hash was unused and could be replaced
> right away.
> 
> With the upcoming change to wait for a RCU grace period before the hash
> can be assigned, the test will always fail.
> 
> If the reported number of hash buckets is not updated after an
> auto scaling event, block on an acquired lock with a timeout. The timeout
> is the delay to wait towards a grace period and locking and a locked
> pthread_mutex_t ensure that glibc calls into kernel using futex
> operation which will assign new private hash if available.
> This will retry every 100ms up to 2 seconds in total.

So the auto scaling thing is 'broken' in that if you do a 'final'
pthread_create() it will try and stage this new hash. If for whatever
reason the refcount isn't '0' -- and this can already happen today due
to a concurrent futex operation. Nothing will re-try the rehash.

This RCU business made this all but a certainty, but the race was there.

I briefly wondered if we should have a workqueue re-try the rehash.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] futex: Use RCU-based per-CPU reference counting
  2025-07-07 14:36 [PATCH 0/3] futex: Use RCU-based per-CPU reference counting Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2025-07-07 14:36 ` [PATCH 3/3] futex: Make futex_private_hash_get() static Sebastian Andrzej Siewior
@ 2025-07-08  7:50 ` Peter Zijlstra
  2025-07-08  8:02   ` Sebastian Andrzej Siewior
  3 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2025-07-08  7:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On Mon, Jul 07, 2025 at 04:36:20PM +0200, Sebastian Andrzej Siewior wrote:
> Since it was not yet released, should we rip out the IMMUTABLE bits and
> just stick with GET/SET slots?

I suppose that makes sense. If someone ever comes up with another
use-case for it we can always add it back in.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes
  2025-07-08  7:41   ` Peter Zijlstra
@ 2025-07-08  8:01     ` Sebastian Andrzej Siewior
  2025-07-10  8:08       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-08  8:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On 2025-07-08 09:41:24 [+0200], Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 04:36:21PM +0200, Sebastian Andrzej Siewior wrote:
> > The auto scaling on create creation used to automatically assign the new
> > hash because there was the private hash was unused and could be replaced
> > right away.
> > 
> > With the upcoming change to wait for a RCU grace period before the hash
> > can be assigned, the test will always fail.
> > 
> > If the reported number of hash buckets is not updated after an
> > auto scaling event, block on an acquired lock with a timeout. The timeout
> > is the delay to wait towards a grace period and locking and a locked
> > pthread_mutex_t ensure that glibc calls into kernel using futex
> > operation which will assign new private hash if available.
> > This will retry every 100ms up to 2 seconds in total.
> 
> So the auto scaling thing is 'broken' in that if you do a 'final'
> pthread_create() it will try and stage this new hash. If for whatever
> reason the refcount isn't '0' -- and this can already happen today due
> to a concurrent futex operation. Nothing will re-try the rehash.

Sure it was there but not in the way the test case was setup. I *think*
it is okay because in real life the new hash will be put to use unless
you terminate shortly after at which point you don't need it.

> This RCU business made this all but a certainty, but the race was there.
>
> I briefly wondered if we should have a workqueue re-try the rehash.

I don't think we need to worry about it.

Sebastian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] futex: Use RCU-based per-CPU reference counting
  2025-07-08  7:50 ` [PATCH 0/3] futex: Use RCU-based per-CPU reference counting Peter Zijlstra
@ 2025-07-08  8:02   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-08  8:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On 2025-07-08 09:50:25 [+0200], Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 04:36:20PM +0200, Sebastian Andrzej Siewior wrote:
> > Since it was not yet released, should we rip out the IMMUTABLE bits and
> > just stick with GET/SET slots?
> 
> I suppose that makes sense. If someone ever comes up with another
> use-case for it we can always add it back in.

I add a follow-up patch to rip it out and update the docs.
_One_ time we have the docs in time and then…

Sebastian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t
  2025-07-07 14:36 ` [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t Sebastian Andrzej Siewior
  2025-07-08  6:51   ` Sebastian Andrzej Siewior
@ 2025-07-08  8:56   ` Hillf Danton
  2025-07-08  9:16     ` Sebastian Andrzej Siewior
  2025-07-08 13:09   ` Sebastian Andrzej Siewior
  2025-07-08 13:43   ` Waiman Long
  3 siblings, 1 reply; 20+ messages in thread
From: Hillf Danton @ 2025-07-08  8:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Davidlohr Bueso, Peter Zijlstra, Waiman Long

On Mon,  7 Jul 2025 16:36:22 +0200 Sebastian Andrzej Siewior wrote:
> +static bool futex_ref_get(struct futex_private_hash *fph)
> +{
> +	struct mm_struct *mm = fph->mm;
> +
> +	guard(rcu)();
> +
Like regular refcount_t, it is buggy to touch fph if futex_atomic drops
to 0. And more important guard(rcu) does not prevent it from dropping to 0.

> +	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> +		this_cpu_inc(*mm->futex_ref);
> +		return true;
> +	}
> +
> +	return atomic_long_inc_not_zero(&mm->futex_atomic);
> +}
> +
> +static bool futex_ref_put(struct futex_private_hash *fph)
> +{
> +	struct mm_struct *mm = fph->mm;
> +
> +	guard(rcu)();
> +
> +	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> +		this_cpu_dec(*mm->futex_ref);
> +		return false;
> +	}
> +
> +	return atomic_long_dec_and_test(&mm->futex_atomic);
> +}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t
  2025-07-08  8:56   ` Hillf Danton
@ 2025-07-08  9:16     ` Sebastian Andrzej Siewior
  2025-07-08 12:01       ` Hillf Danton
  0 siblings, 1 reply; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-08  9:16 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-kernel, Davidlohr Bueso, Peter Zijlstra, Waiman Long

On 2025-07-08 16:56:39 [+0800], Hillf Danton wrote:
> On Mon,  7 Jul 2025 16:36:22 +0200 Sebastian Andrzej Siewior wrote:
> > +static bool futex_ref_get(struct futex_private_hash *fph)
> > +{
> > +	struct mm_struct *mm = fph->mm;
> > +
> > +	guard(rcu)();
> > +
> Like regular refcount_t, it is buggy to touch fph if futex_atomic drops
> to 0. And more important guard(rcu) does not prevent it from dropping to 0.

What is your intention with this? There is an inc-if-not-zero to ensure
this does not happen. And it has to drop to zero in order to get
replaced.

> > +	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> > +		this_cpu_inc(*mm->futex_ref);
> > +		return true;
> > +	}
> > +
> > +	return atomic_long_inc_not_zero(&mm->futex_atomic);
> > +}
> > +
> > +static bool futex_ref_put(struct futex_private_hash *fph)
> > +{
> > +	struct mm_struct *mm = fph->mm;
> > +
> > +	guard(rcu)();
> > +
> > +	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> > +		this_cpu_dec(*mm->futex_ref);
> > +		return false;
> > +	}
> > +
> > +	return atomic_long_dec_and_test(&mm->futex_atomic);
> > +}

Sebastian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t
  2025-07-08  9:16     ` Sebastian Andrzej Siewior
@ 2025-07-08 12:01       ` Hillf Danton
  2025-07-08 13:15         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 20+ messages in thread
From: Hillf Danton @ 2025-07-08 12:01 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Davidlohr Bueso, Peter Zijlstra, Waiman Long

On Tue, 8 Jul 2025 11:16:26 +0200 Sebastian Andrzej Siewior wrote:
> On 2025-07-08 16:56:39 [+0800], Hillf Danton wrote:
> > On Mon,  7 Jul 2025 16:36:22 +0200 Sebastian Andrzej Siewior wrote:
> > > +static bool futex_ref_get(struct futex_private_hash *fph)
> > > +{
> > > +	struct mm_struct *mm = fph->mm;
> > > +
> > > +	guard(rcu)();
> > > +
> > Like regular refcount_t, it is buggy to touch fph if futex_atomic drops
> > to 0. And more important guard(rcu) does not prevent it from dropping to 0.
> 
> What is your intention with this? There is an inc-if-not-zero to ensure

I am just simply wondering why get and put do not work without the rcu guard?

> this does not happen. And it has to drop to zero in order to get
> replaced.
> 
> > > +	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> > > +		this_cpu_inc(*mm->futex_ref);
> > > +		return true;
> > > +	}
> > > +
> > > +	return atomic_long_inc_not_zero(&mm->futex_atomic);
> > > +}
> > > +
> > > +static bool futex_ref_put(struct futex_private_hash *fph)
> > > +{
> > > +	struct mm_struct *mm = fph->mm;
> > > +
> > > +	guard(rcu)();
> > > +
> > > +	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> > > +		this_cpu_dec(*mm->futex_ref);
> > > +		return false;
> > > +	}
> > > +
> > > +	return atomic_long_dec_and_test(&mm->futex_atomic);
> > > +}
> 
> Sebastian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t
  2025-07-07 14:36 ` [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t Sebastian Andrzej Siewior
  2025-07-08  6:51   ` Sebastian Andrzej Siewior
  2025-07-08  8:56   ` Hillf Danton
@ 2025-07-08 13:09   ` Sebastian Andrzej Siewior
  2025-07-08 13:43   ` Waiman Long
  3 siblings, 0 replies; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-08 13:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On 2025-07-07 16:36:22 [+0200], To linux-kernel@vger.kernel.org wrote:

so a box was doing innocent things and then this happened:

|  slab mm_struct start ffff888549a50580 pointer offset 280 size 1352
| BUG: kernel NULL pointer dereference, address: 0000000000000000
| #PF: supervisor instruction fetch in kernel mode
| #PF: error_code(0x0010) - not-present page
| PGD 0 P4D 0
| Oops: Oops: 0010 [#1] SMP
| CPU: 11 UID: 1001 PID: 125007 Comm: clang Not tainted 6.16.0-rc5+ #262 PREEMPT(lazy)  3bf8bc6327fe388c2a27e778516b456f280aa854
| Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014
| RIP: 0010:0x0
| Code: Unable to access opcode bytes at 0xffffffffffffffd6.
| RSP: 0000:ffffc90020317e60 EFLAGS: 00010282
| RAX: 0000000000000001 RBX: 0000000000000006 RCX: 0000000000000000
| RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888549a50698
| RBP: ffff888a3faeab00 R08: 0000000000000000 R09: ffffc90020317bc8
| R10: ffffffff8296bdc8 R11: 0000000000000003 R12: ffff8881b6f80000
| R13: ffffc90020317e98 R14: 0000000000000005 R15: 0000000000000000
| FS:  00007fd37b766c40(0000) GS:ffff888abc9ef000(0000) knlGS:0000000000000000
| CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
| CR2: ffffffffffffffd6 CR3: 00000005c3d5f001 CR4: 00000000000626f0
| Call Trace:
|  <TASK>
|  rcu_core+0x27c/0x720
|  ? rcu_core+0x21c/0x720
|  handle_softirqs+0xc5/0x260
|  irq_exit_rcu+0x85/0xa0
|  sysvec_apic_timer_interrupt+0x3d/0x90
|  asm_sysvec_apic_timer_interrupt+0x1a/0x20
| RIP: 0033:0x7fd3862e1190
| Code: 41 56 53 48 89 d3 49 89 f6 49 89 ff 48 c7 47 08 00 00 00 00 8b 47 10 48 85 c0 74 52 49 8b 0f 48 c1 e0 04 31 d2 0f 1f 44 00 00 <48> c7 04 11 00 f0 ff ff 48 83 c2  10 48 39 d0 75 ef eb 31 4d 85 c9
| RSP: 002b:00007fff1e519aa0 EFLAGS: 00010202
| RAX: 0000000000040000 RBX: 00007fd37b71f010 RCX: 00007fd37b5c3010
| RDX: 000000000000fff0 RSI: 00007fd37b6ff010 RDI: 000055836a11e290
| RBP: 0000000000150050 R08: 00000000ffffffff R09: 0000000000000000
| R10: 0000000000000022 R11: 0000000000000246 R12: 000055836a597420
| R13: 000055836a593030 R14: 00007fd37b6ff010 R15: 000055836a11e290
|  </TASK>
| Modules linked in:
| Dumping ftrace buffer:
…
| CR2: 0000000000000000
| ---[ end trace 0000000000000000 ]---
| RIP: 0010:0x0
| Code: Unable to access opcode bytes at 0xffffffffffffffd6.
| RSP: 0000:ffffc90020317e60 EFLAGS: 00010282
| RAX: 0000000000000001 RBX: 0000000000000006 RCX: 0000000000000000
| RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888549a50698

on the plus side there is no evidence that this could be futex related
:)

However, I was wondering could this be because nothing ensures that the
mm stays around after the RCU callback has been enqueued.

What about this:

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index b13474825130f..2201da0afecc5 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -140,7 +140,7 @@ static inline bool mmget_not_zero(struct mm_struct *mm)
 
 /* mmput gets rid of the mappings and all user-space */
 extern void mmput(struct mm_struct *);
-#ifdef CONFIG_MMU
+#if defined(CONFIG_MMU) || defined(CONFIG_FUTEX_PRIVATE_HASH)
 /* same as above but performs the slow path from the async context. Can
  * be called from the atomic context as well
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index 66c4d4cc2340b..0b885dcbde9af 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1149,7 +1149,7 @@ void mmput(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(mmput);
 
-#ifdef CONFIG_MMU
+#if defined(CONFIG_MMU) || defined(CONFIG_FUTEX_PRIVATE_HASH)
 static void mmput_async_fn(struct work_struct *work)
 {
 	struct mm_struct *mm = container_of(work, struct mm_struct,
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index d1877abbb7147..cd8463f3d1026 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1602,6 +1602,7 @@ static void __futex_ref_atomic_end(struct futex_private_hash *fph)
 		wake_up_var(mm);
 
 	WARN_ON_ONCE(ret < 0);
+	mmput_async(mm);
 }
 
 static void futex_ref_rcu(struct rcu_head *head)
@@ -1637,6 +1638,11 @@ static void futex_ref_drop(struct futex_private_hash *fph)
 	 * Can only transition the current fph;
 	 */
 	WARN_ON_ONCE(rcu_dereference_raw(mm->futex_phash) != fph);
+	/*
+	 * We enqueue at least one RCU callback. Ensure mm stays if the task
+	 * exits before the transition is completed.
+	 */
+	mmget(mm);
 
 	/*
 	 * In order to avoid the following scenario:
-- 
2.50.0

Sebastian

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t
  2025-07-08 12:01       ` Hillf Danton
@ 2025-07-08 13:15         ` Sebastian Andrzej Siewior
  2025-07-08 22:00           ` Hillf Danton
  0 siblings, 1 reply; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-08 13:15 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-kernel, Davidlohr Bueso, Peter Zijlstra, Waiman Long

On 2025-07-08 20:01:56 [+0800], Hillf Danton wrote:
> On Tue, 8 Jul 2025 11:16:26 +0200 Sebastian Andrzej Siewior wrote:
> > On 2025-07-08 16:56:39 [+0800], Hillf Danton wrote:
> > > On Mon,  7 Jul 2025 16:36:22 +0200 Sebastian Andrzej Siewior wrote:
> > > > +static bool futex_ref_get(struct futex_private_hash *fph)
> > > > +{
> > > > +	struct mm_struct *mm = fph->mm;
> > > > +
> > > > +	guard(rcu)();
> > > > +
> > > Like regular refcount_t, it is buggy to touch fph if futex_atomic drops
> > > to 0. And more important guard(rcu) does not prevent it from dropping to 0.
> > 
> > What is your intention with this? There is an inc-if-not-zero to ensure
> 
> I am just simply wondering why get and put do not work without the rcu guard?

To ensure every get/ put user within this section observed the switch to
atomics. There is this call-rcu callback which performs the switch. This
one will be invoked after every user, that was user the per-CPU counter,
is gone and using the atomic one.

Sebastian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t
  2025-07-07 14:36 ` [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t Sebastian Andrzej Siewior
                     ` (2 preceding siblings ...)
  2025-07-08 13:09   ` Sebastian Andrzej Siewior
@ 2025-07-08 13:43   ` Waiman Long
  2025-07-08 13:47     ` Sebastian Andrzej Siewior
  3 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2025-07-08 13:43 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider

On 7/7/25 10:36 AM, Sebastian Andrzej Siewior wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> The use of rcuref_t for reference counting introduces a performance bottleneck
> when accessed concurrently by multiple threads during futex operations.
>
> Replace rcuref_t with special crafted per-CPU reference counters. The
> lifetime logic remains the same.
>
> The newly allocate private hash starts in FR_PERCPU state. In this state, each
> futex operation that requires the private hash uses a per-CPU counter (an
> unsigned int) for incrementing or decrementing the reference count.
>
> When the private hash is about to be replaced, the per-CPU counters are
> migrated to a atomic_t counter mm_struct::futex_atomic.
> The migration process:
> - Waiting for one RCU grace period to ensure all users observe the
>    current private hash. This can be skipped if a grace period elapsed
>    since the private hash was assigned.
>
> - futex_private_hash::state is set to FR_ATOMIC, forcing all users to
>    use mm_struct::futex_atomic for reference counting.
>
> - After a RCU grace period, all users are guaranteed to be using the
>    atomic counter. The per-CPU counters can now be summed up and added to
>    the atomic_t counter. If the resulting count is zero, the hash can be
>    safely replaced. Otherwise, active users still hold a valid reference.
>
> - Once the atomic reference count drops to zero, the next futex
>    operation will switch to the new private hash.
>
> call_rcu_hurry() is used to speed up transition which otherwise might be
> delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The
> side effects would be that on auto scaling the new hash is used later
> and the SET_SLOTS prctl() will block longer.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

This looks somewhat like what the percpu refcount does (see 
lib/percpu-refcount.c). Could this be used instead of reinventing the 
wheel again?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t
  2025-07-08 13:43   ` Waiman Long
@ 2025-07-08 13:47     ` Sebastian Andrzej Siewior
  2025-07-08 19:06       ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-08 13:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider

On 2025-07-08 09:43:44 [-0400], Waiman Long wrote:
> This looks somewhat like what the percpu refcount does (see
> lib/percpu-refcount.c). Could this be used instead of reinventing the wheel
> again?

From the comment:

  * futex-ref
  *
  * Heavily inspired by percpu-rwsem/percpu-refcount; not reusing any of that
  * code because it just doesn't fit right.
  *
  * Dual counter, per-cpu / atomic approach like percpu-refcount, except it
  * re-initializes the state automatically, such that the fph swizzle is also a
  * transition back to per-cpu.

but I leave it up to Peter if he considers merging that.

> Cheers,
> Longman

Sebastian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t
  2025-07-08 13:47     ` Sebastian Andrzej Siewior
@ 2025-07-08 19:06       ` Peter Zijlstra
  2025-07-10 15:58         ` Waiman Long
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2025-07-08 19:06 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Waiman Long, linux-kernel, Andr?? Almeida, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Juri Lelli, Thomas Gleixner,
	Valentin Schneider

On Tue, Jul 08, 2025 at 03:47:08PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-07-08 09:43:44 [-0400], Waiman Long wrote:
> > This looks somewhat like what the percpu refcount does (see
> > lib/percpu-refcount.c). Could this be used instead of reinventing the wheel
> > again?
> 
> From the comment:
> 
>   * futex-ref
>   *
>   * Heavily inspired by percpu-rwsem/percpu-refcount; not reusing any of that
>   * code because it just doesn't fit right.
>   *
>   * Dual counter, per-cpu / atomic approach like percpu-refcount, except it
>   * re-initializes the state automatically, such that the fph swizzle is also a
>   * transition back to per-cpu.
> 
> but I leave it up to Peter if he considers merging that.

Basically what the comment says. Trying to reuse things ended up in a
mess. It really isn't much code, most of it is comments.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t
  2025-07-08 13:15         ` Sebastian Andrzej Siewior
@ 2025-07-08 22:00           ` Hillf Danton
  0 siblings, 0 replies; 20+ messages in thread
From: Hillf Danton @ 2025-07-08 22:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Davidlohr Bueso, Peter Zijlstra, Waiman Long

On Tue, 8 Jul 2025 15:15:58 +0200 Sebastian Andrzej Siewior wrote:
> On 2025-07-08 20:01:56 [+0800], Hillf Danton wrote:
> > On Tue, 8 Jul 2025 11:16:26 +0200 Sebastian Andrzej Siewior wrote:
> > > On 2025-07-08 16:56:39 [+0800], Hillf Danton wrote:
> > > > On Mon,  7 Jul 2025 16:36:22 +0200 Sebastian Andrzej Siewior wrote:
> > > > > +static bool futex_ref_get(struct futex_private_hash *fph)
> > > > > +{
> > > > > +	struct mm_struct *mm = fph->mm;
> > > > > +
> > > > > +	guard(rcu)();
> > > > > +
> > > > Like regular refcount_t, it is buggy to touch fph if futex_atomic drops
> > > > to 0. And more important guard(rcu) does not prevent it from dropping to 0.
> > > 
> > > What is your intention with this? There is an inc-if-not-zero to ensure
> > 
> > I am just simply wondering why get and put do not work without the rcu guard?
> 
> To ensure every get/ put user within this section observed the switch to
> atomics. There is this call-rcu callback which performs the switch. This
> one will be invoked after every user, that was user the per-CPU counter,
> is gone and using the atomic one.
> 
Then percpu refcount sounds like a better option because it is free at least.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes
  2025-07-08  8:01     ` Sebastian Andrzej Siewior
@ 2025-07-10  8:08       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 20+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-10  8:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On 2025-07-08 10:01:42 [+0200], To Peter Zijlstra wrote:
> > So the auto scaling thing is 'broken' in that if you do a 'final'
> > pthread_create() it will try and stage this new hash. If for whatever
> > reason the refcount isn't '0' -- and this can already happen today due
> > to a concurrent futex operation. Nothing will re-try the rehash.
> 
> Sure it was there but not in the way the test case was setup. I *think*
> it is okay because in real life the new hash will be put to use unless
> you terminate shortly after at which point you don't need it.
> 
> > This RCU business made this all but a certainty, but the race was there.
> >
> > I briefly wondered if we should have a workqueue re-try the rehash.
> 
> I don't think we need to worry about it.

Just to replace the *think* arguments with some technical foundation: I
added a few trace printks to see how it going and this came out:

|    zstd-2308004 [012] ..... 01.857262: __futex_pivot_hash: mm ffff88854c3f9b80 fph ffff88854ed3f000 (16)
first thread/ assignment
|    zstd-2308004 [012] ..... 01.858722: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff88854a795000
|    zstd-2308004 [012] ..... 01.858760: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff88854a795000
|    zstd-2308004 [012] ..... 01.858792: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff88854a795000
|    zstd-2308004 [012] ..... 01.858830: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff88854a795000
|    zstd-2308004 [012] ..... 01.858865: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.858913: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.858963: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.859019: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.859057: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.859092: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.859125: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.859160: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff888782280000
|    zstd-2308004 [012] ..... 01.859202: futex_hash_allocate: mm ffff88854c3f9b80 current ffff88854ed3f000 pending ffff8885b65c8000

Another and another thread is created. So it ends up allocating a new
futex_private_hash because current mask is less than what we want to
have. We then acquire mm__struct::futex_hash_lock, notice that
mm_struct::futex_phash_new is already set (pending in the trace) and
free the newly allocated one.
As an optimization we could check mm_struct::futex_phash_new under the
lock and skip the allocation if the already fph is okay. We will acquire
the lock for assignment anyway so we could skip the allocation in this
case. As you see the pending address changed a few times because as the
number of threads increased, the size also increased three times.

|  <idle>-0       [012] ..s1. 01.890423: futex_ref_rcu: mm ffff88854c3f9b80 [0] ffff88854ed3f000
|  <idle>-0       [012] ..s1. 01.926421: futex_ref_rcu: mm ffff88854c3f9b80 [1] ffff88854ed3f000
the state transition, two times because the setting up threads was so
quick.

|    zstd-2308007 [031] ..... 02.724004: __futex_pivot_hash: mm ffff88854c3f9b80 fph ffff8885b65c8000 (128)
almost a second later the first futex op led to the assignment of the hash.

|    zstd-2308004 [012] ..... 41.162795: futex_hash_free: mm ffff88854c3f9b80 fph ffff8885b65c8000 new 0000000000000000
almost 40 secs, the application is done.

This is an example for let me setup the threads, I need them for a while.
A different one:

|         lz4-2309026 [008] ..... 2.654404: __futex_pivot_hash: mm ffff888664977380 fph ffff8885b2000800 (16)
first assignment.

|         lz4-2309026 [008] ..... 2.654597: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff88854d0bc000
|         lz4-2309026 [008] ..... 2.654641: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff88854d0bc000
|         lz4-2309026 [008] ..... 2.654690: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff88854d0bc000
|         lz4-2309026 [008] ..... 2.654732: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff88854d0bc000
|         lz4-2309026 [008] ..... 2.654774: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.654812: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.654847: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.654884: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.654919: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.654962: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.655002: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.655039: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff888615e3e000
|         lz4-2309026 [008] ..... 2.655202: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655250: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655288: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655329: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655370: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655411: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655449: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655488: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655525: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655562: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
|         lz4-2309026 [008] ..... 2.655603: futex_hash_allocate: mm ffff888664977380 current ffff8885b2000800 pending ffff8885ac4b4000
a bunch of threads is created.

|      <idle>-0       [008] ..s1. 2.689306: futex_ref_rcu: mm ffff888664977380 [0] ffff8885b2000800
|      <idle>-0       [008] ..s1. 2.725324: futex_ref_rcu: mm ffff888664977380 [1] ffff8885b2000800
state transition over

|kworker/8:38-2307372 [008] ..... 2.727104: futex_hash_free: mm ffff888664977380 fph ffff8885b2000800 new ffff8885ac4b4000

The hash is released from the kworker due to mmput_async(). It frees the
current and the new hash which was never assigned.
I have a few examples which are similar to the lz4 where the new hash
was not assigned either because the application terminated early or it
did not invoke any futex opcodes after that.

Therefore it does not make sense to assign the new hash once it is
possible and mobilize a kworker for it. Avoiding the alloc/ free dance
would make sense.

Sebastian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t
  2025-07-08 19:06       ` Peter Zijlstra
@ 2025-07-10 15:58         ` Waiman Long
  0 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2025-07-10 15:58 UTC (permalink / raw)
  To: Peter Zijlstra, Sebastian Andrzej Siewior
  Cc: Waiman Long, linux-kernel, Andr?? Almeida, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Juri Lelli, Thomas Gleixner,
	Valentin Schneider

On 7/8/25 3:06 PM, Peter Zijlstra wrote:
> On Tue, Jul 08, 2025 at 03:47:08PM +0200, Sebastian Andrzej Siewior wrote:
>> On 2025-07-08 09:43:44 [-0400], Waiman Long wrote:
>>> This looks somewhat like what the percpu refcount does (see
>>> lib/percpu-refcount.c). Could this be used instead of reinventing the wheel
>>> again?
>>  From the comment:
>>
>>    * futex-ref
>>    *
>>    * Heavily inspired by percpu-rwsem/percpu-refcount; not reusing any of that
>>    * code because it just doesn't fit right.
>>    *
>>    * Dual counter, per-cpu / atomic approach like percpu-refcount, except it
>>    * re-initializes the state automatically, such that the fph swizzle is also a
>>    * transition back to per-cpu.
>>
>> but I leave it up to Peter if he considers merging that.
> Basically what the comment says. Trying to reuse things ended up in a
> mess. It really isn't much code, most of it is comments.
>
I got it now. I am not against adding a variant specific to this code 
giving that we want to fix the performance regression ASAP. Merging it 
to any existing set of helpers may be something we want to do in the future.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-07-10 15:58 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-07 14:36 [PATCH 0/3] futex: Use RCU-based per-CPU reference counting Sebastian Andrzej Siewior
2025-07-07 14:36 ` [PATCH 1/3] selftests/futex: Adapt the private hash test to RCU related changes Sebastian Andrzej Siewior
2025-07-08  7:41   ` Peter Zijlstra
2025-07-08  8:01     ` Sebastian Andrzej Siewior
2025-07-10  8:08       ` Sebastian Andrzej Siewior
2025-07-07 14:36 ` [PATCH 2/3] futex: Use RCU-based per-CPU reference counting instead of rcuref_t Sebastian Andrzej Siewior
2025-07-08  6:51   ` Sebastian Andrzej Siewior
2025-07-08  8:56   ` Hillf Danton
2025-07-08  9:16     ` Sebastian Andrzej Siewior
2025-07-08 12:01       ` Hillf Danton
2025-07-08 13:15         ` Sebastian Andrzej Siewior
2025-07-08 22:00           ` Hillf Danton
2025-07-08 13:09   ` Sebastian Andrzej Siewior
2025-07-08 13:43   ` Waiman Long
2025-07-08 13:47     ` Sebastian Andrzej Siewior
2025-07-08 19:06       ` Peter Zijlstra
2025-07-10 15:58         ` Waiman Long
2025-07-07 14:36 ` [PATCH 3/3] futex: Make futex_private_hash_get() static Sebastian Andrzej Siewior
2025-07-08  7:50 ` [PATCH 0/3] futex: Use RCU-based per-CPU reference counting Peter Zijlstra
2025-07-08  8:02   ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).