[PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
@ 2025-03-12 15:16 Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 01/21] rcuref: Provide rcuref_is_dead() Sebastian Andrzej Siewior
                   ` (22 more replies)
  0 siblings, 23 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

Hi,

this is a follow up on
        https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb

and adds support for task local futex_hash_bucket.

This is the local hash map series based on v9 extended with PeterZ
FUTEX2_NUMA and FUTEX2_MPOL plus a few fixes on top.

The complete tree is at
	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git/log/?h=futex_local_v10
	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git futex_local_v10

v9…v10: https://lore.kernel.org/all/20250225170914.289358-1-bigeasy@linutronix.de/
  - The rcuref_read() check in __futex_hash_private() has been replaced
    with rcuref_is_dead() which is added as part of the series.

  - The local hash support depended on !CONFIG_BASE_SMALL which has been
    replaced with CONFIG_FUTEX_PRIVATE_HASH. This is defined as
    "!BASE_SMALL && MMU" because as part of the rework
    futex_key::private::mm is used which is not set on !CONFIG_MMU builds

  - Added CONFIG_FUTEX_MPOL to build on !NUMA configs.

  - Replaced direct access of mm_struct::futex_phash with a RCU
    accessor.
 
  - futex_hash_allocate() for !CONFIG_FUTEX_PRIVATE_HASH returns an
    error. This does not affect fork() but is noticed by
    PR_FUTEX_HASH_SET_SLOTS.

  - futex_init() ensures the computed hashsize is not less than 4 after
    the divide by num_possible_nodes().

  - futex_init() added info output about used hash table entries (in the
    global hash) and occupied memory, allocation method. This vanished
    after the removal of alloc_large_system_hash().

  - There is a WARN_ON again in futex_hash_free() path if the task
    failed to free all references (that would be a leak).

  - vmalloc_huge_node_noprof():
    - Replced __vmalloc_node_range() with __vmalloc_node_range_noprof()
      to skip the alloc_hooks() layer which is already part of
      vmalloc_huge_node().
    - Added vmalloc_huge_node_noprof for !MMU.

v8…v9 https://lore.kernel.org/all/20250203135935.440018-1-bigeasy@linutronix.de
  - Rebase on top PeterZ futex_class
  - A few patches vanished due to class rework.
  - struct futex_hash_bucket has now pointer to futex_private_hash
    instead of slot number
  - CONFIG_BASE_SMALL now removes support for the "futex local hash"
    instead of restricting it to to 2 slots.
  - Number of threads, used to determine the number of slots, is capped
    at num_online_cpus.

Peter Zijlstra (11):
  futex: Move futex_queue() into futex_wait_setup()
  futex: Pull futex_hash() out of futex_q_lock()
  futex: Create hb scopes
  futex: Create futex_hash() get/put class
  futex: s/hb_p/fph/
  futex: Remove superfluous state
  futex: Untangle and naming
  futex: Rework SET_SLOTS
  mm: Add vmalloc_huge_node()
  futex: Implement FUTEX2_NUMA
  futex: Implement FUTEX2_MPOL

Sebastian Andrzej Siewior (10):
  rcuref: Provide rcuref_is_dead().
  futex: Create helper function to initialize a hash slot.
  futex: Add basic infrastructure for local task local hash.
  futex: Hash only the address for private futexes.
  futex: Allow automatic allocation of process wide futex hash.
  futex: Decrease the waiter count before the unlock operation.
  futex: Introduce futex_q_lockptr_lock().
  futex: Acquire a hash reference in futex_wait_multiple_setup().
  futex: Allow to re-allocate the private local hash.
  futex: Resize local futex hash table based on number of threads.

 include/linux/futex.h      |  34 +-
 include/linux/mm_types.h   |   7 +-
 include/linux/mmap_lock.h  |   4 +
 include/linux/rcuref.h     |  22 +-
 include/linux/vmalloc.h    |   3 +
 include/uapi/linux/futex.h |  10 +-
 include/uapi/linux/prctl.h |   5 +
 init/Kconfig               |  10 +
 io_uring/futex.c           |   4 +-
 kernel/fork.c              |  24 ++
 kernel/futex/core.c        | 746 +++++++++++++++++++++++++++++++++----
 kernel/futex/futex.h       |  81 +++-
 kernel/futex/pi.c          | 300 ++++++++-------
 kernel/futex/requeue.c     | 480 ++++++++++++------------
 kernel/futex/waitwake.c    | 203 +++++-----
 kernel/sys.c               |   4 +
 mm/nommu.c                 |   5 +
 mm/vmalloc.c               |   7 +
 18 files changed, 1396 insertions(+), 553 deletions(-)

-- 
2.47.2


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v10 01/21] rcuref: Provide rcuref_is_dead().
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-13  4:23   ` Joel Fernandes
  2025-03-14 10:36   ` Peter Zijlstra
  2025-03-12 15:16 ` [PATCH v10 02/21] futex: Move futex_queue() into futex_wait_setup() Sebastian Andrzej Siewior
                   ` (21 subsequent siblings)
  22 siblings, 2 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

rcuref_read() returns the number of references that are currently held.
If 0 is returned then it is not safe to assume that the object ca be
scheduled for deconstruction because it is marked DEAD. This happens if
the return value of rcuref_put() is ignored and assumptions are made.

If 0 is returned then the counter transitioned from 0 to RCUREF_NOREF.
If rcuref_put() did not return to the caller then the counter did not
yet transition from RCUREF_NOREF to RCUREF_DEAD. This means that there
is still a chance that the counter counter will transition from
RCUREF_NOREF to 0 meaning it is still valid and must not be
deconstructed. In this brief window rcuref_read() will return 0.

Provide rcuref_is_dead() to determine if the counter is marked as
RCUREF_DEAD.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/rcuref.h | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/rcuref.h b/include/linux/rcuref.h
index 6322d8c1c6b42..2fb2af6d98249 100644
--- a/include/linux/rcuref.h
+++ b/include/linux/rcuref.h
@@ -30,7 +30,11 @@ static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
  * rcuref_read - Read the number of held reference counts of a rcuref
  * @ref:	Pointer to the reference count
  *
- * Return: The number of held references (0 ... N)
+ * Return: The number of held references (0 ... N). The value 0 does not
+ * indicate that it is safe to schedule the object, protected by this reference
+ * counter, for deconstruction.
+ * If you want to know if the reference counter has been marked DEAD (as
+ * signaled by rcuref_put()) please use rcuread_is_dead().
  */
 static inline unsigned int rcuref_read(rcuref_t *ref)
 {
@@ -40,6 +44,22 @@ static inline unsigned int rcuref_read(rcuref_t *ref)
 	return c >= RCUREF_RELEASED ? 0 : c + 1;
 }
 
+/**
+ * rcuref_is_dead -	Check if the rcuref has been already marked dead
+ * @ref:		Pointer to the reference count
+ *
+ * Return: True if the object has been marked DEAD. This signals that a previous
+ * invocation of rcuref_put() returned true on this reference counter meaning
+ * the protected object can safely be scheduled for deconstruction.
+ * Otherwise, returns false.
+ */
+static inline bool rcuref_is_dead(rcuref_t *ref)
+{
+	unsigned int c = atomic_read(&ref->refcnt);
+
+	return (c >= RCUREF_RELEASED) && (c < RCUREF_NOREF);
+}
+
 extern __must_check bool rcuref_get_slowpath(rcuref_t *ref);
 
 /**
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 02/21] futex: Move futex_queue() into futex_wait_setup()
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 01/21] rcuref: Provide rcuref_is_dead() Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 03/21] futex: Pull futex_hash() out of futex_q_lock() Sebastian Andrzej Siewior
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

From: Peter Zijlstra <peterz@infradead.org>

futex_wait_setup() has a weird calling convention in order to return
hb to use as an argument to futex_queue().

Mostly such that requeue can have an extra test in between.

Reorder code a little to get rid of this and keep the hb usage inside
futex_wait_setup().

[bigeasy: fixes]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 io_uring/futex.c        |  4 +---
 kernel/futex/futex.h    |  6 +++---
 kernel/futex/requeue.c  | 28 ++++++++++--------------
 kernel/futex/waitwake.c | 47 +++++++++++++++++++++++------------------
 4 files changed, 42 insertions(+), 43 deletions(-)

diff --git a/io_uring/futex.c b/io_uring/futex.c
index 43e2143255f57..e7c264db0818e 100644
--- a/io_uring/futex.c
+++ b/io_uring/futex.c
@@ -311,7 +311,6 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex);
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_futex_data *ifd = NULL;
-	struct futex_hash_bucket *hb;
 	int ret;
 
 	if (!iof->futex_mask) {
@@ -333,12 +332,11 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
 	ifd->req = req;
 
 	ret = futex_wait_setup(iof->uaddr, iof->futex_val, iof->futex_flags,
-			       &ifd->q, &hb);
+			       &ifd->q, NULL, NULL);
 	if (!ret) {
 		hlist_add_head(&req->hash_node, &ctx->futex_list);
 		io_ring_submit_unlock(ctx, issue_flags);
 
-		futex_queue(&ifd->q, hb, NULL);
 		return IOU_ISSUE_SKIP_COMPLETE;
 	}
 
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 6b2f4c7eb720f..16aafd0113442 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -219,9 +219,9 @@ static inline int futex_match(union futex_key *key1, union futex_key *key2)
 }
 
 extern int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
-			    struct futex_q *q, struct futex_hash_bucket **hb);
-extern void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
-				   struct hrtimer_sleeper *timeout);
+			    struct futex_q *q, union futex_key *key2,
+			    struct task_struct *task);
+extern void futex_do_wait(struct futex_q *q, struct hrtimer_sleeper *timeout);
 extern bool __futex_wake_mark(struct futex_q *q);
 extern void futex_wake_mark(struct wake_q_head *wake_q, struct futex_q *q);
 
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index b47bb764b3520..0e55975af515c 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -769,7 +769,6 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 {
 	struct hrtimer_sleeper timeout, *to;
 	struct rt_mutex_waiter rt_waiter;
-	struct futex_hash_bucket *hb;
 	union futex_key key2 = FUTEX_KEY_INIT;
 	struct futex_q q = futex_q_init;
 	struct rt_mutex_base *pi_mutex;
@@ -805,29 +804,24 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	 * Prepare to wait on uaddr. On success, it holds hb->lock and q
 	 * is initialized.
 	 */
-	ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
+	ret = futex_wait_setup(uaddr, val, flags, &q, &key2, current);
 	if (ret)
 		goto out;
 
-	/*
-	 * The check above which compares uaddrs is not sufficient for
-	 * shared futexes. We need to compare the keys:
-	 */
-	if (futex_match(&q.key, &key2)) {
-		futex_q_unlock(hb);
-		ret = -EINVAL;
-		goto out;
-	}
-
 	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
-	futex_wait_queue(hb, &q, to);
+	futex_do_wait(&q, to);
 
 	switch (futex_requeue_pi_wakeup_sync(&q)) {
 	case Q_REQUEUE_PI_IGNORE:
-		/* The waiter is still on uaddr1 */
-		spin_lock(&hb->lock);
-		ret = handle_early_requeue_pi_wakeup(hb, &q, to);
-		spin_unlock(&hb->lock);
+		{
+			struct futex_hash_bucket *hb;
+
+			hb = futex_hash(&q.key);
+			/* The waiter is still on uaddr1 */
+			spin_lock(&hb->lock);
+			ret = handle_early_requeue_pi_wakeup(hb, &q, to);
+			spin_unlock(&hb->lock);
+		}
 		break;
 
 	case Q_REQUEUE_PI_LOCKED:
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 25877d4f2f8f3..6cf10701294b4 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -339,18 +339,8 @@ static long futex_wait_restart(struct restart_block *restart);
  * @q:		the futex_q to queue up on
  * @timeout:	the prepared hrtimer_sleeper, or null for no timeout
  */
-void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
-			    struct hrtimer_sleeper *timeout)
+void futex_do_wait(struct futex_q *q, struct hrtimer_sleeper *timeout)
 {
-	/*
-	 * The task state is guaranteed to be set before another task can
-	 * wake it. set_current_state() is implemented using smp_store_mb() and
-	 * futex_queue() calls spin_unlock() upon completion, both serializing
-	 * access to the hash list and forcing another memory barrier.
-	 */
-	set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
-	futex_queue(q, hb, current);
-
 	/* Arm the timer */
 	if (timeout)
 		hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
@@ -578,7 +568,8 @@ int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
  * @val:	the expected value
  * @flags:	futex flags (FLAGS_SHARED, etc.)
  * @q:		the associated futex_q
- * @hb:		storage for hash_bucket pointer to be returned to caller
+ * @key2:	the second futex_key if used for requeue PI
+ * task:	Task queueing this futex
  *
  * Setup the futex_q and locate the hash_bucket.  Get the futex value and
  * compare it with the expected value.  Handle atomic faults internally.
@@ -589,8 +580,10 @@ int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
  *  - <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
  */
 int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
-		     struct futex_q *q, struct futex_hash_bucket **hb)
+		     struct futex_q *q, union futex_key *key2,
+		     struct task_struct *task)
 {
+	struct futex_hash_bucket *hb;
 	u32 uval;
 	int ret;
 
@@ -618,12 +611,12 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 		return ret;
 
 retry_private:
-	*hb = futex_q_lock(q);
+	hb = futex_q_lock(q);
 
 	ret = futex_get_value_locked(&uval, uaddr);
 
 	if (ret) {
-		futex_q_unlock(*hb);
+		futex_q_unlock(hb);
 
 		ret = get_user(uval, uaddr);
 		if (ret)
@@ -636,10 +629,25 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 	}
 
 	if (uval != val) {
-		futex_q_unlock(*hb);
-		ret = -EWOULDBLOCK;
+		futex_q_unlock(hb);
+		return -EWOULDBLOCK;
 	}
 
+	if (key2 && futex_match(&q->key, key2)) {
+		futex_q_unlock(hb);
+		return -EINVAL;
+	}
+
+	/*
+	 * The task state is guaranteed to be set before another task can
+	 * wake it. set_current_state() is implemented using smp_store_mb() and
+	 * futex_queue() calls spin_unlock() upon completion, both serializing
+	 * access to the hash list and forcing another memory barrier.
+	 */
+	if (task == current)
+		set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+	futex_queue(q, hb, task);
+
 	return ret;
 }
 
@@ -647,7 +655,6 @@ int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 		 struct hrtimer_sleeper *to, u32 bitset)
 {
 	struct futex_q q = futex_q_init;
-	struct futex_hash_bucket *hb;
 	int ret;
 
 	if (!bitset)
@@ -660,12 +667,12 @@ int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 	 * Prepare to wait on uaddr. On success, it holds hb->lock and q
 	 * is initialized.
 	 */
-	ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
+	ret = futex_wait_setup(uaddr, val, flags, &q, NULL, current);
 	if (ret)
 		return ret;
 
 	/* futex_queue and wait for wakeup, timeout, or a signal. */
-	futex_wait_queue(hb, &q, to);
+	futex_do_wait(&q, to);
 
 	/* If we were woken (and unqueued), we succeeded, whatever. */
 	if (!futex_unqueue(&q))
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 03/21] futex: Pull futex_hash() out of futex_q_lock()
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 01/21] rcuref: Provide rcuref_is_dead() Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 02/21] futex: Move futex_queue() into futex_wait_setup() Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 04/21] futex: Create hb scopes Sebastian Andrzej Siewior
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

From: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/core.c     | 7 +------
 kernel/futex/futex.h    | 2 +-
 kernel/futex/pi.c       | 3 ++-
 kernel/futex/waitwake.c | 6 ++++--
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index cca15859a50be..7adc914878933 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -502,13 +502,9 @@ void __futex_unqueue(struct futex_q *q)
 }
 
 /* The key must be already stored in q->key. */
-struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
+void futex_q_lock(struct futex_q *q, struct futex_hash_bucket *hb)
 	__acquires(&hb->lock)
 {
-	struct futex_hash_bucket *hb;
-
-	hb = futex_hash(&q->key);
-
 	/*
 	 * Increment the counter before taking the lock so that
 	 * a potential waker won't miss a to-be-slept task that is
@@ -522,7 +518,6 @@ struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
 	q->lock_ptr = &hb->lock;
 
 	spin_lock(&hb->lock);
-	return hb;
 }
 
 void futex_q_unlock(struct futex_hash_bucket *hb)
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 16aafd0113442..a219903e52084 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -354,7 +354,7 @@ static inline int futex_hb_waiters_pending(struct futex_hash_bucket *hb)
 #endif
 }
 
-extern struct futex_hash_bucket *futex_q_lock(struct futex_q *q);
+extern void futex_q_lock(struct futex_q *q, struct futex_hash_bucket *hb);
 extern void futex_q_unlock(struct futex_hash_bucket *hb);
 
 
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 7a941845f7eee..3bf942e9400ac 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -939,7 +939,8 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 		goto out;
 
 retry_private:
-	hb = futex_q_lock(&q);
+	hb = futex_hash(&q.key);
+	futex_q_lock(&q, hb);
 
 	ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
 				   &exiting, 0);
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 6cf10701294b4..1108f373fd315 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -441,7 +441,8 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
 		struct futex_q *q = &vs[i].q;
 		u32 val = vs[i].w.val;
 
-		hb = futex_q_lock(q);
+		hb = futex_hash(&q->key);
+		futex_q_lock(q, hb);
 		ret = futex_get_value_locked(&uval, uaddr);
 
 		if (!ret && uval == val) {
@@ -611,7 +612,8 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 		return ret;
 
 retry_private:
-	hb = futex_q_lock(q);
+	hb = futex_hash(&q->key);
+	futex_q_lock(q, hb);
 
 	ret = futex_get_value_locked(&uval, uaddr);
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 04/21] futex: Create hb scopes
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 03/21] futex: Pull futex_hash() out of futex_q_lock() Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 05/21] futex: Create futex_hash() get/put class Sebastian Andrzej Siewior
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

From: Peter Zijlstra <peterz@infradead.org>

Create explicit scopes for hb variables; almost pure re-indent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/core.c     |  81 ++++----
 kernel/futex/pi.c       | 282 +++++++++++++-------------
 kernel/futex/requeue.c  | 433 ++++++++++++++++++++--------------------
 kernel/futex/waitwake.c | 193 +++++++++---------
 4 files changed, 504 insertions(+), 485 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 7adc914878933..e4cb5ce9785b1 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -944,7 +944,6 @@ static void exit_pi_state_list(struct task_struct *curr)
 {
 	struct list_head *next, *head = &curr->pi_state_list;
 	struct futex_pi_state *pi_state;
-	struct futex_hash_bucket *hb;
 	union futex_key key = FUTEX_KEY_INIT;
 
 	/*
@@ -957,50 +956,54 @@ static void exit_pi_state_list(struct task_struct *curr)
 		next = head->next;
 		pi_state = list_entry(next, struct futex_pi_state, list);
 		key = pi_state->key;
-		hb = futex_hash(&key);
+		if (1) {
+			struct futex_hash_bucket *hb;
 
-		/*
-		 * We can race against put_pi_state() removing itself from the
-		 * list (a waiter going away). put_pi_state() will first
-		 * decrement the reference count and then modify the list, so
-		 * its possible to see the list entry but fail this reference
-		 * acquire.
-		 *
-		 * In that case; drop the locks to let put_pi_state() make
-		 * progress and retry the loop.
-		 */
-		if (!refcount_inc_not_zero(&pi_state->refcount)) {
+			hb = futex_hash(&key);
+
+			/*
+			 * We can race against put_pi_state() removing itself from the
+			 * list (a waiter going away). put_pi_state() will first
+			 * decrement the reference count and then modify the list, so
+			 * its possible to see the list entry but fail this reference
+			 * acquire.
+			 *
+			 * In that case; drop the locks to let put_pi_state() make
+			 * progress and retry the loop.
+			 */
+			if (!refcount_inc_not_zero(&pi_state->refcount)) {
+				raw_spin_unlock_irq(&curr->pi_lock);
+				cpu_relax();
+				raw_spin_lock_irq(&curr->pi_lock);
+				continue;
+			}
 			raw_spin_unlock_irq(&curr->pi_lock);
-			cpu_relax();
-			raw_spin_lock_irq(&curr->pi_lock);
-			continue;
-		}
-		raw_spin_unlock_irq(&curr->pi_lock);
 
-		spin_lock(&hb->lock);
-		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
-		raw_spin_lock(&curr->pi_lock);
-		/*
-		 * We dropped the pi-lock, so re-check whether this
-		 * task still owns the PI-state:
-		 */
-		if (head->next != next) {
-			/* retain curr->pi_lock for the loop invariant */
-			raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
+			spin_lock(&hb->lock);
+			raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+			raw_spin_lock(&curr->pi_lock);
+			/*
+			 * We dropped the pi-lock, so re-check whether this
+			 * task still owns the PI-state:
+			 */
+			if (head->next != next) {
+				/* retain curr->pi_lock for the loop invariant */
+				raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
+				spin_unlock(&hb->lock);
+				put_pi_state(pi_state);
+				continue;
+			}
+
+			WARN_ON(pi_state->owner != curr);
+			WARN_ON(list_empty(&pi_state->list));
+			list_del_init(&pi_state->list);
+			pi_state->owner = NULL;
+
+			raw_spin_unlock(&curr->pi_lock);
+			raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 			spin_unlock(&hb->lock);
-			put_pi_state(pi_state);
-			continue;
 		}
 
-		WARN_ON(pi_state->owner != curr);
-		WARN_ON(list_empty(&pi_state->list));
-		list_del_init(&pi_state->list);
-		pi_state->owner = NULL;
-
-		raw_spin_unlock(&curr->pi_lock);
-		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-		spin_unlock(&hb->lock);
-
 		rt_mutex_futex_unlock(&pi_state->pi_mutex);
 		put_pi_state(pi_state);
 
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 3bf942e9400ac..62ce5ecaeddd6 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -920,7 +920,6 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 	struct hrtimer_sleeper timeout, *to;
 	struct task_struct *exiting = NULL;
 	struct rt_mutex_waiter rt_waiter;
-	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
 	DEFINE_WAKE_Q(wake_q);
 	int res, ret;
@@ -939,152 +938,169 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 		goto out;
 
 retry_private:
-	hb = futex_hash(&q.key);
-	futex_q_lock(&q, hb);
+	if (1) {
+		struct futex_hash_bucket *hb;
 
-	ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
-				   &exiting, 0);
-	if (unlikely(ret)) {
-		/*
-		 * Atomic work succeeded and we got the lock,
-		 * or failed. Either way, we do _not_ block.
-		 */
-		switch (ret) {
-		case 1:
-			/* We got the lock. */
-			ret = 0;
-			goto out_unlock_put_key;
-		case -EFAULT:
-			goto uaddr_faulted;
-		case -EBUSY:
-		case -EAGAIN:
+		hb = futex_hash(&q.key);
+		futex_q_lock(&q, hb);
+
+		ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
+					   &exiting, 0);
+		if (unlikely(ret)) {
 			/*
-			 * Two reasons for this:
-			 * - EBUSY: Task is exiting and we just wait for the
-			 *   exit to complete.
-			 * - EAGAIN: The user space value changed.
+			 * Atomic work succeeded and we got the lock,
+			 * or failed. Either way, we do _not_ block.
 			 */
-			futex_q_unlock(hb);
-			/*
-			 * Handle the case where the owner is in the middle of
-			 * exiting. Wait for the exit to complete otherwise
-			 * this task might loop forever, aka. live lock.
-			 */
-			wait_for_owner_exiting(ret, exiting);
-			cond_resched();
-			goto retry;
-		default:
-			goto out_unlock_put_key;
+			switch (ret) {
+			case 1:
+				/* We got the lock. */
+				ret = 0;
+				goto out_unlock_put_key;
+			case -EFAULT:
+				goto uaddr_faulted;
+			case -EBUSY:
+			case -EAGAIN:
+				/*
+				 * Two reasons for this:
+				 * - EBUSY: Task is exiting and we just wait for the
+				 *   exit to complete.
+				 * - EAGAIN: The user space value changed.
+				 */
+				futex_q_unlock(hb);
+				/*
+				 * Handle the case where the owner is in the middle of
+				 * exiting. Wait for the exit to complete otherwise
+				 * this task might loop forever, aka. live lock.
+				 */
+				wait_for_owner_exiting(ret, exiting);
+				cond_resched();
+				goto retry;
+			default:
+				goto out_unlock_put_key;
+			}
 		}
-	}
 
-	WARN_ON(!q.pi_state);
+		WARN_ON(!q.pi_state);
 
-	/*
-	 * Only actually queue now that the atomic ops are done:
-	 */
-	__futex_queue(&q, hb, current);
+		/*
+		 * Only actually queue now that the atomic ops are done:
+		 */
+		__futex_queue(&q, hb, current);
 
-	if (trylock) {
-		ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
-		/* Fixup the trylock return value: */
-		ret = ret ? 0 : -EWOULDBLOCK;
-		goto no_block;
-	}
+		if (trylock) {
+			ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
+			/* Fixup the trylock return value: */
+			ret = ret ? 0 : -EWOULDBLOCK;
+			goto no_block;
+		}
 
-	/*
-	 * Must be done before we enqueue the waiter, here is unfortunately
-	 * under the hb lock, but that *should* work because it does nothing.
-	 */
-	rt_mutex_pre_schedule();
+		/*
+		 * Must be done before we enqueue the waiter, here is unfortunately
+		 * under the hb lock, but that *should* work because it does nothing.
+		 */
+		rt_mutex_pre_schedule();
 
-	rt_mutex_init_waiter(&rt_waiter);
+		rt_mutex_init_waiter(&rt_waiter);
 
-	/*
-	 * On PREEMPT_RT, when hb->lock becomes an rt_mutex, we must not
-	 * hold it while doing rt_mutex_start_proxy(), because then it will
-	 * include hb->lock in the blocking chain, even through we'll not in
-	 * fact hold it while blocking. This will lead it to report -EDEADLK
-	 * and BUG when futex_unlock_pi() interleaves with this.
-	 *
-	 * Therefore acquire wait_lock while holding hb->lock, but drop the
-	 * latter before calling __rt_mutex_start_proxy_lock(). This
-	 * interleaves with futex_unlock_pi() -- which does a similar lock
-	 * handoff -- such that the latter can observe the futex_q::pi_state
-	 * before __rt_mutex_start_proxy_lock() is done.
-	 */
-	raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
-	spin_unlock(q.lock_ptr);
-	/*
-	 * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
-	 * such that futex_unlock_pi() is guaranteed to observe the waiter when
-	 * it sees the futex_q::pi_state.
-	 */
-	ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current, &wake_q);
-	raw_spin_unlock_irq_wake(&q.pi_state->pi_mutex.wait_lock, &wake_q);
+		/*
+		 * On PREEMPT_RT, when hb->lock becomes an rt_mutex, we must not
+		 * hold it while doing rt_mutex_start_proxy(), because then it will
+		 * include hb->lock in the blocking chain, even through we'll not in
+		 * fact hold it while blocking. This will lead it to report -EDEADLK
+		 * and BUG when futex_unlock_pi() interleaves with this.
+		 *
+		 * Therefore acquire wait_lock while holding hb->lock, but drop the
+		 * latter before calling __rt_mutex_start_proxy_lock(). This
+		 * interleaves with futex_unlock_pi() -- which does a similar lock
+		 * handoff -- such that the latter can observe the futex_q::pi_state
+		 * before __rt_mutex_start_proxy_lock() is done.
+		 */
+		raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
+		spin_unlock(q.lock_ptr);
+		/*
+		 * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
+		 * such that futex_unlock_pi() is guaranteed to observe the waiter when
+		 * it sees the futex_q::pi_state.
+		 */
+		ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current, &wake_q);
+		raw_spin_unlock_irq_wake(&q.pi_state->pi_mutex.wait_lock, &wake_q);
 
-	if (ret) {
-		if (ret == 1)
-			ret = 0;
-		goto cleanup;
-	}
+		if (ret) {
+			if (ret == 1)
+				ret = 0;
+			goto cleanup;
+		}
 
-	if (unlikely(to))
-		hrtimer_sleeper_start_expires(to, HRTIMER_MODE_ABS);
+		if (unlikely(to))
+			hrtimer_sleeper_start_expires(to, HRTIMER_MODE_ABS);
 
-	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
+		ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
 
 cleanup:
-	/*
-	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
-	 * must unwind the above, however we canont lock hb->lock because
-	 * rt_mutex already has a waiter enqueued and hb->lock can itself try
-	 * and enqueue an rt_waiter through rtlock.
-	 *
-	 * Doing the cleanup without holding hb->lock can cause inconsistent
-	 * state between hb and pi_state, but only in the direction of not
-	 * seeing a waiter that is leaving.
-	 *
-	 * See futex_unlock_pi(), it deals with this inconsistency.
-	 *
-	 * There be dragons here, since we must deal with the inconsistency on
-	 * the way out (here), it is impossible to detect/warn about the race
-	 * the other way around (missing an incoming waiter).
-	 *
-	 * What could possibly go wrong...
-	 */
-	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
-		ret = 0;
+		/*
+		 * If we failed to acquire the lock (deadlock/signal/timeout), we must
+		 * must unwind the above, however we canont lock hb->lock because
+		 * rt_mutex already has a waiter enqueued and hb->lock can itself try
+		 * and enqueue an rt_waiter through rtlock.
+		 *
+		 * Doing the cleanup without holding hb->lock can cause inconsistent
+		 * state between hb and pi_state, but only in the direction of not
+		 * seeing a waiter that is leaving.
+		 *
+		 * See futex_unlock_pi(), it deals with this inconsistency.
+		 *
+		 * There be dragons here, since we must deal with the inconsistency on
+		 * the way out (here), it is impossible to detect/warn about the race
+		 * the other way around (missing an incoming waiter).
+		 *
+		 * What could possibly go wrong...
+		 */
+		if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
+			ret = 0;
 
-	/*
-	 * Now that the rt_waiter has been dequeued, it is safe to use
-	 * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
-	 * the
-	 */
-	spin_lock(q.lock_ptr);
-	/*
-	 * Waiter is unqueued.
-	 */
-	rt_mutex_post_schedule();
+		/*
+		 * Now that the rt_waiter has been dequeued, it is safe to use
+		 * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
+		 * the
+		 */
+		spin_lock(q.lock_ptr);
+		/*
+		 * Waiter is unqueued.
+		 */
+		rt_mutex_post_schedule();
 no_block:
-	/*
-	 * Fixup the pi_state owner and possibly acquire the lock if we
-	 * haven't already.
-	 */
-	res = fixup_pi_owner(uaddr, &q, !ret);
-	/*
-	 * If fixup_pi_owner() returned an error, propagate that.  If it acquired
-	 * the lock, clear our -ETIMEDOUT or -EINTR.
-	 */
-	if (res)
-		ret = (res < 0) ? res : 0;
+		/*
+		 * Fixup the pi_state owner and possibly acquire the lock if we
+		 * haven't already.
+		 */
+		res = fixup_pi_owner(uaddr, &q, !ret);
+		/*
+		 * If fixup_pi_owner() returned an error, propagate that.  If it acquired
+		 * the lock, clear our -ETIMEDOUT or -EINTR.
+		 */
+		if (res)
+			ret = (res < 0) ? res : 0;
 
-	futex_unqueue_pi(&q);
-	spin_unlock(q.lock_ptr);
-	goto out;
+		futex_unqueue_pi(&q);
+		spin_unlock(q.lock_ptr);
+		goto out;
 
 out_unlock_put_key:
-	futex_q_unlock(hb);
+		futex_q_unlock(hb);
+		goto out;
+
+uaddr_faulted:
+		futex_q_unlock(hb);
+
+		ret = fault_in_user_writeable(uaddr);
+		if (ret)
+			goto out;
+
+		if (!(flags & FLAGS_SHARED))
+			goto retry_private;
+
+		goto retry;
+	}
 
 out:
 	if (to) {
@@ -1092,18 +1108,6 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 		destroy_hrtimer_on_stack(&to->timer);
 	}
 	return ret != -EINTR ? ret : -ERESTARTNOINTR;
-
-uaddr_faulted:
-	futex_q_unlock(hb);
-
-	ret = fault_in_user_writeable(uaddr);
-	if (ret)
-		goto out;
-
-	if (!(flags & FLAGS_SHARED))
-		goto retry_private;
-
-	goto retry;
 }
 
 /*
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 0e55975af515c..209794cad6f2f 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -371,7 +371,6 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
 	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
 	int task_count = 0, ret;
 	struct futex_pi_state *pi_state = NULL;
-	struct futex_hash_bucket *hb1, *hb2;
 	struct futex_q *this, *next;
 	DEFINE_WAKE_Q(wake_q);
 
@@ -443,240 +442,244 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
 	if (requeue_pi && futex_match(&key1, &key2))
 		return -EINVAL;
 
-	hb1 = futex_hash(&key1);
-	hb2 = futex_hash(&key2);
-
 retry_private:
-	futex_hb_waiters_inc(hb2);
-	double_lock_hb(hb1, hb2);
+	if (1) {
+		struct futex_hash_bucket *hb1, *hb2;
 
-	if (likely(cmpval != NULL)) {
-		u32 curval;
+		hb1 = futex_hash(&key1);
+		hb2 = futex_hash(&key2);
 
-		ret = futex_get_value_locked(&curval, uaddr1);
+		futex_hb_waiters_inc(hb2);
+		double_lock_hb(hb1, hb2);
 
-		if (unlikely(ret)) {
-			double_unlock_hb(hb1, hb2);
-			futex_hb_waiters_dec(hb2);
+		if (likely(cmpval != NULL)) {
+			u32 curval;
 
-			ret = get_user(curval, uaddr1);
-			if (ret)
-				return ret;
+			ret = futex_get_value_locked(&curval, uaddr1);
 
-			if (!(flags1 & FLAGS_SHARED))
-				goto retry_private;
+			if (unlikely(ret)) {
+				double_unlock_hb(hb1, hb2);
+				futex_hb_waiters_dec(hb2);
 
-			goto retry;
-		}
-		if (curval != *cmpval) {
-			ret = -EAGAIN;
-			goto out_unlock;
-		}
-	}
+				ret = get_user(curval, uaddr1);
+				if (ret)
+					return ret;
 
-	if (requeue_pi) {
-		struct task_struct *exiting = NULL;
+				if (!(flags1 & FLAGS_SHARED))
+					goto retry_private;
 
-		/*
-		 * Attempt to acquire uaddr2 and wake the top waiter. If we
-		 * intend to requeue waiters, force setting the FUTEX_WAITERS
-		 * bit.  We force this here where we are able to easily handle
-		 * faults rather in the requeue loop below.
-		 *
-		 * Updates topwaiter::requeue_state if a top waiter exists.
-		 */
-		ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
-						 &key2, &pi_state,
-						 &exiting, nr_requeue);
-
-		/*
-		 * At this point the top_waiter has either taken uaddr2 or
-		 * is waiting on it. In both cases pi_state has been
-		 * established and an initial refcount on it. In case of an
-		 * error there's nothing.
-		 *
-		 * The top waiter's requeue_state is up to date:
-		 *
-		 *  - If the lock was acquired atomically (ret == 1), then
-		 *    the state is Q_REQUEUE_PI_LOCKED.
-		 *
-		 *    The top waiter has been dequeued and woken up and can
-		 *    return to user space immediately. The kernel/user
-		 *    space state is consistent. In case that there must be
-		 *    more waiters requeued the WAITERS bit in the user
-		 *    space futex is set so the top waiter task has to go
-		 *    into the syscall slowpath to unlock the futex. This
-		 *    will block until this requeue operation has been
-		 *    completed and the hash bucket locks have been
-		 *    dropped.
-		 *
-		 *  - If the trylock failed with an error (ret < 0) then
-		 *    the state is either Q_REQUEUE_PI_NONE, i.e. "nothing
-		 *    happened", or Q_REQUEUE_PI_IGNORE when there was an
-		 *    interleaved early wakeup.
-		 *
-		 *  - If the trylock did not succeed (ret == 0) then the
-		 *    state is either Q_REQUEUE_PI_IN_PROGRESS or
-		 *    Q_REQUEUE_PI_WAIT if an early wakeup interleaved.
-		 *    This will be cleaned up in the loop below, which
-		 *    cannot fail because futex_proxy_trylock_atomic() did
-		 *    the same sanity checks for requeue_pi as the loop
-		 *    below does.
-		 */
-		switch (ret) {
-		case 0:
-			/* We hold a reference on the pi state. */
-			break;
-
-		case 1:
-			/*
-			 * futex_proxy_trylock_atomic() acquired the user space
-			 * futex. Adjust task_count.
-			 */
-			task_count++;
-			ret = 0;
-			break;
-
-		/*
-		 * If the above failed, then pi_state is NULL and
-		 * waiter::requeue_state is correct.
-		 */
-		case -EFAULT:
-			double_unlock_hb(hb1, hb2);
-			futex_hb_waiters_dec(hb2);
-			ret = fault_in_user_writeable(uaddr2);
-			if (!ret)
 				goto retry;
-			return ret;
-		case -EBUSY:
-		case -EAGAIN:
-			/*
-			 * Two reasons for this:
-			 * - EBUSY: Owner is exiting and we just wait for the
-			 *   exit to complete.
-			 * - EAGAIN: The user space value changed.
-			 */
-			double_unlock_hb(hb1, hb2);
-			futex_hb_waiters_dec(hb2);
-			/*
-			 * Handle the case where the owner is in the middle of
-			 * exiting. Wait for the exit to complete otherwise
-			 * this task might loop forever, aka. live lock.
-			 */
-			wait_for_owner_exiting(ret, exiting);
-			cond_resched();
-			goto retry;
-		default:
-			goto out_unlock;
-		}
-	}
-
-	plist_for_each_entry_safe(this, next, &hb1->chain, list) {
-		if (task_count - nr_wake >= nr_requeue)
-			break;
-
-		if (!futex_match(&this->key, &key1))
-			continue;
-
-		/*
-		 * FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI should always
-		 * be paired with each other and no other futex ops.
-		 *
-		 * We should never be requeueing a futex_q with a pi_state,
-		 * which is awaiting a futex_unlock_pi().
-		 */
-		if ((requeue_pi && !this->rt_waiter) ||
-		    (!requeue_pi && this->rt_waiter) ||
-		    this->pi_state) {
-			ret = -EINVAL;
-			break;
+			}
+			if (curval != *cmpval) {
+				ret = -EAGAIN;
+				goto out_unlock;
+			}
 		}
 
-		/* Plain futexes just wake or requeue and are done */
-		if (!requeue_pi) {
-			if (++task_count <= nr_wake)
-				this->wake(&wake_q, this);
-			else
+		if (requeue_pi) {
+			struct task_struct *exiting = NULL;
+
+			/*
+			 * Attempt to acquire uaddr2 and wake the top waiter. If we
+			 * intend to requeue waiters, force setting the FUTEX_WAITERS
+			 * bit.  We force this here where we are able to easily handle
+			 * faults rather in the requeue loop below.
+			 *
+			 * Updates topwaiter::requeue_state if a top waiter exists.
+			 */
+			ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
+							 &key2, &pi_state,
+							 &exiting, nr_requeue);
+
+			/*
+			 * At this point the top_waiter has either taken uaddr2 or
+			 * is waiting on it. In both cases pi_state has been
+			 * established and an initial refcount on it. In case of an
+			 * error there's nothing.
+			 *
+			 * The top waiter's requeue_state is up to date:
+			 *
+			 *  - If the lock was acquired atomically (ret == 1), then
+			 *    the state is Q_REQUEUE_PI_LOCKED.
+			 *
+			 *    The top waiter has been dequeued and woken up and can
+			 *    return to user space immediately. The kernel/user
+			 *    space state is consistent. In case that there must be
+			 *    more waiters requeued the WAITERS bit in the user
+			 *    space futex is set so the top waiter task has to go
+			 *    into the syscall slowpath to unlock the futex. This
+			 *    will block until this requeue operation has been
+			 *    completed and the hash bucket locks have been
+			 *    dropped.
+			 *
+			 *  - If the trylock failed with an error (ret < 0) then
+			 *    the state is either Q_REQUEUE_PI_NONE, i.e. "nothing
+			 *    happened", or Q_REQUEUE_PI_IGNORE when there was an
+			 *    interleaved early wakeup.
+			 *
+			 *  - If the trylock did not succeed (ret == 0) then the
+			 *    state is either Q_REQUEUE_PI_IN_PROGRESS or
+			 *    Q_REQUEUE_PI_WAIT if an early wakeup interleaved.
+			 *    This will be cleaned up in the loop below, which
+			 *    cannot fail because futex_proxy_trylock_atomic() did
+			 *    the same sanity checks for requeue_pi as the loop
+			 *    below does.
+			 */
+			switch (ret) {
+			case 0:
+				/* We hold a reference on the pi state. */
+				break;
+
+			case 1:
+				/*
+				 * futex_proxy_trylock_atomic() acquired the user space
+				 * futex. Adjust task_count.
+				 */
+				task_count++;
+				ret = 0;
+				break;
+
+				/*
+				 * If the above failed, then pi_state is NULL and
+				 * waiter::requeue_state is correct.
+				 */
+			case -EFAULT:
+				double_unlock_hb(hb1, hb2);
+				futex_hb_waiters_dec(hb2);
+				ret = fault_in_user_writeable(uaddr2);
+				if (!ret)
+					goto retry;
+				return ret;
+			case -EBUSY:
+			case -EAGAIN:
+				/*
+				 * Two reasons for this:
+				 * - EBUSY: Owner is exiting and we just wait for the
+				 *   exit to complete.
+				 * - EAGAIN: The user space value changed.
+				 */
+				double_unlock_hb(hb1, hb2);
+				futex_hb_waiters_dec(hb2);
+				/*
+				 * Handle the case where the owner is in the middle of
+				 * exiting. Wait for the exit to complete otherwise
+				 * this task might loop forever, aka. live lock.
+				 */
+				wait_for_owner_exiting(ret, exiting);
+				cond_resched();
+				goto retry;
+			default:
+				goto out_unlock;
+			}
+		}
+
+		plist_for_each_entry_safe(this, next, &hb1->chain, list) {
+			if (task_count - nr_wake >= nr_requeue)
+				break;
+
+			if (!futex_match(&this->key, &key1))
+				continue;
+
+			/*
+			 * FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI should always
+			 * be paired with each other and no other futex ops.
+			 *
+			 * We should never be requeueing a futex_q with a pi_state,
+			 * which is awaiting a futex_unlock_pi().
+			 */
+			if ((requeue_pi && !this->rt_waiter) ||
+			    (!requeue_pi && this->rt_waiter) ||
+			    this->pi_state) {
+				ret = -EINVAL;
+				break;
+			}
+
+			/* Plain futexes just wake or requeue and are done */
+			if (!requeue_pi) {
+				if (++task_count <= nr_wake)
+					this->wake(&wake_q, this);
+				else
+					requeue_futex(this, hb1, hb2, &key2);
+				continue;
+			}
+
+			/* Ensure we requeue to the expected futex for requeue_pi. */
+			if (!futex_match(this->requeue_pi_key, &key2)) {
+				ret = -EINVAL;
+				break;
+			}
+
+			/*
+			 * Requeue nr_requeue waiters and possibly one more in the case
+			 * of requeue_pi if we couldn't acquire the lock atomically.
+			 *
+			 * Prepare the waiter to take the rt_mutex. Take a refcount
+			 * on the pi_state and store the pointer in the futex_q
+			 * object of the waiter.
+			 */
+			get_pi_state(pi_state);
+
+			/* Don't requeue when the waiter is already on the way out. */
+			if (!futex_requeue_pi_prepare(this, pi_state)) {
+				/*
+				 * Early woken waiter signaled that it is on the
+				 * way out. Drop the pi_state reference and try the
+				 * next waiter. @this->pi_state is still NULL.
+				 */
+				put_pi_state(pi_state);
+				continue;
+			}
+
+			ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
+							this->rt_waiter,
+							this->task);
+
+			if (ret == 1) {
+				/*
+				 * We got the lock. We do neither drop the refcount
+				 * on pi_state nor clear this->pi_state because the
+				 * waiter needs the pi_state for cleaning up the
+				 * user space value. It will drop the refcount
+				 * after doing so. this::requeue_state is updated
+				 * in the wakeup as well.
+				 */
+				requeue_pi_wake_futex(this, &key2, hb2);
+				task_count++;
+			} else if (!ret) {
+				/* Waiter is queued, move it to hb2 */
 				requeue_futex(this, hb1, hb2, &key2);
-			continue;
-		}
-
-		/* Ensure we requeue to the expected futex for requeue_pi. */
-		if (!futex_match(this->requeue_pi_key, &key2)) {
-			ret = -EINVAL;
-			break;
+				futex_requeue_pi_complete(this, 0);
+				task_count++;
+			} else {
+				/*
+				 * rt_mutex_start_proxy_lock() detected a potential
+				 * deadlock when we tried to queue that waiter.
+				 * Drop the pi_state reference which we took above
+				 * and remove the pointer to the state from the
+				 * waiters futex_q object.
+				 */
+				this->pi_state = NULL;
+				put_pi_state(pi_state);
+				futex_requeue_pi_complete(this, ret);
+				/*
+				 * We stop queueing more waiters and let user space
+				 * deal with the mess.
+				 */
+				break;
+			}
 		}
 
 		/*
-		 * Requeue nr_requeue waiters and possibly one more in the case
-		 * of requeue_pi if we couldn't acquire the lock atomically.
-		 *
-		 * Prepare the waiter to take the rt_mutex. Take a refcount
-		 * on the pi_state and store the pointer in the futex_q
-		 * object of the waiter.
+		 * We took an extra initial reference to the pi_state in
+		 * futex_proxy_trylock_atomic(). We need to drop it here again.
 		 */
-		get_pi_state(pi_state);
-
-		/* Don't requeue when the waiter is already on the way out. */
-		if (!futex_requeue_pi_prepare(this, pi_state)) {
-			/*
-			 * Early woken waiter signaled that it is on the
-			 * way out. Drop the pi_state reference and try the
-			 * next waiter. @this->pi_state is still NULL.
-			 */
-			put_pi_state(pi_state);
-			continue;
-		}
-
-		ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
-						this->rt_waiter,
-						this->task);
-
-		if (ret == 1) {
-			/*
-			 * We got the lock. We do neither drop the refcount
-			 * on pi_state nor clear this->pi_state because the
-			 * waiter needs the pi_state for cleaning up the
-			 * user space value. It will drop the refcount
-			 * after doing so. this::requeue_state is updated
-			 * in the wakeup as well.
-			 */
-			requeue_pi_wake_futex(this, &key2, hb2);
-			task_count++;
-		} else if (!ret) {
-			/* Waiter is queued, move it to hb2 */
-			requeue_futex(this, hb1, hb2, &key2);
-			futex_requeue_pi_complete(this, 0);
-			task_count++;
-		} else {
-			/*
-			 * rt_mutex_start_proxy_lock() detected a potential
-			 * deadlock when we tried to queue that waiter.
-			 * Drop the pi_state reference which we took above
-			 * and remove the pointer to the state from the
-			 * waiters futex_q object.
-			 */
-			this->pi_state = NULL;
-			put_pi_state(pi_state);
-			futex_requeue_pi_complete(this, ret);
-			/*
-			 * We stop queueing more waiters and let user space
-			 * deal with the mess.
-			 */
-			break;
-		}
-	}
-
-	/*
-	 * We took an extra initial reference to the pi_state in
-	 * futex_proxy_trylock_atomic(). We need to drop it here again.
-	 */
-	put_pi_state(pi_state);
+		put_pi_state(pi_state);
 
 out_unlock:
-	double_unlock_hb(hb1, hb2);
+		double_unlock_hb(hb1, hb2);
+		futex_hb_waiters_dec(hb2);
+	}
 	wake_up_q(&wake_q);
-	futex_hb_waiters_dec(hb2);
 	return ret ? ret : task_count;
 }
 
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 1108f373fd315..4bf839c85b66c 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -253,7 +253,6 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 		  int nr_wake, int nr_wake2, int op)
 {
 	union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
-	struct futex_hash_bucket *hb1, *hb2;
 	struct futex_q *this, *next;
 	int ret, op_ret;
 	DEFINE_WAKE_Q(wake_q);
@@ -266,67 +265,71 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 	if (unlikely(ret != 0))
 		return ret;
 
-	hb1 = futex_hash(&key1);
-	hb2 = futex_hash(&key2);
-
 retry_private:
-	double_lock_hb(hb1, hb2);
-	op_ret = futex_atomic_op_inuser(op, uaddr2);
-	if (unlikely(op_ret < 0)) {
-		double_unlock_hb(hb1, hb2);
+	if (1) {
+		struct futex_hash_bucket *hb1, *hb2;
 
-		if (!IS_ENABLED(CONFIG_MMU) ||
-		    unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
-			/*
-			 * we don't get EFAULT from MMU faults if we don't have
-			 * an MMU, but we might get them from range checking
-			 */
-			ret = op_ret;
-			return ret;
-		}
+		hb1 = futex_hash(&key1);
+		hb2 = futex_hash(&key2);
 
-		if (op_ret == -EFAULT) {
-			ret = fault_in_user_writeable(uaddr2);
-			if (ret)
+		double_lock_hb(hb1, hb2);
+		op_ret = futex_atomic_op_inuser(op, uaddr2);
+		if (unlikely(op_ret < 0)) {
+			double_unlock_hb(hb1, hb2);
+
+			if (!IS_ENABLED(CONFIG_MMU) ||
+			    unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
+				/*
+				 * we don't get EFAULT from MMU faults if we don't have
+				 * an MMU, but we might get them from range checking
+				 */
+				ret = op_ret;
 				return ret;
-		}
-
-		cond_resched();
-		if (!(flags & FLAGS_SHARED))
-			goto retry_private;
-		goto retry;
-	}
-
-	plist_for_each_entry_safe(this, next, &hb1->chain, list) {
-		if (futex_match (&this->key, &key1)) {
-			if (this->pi_state || this->rt_waiter) {
-				ret = -EINVAL;
-				goto out_unlock;
 			}
-			this->wake(&wake_q, this);
-			if (++ret >= nr_wake)
-				break;
-		}
-	}
 
-	if (op_ret > 0) {
-		op_ret = 0;
-		plist_for_each_entry_safe(this, next, &hb2->chain, list) {
-			if (futex_match (&this->key, &key2)) {
+			if (op_ret == -EFAULT) {
+				ret = fault_in_user_writeable(uaddr2);
+				if (ret)
+					return ret;
+			}
+
+			cond_resched();
+			if (!(flags & FLAGS_SHARED))
+				goto retry_private;
+			goto retry;
+		}
+
+		plist_for_each_entry_safe(this, next, &hb1->chain, list) {
+			if (futex_match (&this->key, &key1)) {
 				if (this->pi_state || this->rt_waiter) {
 					ret = -EINVAL;
 					goto out_unlock;
 				}
 				this->wake(&wake_q, this);
-				if (++op_ret >= nr_wake2)
+				if (++ret >= nr_wake)
 					break;
 			}
 		}
-		ret += op_ret;
-	}
+
+		if (op_ret > 0) {
+			op_ret = 0;
+			plist_for_each_entry_safe(this, next, &hb2->chain, list) {
+				if (futex_match (&this->key, &key2)) {
+					if (this->pi_state || this->rt_waiter) {
+						ret = -EINVAL;
+						goto out_unlock;
+					}
+					this->wake(&wake_q, this);
+					if (++op_ret >= nr_wake2)
+						break;
+				}
+			}
+			ret += op_ret;
+		}
 
 out_unlock:
-	double_unlock_hb(hb1, hb2);
+		double_unlock_hb(hb1, hb2);
+	}
 	wake_up_q(&wake_q);
 	return ret;
 }
@@ -402,7 +405,6 @@ int futex_unqueue_multiple(struct futex_vector *v, int count)
  */
 int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
 {
-	struct futex_hash_bucket *hb;
 	bool retry = false;
 	int ret, i;
 	u32 uval;
@@ -441,21 +443,25 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
 		struct futex_q *q = &vs[i].q;
 		u32 val = vs[i].w.val;
 
-		hb = futex_hash(&q->key);
-		futex_q_lock(q, hb);
-		ret = futex_get_value_locked(&uval, uaddr);
+		if (1) {
+			struct futex_hash_bucket *hb;
 
-		if (!ret && uval == val) {
-			/*
-			 * The bucket lock can't be held while dealing with the
-			 * next futex. Queue each futex at this moment so hb can
-			 * be unlocked.
-			 */
-			futex_queue(q, hb, current);
-			continue;
+			hb = futex_hash(&q->key);
+			futex_q_lock(q, hb);
+			ret = futex_get_value_locked(&uval, uaddr);
+
+			if (!ret && uval == val) {
+				/*
+				 * The bucket lock can't be held while dealing with the
+				 * next futex. Queue each futex at this moment so hb can
+				 * be unlocked.
+				 */
+				futex_queue(q, hb, current);
+				continue;
+			}
+
+			futex_q_unlock(hb);
 		}
-
-		futex_q_unlock(hb);
 		__set_current_state(TASK_RUNNING);
 
 		/*
@@ -584,7 +590,6 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 		     struct futex_q *q, union futex_key *key2,
 		     struct task_struct *task)
 {
-	struct futex_hash_bucket *hb;
 	u32 uval;
 	int ret;
 
@@ -612,44 +617,48 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 		return ret;
 
 retry_private:
-	hb = futex_hash(&q->key);
-	futex_q_lock(q, hb);
+	if (1) {
+		struct futex_hash_bucket *hb;
 
-	ret = futex_get_value_locked(&uval, uaddr);
+		hb = futex_hash(&q->key);
+		futex_q_lock(q, hb);
 
-	if (ret) {
-		futex_q_unlock(hb);
+		ret = futex_get_value_locked(&uval, uaddr);
 
-		ret = get_user(uval, uaddr);
-		if (ret)
-			return ret;
+		if (ret) {
+			futex_q_unlock(hb);
 
-		if (!(flags & FLAGS_SHARED))
-			goto retry_private;
+			ret = get_user(uval, uaddr);
+			if (ret)
+				return ret;
 
-		goto retry;
+			if (!(flags & FLAGS_SHARED))
+				goto retry_private;
+
+			goto retry;
+		}
+
+		if (uval != val) {
+			futex_q_unlock(hb);
+			return -EWOULDBLOCK;
+		}
+
+		if (key2 && futex_match(&q->key, key2)) {
+			futex_q_unlock(hb);
+			return -EINVAL;
+		}
+
+		/*
+		 * The task state is guaranteed to be set before another task can
+		 * wake it. set_current_state() is implemented using smp_store_mb() and
+		 * futex_queue() calls spin_unlock() upon completion, both serializing
+		 * access to the hash list and forcing another memory barrier.
+		 */
+		if (task == current)
+			set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+		futex_queue(q, hb, task);
 	}
 
-	if (uval != val) {
-		futex_q_unlock(hb);
-		return -EWOULDBLOCK;
-	}
-
-	if (key2 && futex_match(&q->key, key2)) {
-		futex_q_unlock(hb);
-		return -EINVAL;
-	}
-
-	/*
-	 * The task state is guaranteed to be set before another task can
-	 * wake it. set_current_state() is implemented using smp_store_mb() and
-	 * futex_queue() calls spin_unlock() upon completion, both serializing
-	 * access to the hash list and forcing another memory barrier.
-	 */
-	if (task == current)
-		set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
-	futex_queue(q, hb, task);
-
 	return ret;
 }
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 05/21] futex: Create futex_hash() get/put class
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (3 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 04/21] futex: Create hb scopes Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 06/21] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

From: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/core.c     |  7 +++----
 kernel/futex/futex.h    |  8 +++++++-
 kernel/futex/pi.c       | 10 ++++++----
 kernel/futex/requeue.c  | 10 +++-------
 kernel/futex/waitwake.c | 15 +++++----------
 5 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index e4cb5ce9785b1..08cf54567aeb6 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -114,7 +114,7 @@ late_initcall(fail_futex_debugfs);
  * We hash on the keys returned from get_futex_key (see below) and return the
  * corresponding hash bucket in the global hash.
  */
-struct futex_hash_bucket *futex_hash(union futex_key *key)
+struct futex_hash_bucket *__futex_hash(union futex_key *key)
 {
 	u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
 			  key->both.offset);
@@ -122,6 +122,7 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
 	return &futex_queues[hash & futex_hashmask];
 }
 
+void futex_hash_put(struct futex_hash_bucket *hb) { }
 
 /**
  * futex_setup_timer - set up the sleeping hrtimer.
@@ -957,9 +958,7 @@ static void exit_pi_state_list(struct task_struct *curr)
 		pi_state = list_entry(next, struct futex_pi_state, list);
 		key = pi_state->key;
 		if (1) {
-			struct futex_hash_bucket *hb;
-
-			hb = futex_hash(&key);
+			CLASS(hb, hb)(&key);
 
 			/*
 			 * We can race against put_pi_state() removing itself from the
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index a219903e52084..eac6de6ed563a 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -7,6 +7,7 @@
 #include <linux/sched/wake_q.h>
 #include <linux/compat.h>
 #include <linux/uaccess.h>
+#include <linux/cleanup.h>
 
 #ifdef CONFIG_PREEMPT_RT
 #include <linux/rcuwait.h>
@@ -201,7 +202,12 @@ extern struct hrtimer_sleeper *
 futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
 		  int flags, u64 range_ns);
 
-extern struct futex_hash_bucket *futex_hash(union futex_key *key);
+extern struct futex_hash_bucket *__futex_hash(union futex_key *key);
+extern void futex_hash_put(struct futex_hash_bucket *hb);
+
+DEFINE_CLASS(hb, struct futex_hash_bucket *,
+	     if (_T) futex_hash_put(_T),
+	     __futex_hash(key), union futex_key *key);
 
 /**
  * futex_match - Check whether two futex keys are equal
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 62ce5ecaeddd6..4cee9ec5d97d6 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -939,9 +939,8 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 
 retry_private:
 	if (1) {
-		struct futex_hash_bucket *hb;
+		CLASS(hb, hb)(&q.key);
 
-		hb = futex_hash(&q.key);
 		futex_q_lock(&q, hb);
 
 		ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
@@ -1017,6 +1016,10 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 		 */
 		raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
 		spin_unlock(q.lock_ptr);
+		/*
+		 * Caution; releasing @hb in-scope.
+		 */
+		futex_hash_put(no_free_ptr(hb));
 		/*
 		 * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
 		 * such that futex_unlock_pi() is guaranteed to observe the waiter when
@@ -1119,7 +1122,6 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 {
 	u32 curval, uval, vpid = task_pid_vnr(current);
 	union futex_key key = FUTEX_KEY_INIT;
-	struct futex_hash_bucket *hb;
 	struct futex_q *top_waiter;
 	int ret;
 
@@ -1139,7 +1141,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	if (ret)
 		return ret;
 
-	hb = futex_hash(&key);
+	CLASS(hb, hb)(&key);
 	spin_lock(&hb->lock);
 retry_hb:
 
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 209794cad6f2f..992e3ce005c6f 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -444,10 +444,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
 
 retry_private:
 	if (1) {
-		struct futex_hash_bucket *hb1, *hb2;
-
-		hb1 = futex_hash(&key1);
-		hb2 = futex_hash(&key2);
+		CLASS(hb, hb1)(&key1);
+		CLASS(hb, hb2)(&key2);
 
 		futex_hb_waiters_inc(hb2);
 		double_lock_hb(hb1, hb2);
@@ -817,9 +815,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	switch (futex_requeue_pi_wakeup_sync(&q)) {
 	case Q_REQUEUE_PI_IGNORE:
 		{
-			struct futex_hash_bucket *hb;
-
-			hb = futex_hash(&q.key);
+			CLASS(hb, hb)(&q.key);
 			/* The waiter is still on uaddr1 */
 			spin_lock(&hb->lock);
 			ret = handle_early_requeue_pi_wakeup(hb, &q, to);
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 4bf839c85b66c..44034dee7a48c 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -154,7 +154,6 @@ void futex_wake_mark(struct wake_q_head *wake_q, struct futex_q *q)
  */
 int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 {
-	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
 	union futex_key key = FUTEX_KEY_INIT;
 	DEFINE_WAKE_Q(wake_q);
@@ -170,7 +169,7 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	if ((flags & FLAGS_STRICT) && !nr_wake)
 		return 0;
 
-	hb = futex_hash(&key);
+	CLASS(hb, hb)(&key);
 
 	/* Make sure we really have tasks to wakeup */
 	if (!futex_hb_waiters_pending(hb))
@@ -267,10 +266,8 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 
 retry_private:
 	if (1) {
-		struct futex_hash_bucket *hb1, *hb2;
-
-		hb1 = futex_hash(&key1);
-		hb2 = futex_hash(&key2);
+		CLASS(hb, hb1)(&key1);
+		CLASS(hb, hb2)(&key2);
 
 		double_lock_hb(hb1, hb2);
 		op_ret = futex_atomic_op_inuser(op, uaddr2);
@@ -444,9 +441,8 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
 		u32 val = vs[i].w.val;
 
 		if (1) {
-			struct futex_hash_bucket *hb;
+			CLASS(hb, hb)(&q->key);
 
-			hb = futex_hash(&q->key);
 			futex_q_lock(q, hb);
 			ret = futex_get_value_locked(&uval, uaddr);
 
@@ -618,9 +614,8 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 
 retry_private:
 	if (1) {
-		struct futex_hash_bucket *hb;
+		CLASS(hb, hb)(&q->key);
 
-		hb = futex_hash(&q->key);
 		futex_q_lock(q, hb);
 
 		ret = futex_get_value_locked(&uval, uaddr);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 06/21] futex: Create helper function to initialize a hash slot.
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (4 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 05/21] futex: Create futex_hash() get/put class Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 07/21] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

Factor out the futex_hash_bucket initialisation into a helpr function.
The helper function will be used in a follow up patch implementing
process private hash buckets.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/core.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 08cf54567aeb6..c6c5cde78e0cb 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1122,6 +1122,13 @@ void futex_exit_release(struct task_struct *tsk)
 	futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
 }
 
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
+{
+	atomic_set(&fhb->waiters, 0);
+	plist_head_init(&fhb->chain);
+	spin_lock_init(&fhb->lock);
+}
+
 static int __init futex_init(void)
 {
 	unsigned long hashsize, i;
@@ -1139,11 +1146,8 @@ static int __init futex_init(void)
 					       hashsize, hashsize);
 	hashsize = 1UL << futex_shift;
 
-	for (i = 0; i < hashsize; i++) {
-		atomic_set(&futex_queues[i].waiters, 0);
-		plist_head_init(&futex_queues[i].chain);
-		spin_lock_init(&futex_queues[i].lock);
-	}
+	for (i = 0; i < hashsize; i++)
+		futex_hash_bucket_init(&futex_queues[i]);
 
 	futex_hashmask = hashsize - 1;
 	return 0;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 07/21] futex: Add basic infrastructure for local task local hash.
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (5 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 06/21] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 08/21] futex: Hash only the address for private futexes Sebastian Andrzej Siewior
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

The futex hashmap is system wide and shared by random tasks. Each slot
is hashed based on its address and VMA. Due to randomized VMAs (and
memory allocations) the same logical lock (pointer) can end up in a
different hash bucket on each invocation of the application. This in
turn means that different applications may share a hash bucket on the
first invocation but not on the second an it is not always clear which
applications will be involved. This can result in high latency's to
acquire the futex_hash_bucket::lock especially if the lock owner is
limited to a CPU and not be effectively PI boosted.

Introduce a task local hash map. The hashmap can be allocated via
	prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, 0)

The `0' argument allocates a default number of 16 slots, a higher number
can be specified if desired. The current upper limit is 131072.
The allocated hashmap is used by all threads within a process.
A thread can check if the private map has been allocated via
	prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);

Which return the current number of slots.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/futex.h      | 20 ++++++++
 include/linux/mm_types.h   |  6 ++-
 include/uapi/linux/prctl.h |  5 ++
 kernel/fork.c              |  2 +
 kernel/futex/core.c        | 99 ++++++++++++++++++++++++++++++++++++--
 kernel/sys.c               |  4 ++
 6 files changed, 132 insertions(+), 4 deletions(-)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index b70df27d7e85c..943828db52234 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -77,6 +77,15 @@ void futex_exec_release(struct task_struct *tsk);
 
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	      u32 __user *uaddr2, u32 val2, u32 val3);
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3);
+int futex_hash_allocate_default(void);
+void futex_hash_free(struct mm_struct *mm);
+
+static inline void futex_mm_init(struct mm_struct *mm)
+{
+	mm->futex_hash_bucket = NULL;
+}
+
 #else
 static inline void futex_init_task(struct task_struct *tsk) { }
 static inline void futex_exit_recursive(struct task_struct *tsk) { }
@@ -88,6 +97,17 @@ static inline long do_futex(u32 __user *uaddr, int op, u32 val,
 {
 	return -EINVAL;
 }
+static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
+{
+	return -EINVAL;
+}
+static inline int futex_hash_allocate_default(void)
+{
+	return 0;
+}
+static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline void futex_mm_init(struct mm_struct *mm) { }
+
 #endif
 
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0234f14f2aa6b..769cd77364e2d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -30,6 +30,7 @@
 #define INIT_PASID	0
 
 struct address_space;
+struct futex_hash_bucket;
 struct mem_cgroup;
 
 /*
@@ -937,7 +938,10 @@ struct mm_struct {
 		 */
 		seqcount_t mm_lock_seq;
 #endif
-
+#ifdef CONFIG_FUTEX
+		unsigned int			futex_hash_mask;
+		struct futex_hash_bucket	*futex_hash_bucket;
+#endif
 
 		unsigned long hiwater_rss; /* High-watermark of RSS usage */
 		unsigned long hiwater_vm;  /* High-water virtual memory usage */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 5c6080680cb27..55b843644c51a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -353,4 +353,9 @@ struct prctl_mm_map {
  */
 #define PR_LOCK_SHADOW_STACK_STATUS      76
 
+/* FUTEX hash management */
+#define PR_FUTEX_HASH			77
+# define PR_FUTEX_HASH_SET_SLOTS	1
+# define PR_FUTEX_HASH_GET_SLOTS	2
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index e27fe5d5a15c9..95d48dbc90934 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1287,6 +1287,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	RCU_INIT_POINTER(mm->exe_file, NULL);
 	mmu_notifier_subscriptions_init(mm);
 	init_tlb_flush_pending(mm);
+	futex_mm_init(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
 	mm->pmd_huge_pte = NULL;
 #endif
@@ -1364,6 +1365,7 @@ static inline void __mmput(struct mm_struct *mm)
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
 	lru_gen_del_mm(mm);
+	futex_hash_free(mm);
 	mmdrop(mm);
 }
 
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index c6c5cde78e0cb..1feb7092635d0 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -39,6 +39,7 @@
 #include <linux/memblock.h>
 #include <linux/fault-inject.h>
 #include <linux/slab.h>
+#include <linux/prctl.h>
 
 #include "futex.h"
 #include "../locking/rtmutex_common.h"
@@ -107,18 +108,40 @@ late_initcall(fail_futex_debugfs);
 
 #endif /* CONFIG_FAIL_FUTEX */
 
+static inline bool futex_key_is_private(union futex_key *key)
+{
+	/*
+	 * Relies on get_futex_key() to set either bit for shared
+	 * futexes -- see comment with union futex_key.
+	 */
+	return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
+}
+
 /**
  * futex_hash - Return the hash bucket in the global hash
  * @key:	Pointer to the futex key for which the hash is calculated
  *
  * We hash on the keys returned from get_futex_key (see below) and return the
- * corresponding hash bucket in the global hash.
+ * corresponding hash bucket in the global hash. If the FUTEX is private and
+ * a local hash table is privated then this one is used.
  */
 struct futex_hash_bucket *__futex_hash(union futex_key *key)
 {
-	u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
-			  key->both.offset);
+	struct futex_hash_bucket *fhb;
+	u32 hash;
 
+	fhb = current->mm->futex_hash_bucket;
+	if (fhb && futex_key_is_private(key)) {
+		u32 hash_mask = current->mm->futex_hash_mask;
+
+		hash = jhash2((u32 *)key,
+			      offsetof(typeof(*key), both.offset) / 4,
+			      key->both.offset);
+		return &fhb[hash & hash_mask];
+	}
+	hash = jhash2((u32 *)key,
+		      offsetof(typeof(*key), both.offset) / 4,
+		      key->both.offset);
 	return &futex_queues[hash & futex_hashmask];
 }
 
@@ -1129,6 +1152,76 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
 	spin_lock_init(&fhb->lock);
 }
 
+void futex_hash_free(struct mm_struct *mm)
+{
+	kvfree(mm->futex_hash_bucket);
+}
+
+static int futex_hash_allocate(unsigned int hash_slots)
+{
+	struct futex_hash_bucket *fhb;
+	int i;
+
+	if (current->mm->futex_hash_bucket)
+		return -EALREADY;
+
+	if (!thread_group_leader(current))
+		return -EINVAL;
+
+	if (hash_slots == 0)
+		hash_slots = 16;
+	if (hash_slots < 2)
+		hash_slots = 2;
+	if (hash_slots > 131072)
+		hash_slots = 131072;
+	if (!is_power_of_2(hash_slots))
+		hash_slots = rounddown_pow_of_two(hash_slots);
+
+	fhb = kvmalloc_array(hash_slots, sizeof(struct futex_hash_bucket), GFP_KERNEL_ACCOUNT);
+	if (!fhb)
+		return -ENOMEM;
+
+	current->mm->futex_hash_mask = hash_slots - 1;
+
+	for (i = 0; i < hash_slots; i++)
+		futex_hash_bucket_init(&fhb[i]);
+
+	current->mm->futex_hash_bucket = fhb;
+	return 0;
+}
+
+int futex_hash_allocate_default(void)
+{
+	return futex_hash_allocate(0);
+}
+
+static int futex_hash_get_slots(void)
+{
+	if (current->mm->futex_hash_bucket)
+		return current->mm->futex_hash_mask + 1;
+	return 0;
+}
+
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
+{
+	int ret;
+
+	switch (arg2) {
+	case PR_FUTEX_HASH_SET_SLOTS:
+		ret = futex_hash_allocate(arg3);
+		break;
+
+	case PR_FUTEX_HASH_GET_SLOTS:
+		ret = futex_hash_get_slots();
+		break;
+
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
 static int __init futex_init(void)
 {
 	unsigned long hashsize, i;
diff --git a/kernel/sys.c b/kernel/sys.c
index cb366ff8703af..e509ad9795103 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -52,6 +52,7 @@
 #include <linux/user_namespace.h>
 #include <linux/time_namespace.h>
 #include <linux/binfmts.h>
+#include <linux/futex.h>
 
 #include <linux/sched.h>
 #include <linux/sched/autogroup.h>
@@ -2811,6 +2812,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINVAL;
 		error = arch_lock_shadow_stack_status(me, arg2);
 		break;
+	case PR_FUTEX_HASH:
+		error = futex_hash_prctl(arg2, arg3);
+		break;
 	default:
 		trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
 		error = -EINVAL;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 08/21] futex: Hash only the address for private futexes.
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (6 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 07/21] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 09/21] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

futex_hash() passes the whole futex_key to jhash2. The first two member
are passed as the first argument and the offset as the "initial value".

For private futexes, the mm-part is always the same and it is used only
within the process. By excluding the mm part from the hash, we reduce
the length passed to jhash2 from 4 (16 / 4) to 2 (8 / 2). This avoids
the __jhash_mix() part of jhash.

The resulting code is smaller and based on testing this variant performs
as good as the original or slightly better.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/core.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 1feb7092635d0..8561c41df7dc5 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -117,6 +117,18 @@ static inline bool futex_key_is_private(union futex_key *key)
 	return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
 }
 
+static struct futex_hash_bucket *futex_hash_private(union futex_key *key,
+						    struct futex_hash_bucket *fhb,
+						    u32 hash_mask)
+{
+	u32 hash;
+
+	hash = jhash2((void *)&key->private.address,
+		      sizeof(key->private.address) / 4,
+		      key->both.offset);
+	return &fhb[hash & hash_mask];
+}
+
 /**
  * futex_hash - Return the hash bucket in the global hash
  * @key:	Pointer to the futex key for which the hash is calculated
@@ -131,14 +143,9 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
 	u32 hash;
 
 	fhb = current->mm->futex_hash_bucket;
-	if (fhb && futex_key_is_private(key)) {
-		u32 hash_mask = current->mm->futex_hash_mask;
+	if (fhb && futex_key_is_private(key))
+		return futex_hash_private(key, fhb, current->mm->futex_hash_mask);
 
-		hash = jhash2((u32 *)key,
-			      offsetof(typeof(*key), both.offset) / 4,
-			      key->both.offset);
-		return &fhb[hash & hash_mask];
-	}
 	hash = jhash2((u32 *)key,
 		      offsetof(typeof(*key), both.offset) / 4,
 		      key->both.offset);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 09/21] futex: Allow automatic allocation of process wide futex hash.
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (7 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 08/21] futex: Hash only the address for private futexes Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 10/21] futex: Decrease the waiter count before the unlock operation Sebastian Andrzej Siewior
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

Allocate a default futex hash if a task forks its first thread.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/futex.h | 12 ++++++++++++
 kernel/fork.c         | 24 ++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index 943828db52234..bad377c30de5e 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -86,6 +86,13 @@ static inline void futex_mm_init(struct mm_struct *mm)
 	mm->futex_hash_bucket = NULL;
 }
 
+static inline bool futex_hash_requires_allocation(void)
+{
+	if (current->mm->futex_hash_bucket)
+		return false;
+	return true;
+}
+
 #else
 static inline void futex_init_task(struct task_struct *tsk) { }
 static inline void futex_exit_recursive(struct task_struct *tsk) { }
@@ -108,6 +115,11 @@ static inline int futex_hash_allocate_default(void)
 static inline void futex_hash_free(struct mm_struct *mm) { }
 static inline void futex_mm_init(struct mm_struct *mm) { }
 
+static inline bool futex_hash_requires_allocation(void)
+{
+	return false;
+}
+
 #endif
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 95d48dbc90934..440c5808f70a2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2137,6 +2137,15 @@ static void rv_task_fork(struct task_struct *p)
 #define rv_task_fork(p) do {} while (0)
 #endif
 
+static bool need_futex_hash_allocate_default(u64 clone_flags)
+{
+	if ((clone_flags & (CLONE_THREAD | CLONE_VM)) != (CLONE_THREAD | CLONE_VM))
+		return false;
+	if (!thread_group_empty(current))
+		return false;
+	return futex_hash_requires_allocation();
+}
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -2514,6 +2523,21 @@ __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cancel_cgroup;
 
+	/*
+	 * Allocate a default futex hash for the user process once the first
+	 * thread spawns.
+	 */
+	if (need_futex_hash_allocate_default(clone_flags)) {
+		retval = futex_hash_allocate_default();
+		if (retval)
+			goto bad_fork_core_free;
+		/*
+		 * If we fail beyond this point we don't free the allocated
+		 * futex hash map. We assume that another thread will be created
+		 * and makes use of it. The hash map will be freed once the main
+		 * thread terminates.
+		 */
+	}
 	/*
 	 * From this point on we must avoid any synchronous user-space
 	 * communication until we take the tasklist-lock. In particular, we do
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 10/21] futex: Decrease the waiter count before the unlock operation.
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (8 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 09/21] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 11/21] futex: Introduce futex_q_lockptr_lock() Sebastian Andrzej Siewior
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

To support runtime resizing of the process private hash, it's required
to not use the obtained hash bucket once the reference count has been
dropped. The reference will be dropped after the unlock of the hash
bucket.
The amount of waiters is decremented after the unlock operation. There
is no requirement that this needs to happen after the unlock. The
increment happens before acquiring the lock to signal early that there
will be a waiter. The waiter can avoid blocking on the lock if it is
known that there will be no waiter.
There is no difference in terms of ordering if the decrement happens
before or after the unlock.

Decrease the waiter count before the unlock operation.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/core.c    | 2 +-
 kernel/futex/requeue.c | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 8561c41df7dc5..063d733181783 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -554,8 +554,8 @@ void futex_q_lock(struct futex_q *q, struct futex_hash_bucket *hb)
 void futex_q_unlock(struct futex_hash_bucket *hb)
 	__releases(&hb->lock)
 {
-	spin_unlock(&hb->lock);
 	futex_hb_waiters_dec(hb);
+	spin_unlock(&hb->lock);
 }
 
 void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb,
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 992e3ce005c6f..023c028d2fce3 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -456,8 +456,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
 			ret = futex_get_value_locked(&curval, uaddr1);
 
 			if (unlikely(ret)) {
-				double_unlock_hb(hb1, hb2);
 				futex_hb_waiters_dec(hb2);
+				double_unlock_hb(hb1, hb2);
 
 				ret = get_user(curval, uaddr1);
 				if (ret)
@@ -542,8 +542,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
 				 * waiter::requeue_state is correct.
 				 */
 			case -EFAULT:
-				double_unlock_hb(hb1, hb2);
 				futex_hb_waiters_dec(hb2);
+				double_unlock_hb(hb1, hb2);
 				ret = fault_in_user_writeable(uaddr2);
 				if (!ret)
 					goto retry;
@@ -556,8 +556,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
 				 *   exit to complete.
 				 * - EAGAIN: The user space value changed.
 				 */
-				double_unlock_hb(hb1, hb2);
 				futex_hb_waiters_dec(hb2);
+				double_unlock_hb(hb1, hb2);
 				/*
 				 * Handle the case where the owner is in the middle of
 				 * exiting. Wait for the exit to complete otherwise
@@ -674,8 +674,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
 		put_pi_state(pi_state);
 
 out_unlock:
-		double_unlock_hb(hb1, hb2);
 		futex_hb_waiters_dec(hb2);
+		double_unlock_hb(hb1, hb2);
 	}
 	wake_up_q(&wake_q);
 	return ret ? ret : task_count;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 11/21] futex: Introduce futex_q_lockptr_lock().
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (9 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 10/21] futex: Decrease the waiter count before the unlock operation Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 12/21] futex: Acquire a hash reference in futex_wait_multiple_setup() Sebastian Andrzej Siewior
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

futex_lock_pi() and __fixup_pi_state_owner() acquire the
futex_q::lock_ptr without holding a reference assuming the previously
obtained hash bucket and the assigned lock_ptr are still valid. This
isn't the case once the private hash can be resized and becomes invalid
after the reference drop.

Introduce futex_q_lockptr_lock() to lock the hash bucket recorded in
futex_q::lock_ptr. The lock pointer is read in a RCU section to ensure
that it does not go away if the hash bucket has been replaced and the
old pointer has been observed. After locking the pointer needs to be
compared to check if it changed. If so then the hash bucket has been
replaced and the user has been moved to the new one and lock_ptr has
been updated. The lock operation needs to be redone in this case.

The locked hash bucket is not returned.

A special case is an early return in futex_lock_pi() (due to signal or
timeout) and a successful futex_wait_requeue_pi(). In both cases a valid
futex_q::lock_ptr is expected (and its matching hash bucket) but since
the waiter has been removed from the hash this can no longer be
guaranteed. Therefore before the waiter is removed and a reference is
acquired which is later dropped by the waiter to avoid a resize.

Add futex_q_lockptr_lock() and use it.
Acquire an additional reference in requeue_pi_wake_futex() and
futex_unlock_pi() while the futex_q is removed, denote this extra
reference in futex_q::drop_hb_ref and let the waiter drop the reference
in this case.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/core.c    | 29 +++++++++++++++++++++++++++++
 kernel/futex/futex.h   |  4 +++-
 kernel/futex/pi.c      | 15 +++++++++++++--
 kernel/futex/requeue.c | 16 +++++++++++++---
 4 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 063d733181783..4d8912daffe83 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -152,6 +152,17 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
 	return &futex_queues[hash & futex_hashmask];
 }
 
+/**
+ * futex_hash_get - Get an additional reference for the local hash.
+ * @hb:		    ptr to the private local hash.
+ *
+ * Obtain an additional reference for the already obtained hash bucket. The
+ * caller must already own an reference.
+ */
+void futex_hash_get(struct futex_hash_bucket *hb)
+{
+}
+
 void futex_hash_put(struct futex_hash_bucket *hb) { }
 
 /**
@@ -632,6 +643,24 @@ int futex_unqueue(struct futex_q *q)
 	return ret;
 }
 
+void futex_q_lockptr_lock(struct futex_q *q)
+{
+	spinlock_t *lock_ptr;
+
+	/*
+	 * See futex_unqueue() why lock_ptr can change.
+	 */
+	guard(rcu)();
+retry:
+	lock_ptr = READ_ONCE(q->lock_ptr);
+	spin_lock(lock_ptr);
+
+	if (unlikely(lock_ptr != q->lock_ptr)) {
+		spin_unlock(lock_ptr);
+		goto retry;
+	}
+}
+
 /*
  * PI futexes can not be requeued and must remove themselves from the hash
  * bucket. The hash bucket lock (i.e. lock_ptr) is held.
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index eac6de6ed563a..e6f8f2f9281aa 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -183,6 +183,7 @@ struct futex_q {
 	union futex_key *requeue_pi_key;
 	u32 bitset;
 	atomic_t requeue_state;
+	bool drop_hb_ref;
 #ifdef CONFIG_PREEMPT_RT
 	struct rcuwait requeue_wait;
 #endif
@@ -197,12 +198,13 @@ enum futex_access {
 
 extern int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
 			 enum futex_access rw);
-
+extern void futex_q_lockptr_lock(struct futex_q *q);
 extern struct hrtimer_sleeper *
 futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
 		  int flags, u64 range_ns);
 
 extern struct futex_hash_bucket *__futex_hash(union futex_key *key);
+extern void futex_hash_get(struct futex_hash_bucket *hb);
 extern void futex_hash_put(struct futex_hash_bucket *hb);
 
 DEFINE_CLASS(hb, struct futex_hash_bucket *,
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 4cee9ec5d97d6..51c69e8808152 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -806,7 +806,7 @@ static int __fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 		break;
 	}
 
-	spin_lock(q->lock_ptr);
+	futex_q_lockptr_lock(q);
 	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 
 	/*
@@ -1066,7 +1066,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 		 * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
 		 * the
 		 */
-		spin_lock(q.lock_ptr);
+		futex_q_lockptr_lock(&q);
 		/*
 		 * Waiter is unqueued.
 		 */
@@ -1086,6 +1086,11 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 
 		futex_unqueue_pi(&q);
 		spin_unlock(q.lock_ptr);
+		if (q.drop_hb_ref) {
+			CLASS(hb, hb)(&q.key);
+			/* Additional reference from futex_unlock_pi() */
+			futex_hash_put(hb);
+		}
 		goto out;
 
 out_unlock_put_key:
@@ -1194,6 +1199,12 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 		 */
 		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
 		if (!rt_waiter) {
+			/*
+			 * Acquire a reference for the leaving waiter to ensure
+			 * valid futex_q::lock_ptr.
+			 */
+			futex_hash_get(hb);
+			top_waiter->drop_hb_ref = true;
 			__futex_unqueue(top_waiter);
 			raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 			goto retry_hb;
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 023c028d2fce3..b0e64fd454d96 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -231,7 +231,12 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
 
 	WARN_ON(!q->rt_waiter);
 	q->rt_waiter = NULL;
-
+	/*
+	 * Acquire a reference for the waiter to ensure valid
+	 * futex_q::lock_ptr.
+	 */
+	futex_hash_get(hb);
+	q->drop_hb_ref = true;
 	q->lock_ptr = &hb->lock;
 
 	/* Signal locked state to the waiter */
@@ -826,7 +831,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	case Q_REQUEUE_PI_LOCKED:
 		/* The requeue acquired the lock */
 		if (q.pi_state && (q.pi_state->owner != current)) {
-			spin_lock(q.lock_ptr);
+			futex_q_lockptr_lock(&q);
 			ret = fixup_pi_owner(uaddr2, &q, true);
 			/*
 			 * Drop the reference to the pi state which the
@@ -853,7 +858,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
 			ret = 0;
 
-		spin_lock(q.lock_ptr);
+		futex_q_lockptr_lock(&q);
 		debug_rt_mutex_free_waiter(&rt_waiter);
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we
@@ -885,6 +890,11 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	default:
 		BUG();
 	}
+	if (q.drop_hb_ref) {
+		CLASS(hb, hb)(&q.key);
+		/* Additional reference from requeue_pi_wake_futex() */
+		futex_hash_put(hb);
+	}
 
 out:
 	if (to) {
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 12/21] futex: Acquire a hash reference in futex_wait_multiple_setup().
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (10 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 11/21] futex: Introduce futex_q_lockptr_lock() Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 13/21] futex: Allow to re-allocate the private local hash Sebastian Andrzej Siewior
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

futex_wait_multiple_setup() changes task_struct::__state to
!TASK_RUNNING and then enqueues on multiple futexes. Every
futex_q_lock() acquires a reference on the global hash which is dropped
later.
If a rehash is in progress then the loop will block on
mm_struct::futex_hash_bucket for the rehash to complete and this will
lose the previously set task_struct::__state.

Acquire a reference on the local hash to avoiding blocking on
mm_struct::futex_hash_bucket.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/core.c     | 10 ++++++++++
 kernel/futex/futex.h    |  2 ++
 kernel/futex/waitwake.c | 21 +++++++++++++++++++--
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 4d8912daffe83..700a24d796acb 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -129,6 +129,11 @@ static struct futex_hash_bucket *futex_hash_private(union futex_key *key,
 	return &fhb[hash & hash_mask];
 }
 
+struct futex_private_hash *futex_get_private_hash(void)
+{
+	return NULL;
+}
+
 /**
  * futex_hash - Return the hash bucket in the global hash
  * @key:	Pointer to the futex key for which the hash is calculated
@@ -152,6 +157,11 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
 	return &futex_queues[hash & futex_hashmask];
 }
 
+bool futex_put_private_hash(struct futex_private_hash *hb_p)
+{
+	return false;
+}
+
 /**
  * futex_hash_get - Get an additional reference for the local hash.
  * @hb:		    ptr to the private local hash.
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index e6f8f2f9281aa..0a76ee6e7dc10 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -206,6 +206,8 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
 extern struct futex_hash_bucket *__futex_hash(union futex_key *key);
 extern void futex_hash_get(struct futex_hash_bucket *hb);
 extern void futex_hash_put(struct futex_hash_bucket *hb);
+extern struct futex_private_hash *futex_get_private_hash(void);
+extern bool futex_put_private_hash(struct futex_private_hash *hb_p);
 
 DEFINE_CLASS(hb, struct futex_hash_bucket *,
 	     if (_T) futex_hash_put(_T),
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 44034dee7a48c..67eebb5b4b212 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -385,7 +385,7 @@ int futex_unqueue_multiple(struct futex_vector *v, int count)
 }
 
 /**
- * futex_wait_multiple_setup - Prepare to wait and enqueue multiple futexes
+ * __futex_wait_multiple_setup - Prepare to wait and enqueue multiple futexes
  * @vs:		The futex list to wait on
  * @count:	The size of the list
  * @woken:	Index of the last woken futex, if any. Used to notify the
@@ -400,7 +400,7 @@ int futex_unqueue_multiple(struct futex_vector *v, int count)
  *  -  0 - Success
  *  - <0 - -EFAULT, -EWOULDBLOCK or -EINVAL
  */
-int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
+static int __futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
 {
 	bool retry = false;
 	int ret, i;
@@ -491,6 +491,23 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
 	return 0;
 }
 
+int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
+{
+	struct futex_private_hash *hb_p;
+	int ret;
+
+	/*
+	 * Assume to have a private futex and acquire a reference on the private
+	 * hash to avoid blocking on mm_struct::futex_hash_bucket during rehash
+	 * after changing the task state.
+	 */
+	hb_p = futex_get_private_hash();
+	ret = __futex_wait_multiple_setup(vs, count, woken);
+	if (hb_p)
+		futex_put_private_hash(hb_p);
+	return ret;
+}
+
 /**
  * futex_sleep_multiple - Check sleeping conditions and sleep
  * @vs:    List of futexes to wait for
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 13/21] futex: Allow to re-allocate the private local hash.
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (11 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 12/21] futex: Acquire a hash reference in futex_wait_multiple_setup() Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 14/21] futex: Resize local futex hash table based on number of threads Sebastian Andrzej Siewior
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/
replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS
operation can now be invoked at runtime and resize an already existing
internal private futex_hash_bucket to another size.

The reallocation is based on an idea by Thomas Gleixner: The initial
allocation of struct futex_private_hash sets the reference count
to one. Every user acquires a reference on the local hash before using
it and drops it after it enqueued itself on the hash bucket. There is no
reference held while the task is scheduled out while waiting for the
wake up.
The resize allocates a new struct futex_private_hash and drops the
initial reference under the mm_struct::futex_hash_lock. If the reference
drop results in destruction of the object then users currently queued on
the local hash will be requeued on the new local hash. At the end
mm_struct::futex_phash is updated, the old pointer is RCU freed
and the mutex is dropped.
If the reference drop does not result in destruction of the object then
the new pointer is saved as mm_struct::futex_phash_new. In this case
replacement is delayed. The user dropping the last reference is not
always the best choice to perform the replacement. For instance
futex_wait_queue() drops the reference after changing its task
state which will also be modified while the futex_hash_lock is acquired.
Therefore the replacement is delayed to the task acquiring a reference
on the current local hash.

This scheme keeps the requirement that all waiters/ wakers of the same address
block always on the same futex_hash_bucket::lock.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/futex.h    |   5 +-
 include/linux/mm_types.h |   7 +-
 kernel/futex/core.c      | 248 +++++++++++++++++++++++++++++++++++----
 kernel/futex/futex.h     |   1 +
 kernel/futex/requeue.c   |   5 +
 5 files changed, 237 insertions(+), 29 deletions(-)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index bad377c30de5e..bfb38764bac7a 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -83,12 +83,13 @@ void futex_hash_free(struct mm_struct *mm);
 
 static inline void futex_mm_init(struct mm_struct *mm)
 {
-	mm->futex_hash_bucket = NULL;
+	rcu_assign_pointer(mm->futex_phash, NULL);
+	mutex_init(&mm->futex_hash_lock);
 }
 
 static inline bool futex_hash_requires_allocation(void)
 {
-	if (current->mm->futex_hash_bucket)
+	if (current->mm->futex_phash)
 		return false;
 	return true;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 769cd77364e2d..46abaf1ce1c0a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -30,7 +30,7 @@
 #define INIT_PASID	0
 
 struct address_space;
-struct futex_hash_bucket;
+struct futex_private_hash;
 struct mem_cgroup;
 
 /*
@@ -939,8 +939,9 @@ struct mm_struct {
 		seqcount_t mm_lock_seq;
 #endif
 #ifdef CONFIG_FUTEX
-		unsigned int			futex_hash_mask;
-		struct futex_hash_bucket	*futex_hash_bucket;
+		struct mutex			futex_hash_lock;
+		struct futex_private_hash	__rcu *futex_phash;
+		struct futex_private_hash	*futex_phash_new;
 #endif
 
 		unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 700a24d796acb..c5a9db946b421 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -40,6 +40,7 @@
 #include <linux/fault-inject.h>
 #include <linux/slab.h>
 #include <linux/prctl.h>
+#include <linux/rcuref.h>
 
 #include "futex.h"
 #include "../locking/rtmutex_common.h"
@@ -56,6 +57,14 @@ static struct {
 #define futex_queues   (__futex_data.queues)
 #define futex_hashmask (__futex_data.hashmask)
 
+struct futex_private_hash {
+	rcuref_t	users;
+	unsigned int	hash_mask;
+	struct rcu_head	rcu;
+	bool		initial_ref_dropped;
+	bool		released;
+	struct futex_hash_bucket queues[];
+};
 
 /*
  * Fault injections for futexes.
@@ -129,9 +138,122 @@ static struct futex_hash_bucket *futex_hash_private(union futex_key *key,
 	return &fhb[hash & hash_mask];
 }
 
+static void futex_rehash_current_users(struct futex_private_hash *old,
+				       struct futex_private_hash *new)
+{
+	struct futex_hash_bucket *hb_old, *hb_new;
+	unsigned int slots = old->hash_mask + 1;
+	u32 hash_mask = new->hash_mask;
+	unsigned int i;
+
+	for (i = 0; i < slots; i++) {
+		struct futex_q *this, *tmp;
+
+		hb_old = &old->queues[i];
+
+		spin_lock(&hb_old->lock);
+		plist_for_each_entry_safe(this, tmp, &hb_old->chain, list) {
+
+			plist_del(&this->list, &hb_old->chain);
+			futex_hb_waiters_dec(hb_old);
+
+			WARN_ON_ONCE(this->lock_ptr != &hb_old->lock);
+
+			hb_new = futex_hash_private(&this->key, new->queues, hash_mask);
+			futex_hb_waiters_inc(hb_new);
+			/*
+			 * The new pointer isn't published yet but an already
+			 * moved user can be unqueued due to timeout or signal.
+			 */
+			spin_lock_nested(&hb_new->lock, SINGLE_DEPTH_NESTING);
+			plist_add(&this->list, &hb_new->chain);
+			this->lock_ptr = &hb_new->lock;
+			spin_unlock(&hb_new->lock);
+		}
+		spin_unlock(&hb_old->lock);
+	}
+}
+
+static void futex_assign_new_hash(struct futex_private_hash *hb_p_new,
+				  struct mm_struct *mm)
+{
+	bool drop_init_ref = hb_p_new != NULL;
+	struct futex_private_hash *hb_p;
+
+	if (!hb_p_new) {
+		hb_p_new = mm->futex_phash_new;
+		mm->futex_phash_new = NULL;
+	}
+	/* Someone was quicker, the current mask is valid */
+	if (!hb_p_new)
+		return;
+
+	hb_p = rcu_dereference_check(mm->futex_phash,
+				     lockdep_is_held(&mm->futex_hash_lock));
+	if (hb_p) {
+		if (hb_p->hash_mask >= hb_p_new->hash_mask) {
+			/* It was increased again while we were waiting */
+			kvfree(hb_p_new);
+			return;
+		}
+		/*
+		 * If the caller started the resize then the initial reference
+		 * needs to be dropped. If the object can not be deconstructed
+		 * we save hb_p_new for later and ensure the reference counter
+		 * is not dropped again.
+		 */
+		if (drop_init_ref &&
+		    (hb_p->initial_ref_dropped || !futex_put_private_hash(hb_p))) {
+			mm->futex_phash_new = hb_p_new;
+			hb_p->initial_ref_dropped = true;
+			return;
+		}
+		if (!READ_ONCE(hb_p->released)) {
+			mm->futex_phash_new = hb_p_new;
+			return;
+		}
+
+		futex_rehash_current_users(hb_p, hb_p_new);
+	}
+	rcu_assign_pointer(mm->futex_phash, hb_p_new);
+	kvfree_rcu(hb_p, rcu);
+}
+
 struct futex_private_hash *futex_get_private_hash(void)
 {
-	return NULL;
+	struct mm_struct *mm = current->mm;
+	/*
+	 * Ideally we don't loop. If there is a replacement in progress
+	 * then a new private hash is already prepared and a reference can't be
+	 * obtained once the last user dropped it's.
+	 * In that case we block on mm_struct::futex_hash_lock and either have
+	 * to perform the replacement or wait while someone else is doing the
+	 * job. Eitherway, on the second iteration we acquire a reference on the
+	 * new private hash or loop again because a new replacement has been
+	 * requested.
+	 */
+again:
+	scoped_guard(rcu) {
+		struct futex_private_hash *hb_p;
+
+		hb_p = rcu_dereference(mm->futex_phash);
+		if (!hb_p)
+			return NULL;
+
+		if (rcuref_get(&hb_p->users))
+			return hb_p;
+	}
+	scoped_guard(mutex, &current->mm->futex_hash_lock)
+		futex_assign_new_hash(NULL, mm);
+	goto again;
+}
+
+static struct futex_private_hash *futex_get_private_hb(union futex_key *key)
+{
+	if (!futex_key_is_private(key))
+		return NULL;
+
+	return futex_get_private_hash();
 }
 
 /**
@@ -144,12 +266,12 @@ struct futex_private_hash *futex_get_private_hash(void)
  */
 struct futex_hash_bucket *__futex_hash(union futex_key *key)
 {
-	struct futex_hash_bucket *fhb;
+	struct futex_private_hash *hb_p;
 	u32 hash;
 
-	fhb = current->mm->futex_hash_bucket;
-	if (fhb && futex_key_is_private(key))
-		return futex_hash_private(key, fhb, current->mm->futex_hash_mask);
+	hb_p = futex_get_private_hb(key);
+	if (hb_p)
+		return futex_hash_private(key, hb_p->queues, hb_p->hash_mask);
 
 	hash = jhash2((u32 *)key,
 		      offsetof(typeof(*key), both.offset) / 4,
@@ -159,7 +281,13 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
 
 bool futex_put_private_hash(struct futex_private_hash *hb_p)
 {
-	return false;
+	bool released;
+
+	guard(preempt)();
+	released = rcuref_put_rcusafe(&hb_p->users);
+	if (released)
+		WRITE_ONCE(hb_p->released, true);
+	return released;
 }
 
 /**
@@ -171,9 +299,22 @@ bool futex_put_private_hash(struct futex_private_hash *hb_p)
  */
 void futex_hash_get(struct futex_hash_bucket *hb)
 {
+	struct futex_private_hash *hb_p = hb->hb_p;
+
+	if (!hb_p)
+		return;
+
+	WARN_ON_ONCE(!rcuref_get(&hb_p->users));
 }
 
-void futex_hash_put(struct futex_hash_bucket *hb) { }
+void futex_hash_put(struct futex_hash_bucket *hb)
+{
+	struct futex_private_hash *hb_p = hb->hb_p;
+
+	if (!hb_p)
+		return;
+	futex_put_private_hash(hb_p);
+}
 
 /**
  * futex_setup_timer - set up the sleeping hrtimer.
@@ -615,6 +756,8 @@ int futex_unqueue(struct futex_q *q)
 	spinlock_t *lock_ptr;
 	int ret = 0;
 
+	/* RCU so lock_ptr is not going away during locking. */
+	guard(rcu)();
 	/* In the common case we don't take the spinlock, which is nice. */
 retry:
 	/*
@@ -1013,9 +1156,21 @@ static void compat_exit_robust_list(struct task_struct *curr)
 static void exit_pi_state_list(struct task_struct *curr)
 {
 	struct list_head *next, *head = &curr->pi_state_list;
+	struct futex_private_hash *hb_p;
 	struct futex_pi_state *pi_state;
 	union futex_key key = FUTEX_KEY_INIT;
 
+	/*
+	 * The mutex mm_struct::futex_hash_lock might be acquired.
+	 */
+	might_sleep();
+	/*
+	 * Ensure the hash remains stable (no resize) during the while loop
+	 * below. The hb pointer is acquired under the pi_lock so we can't block
+	 * on the mutex.
+	 */
+	WARN_ON(curr != current);
+	hb_p = futex_get_private_hash();
 	/*
 	 * We are a ZOMBIE and nobody can enqueue itself on
 	 * pi_state_list anymore, but we have to be careful
@@ -1078,6 +1233,8 @@ static void exit_pi_state_list(struct task_struct *curr)
 		raw_spin_lock_irq(&curr->pi_lock);
 	}
 	raw_spin_unlock_irq(&curr->pi_lock);
+	if (hb_p)
+		futex_put_private_hash(hb_p);
 }
 #else
 static inline void exit_pi_state_list(struct task_struct *curr) { }
@@ -1191,8 +1348,10 @@ void futex_exit_release(struct task_struct *tsk)
 	futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
 }
 
-static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
+				   struct futex_private_hash *hb_p)
 {
+	fhb->hb_p = hb_p;
 	atomic_set(&fhb->waiters, 0);
 	plist_head_init(&fhb->chain);
 	spin_lock_init(&fhb->lock);
@@ -1200,20 +1359,34 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
 
 void futex_hash_free(struct mm_struct *mm)
 {
-	kvfree(mm->futex_hash_bucket);
+	struct futex_private_hash *hb_p;
+
+	kvfree(mm->futex_phash_new);
+	/*
+	 * The mm_struct belonging to the task is about to be removed so all
+	 * threads, that ever accessed the private hash, are gone and the
+	 * pointer can be accessed directly (omitting a RCU-read section or
+	 * lock).
+	 * Since there can not be a thread holding a reference to the private
+	 * hash we free it immediately.
+	 */
+	hb_p = rcu_dereference_raw(mm->futex_phash);
+	if (!hb_p)
+		return;
+
+	if (!hb_p->initial_ref_dropped && WARN_ON(!futex_put_private_hash(hb_p)))
+		return;
+
+	kvfree(hb_p);
 }
 
 static int futex_hash_allocate(unsigned int hash_slots)
 {
-	struct futex_hash_bucket *fhb;
+	struct futex_private_hash *hb_p, *hb_tofree = NULL;
+	struct mm_struct *mm = current->mm;
+	size_t alloc_size;
 	int i;
 
-	if (current->mm->futex_hash_bucket)
-		return -EALREADY;
-
-	if (!thread_group_leader(current))
-		return -EINVAL;
-
 	if (hash_slots == 0)
 		hash_slots = 16;
 	if (hash_slots < 2)
@@ -1223,16 +1396,39 @@ static int futex_hash_allocate(unsigned int hash_slots)
 	if (!is_power_of_2(hash_slots))
 		hash_slots = rounddown_pow_of_two(hash_slots);
 
-	fhb = kvmalloc_array(hash_slots, sizeof(struct futex_hash_bucket), GFP_KERNEL_ACCOUNT);
-	if (!fhb)
+	if (unlikely(check_mul_overflow(hash_slots, sizeof(struct futex_hash_bucket),
+					&alloc_size)))
 		return -ENOMEM;
 
-	current->mm->futex_hash_mask = hash_slots - 1;
+	if (unlikely(check_add_overflow(alloc_size, sizeof(struct futex_private_hash),
+					&alloc_size)))
+		return -ENOMEM;
+
+	hb_p = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
+	if (!hb_p)
+		return -ENOMEM;
+
+	rcuref_init(&hb_p->users, 1);
+	hb_p->initial_ref_dropped = false;
+	hb_p->released = false;
+	hb_p->hash_mask = hash_slots - 1;
 
 	for (i = 0; i < hash_slots; i++)
-		futex_hash_bucket_init(&fhb[i]);
+		futex_hash_bucket_init(&hb_p->queues[i], hb_p);
 
-	current->mm->futex_hash_bucket = fhb;
+	scoped_guard(mutex, &mm->futex_hash_lock) {
+		if (mm->futex_phash_new) {
+			if (mm->futex_phash_new->hash_mask <= hb_p->hash_mask) {
+				hb_tofree = mm->futex_phash_new;
+			} else {
+				hb_tofree = hb_p;
+				hb_p = mm->futex_phash_new;
+			}
+			mm->futex_phash_new = NULL;
+		}
+		futex_assign_new_hash(hb_p, mm);
+	}
+	kvfree(hb_tofree);
 	return 0;
 }
 
@@ -1243,8 +1439,12 @@ int futex_hash_allocate_default(void)
 
 static int futex_hash_get_slots(void)
 {
-	if (current->mm->futex_hash_bucket)
-		return current->mm->futex_hash_mask + 1;
+	struct futex_private_hash *hb_p;
+
+	guard(rcu)();
+	hb_p = rcu_dereference(current->mm->futex_phash);
+	if (hb_p)
+		return hb_p->hash_mask + 1;
 	return 0;
 }
 
@@ -1286,7 +1486,7 @@ static int __init futex_init(void)
 	hashsize = 1UL << futex_shift;
 
 	for (i = 0; i < hashsize; i++)
-		futex_hash_bucket_init(&futex_queues[i]);
+		futex_hash_bucket_init(&futex_queues[i], NULL);
 
 	futex_hashmask = hashsize - 1;
 	return 0;
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 0a76ee6e7dc10..973efcca2e01b 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -118,6 +118,7 @@ struct futex_hash_bucket {
 	atomic_t waiters;
 	spinlock_t lock;
 	struct plist_head chain;
+	struct futex_private_hash *hb_p;
 } ____cacheline_aligned_in_smp;
 
 /*
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index b0e64fd454d96..c716a66f86929 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -87,6 +87,11 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
 		futex_hb_waiters_inc(hb2);
 		plist_add(&q->list, &hb2->chain);
 		q->lock_ptr = &hb2->lock;
+		/*
+		 * hb1 and hb2 belong to the same futex_hash_bucket_private
+		 * because if we managed get a reference on hb1 then it can't be
+		 * replaced. Therefore we avoid put(hb1)+get(hb2) here.
+		 */
 	}
 	q->key = *key2;
 }
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 14/21] futex: Resize local futex hash table based on number of threads.
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (12 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 13/21] futex: Allow to re-allocate the private local hash Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 15/21] futex: s/hb_p/fph/ Sebastian Andrzej Siewior
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

Automatically size the local hash based on the number of threads but
don't exceed the number of online CPUs. The logic tries to allocate
between 16 and futex_hashsize (the default for the system wide hash
bucket) and uses 4 * number-of-threads.

On CONFIG_BASE_SMALL configs, the additional members for private hash
resize have been removed in order to save memory on mm_struct and avoid
any additional memory consumption. To achive this, a
CONFIG_FUTEX_PRIVATE_HASH has been introduced which depends on
!BASE_SMALL and can be extended later.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/futex.h    | 20 ++++++--------
 include/linux/mm_types.h |  2 +-
 init/Kconfig             |  5 ++++
 kernel/fork.c            |  4 +--
 kernel/futex/core.c      | 57 ++++++++++++++++++++++++++++++++++++----
 kernel/futex/futex.h     |  8 ++++++
 6 files changed, 75 insertions(+), 21 deletions(-)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index bfb38764bac7a..7e14d2e9162d2 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -78,6 +78,8 @@ void futex_exec_release(struct task_struct *tsk);
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	      u32 __user *uaddr2, u32 val2, u32 val3);
 int futex_hash_prctl(unsigned long arg2, unsigned long arg3);
+
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 int futex_hash_allocate_default(void);
 void futex_hash_free(struct mm_struct *mm);
 
@@ -87,14 +89,13 @@ static inline void futex_mm_init(struct mm_struct *mm)
 	mutex_init(&mm->futex_hash_lock);
 }
 
-static inline bool futex_hash_requires_allocation(void)
-{
-	if (current->mm->futex_phash)
-		return false;
-	return true;
-}
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+static inline int futex_hash_allocate_default(void) { return 0; }
+static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline void futex_mm_init(struct mm_struct *mm) { }
+#endif /* CONFIG_FUTEX_PRIVATE_HASH */
 
-#else
+#else /* !CONFIG_FUTEX */
 static inline void futex_init_task(struct task_struct *tsk) { }
 static inline void futex_exit_recursive(struct task_struct *tsk) { }
 static inline void futex_exit_release(struct task_struct *tsk) { }
@@ -116,11 +117,6 @@ static inline int futex_hash_allocate_default(void)
 static inline void futex_hash_free(struct mm_struct *mm) { }
 static inline void futex_mm_init(struct mm_struct *mm) { }
 
-static inline bool futex_hash_requires_allocation(void)
-{
-	return false;
-}
-
 #endif
 
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 46abaf1ce1c0a..e0e8adbe66bdd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -938,7 +938,7 @@ struct mm_struct {
 		 */
 		seqcount_t mm_lock_seq;
 #endif
-#ifdef CONFIG_FUTEX
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 		struct mutex			futex_hash_lock;
 		struct futex_private_hash	__rcu *futex_phash;
 		struct futex_private_hash	*futex_phash_new;
diff --git a/init/Kconfig b/init/Kconfig
index a0ea04c177842..bb209c12a2bda 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1683,6 +1683,11 @@ config FUTEX_PI
 	depends on FUTEX && RT_MUTEXES
 	default y
 
+config FUTEX_PRIVATE_HASH
+	bool
+	depends on FUTEX && !BASE_SMALL
+	default y
+
 config EPOLL
 	bool "Enable eventpoll support" if EXPERT
 	default y
diff --git a/kernel/fork.c b/kernel/fork.c
index 440c5808f70a2..69f98d7e85054 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2141,9 +2141,7 @@ static bool need_futex_hash_allocate_default(u64 clone_flags)
 {
 	if ((clone_flags & (CLONE_THREAD | CLONE_VM)) != (CLONE_THREAD | CLONE_VM))
 		return false;
-	if (!thread_group_empty(current))
-		return false;
-	return futex_hash_requires_allocation();
+	return true;
 }
 
 /*
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index c5a9db946b421..229009279ee7d 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -138,6 +138,7 @@ static struct futex_hash_bucket *futex_hash_private(union futex_key *key,
 	return &fhb[hash & hash_mask];
 }
 
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 static void futex_rehash_current_users(struct futex_private_hash *old,
 				       struct futex_private_hash *new)
 {
@@ -256,6 +257,14 @@ static struct futex_private_hash *futex_get_private_hb(union futex_key *key)
 	return futex_get_private_hash();
 }
 
+#else
+
+static struct futex_private_hash *futex_get_private_hb(union futex_key *key)
+{
+	return NULL;
+}
+#endif
+
 /**
  * futex_hash - Return the hash bucket in the global hash
  * @key:	Pointer to the futex key for which the hash is calculated
@@ -279,6 +288,7 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
 	return &futex_queues[hash & futex_hashmask];
 }
 
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 bool futex_put_private_hash(struct futex_private_hash *hb_p)
 {
 	bool released;
@@ -315,6 +325,7 @@ void futex_hash_put(struct futex_hash_bucket *hb)
 		return;
 	futex_put_private_hash(hb_p);
 }
+#endif
 
 /**
  * futex_setup_timer - set up the sleeping hrtimer.
@@ -1351,12 +1362,15 @@ void futex_exit_release(struct task_struct *tsk)
 static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
 				   struct futex_private_hash *hb_p)
 {
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 	fhb->hb_p = hb_p;
+#endif
 	atomic_set(&fhb->waiters, 0);
 	plist_head_init(&fhb->chain);
 	spin_lock_init(&fhb->lock);
 }
 
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 void futex_hash_free(struct mm_struct *mm)
 {
 	struct futex_private_hash *hb_p;
@@ -1389,10 +1403,7 @@ static int futex_hash_allocate(unsigned int hash_slots)
 
 	if (hash_slots == 0)
 		hash_slots = 16;
-	if (hash_slots < 2)
-		hash_slots = 2;
-	if (hash_slots > 131072)
-		hash_slots = 131072;
+	hash_slots = clamp(hash_slots, 2, futex_hashmask + 1);
 	if (!is_power_of_2(hash_slots))
 		hash_slots = rounddown_pow_of_two(hash_slots);
 
@@ -1434,7 +1445,30 @@ static int futex_hash_allocate(unsigned int hash_slots)
 
 int futex_hash_allocate_default(void)
 {
-	return futex_hash_allocate(0);
+	unsigned int threads, buckets, current_buckets = 0;
+	struct futex_private_hash *hb_p;
+
+	if (!current->mm)
+		return 0;
+
+	scoped_guard(rcu) {
+		threads = min_t(unsigned int, get_nr_threads(current), num_online_cpus());
+		hb_p = rcu_dereference(current->mm->futex_phash);
+		if (hb_p)
+			current_buckets = hb_p->hash_mask + 1;
+	}
+
+	/*
+	 * The default allocation will remain within
+	 *   16 <= threads * 4 <= global hash size
+	 */
+	buckets = roundup_pow_of_two(4 * threads);
+	buckets = clamp(buckets, 16, futex_hashmask + 1);
+
+	if (current_buckets >= buckets)
+		return 0;
+
+	return futex_hash_allocate(buckets);
 }
 
 static int futex_hash_get_slots(void)
@@ -1448,6 +1482,19 @@ static int futex_hash_get_slots(void)
 	return 0;
 }
 
+#else
+
+static int futex_hash_allocate(unsigned int hash_slots)
+{
+	return -EINVAL;
+}
+
+static int futex_hash_get_slots(void)
+{
+	return 0;
+}
+#endif
+
 int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
 {
 	int ret;
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 973efcca2e01b..782021feffe2e 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -205,11 +205,19 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
 		  int flags, u64 range_ns);
 
 extern struct futex_hash_bucket *__futex_hash(union futex_key *key);
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 extern void futex_hash_get(struct futex_hash_bucket *hb);
 extern void futex_hash_put(struct futex_hash_bucket *hb);
 extern struct futex_private_hash *futex_get_private_hash(void);
 extern bool futex_put_private_hash(struct futex_private_hash *hb_p);
 
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+static inline void futex_hash_get(struct futex_hash_bucket *hb) { }
+static inline void futex_hash_put(struct futex_hash_bucket *hb) { }
+static inline struct futex_private_hash *futex_get_private_hash(void) { return NULL; }
+static inline bool futex_put_private_hash(struct futex_private_hash *hb_p) { return false; }
+#endif
+
 DEFINE_CLASS(hb, struct futex_hash_bucket *,
 	     if (_T) futex_hash_put(_T),
 	     __futex_hash(key), union futex_key *key);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 15/21] futex: s/hb_p/fph/
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (13 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 14/21] futex: Resize local futex hash table based on number of threads Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-14 12:36   ` Peter Zijlstra
  2025-03-12 15:16 ` [PATCH v10 16/21] futex: Remove superfluous state Sebastian Andrzej Siewior
                   ` (7 subsequent siblings)
  22 siblings, 1 reply; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

From: Peter Zijlstra <peterz@infradead.org>

To me hb_p reads like hash-bucket-private, but these things are
pointers to private hash table, not bucket.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/core.c     | 136 ++++++++++++++++++++--------------------
 kernel/futex/futex.h    |   6 +-
 kernel/futex/waitwake.c |   8 +--
 3 files changed, 75 insertions(+), 75 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 229009279ee7d..9b87c4f128f14 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -175,49 +175,49 @@ static void futex_rehash_current_users(struct futex_private_hash *old,
 	}
 }
 
-static void futex_assign_new_hash(struct futex_private_hash *hb_p_new,
+static void futex_assign_new_hash(struct futex_private_hash *new,
 				  struct mm_struct *mm)
 {
-	bool drop_init_ref = hb_p_new != NULL;
-	struct futex_private_hash *hb_p;
+	bool drop_init_ref = new != NULL;
+	struct futex_private_hash *fph;
 
-	if (!hb_p_new) {
-		hb_p_new = mm->futex_phash_new;
+	if (!new) {
+		new = mm->futex_phash_new;
 		mm->futex_phash_new = NULL;
 	}
 	/* Someone was quicker, the current mask is valid */
-	if (!hb_p_new)
+	if (!new)
 		return;
 
-	hb_p = rcu_dereference_check(mm->futex_phash,
+	fph = rcu_dereference_check(mm->futex_phash,
 				     lockdep_is_held(&mm->futex_hash_lock));
-	if (hb_p) {
-		if (hb_p->hash_mask >= hb_p_new->hash_mask) {
+	if (fph) {
+		if (fph->hash_mask >= new->hash_mask) {
 			/* It was increased again while we were waiting */
-			kvfree(hb_p_new);
+			kvfree(new);
 			return;
 		}
 		/*
 		 * If the caller started the resize then the initial reference
 		 * needs to be dropped. If the object can not be deconstructed
-		 * we save hb_p_new for later and ensure the reference counter
+		 * we save new for later and ensure the reference counter
 		 * is not dropped again.
 		 */
 		if (drop_init_ref &&
-		    (hb_p->initial_ref_dropped || !futex_put_private_hash(hb_p))) {
-			mm->futex_phash_new = hb_p_new;
-			hb_p->initial_ref_dropped = true;
+		    (fph->initial_ref_dropped || !futex_put_private_hash(fph))) {
+			mm->futex_phash_new = new;
+			fph->initial_ref_dropped = true;
 			return;
 		}
-		if (!READ_ONCE(hb_p->released)) {
-			mm->futex_phash_new = hb_p_new;
+		if (!READ_ONCE(fph->released)) {
+			mm->futex_phash_new = new;
 			return;
 		}
 
-		futex_rehash_current_users(hb_p, hb_p_new);
+		futex_rehash_current_users(fph, new);
 	}
-	rcu_assign_pointer(mm->futex_phash, hb_p_new);
-	kvfree_rcu(hb_p, rcu);
+	rcu_assign_pointer(mm->futex_phash, new);
+	kvfree_rcu(fph, rcu);
 }
 
 struct futex_private_hash *futex_get_private_hash(void)
@@ -235,14 +235,14 @@ struct futex_private_hash *futex_get_private_hash(void)
 	 */
 again:
 	scoped_guard(rcu) {
-		struct futex_private_hash *hb_p;
+		struct futex_private_hash *fph;
 
-		hb_p = rcu_dereference(mm->futex_phash);
-		if (!hb_p)
+		fph = rcu_dereference(mm->futex_phash);
+		if (!fph)
 			return NULL;
 
-		if (rcuref_get(&hb_p->users))
-			return hb_p;
+		if (rcuref_get(&fph->users))
+			return fph;
 	}
 	scoped_guard(mutex, &current->mm->futex_hash_lock)
 		futex_assign_new_hash(NULL, mm);
@@ -275,12 +275,12 @@ static struct futex_private_hash *futex_get_private_hb(union futex_key *key)
  */
 struct futex_hash_bucket *__futex_hash(union futex_key *key)
 {
-	struct futex_private_hash *hb_p;
+	struct futex_private_hash *fph;
 	u32 hash;
 
-	hb_p = futex_get_private_hb(key);
-	if (hb_p)
-		return futex_hash_private(key, hb_p->queues, hb_p->hash_mask);
+	fph = futex_get_private_hb(key);
+	if (fph)
+		return futex_hash_private(key, fph->queues, fph->hash_mask);
 
 	hash = jhash2((u32 *)key,
 		      offsetof(typeof(*key), both.offset) / 4,
@@ -289,14 +289,14 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
 }
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
-bool futex_put_private_hash(struct futex_private_hash *hb_p)
+bool futex_put_private_hash(struct futex_private_hash *fph)
 {
 	bool released;
 
 	guard(preempt)();
-	released = rcuref_put_rcusafe(&hb_p->users);
+	released = rcuref_put_rcusafe(&fph->users);
 	if (released)
-		WRITE_ONCE(hb_p->released, true);
+		WRITE_ONCE(fph->released, true);
 	return released;
 }
 
@@ -309,21 +309,21 @@ bool futex_put_private_hash(struct futex_private_hash *hb_p)
  */
 void futex_hash_get(struct futex_hash_bucket *hb)
 {
-	struct futex_private_hash *hb_p = hb->hb_p;
+	struct futex_private_hash *fph = hb->priv;
 
-	if (!hb_p)
+	if (!fph)
 		return;
 
-	WARN_ON_ONCE(!rcuref_get(&hb_p->users));
+	WARN_ON_ONCE(!rcuref_get(&fph->users));
 }
 
 void futex_hash_put(struct futex_hash_bucket *hb)
 {
-	struct futex_private_hash *hb_p = hb->hb_p;
+	struct futex_private_hash *fph = hb->priv;
 
-	if (!hb_p)
+	if (!fph)
 		return;
-	futex_put_private_hash(hb_p);
+	futex_put_private_hash(fph);
 }
 #endif
 
@@ -1167,7 +1167,7 @@ static void compat_exit_robust_list(struct task_struct *curr)
 static void exit_pi_state_list(struct task_struct *curr)
 {
 	struct list_head *next, *head = &curr->pi_state_list;
-	struct futex_private_hash *hb_p;
+	struct futex_private_hash *fph;
 	struct futex_pi_state *pi_state;
 	union futex_key key = FUTEX_KEY_INIT;
 
@@ -1181,7 +1181,7 @@ static void exit_pi_state_list(struct task_struct *curr)
 	 * on the mutex.
 	 */
 	WARN_ON(curr != current);
-	hb_p = futex_get_private_hash();
+	fph = futex_get_private_hash();
 	/*
 	 * We are a ZOMBIE and nobody can enqueue itself on
 	 * pi_state_list anymore, but we have to be careful
@@ -1244,8 +1244,8 @@ static void exit_pi_state_list(struct task_struct *curr)
 		raw_spin_lock_irq(&curr->pi_lock);
 	}
 	raw_spin_unlock_irq(&curr->pi_lock);
-	if (hb_p)
-		futex_put_private_hash(hb_p);
+	if (fph)
+		futex_put_private_hash(fph);
 }
 #else
 static inline void exit_pi_state_list(struct task_struct *curr) { }
@@ -1360,10 +1360,10 @@ void futex_exit_release(struct task_struct *tsk)
 }
 
 static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
-				   struct futex_private_hash *hb_p)
+				   struct futex_private_hash *fph)
 {
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
-	fhb->hb_p = hb_p;
+	fhb->priv = fph;
 #endif
 	atomic_set(&fhb->waiters, 0);
 	plist_head_init(&fhb->chain);
@@ -1373,7 +1373,7 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 void futex_hash_free(struct mm_struct *mm)
 {
-	struct futex_private_hash *hb_p;
+	struct futex_private_hash *fph;
 
 	kvfree(mm->futex_phash_new);
 	/*
@@ -1384,19 +1384,19 @@ void futex_hash_free(struct mm_struct *mm)
 	 * Since there can not be a thread holding a reference to the private
 	 * hash we free it immediately.
 	 */
-	hb_p = rcu_dereference_raw(mm->futex_phash);
-	if (!hb_p)
+	fph = rcu_dereference_raw(mm->futex_phash);
+	if (!fph)
 		return;
 
-	if (!hb_p->initial_ref_dropped && WARN_ON(!futex_put_private_hash(hb_p)))
+	if (!fph->initial_ref_dropped && WARN_ON(!futex_put_private_hash(fph)))
 		return;
 
-	kvfree(hb_p);
+	kvfree(fph);
 }
 
 static int futex_hash_allocate(unsigned int hash_slots)
 {
-	struct futex_private_hash *hb_p, *hb_tofree = NULL;
+	struct futex_private_hash *fph, *hb_tofree = NULL;
 	struct mm_struct *mm = current->mm;
 	size_t alloc_size;
 	int i;
@@ -1415,29 +1415,29 @@ static int futex_hash_allocate(unsigned int hash_slots)
 					&alloc_size)))
 		return -ENOMEM;
 
-	hb_p = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
-	if (!hb_p)
+	fph = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
+	if (!fph)
 		return -ENOMEM;
 
-	rcuref_init(&hb_p->users, 1);
-	hb_p->initial_ref_dropped = false;
-	hb_p->released = false;
-	hb_p->hash_mask = hash_slots - 1;
+	rcuref_init(&fph->users, 1);
+	fph->initial_ref_dropped = false;
+	fph->released = false;
+	fph->hash_mask = hash_slots - 1;
 
 	for (i = 0; i < hash_slots; i++)
-		futex_hash_bucket_init(&hb_p->queues[i], hb_p);
+		futex_hash_bucket_init(&fph->queues[i], fph);
 
 	scoped_guard(mutex, &mm->futex_hash_lock) {
 		if (mm->futex_phash_new) {
-			if (mm->futex_phash_new->hash_mask <= hb_p->hash_mask) {
+			if (mm->futex_phash_new->hash_mask <= fph->hash_mask) {
 				hb_tofree = mm->futex_phash_new;
 			} else {
-				hb_tofree = hb_p;
-				hb_p = mm->futex_phash_new;
+				hb_tofree = fph;
+				fph = mm->futex_phash_new;
 			}
 			mm->futex_phash_new = NULL;
 		}
-		futex_assign_new_hash(hb_p, mm);
+		futex_assign_new_hash(fph, mm);
 	}
 	kvfree(hb_tofree);
 	return 0;
@@ -1446,16 +1446,16 @@ static int futex_hash_allocate(unsigned int hash_slots)
 int futex_hash_allocate_default(void)
 {
 	unsigned int threads, buckets, current_buckets = 0;
-	struct futex_private_hash *hb_p;
+	struct futex_private_hash *fph;
 
 	if (!current->mm)
 		return 0;
 
 	scoped_guard(rcu) {
 		threads = min_t(unsigned int, get_nr_threads(current), num_online_cpus());
-		hb_p = rcu_dereference(current->mm->futex_phash);
-		if (hb_p)
-			current_buckets = hb_p->hash_mask + 1;
+		fph = rcu_dereference(current->mm->futex_phash);
+		if (fph)
+			current_buckets = fph->hash_mask + 1;
 	}
 
 	/*
@@ -1473,12 +1473,12 @@ int futex_hash_allocate_default(void)
 
 static int futex_hash_get_slots(void)
 {
-	struct futex_private_hash *hb_p;
+	struct futex_private_hash *fph;
 
 	guard(rcu)();
-	hb_p = rcu_dereference(current->mm->futex_phash);
-	if (hb_p)
-		return hb_p->hash_mask + 1;
+	fph = rcu_dereference(current->mm->futex_phash);
+	if (fph)
+		return fph->hash_mask + 1;
 	return 0;
 }
 
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 782021feffe2e..99218d220e534 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -118,7 +118,7 @@ struct futex_hash_bucket {
 	atomic_t waiters;
 	spinlock_t lock;
 	struct plist_head chain;
-	struct futex_private_hash *hb_p;
+	struct futex_private_hash *priv;
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -209,13 +209,13 @@ extern struct futex_hash_bucket *__futex_hash(union futex_key *key);
 extern void futex_hash_get(struct futex_hash_bucket *hb);
 extern void futex_hash_put(struct futex_hash_bucket *hb);
 extern struct futex_private_hash *futex_get_private_hash(void);
-extern bool futex_put_private_hash(struct futex_private_hash *hb_p);
+extern bool futex_put_private_hash(struct futex_private_hash *fph);
 
 #else /* !CONFIG_FUTEX_PRIVATE_HASH */
 static inline void futex_hash_get(struct futex_hash_bucket *hb) { }
 static inline void futex_hash_put(struct futex_hash_bucket *hb) { }
 static inline struct futex_private_hash *futex_get_private_hash(void) { return NULL; }
-static inline bool futex_put_private_hash(struct futex_private_hash *hb_p) { return false; }
+static inline bool futex_put_private_hash(struct futex_private_hash *fph) { return false; }
 #endif
 
 DEFINE_CLASS(hb, struct futex_hash_bucket *,
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 67eebb5b4b212..0d150453a0b41 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -493,7 +493,7 @@ static int __futex_wait_multiple_setup(struct futex_vector *vs, int count, int *
 
 int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
 {
-	struct futex_private_hash *hb_p;
+	struct futex_private_hash *fph;
 	int ret;
 
 	/*
@@ -501,10 +501,10 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
 	 * hash to avoid blocking on mm_struct::futex_hash_bucket during rehash
 	 * after changing the task state.
 	 */
-	hb_p = futex_get_private_hash();
+	fph = futex_get_private_hash();
 	ret = __futex_wait_multiple_setup(vs, count, woken);
-	if (hb_p)
-		futex_put_private_hash(hb_p);
+	if (fph)
+		futex_put_private_hash(fph);
 	return ret;
 }
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 16/21] futex: Remove superfluous state
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (14 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 15/21] futex: s/hb_p/fph/ Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 17/21] futex: Untangle and naming Sebastian Andrzej Siewior
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

From: Peter Zijlstra <peterz@infradead.org>

The whole initial_ref_dropped and release state lead to confusing
code.

[bigeasy: use rcuref_is_dead() ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/core.c  | 116 +++++++++++++++++++++----------------------
 kernel/futex/futex.h |   4 +-
 2 files changed, 58 insertions(+), 62 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 9b87c4f128f14..37c3e020f2f03 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -61,8 +61,6 @@ struct futex_private_hash {
 	rcuref_t	users;
 	unsigned int	hash_mask;
 	struct rcu_head	rcu;
-	bool		initial_ref_dropped;
-	bool		released;
 	struct futex_hash_bucket queues[];
 };
 
@@ -175,49 +173,32 @@ static void futex_rehash_current_users(struct futex_private_hash *old,
 	}
 }
 
-static void futex_assign_new_hash(struct futex_private_hash *new,
-				  struct mm_struct *mm)
+static bool futex_assign_new_hash(struct mm_struct *mm,
+				  struct futex_private_hash *new)
 {
-	bool drop_init_ref = new != NULL;
 	struct futex_private_hash *fph;
 
-	if (!new) {
-		new = mm->futex_phash_new;
-		mm->futex_phash_new = NULL;
-	}
-	/* Someone was quicker, the current mask is valid */
-	if (!new)
-		return;
+	WARN_ON_ONCE(mm->futex_phash_new);
 
-	fph = rcu_dereference_check(mm->futex_phash,
-				     lockdep_is_held(&mm->futex_hash_lock));
+	fph = rcu_dereference_protected(mm->futex_phash,
+					lockdep_is_held(&mm->futex_hash_lock));
 	if (fph) {
 		if (fph->hash_mask >= new->hash_mask) {
 			/* It was increased again while we were waiting */
 			kvfree(new);
-			return;
+			return true;
 		}
-		/*
-		 * If the caller started the resize then the initial reference
-		 * needs to be dropped. If the object can not be deconstructed
-		 * we save new for later and ensure the reference counter
-		 * is not dropped again.
-		 */
-		if (drop_init_ref &&
-		    (fph->initial_ref_dropped || !futex_put_private_hash(fph))) {
+
+		if (!rcuref_is_dead(&fph->users)) {
 			mm->futex_phash_new = new;
-			fph->initial_ref_dropped = true;
-			return;
-		}
-		if (!READ_ONCE(fph->released)) {
-			mm->futex_phash_new = new;
-			return;
+			return false;
 		}
 
 		futex_rehash_current_users(fph, new);
 	}
 	rcu_assign_pointer(mm->futex_phash, new);
 	kvfree_rcu(fph, rcu);
+	return true;
 }
 
 struct futex_private_hash *futex_get_private_hash(void)
@@ -244,11 +225,26 @@ struct futex_private_hash *futex_get_private_hash(void)
 		if (rcuref_get(&fph->users))
 			return fph;
 	}
-	scoped_guard(mutex, &current->mm->futex_hash_lock)
-		futex_assign_new_hash(NULL, mm);
+	scoped_guard (mutex, &mm->futex_hash_lock) {
+		struct futex_private_hash *fph;
+
+		fph = mm->futex_phash_new;
+		if (fph) {
+			mm->futex_phash_new = NULL;
+			futex_assign_new_hash(mm, fph);
+		}
+	}
 	goto again;
 }
 
+void futex_put_private_hash(struct futex_private_hash *fph)
+{
+	/* Ignore return value, last put is verified via rcuref_is_dead() */
+	if (rcuref_put(&fph->users)) {
+		;
+	}
+}
+
 static struct futex_private_hash *futex_get_private_hb(union futex_key *key)
 {
 	if (!futex_key_is_private(key))
@@ -289,17 +285,6 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
 }
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
-bool futex_put_private_hash(struct futex_private_hash *fph)
-{
-	bool released;
-
-	guard(preempt)();
-	released = rcuref_put_rcusafe(&fph->users);
-	if (released)
-		WRITE_ONCE(fph->released, true);
-	return released;
-}
-
 /**
  * futex_hash_get - Get an additional reference for the local hash.
  * @hb:		    ptr to the private local hash.
@@ -1376,22 +1361,11 @@ void futex_hash_free(struct mm_struct *mm)
 	struct futex_private_hash *fph;
 
 	kvfree(mm->futex_phash_new);
-	/*
-	 * The mm_struct belonging to the task is about to be removed so all
-	 * threads, that ever accessed the private hash, are gone and the
-	 * pointer can be accessed directly (omitting a RCU-read section or
-	 * lock).
-	 * Since there can not be a thread holding a reference to the private
-	 * hash we free it immediately.
-	 */
 	fph = rcu_dereference_raw(mm->futex_phash);
-	if (!fph)
-		return;
-
-	if (!fph->initial_ref_dropped && WARN_ON(!futex_put_private_hash(fph)))
-		return;
-
-	kvfree(fph);
+	if (fph) {
+		WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+		kvfree(fph);
+	}
 }
 
 static int futex_hash_allocate(unsigned int hash_slots)
@@ -1420,15 +1394,32 @@ static int futex_hash_allocate(unsigned int hash_slots)
 		return -ENOMEM;
 
 	rcuref_init(&fph->users, 1);
-	fph->initial_ref_dropped = false;
-	fph->released = false;
 	fph->hash_mask = hash_slots - 1;
 
 	for (i = 0; i < hash_slots; i++)
 		futex_hash_bucket_init(&fph->queues[i], fph);
 
 	scoped_guard(mutex, &mm->futex_hash_lock) {
+		if (mm->futex_phash && !mm->futex_phash_new) {
+			/*
+			 * If we have an existing hash, but do not yet have
+			 * allocated a replacement hash, drop the initial
+			 * reference on the existing hash.
+			 *
+			 * Ignore the return value; removal is serialized by
+			 * mm->futex_hash_lock which we currently hold and last
+			 * put is verified via rcuref_is_dead().
+			 */
+			if (rcuref_put(&mm->futex_phash->users)) {
+				;
+			}
+		}
+
 		if (mm->futex_phash_new) {
+			/*
+			 * If we already have a replacement hash pending;
+			 * keep the larger hash.
+			 */
 			if (mm->futex_phash_new->hash_mask <= fph->hash_mask) {
 				hb_tofree = mm->futex_phash_new;
 			} else {
@@ -1437,7 +1428,12 @@ static int futex_hash_allocate(unsigned int hash_slots)
 			}
 			mm->futex_phash_new = NULL;
 		}
-		futex_assign_new_hash(fph, mm);
+
+		/*
+		 * Will set mm->futex_phash_new on failure;
+		 * futex_get_private_hash() will try again.
+		 */
+		futex_assign_new_hash(mm, fph);
 	}
 	kvfree(hb_tofree);
 	return 0;
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 99218d220e534..5b6b58e8a7008 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -209,13 +209,13 @@ extern struct futex_hash_bucket *__futex_hash(union futex_key *key);
 extern void futex_hash_get(struct futex_hash_bucket *hb);
 extern void futex_hash_put(struct futex_hash_bucket *hb);
 extern struct futex_private_hash *futex_get_private_hash(void);
-extern bool futex_put_private_hash(struct futex_private_hash *fph);
+extern void futex_put_private_hash(struct futex_private_hash *fph);
 
 #else /* !CONFIG_FUTEX_PRIVATE_HASH */
 static inline void futex_hash_get(struct futex_hash_bucket *hb) { }
 static inline void futex_hash_put(struct futex_hash_bucket *hb) { }
 static inline struct futex_private_hash *futex_get_private_hash(void) { return NULL; }
-static inline bool futex_put_private_hash(struct futex_private_hash *fph) { return false; }
+static inline void futex_put_private_hash(struct futex_private_hash *fph) { }
 #endif
 
 DEFINE_CLASS(hb, struct futex_hash_bucket *,
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 17/21] futex: Untangle and naming
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (15 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 16/21] futex: Remove superfluous state Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 18/21] futex: Rework SET_SLOTS Sebastian Andrzej Siewior
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

From: Peter Zijlstra <peterz@infradead.org>

Untangle the futex_private_hash::users increment from finding the hb.

  hb = __futex_hash(key) /* finds the hb */
  hb = futex_hash(key)   /* finds the hb and inc users */

Use __futex_hash() for re-hashing, notably allowing to rehash into the
global hash.

This gets us:

  hb = futex_hash(key) /* gets hb and inc users */
  futex_hash_get(hb)   /* inc users */
  futex_hash_put(hb)   /* dec users */

But then we have:

  fph = futex_get_private_hash()  /* get fph and inc */
  futex_put_private_hash()        /* dec */

Which doesn't match naming, so change to:

  fph = futex_private_hash()    /* get and inc */
  futex_private_hash_get(fph)   /* inc */
  futex_private_hash_put(fph)   /* dec */

Add a CLASS for the private_hash, to clean up some trivial wrappers.

Additional random renaming that happened while mucking about with the
code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/core.c     | 170 +++++++++++++++++++++++-----------------
 kernel/futex/futex.h    |  27 +++++--
 kernel/futex/waitwake.c |  25 ++----
 3 files changed, 124 insertions(+), 98 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 37c3e020f2f03..1c00890cc4fb5 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -124,25 +124,34 @@ static inline bool futex_key_is_private(union futex_key *key)
 	return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
 }
 
-static struct futex_hash_bucket *futex_hash_private(union futex_key *key,
-						    struct futex_hash_bucket *fhb,
-						    u32 hash_mask)
+static struct futex_hash_bucket *
+__futex_hash(union futex_key *key, struct futex_private_hash *fph);
+
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+static struct futex_hash_bucket *
+__futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
 {
 	u32 hash;
 
+	if (!futex_key_is_private(key))
+		return NULL;
+
+	if (!fph)
+		fph = rcu_dereference(key->private.mm->futex_phash);
+	if (!fph || !fph->hash_mask)
+		return NULL;
+
 	hash = jhash2((void *)&key->private.address,
 		      sizeof(key->private.address) / 4,
 		      key->both.offset);
-	return &fhb[hash & hash_mask];
+	return &fph->queues[hash & fph->hash_mask];
 }
 
-#ifdef CONFIG_FUTEX_PRIVATE_HASH
-static void futex_rehash_current_users(struct futex_private_hash *old,
-				       struct futex_private_hash *new)
+static void futex_rehash_private(struct futex_private_hash *old,
+				 struct futex_private_hash *new)
 {
 	struct futex_hash_bucket *hb_old, *hb_new;
 	unsigned int slots = old->hash_mask + 1;
-	u32 hash_mask = new->hash_mask;
 	unsigned int i;
 
 	for (i = 0; i < slots; i++) {
@@ -158,7 +167,7 @@ static void futex_rehash_current_users(struct futex_private_hash *old,
 
 			WARN_ON_ONCE(this->lock_ptr != &hb_old->lock);
 
-			hb_new = futex_hash_private(&this->key, new->queues, hash_mask);
+			hb_new = __futex_hash(&this->key, new);
 			futex_hb_waiters_inc(hb_new);
 			/*
 			 * The new pointer isn't published yet but an already
@@ -173,8 +182,8 @@ static void futex_rehash_current_users(struct futex_private_hash *old,
 	}
 }
 
-static bool futex_assign_new_hash(struct mm_struct *mm,
-				  struct futex_private_hash *new)
+static bool __futex_pivot_hash(struct mm_struct *mm,
+			       struct futex_private_hash *new)
 {
 	struct futex_private_hash *fph;
 
@@ -194,14 +203,27 @@ static bool futex_assign_new_hash(struct mm_struct *mm,
 			return false;
 		}
 
-		futex_rehash_current_users(fph, new);
+		futex_rehash_private(fph, new);
 	}
 	rcu_assign_pointer(mm->futex_phash, new);
 	kvfree_rcu(fph, rcu);
 	return true;
 }
 
-struct futex_private_hash *futex_get_private_hash(void)
+static void futex_pivot_hash(struct mm_struct *mm)
+{
+	scoped_guard (mutex, &mm->futex_hash_lock) {
+		struct futex_private_hash *fph;
+
+		fph = mm->futex_phash_new;
+		if (fph) {
+			mm->futex_phash_new = NULL;
+			__futex_pivot_hash(mm, fph);
+		}
+	}
+}
+
+struct futex_private_hash *futex_private_hash(void)
 {
 	struct mm_struct *mm = current->mm;
 	/*
@@ -225,41 +247,73 @@ struct futex_private_hash *futex_get_private_hash(void)
 		if (rcuref_get(&fph->users))
 			return fph;
 	}
-	scoped_guard (mutex, &mm->futex_hash_lock) {
-		struct futex_private_hash *fph;
-
-		fph = mm->futex_phash_new;
-		if (fph) {
-			mm->futex_phash_new = NULL;
-			futex_assign_new_hash(mm, fph);
-		}
-	}
+	futex_pivot_hash(mm);
 	goto again;
 }
 
-void futex_put_private_hash(struct futex_private_hash *fph)
+bool futex_private_hash_get(struct futex_private_hash *fph)
 {
-	/* Ignore return value, last put is verified via rcuref_is_dead() */
-	if (rcuref_put(&fph->users)) {
-		;
+	return rcuref_get(&fph->users);
+}
+
+void futex_private_hash_put(struct futex_private_hash *fph)
+{
+	/*
+	 * Ignore the result; the DEAD state is picked up
+	 * when rcuref_get() starts failing via rcuref_is_dead().
+	 */
+	bool __maybe_unused ignore = rcuref_put(&fph->users);
+}
+
+struct futex_hash_bucket *futex_hash(union futex_key *key)
+{
+	struct futex_private_hash *fph;
+	struct futex_hash_bucket *hb;
+
+again:
+	scoped_guard (rcu) {
+		hb = __futex_hash(key, NULL);
+		fph = hb->priv;
+
+		if (!fph || futex_private_hash_get(fph))
+			return hb;
 	}
+	futex_pivot_hash(key->private.mm);
+	goto again;
 }
 
-static struct futex_private_hash *futex_get_private_hb(union futex_key *key)
+void futex_hash_get(struct futex_hash_bucket *hb)
 {
-	if (!futex_key_is_private(key))
-		return NULL;
+	struct futex_private_hash *fph = hb->priv;
 
-	return futex_get_private_hash();
+	if (!fph)
+		return;
+	WARN_ON_ONCE(!futex_private_hash_get(fph));
 }
 
-#else
+void futex_hash_put(struct futex_hash_bucket *hb)
+{
+	struct futex_private_hash *fph = hb->priv;
 
-static struct futex_private_hash *futex_get_private_hb(union futex_key *key)
+	if (!fph)
+		return;
+	futex_private_hash_put(fph);
+}
+
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+
+static inline struct futex_hash_bucket *
+__futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
 {
 	return NULL;
 }
-#endif
+
+struct futex_hash_bucket *futex_hash(union futex_key *key)
+{
+	return __futex_hash(key, NULL);
+}
+
+#endif /* CONFIG_FUTEX_PRIVATE_HASH */
 
 /**
  * futex_hash - Return the hash bucket in the global hash
@@ -269,14 +323,15 @@ static struct futex_private_hash *futex_get_private_hb(union futex_key *key)
  * corresponding hash bucket in the global hash. If the FUTEX is private and
  * a local hash table is privated then this one is used.
  */
-struct futex_hash_bucket *__futex_hash(union futex_key *key)
+static struct futex_hash_bucket *
+__futex_hash(union futex_key *key, struct futex_private_hash *fph)
 {
-	struct futex_private_hash *fph;
+	struct futex_hash_bucket *hb;
 	u32 hash;
 
-	fph = futex_get_private_hb(key);
-	if (fph)
-		return futex_hash_private(key, fph->queues, fph->hash_mask);
+	hb = __futex_hash_private(key, fph);
+	if (hb)
+		return hb;
 
 	hash = jhash2((u32 *)key,
 		      offsetof(typeof(*key), both.offset) / 4,
@@ -284,34 +339,6 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
 	return &futex_queues[hash & futex_hashmask];
 }
 
-#ifdef CONFIG_FUTEX_PRIVATE_HASH
-/**
- * futex_hash_get - Get an additional reference for the local hash.
- * @hb:		    ptr to the private local hash.
- *
- * Obtain an additional reference for the already obtained hash bucket. The
- * caller must already own an reference.
- */
-void futex_hash_get(struct futex_hash_bucket *hb)
-{
-	struct futex_private_hash *fph = hb->priv;
-
-	if (!fph)
-		return;
-
-	WARN_ON_ONCE(!rcuref_get(&fph->users));
-}
-
-void futex_hash_put(struct futex_hash_bucket *hb)
-{
-	struct futex_private_hash *fph = hb->priv;
-
-	if (!fph)
-		return;
-	futex_put_private_hash(fph);
-}
-#endif
-
 /**
  * futex_setup_timer - set up the sleeping hrtimer.
  * @time:	ptr to the given timeout value
@@ -1152,7 +1179,6 @@ static void compat_exit_robust_list(struct task_struct *curr)
 static void exit_pi_state_list(struct task_struct *curr)
 {
 	struct list_head *next, *head = &curr->pi_state_list;
-	struct futex_private_hash *fph;
 	struct futex_pi_state *pi_state;
 	union futex_key key = FUTEX_KEY_INIT;
 
@@ -1166,7 +1192,7 @@ static void exit_pi_state_list(struct task_struct *curr)
 	 * on the mutex.
 	 */
 	WARN_ON(curr != current);
-	fph = futex_get_private_hash();
+	guard(private_hash)();
 	/*
 	 * We are a ZOMBIE and nobody can enqueue itself on
 	 * pi_state_list anymore, but we have to be careful
@@ -1229,8 +1255,6 @@ static void exit_pi_state_list(struct task_struct *curr)
 		raw_spin_lock_irq(&curr->pi_lock);
 	}
 	raw_spin_unlock_irq(&curr->pi_lock);
-	if (fph)
-		futex_put_private_hash(fph);
 }
 #else
 static inline void exit_pi_state_list(struct task_struct *curr) { }
@@ -1410,9 +1434,7 @@ static int futex_hash_allocate(unsigned int hash_slots)
 			 * mm->futex_hash_lock which we currently hold and last
 			 * put is verified via rcuref_is_dead().
 			 */
-			if (rcuref_put(&mm->futex_phash->users)) {
-				;
-			}
+			futex_private_hash_put(mm->futex_phash);
 		}
 
 		if (mm->futex_phash_new) {
@@ -1433,7 +1455,7 @@ static int futex_hash_allocate(unsigned int hash_slots)
 		 * Will set mm->futex_phash_new on failure;
 		 * futex_get_private_hash() will try again.
 		 */
-		futex_assign_new_hash(mm, fph);
+		__futex_pivot_hash(mm, fph);
 	}
 	kvfree(hb_tofree);
 	return 0;
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 5b6b58e8a7008..8eba9982bcae1 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -204,23 +204,38 @@ extern struct hrtimer_sleeper *
 futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
 		  int flags, u64 range_ns);
 
-extern struct futex_hash_bucket *__futex_hash(union futex_key *key);
+extern struct futex_hash_bucket *futex_hash(union futex_key *key);
+
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 extern void futex_hash_get(struct futex_hash_bucket *hb);
 extern void futex_hash_put(struct futex_hash_bucket *hb);
-extern struct futex_private_hash *futex_get_private_hash(void);
-extern void futex_put_private_hash(struct futex_private_hash *fph);
+
+extern struct futex_private_hash *futex_private_hash(void);
+extern bool futex_private_hash_get(struct futex_private_hash *fph);
+extern void futex_private_hash_put(struct futex_private_hash *fph);
 
 #else /* !CONFIG_FUTEX_PRIVATE_HASH */
 static inline void futex_hash_get(struct futex_hash_bucket *hb) { }
 static inline void futex_hash_put(struct futex_hash_bucket *hb) { }
-static inline struct futex_private_hash *futex_get_private_hash(void) { return NULL; }
-static inline void futex_put_private_hash(struct futex_private_hash *fph) { }
+
+static inline struct futex_private_hash *futex_private_hash(void)
+{
+	return NULL;
+}
+static inline bool futex_private_hash_get(struct futex_private_hash *fph)
+{
+	return false;
+}
+static inline void futex_private_hash_put(struct futex_private_hash *fph) { }
 #endif
 
 DEFINE_CLASS(hb, struct futex_hash_bucket *,
 	     if (_T) futex_hash_put(_T),
-	     __futex_hash(key), union futex_key *key);
+	     futex_hash(key), union futex_key *key);
+
+DEFINE_CLASS(private_hash, struct futex_private_hash *,
+	     if (_T) futex_private_hash_put(_T),
+	     futex_private_hash(), void);
 
 /**
  * futex_match - Check whether two futex keys are equal
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 0d150453a0b41..74647f6bf75de 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -400,12 +400,18 @@ int futex_unqueue_multiple(struct futex_vector *v, int count)
  *  -  0 - Success
  *  - <0 - -EFAULT, -EWOULDBLOCK or -EINVAL
  */
-static int __futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
+int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
 {
 	bool retry = false;
 	int ret, i;
 	u32 uval;
 
+	/*
+	 * Make sure to have a reference on the private_hash such that we
+	 * don't block on rehash after changing the task state below.
+	 */
+	guard(private_hash)();
+
 	/*
 	 * Enqueuing multiple futexes is tricky, because we need to enqueue
 	 * each futex on the list before dealing with the next one to avoid
@@ -491,23 +497,6 @@ static int __futex_wait_multiple_setup(struct futex_vector *vs, int count, int *
 	return 0;
 }
 
-int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
-{
-	struct futex_private_hash *fph;
-	int ret;
-
-	/*
-	 * Assume to have a private futex and acquire a reference on the private
-	 * hash to avoid blocking on mm_struct::futex_hash_bucket during rehash
-	 * after changing the task state.
-	 */
-	fph = futex_get_private_hash();
-	ret = __futex_wait_multiple_setup(vs, count, woken);
-	if (fph)
-		futex_put_private_hash(fph);
-	return ret;
-}
-
 /**
  * futex_sleep_multiple - Check sleeping conditions and sleep
  * @vs:    List of futexes to wait for
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 18/21] futex: Rework SET_SLOTS
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (16 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 17/21] futex: Untangle and naming Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-26 15:37   ` Sebastian Andrzej Siewior
  2025-03-12 15:16 ` [PATCH v10 19/21] mm: Add vmalloc_huge_node() Sebastian Andrzej Siewior
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

From: Peter Zijlstra <peterz@infradead.org>

Let SET_SLOTS have precedence over default scaling; once user sets a
size, stick with it.

Notably, doing SET_SLOTS 0 will cause fph->hash_mask to be 0, which
will cause __futex_hash() to return global hash buckets. Once in this
state, it is impossible to recover, so disable SET_SLOTS.

Also, let prctl() users wait-retry the rehash, such that return of
prctl() means new size is in effect.

[bigeasy: make private hash depend on MMU]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 init/Kconfig        |   2 +-
 kernel/futex/core.c | 183 +++++++++++++++++++++++++++++---------------
 2 files changed, 123 insertions(+), 62 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index bb209c12a2bda..b0a448608446d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1685,7 +1685,7 @@ config FUTEX_PI
 
 config FUTEX_PRIVATE_HASH
 	bool
-	depends on FUTEX && !BASE_SMALL
+	depends on FUTEX && !BASE_SMALL && MMU
 	default y
 
 config EPOLL
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 1c00890cc4fb5..bc7451287b2ce 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -61,6 +61,8 @@ struct futex_private_hash {
 	rcuref_t	users;
 	unsigned int	hash_mask;
 	struct rcu_head	rcu;
+	void		*mm;
+	bool		custom;
 	struct futex_hash_bucket queues[];
 };
 
@@ -192,12 +194,6 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
 	fph = rcu_dereference_protected(mm->futex_phash,
 					lockdep_is_held(&mm->futex_hash_lock));
 	if (fph) {
-		if (fph->hash_mask >= new->hash_mask) {
-			/* It was increased again while we were waiting */
-			kvfree(new);
-			return true;
-		}
-
 		if (!rcuref_is_dead(&fph->users)) {
 			mm->futex_phash_new = new;
 			return false;
@@ -207,6 +203,7 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
 	}
 	rcu_assign_pointer(mm->futex_phash, new);
 	kvfree_rcu(fph, rcu);
+	wake_up_var(mm);
 	return true;
 }
 
@@ -262,7 +259,8 @@ void futex_private_hash_put(struct futex_private_hash *fph)
 	 * Ignore the result; the DEAD state is picked up
 	 * when rcuref_get() starts failing via rcuref_is_dead().
 	 */
-	bool __maybe_unused ignore = rcuref_put(&fph->users);
+	if (rcuref_put(&fph->users))
+		wake_up_var(fph->mm);
 }
 
 struct futex_hash_bucket *futex_hash(union futex_key *key)
@@ -1392,72 +1390,128 @@ void futex_hash_free(struct mm_struct *mm)
 	}
 }
 
-static int futex_hash_allocate(unsigned int hash_slots)
+static bool futex_pivot_pending(struct mm_struct *mm)
+{
+	struct futex_private_hash *fph;
+
+	guard(rcu)();
+
+	if (!mm->futex_phash_new)
+		return false;
+
+	fph = rcu_dereference(mm->futex_phash);
+	return !rcuref_read(&fph->users);
+}
+
+static bool futex_hash_less(struct futex_private_hash *a,
+			    struct futex_private_hash *b)
+{
+	/* user provided always wins */
+	if (!a->custom && b->custom)
+		return true;
+	if (a->custom && !b->custom)
+		return false;
+
+	/* zero-sized hash wins */
+	if (!b->hash_mask)
+		return true;
+	if (!a->hash_mask)
+		return false;
+
+	/* keep the biggest */
+	if (a->hash_mask < b->hash_mask)
+		return true;
+	if (a->hash_mask > b->hash_mask)
+		return false;
+
+	return false; /* equal */
+}
+
+static int futex_hash_allocate(unsigned int hash_slots, bool custom)
 {
-	struct futex_private_hash *fph, *hb_tofree = NULL;
 	struct mm_struct *mm = current->mm;
-	size_t alloc_size;
+	struct futex_private_hash *fph;
 	int i;
 
-	if (hash_slots == 0)
-		hash_slots = 16;
-	hash_slots = clamp(hash_slots, 2, futex_hashmask + 1);
-	if (!is_power_of_2(hash_slots))
-		hash_slots = rounddown_pow_of_two(hash_slots);
+	if (hash_slots && (hash_slots == 1 || !is_power_of_2(hash_slots)))
+		return -EINVAL;
 
-	if (unlikely(check_mul_overflow(hash_slots, sizeof(struct futex_hash_bucket),
-					&alloc_size)))
-		return -ENOMEM;
+	/*
+	 * Once we've disabled the global hash there is no way back.
+	 */
+	scoped_guard (rcu) {
+		fph = rcu_dereference(mm->futex_phash);
+		if (fph && !fph->hash_mask) {
+			if (custom)
+				return -EBUSY;
+			return 0;
+		}
+	}
 
-	if (unlikely(check_add_overflow(alloc_size, sizeof(struct futex_private_hash),
-					&alloc_size)))
-		return -ENOMEM;
-
-	fph = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
+	fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT);
 	if (!fph)
 		return -ENOMEM;
 
 	rcuref_init(&fph->users, 1);
-	fph->hash_mask = hash_slots - 1;
+	fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
+	fph->custom = custom;
+	fph->mm = mm;
 
 	for (i = 0; i < hash_slots; i++)
 		futex_hash_bucket_init(&fph->queues[i], fph);
 
-	scoped_guard(mutex, &mm->futex_hash_lock) {
-		if (mm->futex_phash && !mm->futex_phash_new) {
-			/*
-			 * If we have an existing hash, but do not yet have
-			 * allocated a replacement hash, drop the initial
-			 * reference on the existing hash.
-			 *
-			 * Ignore the return value; removal is serialized by
-			 * mm->futex_hash_lock which we currently hold and last
-			 * put is verified via rcuref_is_dead().
-			 */
-			futex_private_hash_put(mm->futex_phash);
-		}
-
-		if (mm->futex_phash_new) {
-			/*
-			 * If we already have a replacement hash pending;
-			 * keep the larger hash.
-			 */
-			if (mm->futex_phash_new->hash_mask <= fph->hash_mask) {
-				hb_tofree = mm->futex_phash_new;
-			} else {
-				hb_tofree = fph;
-				fph = mm->futex_phash_new;
-			}
-			mm->futex_phash_new = NULL;
-		}
-
+	if (custom) {
 		/*
-		 * Will set mm->futex_phash_new on failure;
-		 * futex_get_private_hash() will try again.
+		 * Only let prctl() wait / retry; don't unduly delay clone().
 		 */
-		__futex_pivot_hash(mm, fph);
+again:
+		wait_var_event(mm, futex_pivot_pending(mm));
+	}
+
+	scoped_guard(mutex, &mm->futex_hash_lock) {
+		struct futex_private_hash *free __free(kvfree) = NULL;
+		struct futex_private_hash *cur, *new;
+
+		cur = rcu_dereference_protected(mm->futex_phash,
+						lockdep_is_held(&mm->futex_hash_lock));
+		new = mm->futex_phash_new;
+		mm->futex_phash_new = NULL;
+
+		if (fph) {
+			if (cur && !new) {
+				/*
+				 * If we have an existing hash, but do not yet have
+				 * allocated a replacement hash, drop the initial
+				 * reference on the existing hash.
+				 */
+				futex_private_hash_put(cur);
+			}
+
+			if (new) {
+				/*
+				 * Two updates raced; throw out the lesser one.
+				 */
+				if (futex_hash_less(new, fph)) {
+					free = new;
+					new = fph;
+				} else {
+					free = fph;
+				}
+			} else {
+				new = fph;
+			}
+			fph = NULL;
+		}
+
+		if (new) {
+			/*
+			 * Will set mm->futex_phash_new on failure;
+			 * futex_get_private_hash() will try again.
+			 */
+			if (!__futex_pivot_hash(mm, new) && custom)
+				goto again;
+		}
 	}
-	kvfree(hb_tofree);
 	return 0;
 }
 
@@ -1470,10 +1524,17 @@ int futex_hash_allocate_default(void)
 		return 0;
 
 	scoped_guard(rcu) {
-		threads = min_t(unsigned int, get_nr_threads(current), num_online_cpus());
+		threads = min_t(unsigned int,
+				get_nr_threads(current),
+				num_online_cpus());
+
 		fph = rcu_dereference(current->mm->futex_phash);
-		if (fph)
+		if (fph) {
+			if (fph->custom)
+				return 0;
+
 			current_buckets = fph->hash_mask + 1;
+		}
 	}
 
 	/*
@@ -1486,7 +1547,7 @@ int futex_hash_allocate_default(void)
 	if (current_buckets >= buckets)
 		return 0;
 
-	return futex_hash_allocate(buckets);
+	return futex_hash_allocate(buckets, false);
 }
 
 static int futex_hash_get_slots(void)
@@ -1502,7 +1563,7 @@ static int futex_hash_get_slots(void)
 
 #else
 
-static int futex_hash_allocate(unsigned int hash_slots)
+static int futex_hash_allocate(unsigned int hash_slots, bool custom)
 {
 	return -EINVAL;
 }
@@ -1519,7 +1580,7 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
 
 	switch (arg2) {
 	case PR_FUTEX_HASH_SET_SLOTS:
-		ret = futex_hash_allocate(arg3);
+		ret = futex_hash_allocate(arg3, true);
 		break;
 
 	case PR_FUTEX_HASH_GET_SLOTS:
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 19/21] mm: Add vmalloc_huge_node()
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (17 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 18/21] futex: Rework SET_SLOTS Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 22:02   ` Andrew Morton
  2025-03-12 15:16 ` [PATCH v10 20/21] futex: Implement FUTEX2_NUMA Sebastian Andrzej Siewior
                   ` (3 subsequent siblings)
  22 siblings, 1 reply; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	linux-mm, Christoph Hellwig, Sebastian Andrzej Siewior

From: Peter Zijlstra <peterz@infradead.org>

To enable node specific hash-tables.

[bigeasy: use __vmalloc_node_range_noprof(), add nommu bits]

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: linux-mm@kvack.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/vmalloc.h | 3 +++
 mm/nommu.c              | 5 +++++
 mm/vmalloc.c            | 7 +++++++
 3 files changed, 15 insertions(+)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 31e9ffd936e39..09c3e3e33f1f8 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -171,6 +171,9 @@ void *__vmalloc_node_noprof(unsigned long size, unsigned long align, gfp_t gfp_m
 void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
 #define vmalloc_huge(...)	alloc_hooks(vmalloc_huge_noprof(__VA_ARGS__))
 
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node) __alloc_size(1);
+#define vmalloc_huge_node(...)	alloc_hooks(vmalloc_huge_node_noprof(__VA_ARGS__))
+
 extern void *__vmalloc_array_noprof(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
 #define __vmalloc_array(...)	alloc_hooks(__vmalloc_array_noprof(__VA_ARGS__))
 
diff --git a/mm/nommu.c b/mm/nommu.c
index baa79abdaf037..d04e601a8f4d7 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -209,6 +209,11 @@ EXPORT_SYMBOL(vmalloc_noprof);
 
 void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __weak __alias(__vmalloc_noprof);
 
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
+{
+	return vmalloc_huge_noprof(size, gfp_mask);
+}
+
 /*
  *	vzalloc - allocate virtually contiguous memory with zero fill
  *
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a6e7acebe9adf..69247b46413ca 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3966,6 +3966,13 @@ void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(vmalloc_huge_noprof);
 
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
+{
+	return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
+					   gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+					   node, __builtin_return_address(0));
+}
+
 /**
  * vzalloc - allocate virtually contiguous memory with zero fill
  * @size:    allocation size
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 20/21] futex: Implement FUTEX2_NUMA
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (18 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 19/21] mm: Add vmalloc_huge_node() Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-25 19:52   ` Shrikanth Hegde
  2025-03-12 15:16 ` [PATCH v10 21/21] futex: Implement FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

From: Peter Zijlstra <peterz@infradead.org>

Extend the futex2 interface to be numa aware.

When FUTEX2_NUMA is specified for a futex, the user value is extended
to two words (of the same size). The first is the user value we all
know, the second one will be the node to place this futex on.

  struct futex_numa_32 {
	u32 val;
	u32 node;
  };

When node is set to ~0, WAIT will set it to the current node_id such
that WAKE knows where to find it. If userspace corrupts the node value
between WAIT and WAKE, the futex will not be found and no wakeup will
happen.

When FUTEX2_NUMA is not set, the node is simply an extention of the
hash, such that traditional futexes are still interleaved over the
nodes.

This is done to avoid having to have a separate !numa hash-table.

[bigeasy: ensure to have at least hashsize of 4 in futex_init(), add
pr_info() for size and allocation information.]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/futex.h      |   3 ++
 include/uapi/linux/futex.h |   8 +++
 kernel/futex/core.c        | 100 ++++++++++++++++++++++++++++++-------
 kernel/futex/futex.h       |  33 ++++++++++--
 4 files changed, 124 insertions(+), 20 deletions(-)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index 7e14d2e9162d2..19c37afa0432a 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -34,6 +34,7 @@ union futex_key {
 		u64 i_seq;
 		unsigned long pgoff;
 		unsigned int offset;
+		/* unsigned int node; */
 	} shared;
 	struct {
 		union {
@@ -42,11 +43,13 @@ union futex_key {
 		};
 		unsigned long address;
 		unsigned int offset;
+		/* unsigned int node; */
 	} private;
 	struct {
 		u64 ptr;
 		unsigned long word;
 		unsigned int offset;
+		unsigned int node;	/* NOT hashed! */
 	} both;
 };
 
diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index d2ee625ea1890..0435025beaae8 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -74,6 +74,14 @@
 /* do not use */
 #define FUTEX_32		FUTEX2_SIZE_U32 /* historical accident :-( */
 
+
+/*
+ * When FUTEX2_NUMA doubles the futex word, the second word is a node value.
+ * The special value -1 indicates no-node. This is the same value as
+ * NUMA_NO_NODE, except that value is not ABI, this is.
+ */
+#define FUTEX_NO_NODE		(-1)
+
 /*
  * Max numbers of elements in a futex_waitv array
  */
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index bc7451287b2ce..b9da7dc6a900a 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -36,6 +36,8 @@
 #include <linux/pagemap.h>
 #include <linux/debugfs.h>
 #include <linux/plist.h>
+#include <linux/gfp.h>
+#include <linux/vmalloc.h>
 #include <linux/memblock.h>
 #include <linux/fault-inject.h>
 #include <linux/slab.h>
@@ -51,11 +53,14 @@
  * reside in the same cacheline.
  */
 static struct {
-	struct futex_hash_bucket *queues;
 	unsigned long            hashmask;
+	unsigned int		 hashshift;
+	struct futex_hash_bucket *queues[MAX_NUMNODES];
 } __futex_data __read_mostly __aligned(2*sizeof(long));
-#define futex_queues   (__futex_data.queues)
-#define futex_hashmask (__futex_data.hashmask)
+
+#define futex_hashmask	(__futex_data.hashmask)
+#define futex_hashshift	(__futex_data.hashshift)
+#define futex_queues	(__futex_data.queues)
 
 struct futex_private_hash {
 	rcuref_t	users;
@@ -326,15 +331,35 @@ __futex_hash(union futex_key *key, struct futex_private_hash *fph)
 {
 	struct futex_hash_bucket *hb;
 	u32 hash;
+	int node;
 
 	hb = __futex_hash_private(key, fph);
 	if (hb)
 		return hb;
 
 	hash = jhash2((u32 *)key,
-		      offsetof(typeof(*key), both.offset) / 4,
+		      offsetof(typeof(*key), both.offset) / sizeof(u32),
 		      key->both.offset);
-	return &futex_queues[hash & futex_hashmask];
+	node = key->both.node;
+
+	if (node == FUTEX_NO_NODE) {
+		/*
+		 * In case of !FLAGS_NUMA, use some unused hash bits to pick a
+		 * node -- this ensures regular futexes are interleaved across
+		 * the nodes and avoids having to allocate multiple
+		 * hash-tables.
+		 *
+		 * NOTE: this isn't perfectly uniform, but it is fast and
+		 * handles sparse node masks.
+		 */
+		node = (hash >> futex_hashshift) % nr_node_ids;
+		if (!node_possible(node)) {
+			node = find_next_bit_wrap(node_possible_map.bits,
+						  nr_node_ids, node);
+		}
+	}
+
+	return &futex_queues[node][hash & futex_hashmask];
 }
 
 /**
@@ -441,25 +466,49 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
 	struct page *page;
 	struct folio *folio;
 	struct address_space *mapping;
-	int err, ro = 0;
+	int node, err, size, ro = 0;
 	bool fshared;
 
 	fshared = flags & FLAGS_SHARED;
+	size = futex_size(flags);
+	if (flags & FLAGS_NUMA)
+		size *= 2;
 
 	/*
 	 * The futex address must be "naturally" aligned.
 	 */
 	key->both.offset = address % PAGE_SIZE;
-	if (unlikely((address % sizeof(u32)) != 0))
+	if (unlikely((address % size) != 0))
 		return -EINVAL;
 	address -= key->both.offset;
 
-	if (unlikely(!access_ok(uaddr, sizeof(u32))))
+	if (unlikely(!access_ok(uaddr, size)))
 		return -EFAULT;
 
 	if (unlikely(should_fail_futex(fshared)))
 		return -EFAULT;
 
+	if (flags & FLAGS_NUMA) {
+		u32 __user *naddr = uaddr + size / 2;
+
+		if (futex_get_value(&node, naddr))
+			return -EFAULT;
+
+		if (node == FUTEX_NO_NODE) {
+			node = numa_node_id();
+			if (futex_put_value(node, naddr))
+				return -EFAULT;
+
+		} else if (node >= MAX_NUMNODES || !node_possible(node)) {
+			return -EINVAL;
+		}
+
+		key->both.node = node;
+
+	} else {
+		key->both.node = FUTEX_NO_NODE;
+	}
+
 	/*
 	 * PROCESS_PRIVATE futexes are fast.
 	 * As the mm cannot disappear under us and the 'key' only needs
@@ -1597,24 +1646,41 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
 static int __init futex_init(void)
 {
 	unsigned long hashsize, i;
-	unsigned int futex_shift;
+	unsigned int order, n;
+	unsigned long size;
 
 #ifdef CONFIG_BASE_SMALL
 	hashsize = 16;
 #else
-	hashsize = roundup_pow_of_two(256 * num_possible_cpus());
+	hashsize = 256 * num_possible_cpus();
+	hashsize /= num_possible_nodes();
+	hashsize = max(4, hashsize);
+	hashsize = roundup_pow_of_two(hashsize);
 #endif
+	futex_hashshift = ilog2(hashsize);
+	size = sizeof(struct futex_hash_bucket) * hashsize;
+	order = get_order(size);
 
-	futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
-					       hashsize, 0, 0,
-					       &futex_shift, NULL,
-					       hashsize, hashsize);
-	hashsize = 1UL << futex_shift;
+	for_each_node(n) {
+		struct futex_hash_bucket *table;
 
-	for (i = 0; i < hashsize; i++)
-		futex_hash_bucket_init(&futex_queues[i], NULL);
+		if (order > MAX_PAGE_ORDER)
+			table = vmalloc_huge_node(size, GFP_KERNEL, n);
+		else
+			table = alloc_pages_exact_nid(n, size, GFP_KERNEL);
+
+		BUG_ON(!table);
+
+		for (i = 0; i < hashsize; i++)
+			futex_hash_bucket_init(&table[i], NULL);
+
+		futex_queues[n] = table;
+	}
 
 	futex_hashmask = hashsize - 1;
+	pr_info("futex hash table entries: %lu (%lu bytes on %d NUMA nodes, total %lu KiB, %s).\n",
+		hashsize, size, num_possible_nodes(), size * num_possible_nodes() / 1024,
+		order > MAX_PAGE_ORDER ? "vmalloc" : "linear");
 	return 0;
 }
 core_initcall(futex_init);
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 8eba9982bcae1..11c870a92b5d0 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -54,7 +54,7 @@ static inline unsigned int futex_to_flags(unsigned int op)
 	return flags;
 }
 
-#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_PRIVATE)
+#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_PRIVATE)
 
 /* FUTEX2_ to FLAGS_ */
 static inline unsigned int futex2_to_flags(unsigned int flags2)
@@ -87,6 +87,19 @@ static inline bool futex_flags_valid(unsigned int flags)
 	if ((flags & FLAGS_SIZE_MASK) != FLAGS_SIZE_32)
 		return false;
 
+	/*
+	 * Must be able to represent both FUTEX_NO_NODE and every valid nodeid
+	 * in a futex word.
+	 */
+	if (flags & FLAGS_NUMA) {
+		int bits = 8 * futex_size(flags);
+		u64 max = ~0ULL;
+
+		max >>= 64 - bits;
+		if (nr_node_ids >= max)
+			return false;
+	}
+
 	return true;
 }
 
@@ -290,7 +303,7 @@ static inline int futex_cmpxchg_value_locked(u32 *curval, u32 __user *uaddr, u32
  * This looks a bit overkill, but generally just results in a couple
  * of instructions.
  */
-static __always_inline int futex_read_inatomic(u32 *dest, u32 __user *from)
+static __always_inline int futex_get_value(u32 *dest, u32 __user *from)
 {
 	u32 val;
 
@@ -307,12 +320,26 @@ static __always_inline int futex_read_inatomic(u32 *dest, u32 __user *from)
 	return -EFAULT;
 }
 
+static __always_inline int futex_put_value(u32 val, u32 __user *to)
+{
+	if (can_do_masked_user_access())
+		to = masked_user_access_begin(to);
+	else if (!user_read_access_begin(to, sizeof(*to)))
+		return -EFAULT;
+	unsafe_put_user(val, to, Efault);
+	user_read_access_end();
+	return 0;
+Efault:
+	user_read_access_end();
+	return -EFAULT;
+}
+
 static inline int futex_get_value_locked(u32 *dest, u32 __user *from)
 {
 	int ret;
 
 	pagefault_disable();
-	ret = futex_read_inatomic(dest, from);
+	ret = futex_get_value(dest, from);
 	pagefault_enable();
 
 	return ret;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v10 21/21] futex: Implement FUTEX2_MPOL
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (19 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 20/21] futex: Implement FUTEX2_NUMA Sebastian Andrzej Siewior
@ 2025-03-12 15:16 ` Sebastian Andrzej Siewior
  2025-03-12 15:18 ` [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
  2025-03-18 13:24 ` Shrikanth Hegde
  22 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Sebastian Andrzej Siewior

From: Peter Zijlstra <peterz@infradead.org>

Extend the futex2 interface to be aware of mempolicy.

When FUTEX2_MPOL is specified and there is a MPOL_PREFERRED or
home_node specified covering the futex address, use that hash-map.

Notably, in this case the futex will go to the global node hashtable,
even if it is a PRIVATE futex.

When FUTEX2_NUMA|FUTEX2_MPOL is specified and the user specified node
value is FUTEX_NO_NODE, the MPOL lookup (as described above) will be
tried first before reverting to setting node to the local node.

[bigeasy: add CONFIG_FUTEX_MPOL ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/mmap_lock.h  |   4 ++
 include/uapi/linux/futex.h |   2 +-
 init/Kconfig               |   5 ++
 kernel/futex/core.c        | 112 +++++++++++++++++++++++++++++++------
 kernel/futex/futex.h       |   4 ++
 5 files changed, 108 insertions(+), 19 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 45a21faa3ff62..89fb032545e0d 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -7,6 +7,7 @@
 #include <linux/rwsem.h>
 #include <linux/tracepoint-defs.h>
 #include <linux/types.h>
+#include <linux/cleanup.h>
 
 #define MMAP_LOCK_INITIALIZER(name) \
 	.mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),
@@ -217,6 +218,9 @@ static inline void mmap_read_unlock(struct mm_struct *mm)
 	up_read(&mm->mmap_lock);
 }
 
+DEFINE_GUARD(mmap_read_lock, struct mm_struct *,
+	     mmap_read_lock(_T), mmap_read_unlock(_T))
+
 static inline void mmap_read_unlock_non_owner(struct mm_struct *mm)
 {
 	__mmap_lock_trace_released(mm, false);
diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 0435025beaae8..247c425e175ef 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -63,7 +63,7 @@
 #define FUTEX2_SIZE_U32		0x02
 #define FUTEX2_SIZE_U64		0x03
 #define FUTEX2_NUMA		0x04
-			/*	0x08 */
+#define FUTEX2_MPOL		0x08
 			/*	0x10 */
 			/*	0x20 */
 			/*	0x40 */
diff --git a/init/Kconfig b/init/Kconfig
index b0a448608446d..a4502a9077e03 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1688,6 +1688,11 @@ config FUTEX_PRIVATE_HASH
 	depends on FUTEX && !BASE_SMALL && MMU
 	default y
 
+config FUTEX_MPOL
+	bool
+	depends on FUTEX && NUMA
+	default y
+
 config EPOLL
 	bool "Enable eventpoll support" if EXPERT
 	default y
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index b9da7dc6a900a..65523f3cfe32e 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -43,6 +43,8 @@
 #include <linux/slab.h>
 #include <linux/prctl.h>
 #include <linux/rcuref.h>
+#include <linux/mempolicy.h>
+#include <linux/mmap_lock.h>
 
 #include "futex.h"
 #include "../locking/rtmutex_common.h"
@@ -318,6 +320,73 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
 
 #endif /* CONFIG_FUTEX_PRIVATE_HASH */
 
+#ifdef CONFIG_FUTEX_MPOL
+static int __futex_key_to_node(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma = vma_lookup(mm, addr);
+	struct mempolicy *mpol;
+	int node = FUTEX_NO_NODE;
+
+	if (!vma)
+		return FUTEX_NO_NODE;
+
+	mpol = vma_policy(vma);
+	if (!mpol)
+		return FUTEX_NO_NODE;
+
+	switch (mpol->mode) {
+	case MPOL_PREFERRED:
+		node = first_node(mpol->nodes);
+		break;
+	case MPOL_PREFERRED_MANY:
+	case MPOL_BIND:
+		if (mpol->home_node != NUMA_NO_NODE)
+			node = mpol->home_node;
+		break;
+	default:
+		break;
+	}
+
+	return node;
+}
+
+static int futex_key_to_node_opt(struct mm_struct *mm, unsigned long addr)
+{
+	int seq, node;
+
+	guard(rcu)();
+
+	if (!mmap_lock_speculate_try_begin(mm, &seq))
+		return -EBUSY;
+
+	node = __futex_key_to_node(mm, addr);
+
+	if (mmap_lock_speculate_retry(mm, seq))
+		return -EAGAIN;
+
+	return node;
+}
+
+static int futex_mpol(struct mm_struct *mm, unsigned long addr)
+{
+	int node;
+
+	node = futex_key_to_node_opt(mm, addr);
+	if (node >= FUTEX_NO_NODE)
+		return node;
+
+	guard(mmap_read_lock)(mm);
+	return __futex_key_to_node(mm, addr);
+}
+#else /* !CONFIG_FUTEX_MPOL */
+
+static int futex_mpol(struct mm_struct *mm, unsigned long addr)
+{
+	return FUTEX_NO_NODE;
+}
+
+#endif /* CONFIG_FUTEX_MPOL */
+
 /**
  * futex_hash - Return the hash bucket in the global hash
  * @key:	Pointer to the futex key for which the hash is calculated
@@ -329,18 +398,20 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
 static struct futex_hash_bucket *
 __futex_hash(union futex_key *key, struct futex_private_hash *fph)
 {
-	struct futex_hash_bucket *hb;
+	int node = key->both.node;
 	u32 hash;
-	int node;
 
-	hb = __futex_hash_private(key, fph);
-	if (hb)
-		return hb;
+	if (node == FUTEX_NO_NODE) {
+		struct futex_hash_bucket *hb;
+
+		hb = __futex_hash_private(key, fph);
+		if (hb)
+			return hb;
+	}
 
 	hash = jhash2((u32 *)key,
 		      offsetof(typeof(*key), both.offset) / sizeof(u32),
 		      key->both.offset);
-	node = key->both.node;
 
 	if (node == FUTEX_NO_NODE) {
 		/*
@@ -488,27 +559,32 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
 	if (unlikely(should_fail_futex(fshared)))
 		return -EFAULT;
 
+	node = FUTEX_NO_NODE;
+
 	if (flags & FLAGS_NUMA) {
 		u32 __user *naddr = uaddr + size / 2;
 
 		if (futex_get_value(&node, naddr))
 			return -EFAULT;
 
-		if (node == FUTEX_NO_NODE) {
-			node = numa_node_id();
-			if (futex_put_value(node, naddr))
-				return -EFAULT;
-
-		} else if (node >= MAX_NUMNODES || !node_possible(node)) {
+		if (node >= MAX_NUMNODES || !node_possible(node))
 			return -EINVAL;
-		}
-
-		key->both.node = node;
-
-	} else {
-		key->both.node = FUTEX_NO_NODE;
 	}
 
+	if (node == FUTEX_NO_NODE && (flags & FLAGS_MPOL))
+		node = futex_mpol(mm, address);
+
+	if (flags & FLAGS_NUMA) {
+		u32 __user *naddr = uaddr + size / 2;
+
+		if (node == FUTEX_NO_NODE)
+			node = numa_node_id();
+		if (futex_put_value(node, naddr))
+			return -EFAULT;
+	}
+
+	key->both.node = node;
+
 	/*
 	 * PROCESS_PRIVATE futexes are fast.
 	 * As the mm cannot disappear under us and the 'key' only needs
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 11c870a92b5d0..52e9c0c4b6c87 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -39,6 +39,7 @@
 #define FLAGS_HAS_TIMEOUT	0x0040
 #define FLAGS_NUMA		0x0080
 #define FLAGS_STRICT		0x0100
+#define FLAGS_MPOL		0x0200
 
 /* FUTEX_ to FLAGS_ */
 static inline unsigned int futex_to_flags(unsigned int op)
@@ -67,6 +68,9 @@ static inline unsigned int futex2_to_flags(unsigned int flags2)
 	if (flags2 & FUTEX2_NUMA)
 		flags |= FLAGS_NUMA;
 
+	if (flags2 & FUTEX2_MPOL)
+		flags |= FLAGS_MPOL;
+
 	return flags;
 }
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (20 preceding siblings ...)
  2025-03-12 15:16 ` [PATCH v10 21/21] futex: Implement FUTEX2_MPOL Sebastian Andrzej Siewior
@ 2025-03-12 15:18 ` Sebastian Andrzej Siewior
  2025-03-14 10:42   ` Peter Zijlstra
  2025-03-14 10:58   ` Peter Zijlstra
  2025-03-18 13:24 ` Shrikanth Hegde
  22 siblings, 2 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-12 15:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On 2025-03-12 16:16:13 [+0100], To linux-kernel@vger.kernel.org wrote:
> The complete tree is at
> 	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git/log/?h=futex_local_v10
> 	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git futex_local_v10
> 
> v9…v10: https://lore.kernel.org/all/20250225170914.289358-1-bigeasy@linutronix.de/
The exact diff vs peterz/locking/futex:

diff --git a/include/linux/futex.h b/include/linux/futex.h
index 0cdd5882e89c1..19c37afa0432a 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -82,12 +82,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	      u32 __user *uaddr2, u32 val2, u32 val3);
 int futex_hash_prctl(unsigned long arg2, unsigned long arg3);
 
-#ifdef CONFIG_BASE_SMALL
-static inline int futex_hash_allocate_default(void) { return 0; }
-static inline void futex_hash_free(struct mm_struct *mm) { }
-static inline void futex_mm_init(struct mm_struct *mm) { }
-#else /* !CONFIG_BASE_SMALL */
-
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 int futex_hash_allocate_default(void);
 void futex_hash_free(struct mm_struct *mm);
 
@@ -97,7 +92,11 @@ static inline void futex_mm_init(struct mm_struct *mm)
 	mutex_init(&mm->futex_hash_lock);
 }
 
-#endif /* CONFIG_BASE_SMALL */
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+static inline int futex_hash_allocate_default(void) { return 0; }
+static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline void futex_mm_init(struct mm_struct *mm) { }
+#endif /* CONFIG_FUTEX_PRIVATE_HASH */
 
 #else /* !CONFIG_FUTEX */
 static inline void futex_init_task(struct task_struct *tsk) { }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9399ee7d40201..e0e8adbe66bdd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -938,7 +938,7 @@ struct mm_struct {
 		 */
 		seqcount_t mm_lock_seq;
 #endif
-#if defined(CONFIG_FUTEX) && !defined(CONFIG_BASE_SMALL)
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 		struct mutex			futex_hash_lock;
 		struct futex_private_hash	__rcu *futex_phash;
 		struct futex_private_hash	*futex_phash_new;
diff --git a/include/linux/rcuref.h b/include/linux/rcuref.h
index 6322d8c1c6b42..2fb2af6d98249 100644
--- a/include/linux/rcuref.h
+++ b/include/linux/rcuref.h
@@ -30,7 +30,11 @@ static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
  * rcuref_read - Read the number of held reference counts of a rcuref
  * @ref:	Pointer to the reference count
  *
- * Return: The number of held references (0 ... N)
+ * Return: The number of held references (0 ... N). The value 0 does not
+ * indicate that it is safe to schedule the object, protected by this reference
+ * counter, for deconstruction.
+ * If you want to know if the reference counter has been marked DEAD (as
+ * signaled by rcuref_put()) please use rcuread_is_dead().
  */
 static inline unsigned int rcuref_read(rcuref_t *ref)
 {
@@ -40,6 +44,22 @@ static inline unsigned int rcuref_read(rcuref_t *ref)
 	return c >= RCUREF_RELEASED ? 0 : c + 1;
 }
 
+/**
+ * rcuref_is_dead -	Check if the rcuref has been already marked dead
+ * @ref:		Pointer to the reference count
+ *
+ * Return: True if the object has been marked DEAD. This signals that a previous
+ * invocation of rcuref_put() returned true on this reference counter meaning
+ * the protected object can safely be scheduled for deconstruction.
+ * Otherwise, returns false.
+ */
+static inline bool rcuref_is_dead(rcuref_t *ref)
+{
+	unsigned int c = atomic_read(&ref->refcnt);
+
+	return (c >= RCUREF_RELEASED) && (c < RCUREF_NOREF);
+}
+
 extern __must_check bool rcuref_get_slowpath(rcuref_t *ref);
 
 /**
diff --git a/init/Kconfig b/init/Kconfig
index a0ea04c177842..a4502a9077e03 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1683,6 +1683,16 @@ config FUTEX_PI
 	depends on FUTEX && RT_MUTEXES
 	default y
 
+config FUTEX_PRIVATE_HASH
+	bool
+	depends on FUTEX && !BASE_SMALL && MMU
+	default y
+
+config FUTEX_MPOL
+	bool
+	depends on FUTEX && NUMA
+	default y
+
 config EPOLL
 	bool "Enable eventpoll support" if EXPERT
 	default y
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 976a487bf3ad5..65523f3cfe32e 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -136,7 +136,7 @@ static inline bool futex_key_is_private(union futex_key *key)
 static struct futex_hash_bucket *
 __futex_hash(union futex_key *key, struct futex_private_hash *fph);
 
-#ifndef CONFIG_BASE_SMALL
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 static struct futex_hash_bucket *
 __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
 {
@@ -196,12 +196,12 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
 {
 	struct futex_private_hash *fph;
 
-	lockdep_assert_held(&mm->futex_hash_lock);
 	WARN_ON_ONCE(mm->futex_phash_new);
 
-	fph = mm->futex_phash;
+	fph = rcu_dereference_protected(mm->futex_phash,
+					lockdep_is_held(&mm->futex_hash_lock));
 	if (fph) {
-		if (rcuref_read(&fph->users) != 0) {
+		if (!rcuref_is_dead(&fph->users)) {
 			mm->futex_phash_new = new;
 			return false;
 		}
@@ -262,6 +262,10 @@ bool futex_private_hash_get(struct futex_private_hash *fph)
 
 void futex_private_hash_put(struct futex_private_hash *fph)
 {
+	/*
+	 * Ignore the result; the DEAD state is picked up
+	 * when rcuref_get() starts failing via rcuref_is_dead().
+	 */
 	if (rcuref_put(&fph->users))
 		wake_up_var(fph->mm);
 }
@@ -301,7 +305,7 @@ void futex_hash_put(struct futex_hash_bucket *hb)
 	futex_private_hash_put(fph);
 }
 
-#else
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
 
 static inline struct futex_hash_bucket *
 __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
@@ -314,8 +318,9 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
 	return __futex_hash(key, NULL);
 }
 
-#endif /* CONFIG_BASE_SMALL */
+#endif /* CONFIG_FUTEX_PRIVATE_HASH */
 
+#ifdef CONFIG_FUTEX_MPOL
 static int __futex_key_to_node(struct mm_struct *mm, unsigned long addr)
 {
 	struct vm_area_struct *vma = vma_lookup(mm, addr);
@@ -325,7 +330,7 @@ static int __futex_key_to_node(struct mm_struct *mm, unsigned long addr)
 	if (!vma)
 		return FUTEX_NO_NODE;
 
-	mpol = vma->vm_policy;
+	mpol = vma_policy(vma);
 	if (!mpol)
 		return FUTEX_NO_NODE;
 
@@ -373,6 +378,14 @@ static int futex_mpol(struct mm_struct *mm, unsigned long addr)
 	guard(mmap_read_lock)(mm);
 	return __futex_key_to_node(mm, addr);
 }
+#else /* !CONFIG_FUTEX_MPOL */
+
+static int futex_mpol(struct mm_struct *mm, unsigned long addr)
+{
+	return FUTEX_NO_NODE;
+}
+
+#endif /* CONFIG_FUTEX_MPOL */
 
 /**
  * futex_hash - Return the hash bucket in the global hash
@@ -420,7 +433,6 @@ __futex_hash(union futex_key *key, struct futex_private_hash *fph)
 	return &futex_queues[node][hash & futex_hashmask];
 }
 
-
 /**
  * futex_setup_timer - set up the sleeping hrtimer.
  * @time:	ptr to the given timeout value
@@ -932,9 +944,6 @@ int futex_unqueue(struct futex_q *q)
 
 void futex_q_lockptr_lock(struct futex_q *q)
 {
-#if 0
-	struct futex_hash_bucket *hb;
-#endif
 	spinlock_t *lock_ptr;
 
 	/*
@@ -949,18 +958,6 @@ void futex_q_lockptr_lock(struct futex_q *q)
 		spin_unlock(lock_ptr);
 		goto retry;
 	}
-#if 0
-	hb = container_of(lock_ptr, struct futex_hash_bucket, lock);
-	/*
-	 * The caller needs to either hold a reference on the hash (to ensure
-	 * that the hash is not resized) _or_ be enqueued on the hash. This
-	 * ensures that futex_q::lock_ptr is updated while moved to the new
-	 * hash during resize.
-	 * Once the hash bucket is locked the resize operation, which might be
-	 * in progress, will block on the lock.
-	 */
-	return hb;
-#endif
 }
 
 /*
@@ -1497,7 +1494,7 @@ void futex_exit_release(struct task_struct *tsk)
 static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
 				   struct futex_private_hash *fph)
 {
-#ifndef CONFIG_BASE_SMALL
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 	fhb->priv = fph;
 #endif
 	atomic_set(&fhb->waiters, 0);
@@ -1505,21 +1502,30 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
 	spin_lock_init(&fhb->lock);
 }
 
-#ifndef CONFIG_BASE_SMALL
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 void futex_hash_free(struct mm_struct *mm)
 {
+	struct futex_private_hash *fph;
+
 	kvfree(mm->futex_phash_new);
-	kvfree(mm->futex_phash);
+	fph = rcu_dereference_raw(mm->futex_phash);
+	if (fph) {
+		WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+		kvfree(fph);
+	}
 }
 
 static bool futex_pivot_pending(struct mm_struct *mm)
 {
+	struct futex_private_hash *fph;
+
 	guard(rcu)();
 
 	if (!mm->futex_phash_new)
 		return false;
 
-	return !rcuref_read(&mm->futex_phash->users);
+	fph = rcu_dereference(mm->futex_phash);
+	return !rcuref_read(&fph->users);
 }
 
 static bool futex_hash_less(struct futex_private_hash *a,
@@ -1560,7 +1566,7 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
 	 */
 	scoped_guard (rcu) {
 		fph = rcu_dereference(mm->futex_phash);
-		if (fph && !mm->futex_phash->hash_mask) {
+		if (fph && !fph->hash_mask) {
 			if (custom)
 				return -EBUSY;
 			return 0;
@@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
 		struct futex_private_hash *free __free(kvfree) = NULL;
 		struct futex_private_hash *cur, *new;
 
-		cur = mm->futex_phash;
+		cur = rcu_dereference_protected(mm->futex_phash,
+						lockdep_is_held(&mm->futex_hash_lock));
 		new = mm->futex_phash_new;
 		mm->futex_phash_new = NULL;
 
@@ -1602,7 +1609,7 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
 				 * allocated a replacement hash, drop the initial
 				 * reference on the existing hash.
 				 */
-				futex_private_hash_put(mm->futex_phash);
+				futex_private_hash_put(cur);
 			}
 
 			if (new) {
@@ -1683,7 +1690,7 @@ static int futex_hash_get_slots(void)
 
 static int futex_hash_allocate(unsigned int hash_slots, bool custom)
 {
-	return 0;
+	return -EINVAL;
 }
 
 static int futex_hash_get_slots(void)
@@ -1723,6 +1730,7 @@ static int __init futex_init(void)
 #else
 	hashsize = 256 * num_possible_cpus();
 	hashsize /= num_possible_nodes();
+	hashsize = max(4, hashsize);
 	hashsize = roundup_pow_of_two(hashsize);
 #endif
 	futex_hashshift = ilog2(hashsize);
@@ -1740,12 +1748,15 @@ static int __init futex_init(void)
 		BUG_ON(!table);
 
 		for (i = 0; i < hashsize; i++)
-			futex_hash_bucket_init(&table[i], 0);
+			futex_hash_bucket_init(&table[i], NULL);
 
 		futex_queues[n] = table;
 	}
 
 	futex_hashmask = hashsize - 1;
+	pr_info("futex hash table entries: %lu (%lu bytes on %d NUMA nodes, total %lu KiB, %s).\n",
+		hashsize, size, num_possible_nodes(), size * num_possible_nodes() / 1024,
+		order > MAX_PAGE_ORDER ? "vmalloc" : "linear");
 	return 0;
 }
 core_initcall(futex_init);
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 40f06523a3565..52e9c0c4b6c87 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -223,14 +223,15 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
 
 extern struct futex_hash_bucket *futex_hash(union futex_key *key);
 
-#ifndef CONFIG_BASE_SMALL
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
 extern void futex_hash_get(struct futex_hash_bucket *hb);
 extern void futex_hash_put(struct futex_hash_bucket *hb);
 
 extern struct futex_private_hash *futex_private_hash(void);
 extern bool futex_private_hash_get(struct futex_private_hash *fph);
 extern void futex_private_hash_put(struct futex_private_hash *fph);
-#else
+
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
 static inline void futex_hash_get(struct futex_hash_bucket *hb) { }
 static inline void futex_hash_put(struct futex_hash_bucket *hb) { }
 
diff --git a/mm/nommu.c b/mm/nommu.c
index baa79abdaf037..d04e601a8f4d7 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -209,6 +209,11 @@ EXPORT_SYMBOL(vmalloc_noprof);
 
 void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __weak __alias(__vmalloc_noprof);
 
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
+{
+	return vmalloc_huge_noprof(size, gfp_mask);
+}
+
 /*
  *	vzalloc - allocate virtually contiguous memory with zero fill
  *
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 39fe43183a64f..69247b46413ca 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3968,9 +3968,9 @@ EXPORT_SYMBOL_GPL(vmalloc_huge_noprof);
 
 void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
 {
-	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
-				    gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
-				    node, __builtin_return_address(0));
+	return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
+					   gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+					   node, __builtin_return_address(0));
 }
 
 /**


Sebastian

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 19/21] mm: Add vmalloc_huge_node()
  2025-03-12 15:16 ` [PATCH v10 19/21] mm: Add vmalloc_huge_node() Sebastian Andrzej Siewior
@ 2025-03-12 22:02   ` Andrew Morton
  2025-03-13  7:59     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2025-03-12 22:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Waiman Long, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Christoph Hellwig

On Wed, 12 Mar 2025 16:16:32 +0100 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> From: Peter Zijlstra <peterz@infradead.org>
> 
> To enable node specific hash-tables.

"... using huge pages if possible"?

> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3966,6 +3966,13 @@ void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL_GPL(vmalloc_huge_noprof);
>  
> +void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
> +{
> +	return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
> +					   gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> +					   node, __builtin_return_address(0));
> +}
> +

kerneldoc please?


I suppose we can now simplify vmalloc_huge_noprof() to use this:

static inline void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask)
{
	return vmalloc_huge_node_noprof(size, gfp_mask, NUMA_NO_NODE);
}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 01/21] rcuref: Provide rcuref_is_dead().
  2025-03-12 15:16 ` [PATCH v10 01/21] rcuref: Provide rcuref_is_dead() Sebastian Andrzej Siewior
@ 2025-03-13  4:23   ` Joel Fernandes
  2025-03-13  7:55     ` Sebastian Andrzej Siewior
  2025-03-14 10:36   ` Peter Zijlstra
  1 sibling, 1 reply; 58+ messages in thread
From: Joel Fernandes @ 2025-03-13  4:23 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Waiman Long

On Wed, Mar 12, 2025 at 04:16:14PM +0100, Sebastian Andrzej Siewior wrote:
> rcuref_read() returns the number of references that are currently held.
> If 0 is returned then it is not safe to assume that the object ca be
> scheduled for deconstruction because it is marked DEAD. This happens if
> the return value of rcuref_put() is ignored and assumptions are made.
> 
> If 0 is returned then the counter transitioned from 0 to RCUREF_NOREF.
> If rcuref_put() did not return to the caller then the counter did not
> yet transition from RCUREF_NOREF to RCUREF_DEAD. This means that there
> is still a chance that the counter counter will transition from
> RCUREF_NOREF to 0 meaning it is still valid and must not be
> deconstructed. In this brief window rcuref_read() will return 0.
> 
> Provide rcuref_is_dead() to determine if the counter is marked as
> RCUREF_DEAD.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  include/linux/rcuref.h | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/rcuref.h b/include/linux/rcuref.h
> index 6322d8c1c6b42..2fb2af6d98249 100644
> --- a/include/linux/rcuref.h
> +++ b/include/linux/rcuref.h
> @@ -30,7 +30,11 @@ static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
>   * rcuref_read - Read the number of held reference counts of a rcuref
>   * @ref:	Pointer to the reference count
>   *
> - * Return: The number of held references (0 ... N)
> + * Return: The number of held references (0 ... N). The value 0 does not
> + * indicate that it is safe to schedule the object, protected by this reference
> + * counter, for deconstruction.
> + * If you want to know if the reference counter has been marked DEAD (as
> + * signaled by rcuref_put()) please use rcuread_is_dead().
>   */
>  static inline unsigned int rcuref_read(rcuref_t *ref)
>  {
> @@ -40,6 +44,22 @@ static inline unsigned int rcuref_read(rcuref_t *ref)
>  	return c >= RCUREF_RELEASED ? 0 : c + 1;
>  }
>  
> +/**
> + * rcuref_is_dead -	Check if the rcuref has been already marked dead
> + * @ref:		Pointer to the reference count
> + *
> + * Return: True if the object has been marked DEAD. This signals that a previous
> + * invocation of rcuref_put() returned true on this reference counter meaning
> + * the protected object can safely be scheduled for deconstruction.
> + * Otherwise, returns false.
> + */
> +static inline bool rcuref_is_dead(rcuref_t *ref)
> +{
> +	unsigned int c = atomic_read(&ref->refcnt);
> +
> +	return (c >= RCUREF_RELEASED) && (c < RCUREF_NOREF);
> +}
> +
>  extern __must_check bool rcuref_get_slowpath(rcuref_t *ref);
>  

This makes sense to me, another way I guess to determine if it is dead is
actually to do a get() and see if it fails? Though that would be more
expensive and silly.

FWIW for this patch,
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 01/21] rcuref: Provide rcuref_is_dead().
  2025-03-13  4:23   ` Joel Fernandes
@ 2025-03-13  7:55     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-13  7:55 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Waiman Long

On 2025-03-13 00:23:11 [-0400], Joel Fernandes wrote:
> This makes sense to me, another way I guess to determine if it is dead is
> actually to do a get() and see if it fails? Though that would be more
> expensive and silly.

good :)

> FWIW for this patch,
> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
> 
> thanks,
> 
>  - Joel

Sebastian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 19/21] mm: Add vmalloc_huge_node()
  2025-03-12 22:02   ` Andrew Morton
@ 2025-03-13  7:59     ` Sebastian Andrzej Siewior
  2025-03-13 22:08       ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-13  7:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Waiman Long, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Christoph Hellwig

On 2025-03-12 15:02:06 [-0700], Andrew Morton wrote:
> On Wed, 12 Mar 2025 16:16:32 +0100 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > To enable node specific hash-tables.
> 
> "... using huge pages if possible"?
> 
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3966,6 +3966,13 @@ void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask)
> >  }
> >  EXPORT_SYMBOL_GPL(vmalloc_huge_noprof);
> >  
> > +void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
> > +{
> > +	return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
> > +					   gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> > +					   node, __builtin_return_address(0));
> > +}
> > +
> 
> kerneldoc please?

Okay.

> 
> I suppose we can now simplify vmalloc_huge_noprof() to use this:
> 
> static inline void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask)
> {
> 	return vmalloc_huge_node_noprof(size, gfp_mask, NUMA_NO_NODE);
> }

Do you want me to stash this into this one or as a follow up?

Sebastian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 19/21] mm: Add vmalloc_huge_node()
  2025-03-13  7:59     ` Sebastian Andrzej Siewior
@ 2025-03-13 22:08       ` Andrew Morton
  2025-03-14  9:59         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2025-03-13 22:08 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Waiman Long, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Christoph Hellwig

On Thu, 13 Mar 2025 08:59:24 +0100 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> > 
> > I suppose we can now simplify vmalloc_huge_noprof() to use this:
> > 
> > static inline void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask)
> > {
> > 	return vmalloc_huge_node_noprof(size, gfp_mask, NUMA_NO_NODE);
> > }
> 
> Do you want me to stash this into this one or as a follow up?

That would be nice, if you think it makes sense.  There is some
duplication here.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 19/21] mm: Add vmalloc_huge_node()
  2025-03-13 22:08       ` Andrew Morton
@ 2025-03-14  9:59         ` Sebastian Andrzej Siewior
  2025-03-14 10:34           ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-14  9:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Waiman Long, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Christoph Hellwig

On 2025-03-13 15:08:14 [-0700], Andrew Morton wrote:
> That would be nice, if you think it makes sense.  There is some
> duplication here.

As you wish. That would be the following patch below. This is now
somehow unique compared to the other interfaces (like vmalloc() vs
vmalloc_node()).

-------------------->8-------------
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 14 Jul 2023 12:45:01 +0200
Subject: [PATCH] mm: Add vmalloc_huge_node()

To enable node specific hash-tables using huge pages if possible.

[bigeasy: use __vmalloc_node_range_noprof(), add nommu bits, inline
vmalloc_huge]

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: linux-mm@kvack.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/vmalloc.h |  9 +++++++--
 mm/nommu.c              | 18 +++++++++++++++++-
 mm/vmalloc.c            | 11 ++++++-----
 3 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 31e9ffd936e39..de95794777ad6 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -168,8 +168,13 @@ void *__vmalloc_node_noprof(unsigned long size, unsigned long align, gfp_t gfp_m
 		int node, const void *caller) __alloc_size(1);
 #define __vmalloc_node(...)	alloc_hooks(__vmalloc_node_noprof(__VA_ARGS__))
 
-void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
-#define vmalloc_huge(...)	alloc_hooks(vmalloc_huge_noprof(__VA_ARGS__))
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node) __alloc_size(1);
+#define vmalloc_huge_node(...)	alloc_hooks(vmalloc_huge_node_noprof(__VA_ARGS__))
+
+static inline void *vmalloc_huge(unsigned long size, gfp_t gfp_mask)
+{
+	return vmalloc_huge_node(size, gfp_mask, NUMA_NO_NODE);
+}
 
 extern void *__vmalloc_array_noprof(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
 #define __vmalloc_array(...)	alloc_hooks(__vmalloc_array_noprof(__VA_ARGS__))
diff --git a/mm/nommu.c b/mm/nommu.c
index baa79abdaf037..aed58ea7398db 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -207,7 +207,23 @@ void *vmalloc_noprof(unsigned long size)
 }
 EXPORT_SYMBOL(vmalloc_noprof);
 
-void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __weak __alias(__vmalloc_noprof);
+/*
+ *	vmalloc_huge_node  -  allocate virtually contiguous memory, on a node
+ *
+ *	@size:		allocation size
+ *	@gfp_mask:	flags for the page level allocator
+ *	@node:          node to use for allocation or NUMA_NO_NODE
+ *
+ *	Allocate enough pages to cover @size from the page level
+ *	allocator and map them into contiguous kernel virtual space.
+ *
+ *	Due to NOMMU implications the node argument and HUGE page attribute is
+ *	ignored.
+ */
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
+{
+	return __vmalloc_noprof(size, gfp_mask);
+}
 
 /*
  *	vzalloc - allocate virtually contiguous memory with zero fill
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a6e7acebe9adf..0e2c49aaf84f1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3947,9 +3947,10 @@ void *vmalloc_noprof(unsigned long size)
 EXPORT_SYMBOL(vmalloc_noprof);
 
 /**
- * vmalloc_huge - allocate virtually contiguous memory, allow huge pages
+ * vmalloc_huge_node - allocate virtually contiguous memory, allow huge pages
  * @size:      allocation size
  * @gfp_mask:  flags for the page level allocator
+ * @node:	    node to use for allocation or NUMA_NO_NODE
  *
  * Allocate enough pages to cover @size from the page level
  * allocator and map them into contiguous kernel virtual space.
@@ -3958,13 +3959,13 @@ EXPORT_SYMBOL(vmalloc_noprof);
  *
  * Return: pointer to the allocated memory or %NULL on error
  */
-void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask)
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
 {
 	return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
-				    gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
-				    NUMA_NO_NODE, __builtin_return_address(0));
+					   gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+					   node, __builtin_return_address(0));
 }
-EXPORT_SYMBOL_GPL(vmalloc_huge_noprof);
+EXPORT_SYMBOL_GPL(vmalloc_huge_node_noprof);
 
 /**
  * vzalloc - allocate virtually contiguous memory with zero fill
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 19/21] mm: Add vmalloc_huge_node()
  2025-03-14  9:59         ` Sebastian Andrzej Siewior
@ 2025-03-14 10:34           ` Andrew Morton
  0 siblings, 0 replies; 58+ messages in thread
From: Andrew Morton @ 2025-03-14 10:34 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
	Valentin Schneider, Waiman Long, Uladzislau Rezki,
	Christoph Hellwig, linux-mm, Christoph Hellwig

On Fri, 14 Mar 2025 10:59:31 +0100 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> On 2025-03-13 15:08:14 [-0700], Andrew Morton wrote:
> > That would be nice, if you think it makes sense.  There is some
> > duplication here.
> 
> As you wish. That would be the following patch below.

Looks OK.

> This is now
> somehow unique compared to the other interfaces (like vmalloc() vs
> vmalloc_node()).

I'm not sure what this means?

I kinda struggle with the name "vmalloc_huge_node".  But
"vmalloc_node_maybe_huge" is too long!


> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -207,7 +207,23 @@ void *vmalloc_noprof(unsigned long size)
>  }
>  EXPORT_SYMBOL(vmalloc_noprof);
>  
> -void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __weak __alias(__vmalloc_noprof);
> +/*
> + *	vmalloc_huge_node  -  allocate virtually contiguous memory, on a node
> + *
> + *	@size:		allocation size
> + *	@gfp_mask:	flags for the page level allocator
> + *	@node:          node to use for allocation or NUMA_NO_NODE
> + *
> + *	Allocate enough pages to cover @size from the page level
> + *	allocator and map them into contiguous kernel virtual space.
> + *
> + *	Due to NOMMU implications the node argument and HUGE page attribute is
> + *	ignored.
> + */
> +void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
> +{
> +	return __vmalloc_noprof(size, gfp_mask);
> +}

Please check, I think this wants to be EXPORTed to modules.

> -void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask)
> +void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
>  {
>  	return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
> -				    gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> -				    NUMA_NO_NODE, __builtin_return_address(0));
> +					   gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> +					   node, __builtin_return_address(0));
>  }
> -EXPORT_SYMBOL_GPL(vmalloc_huge_noprof);
> +EXPORT_SYMBOL_GPL(vmalloc_huge_node_noprof);

Like the NOMMU=n version.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 01/21] rcuref: Provide rcuref_is_dead().
  2025-03-12 15:16 ` [PATCH v10 01/21] rcuref: Provide rcuref_is_dead() Sebastian Andrzej Siewior
  2025-03-13  4:23   ` Joel Fernandes
@ 2025-03-14 10:36   ` Peter Zijlstra
  1 sibling, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2025-03-14 10:36 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On Wed, Mar 12, 2025 at 04:16:14PM +0100, Sebastian Andrzej Siewior wrote:

> +/**
> + * rcuref_is_dead -	Check if the rcuref has been already marked dead
> + * @ref:		Pointer to the reference count
> + *
> + * Return: True if the object has been marked DEAD. This signals that a previous
> + * invocation of rcuref_put() returned true on this reference counter meaning
> + * the protected object can safely be scheduled for deconstruction.
> + * Otherwise, returns false.
> + */
> +static inline bool rcuref_is_dead(rcuref_t *ref)
> +{
> +	unsigned int c = atomic_read(&ref->refcnt);
> +
> +	return (c >= RCUREF_RELEASED) && (c < RCUREF_NOREF);
> +}

I had to check, but yes, the compiler generates sane code for this.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-12 15:18 ` [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
@ 2025-03-14 10:42   ` Peter Zijlstra
  2025-03-14 10:58   ` Peter Zijlstra
  1 sibling, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2025-03-14 10:42 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:
> @@ -196,12 +196,12 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
>  {
>  	struct futex_private_hash *fph;
>  
> -	lockdep_assert_held(&mm->futex_hash_lock);
>  	WARN_ON_ONCE(mm->futex_phash_new);
>  
> -	fph = mm->futex_phash;
> +	fph = rcu_dereference_protected(mm->futex_phash,
> +					lockdep_is_held(&mm->futex_hash_lock));

I are confused... this makes no sense. Why ?!

We only ever write that variable while holding this lock, we hold the
lock, we don't need RCU to read the variable.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-12 15:18 ` [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
  2025-03-14 10:42   ` Peter Zijlstra
@ 2025-03-14 10:58   ` Peter Zijlstra
  2025-03-14 11:28     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-03-14 10:58 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:

> @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
>  		struct futex_private_hash *free __free(kvfree) = NULL;
>  		struct futex_private_hash *cur, *new;
>  
> -		cur = mm->futex_phash;
> +		cur = rcu_dereference_protected(mm->futex_phash,
> +						lockdep_is_held(&mm->futex_hash_lock));
>  		new = mm->futex_phash_new;
>  		mm->futex_phash_new = NULL;
>  

Same thing again, this makes no sense.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-14 10:58   ` Peter Zijlstra
@ 2025-03-14 11:28     ` Sebastian Andrzej Siewior
  2025-03-14 11:41       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-14 11:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote:
> On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:
> 
> > @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> >  		struct futex_private_hash *free __free(kvfree) = NULL;
> >  		struct futex_private_hash *cur, *new;
> >  
> > -		cur = mm->futex_phash;
> > +		cur = rcu_dereference_protected(mm->futex_phash,
> > +						lockdep_is_held(&mm->futex_hash_lock));
> >  		new = mm->futex_phash_new;
> >  		mm->futex_phash_new = NULL;
> >  
> 
> Same thing again, this makes no sense.

With "mm->futex_phash" sparse complains about direct RCU access. This
makes it obvious that you can access it, it won't change as long as you
have the lock.

Sebastian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-14 11:28     ` Sebastian Andrzej Siewior
@ 2025-03-14 11:41       ` Peter Zijlstra
  2025-03-14 12:00         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-03-14 11:41 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On Fri, Mar 14, 2025 at 12:28:08PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote:
> > On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:
> > 
> > > @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> > >  		struct futex_private_hash *free __free(kvfree) = NULL;
> > >  		struct futex_private_hash *cur, *new;
> > >  
> > > -		cur = mm->futex_phash;
> > > +		cur = rcu_dereference_protected(mm->futex_phash,
> > > +						lockdep_is_held(&mm->futex_hash_lock));
> > >  		new = mm->futex_phash_new;
> > >  		mm->futex_phash_new = NULL;
> > >  
> > 
> > Same thing again, this makes no sense.
> 
> With "mm->futex_phash" sparse complains about direct RCU access.

Yeah, but sparse is stupid.

> This makes it obvious that you can access it, it won't change as long
> as you have the lock.

It's just plain confusing. rcu_dereference() says you care about the
load being single copy atomic and the data dependency, we don't.

If we just want to shut up sparse; can't we write it like:

	cur = unrcu_pointer(mm->futex_phash);

?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-14 11:41       ` Peter Zijlstra
@ 2025-03-14 12:00         ` Sebastian Andrzej Siewior
  2025-03-14 12:30           ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-14 12:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On 2025-03-14 12:41:02 [+0100], Peter Zijlstra wrote:
> On Fri, Mar 14, 2025 at 12:28:08PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote:
> > > On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:
> > > 
> > > > @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> > > >  		struct futex_private_hash *free __free(kvfree) = NULL;
> > > >  		struct futex_private_hash *cur, *new;
> > > >  
> > > > -		cur = mm->futex_phash;
> > > > +		cur = rcu_dereference_protected(mm->futex_phash,
> > > > +						lockdep_is_held(&mm->futex_hash_lock));
> > > >  		new = mm->futex_phash_new;
> > > >  		mm->futex_phash_new = NULL;
> > > >  
> > > 
> > > Same thing again, this makes no sense.
> > 
> > With "mm->futex_phash" sparse complains about direct RCU access.
> 
> Yeah, but sparse is stupid.

I though we like sparse.

> > This makes it obvious that you can access it, it won't change as long
> > as you have the lock.
> 
> It's just plain confusing. rcu_dereference() says you care about the
> load being single copy atomic and the data dependency, we don't.
> 
> If we just want to shut up sparse; can't we write it like:
> 
> 	cur = unrcu_pointer(mm->futex_phash);
> 
> ?

But isn't rcu_dereference_protected() doing exactly this? It only
verifies that lockdep_is_held() thingy and it performs a plain read, no
READ_ONCE() or anything. And the reader understands why it is safe to
access the pointer as-is.

Sebastian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-14 12:00         ` Sebastian Andrzej Siewior
@ 2025-03-14 12:30           ` Peter Zijlstra
  2025-03-14 13:30             ` Sebastian Andrzej Siewior
  2025-03-14 14:40             ` Paul E. McKenney
  0 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2025-03-14 12:30 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Paul McKenney

On Fri, Mar 14, 2025 at 01:00:57PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-14 12:41:02 [+0100], Peter Zijlstra wrote:
> > On Fri, Mar 14, 2025 at 12:28:08PM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote:
> > > > On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:
> > > > 
> > > > > @@ -1591,7 +1597,8 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> > > > >  		struct futex_private_hash *free __free(kvfree) = NULL;
> > > > >  		struct futex_private_hash *cur, *new;
> > > > >  
> > > > > -		cur = mm->futex_phash;
> > > > > +		cur = rcu_dereference_protected(mm->futex_phash,
> > > > > +						lockdep_is_held(&mm->futex_hash_lock));
> > > > >  		new = mm->futex_phash_new;
> > > > >  		mm->futex_phash_new = NULL;
> > > > >  
> > > > 
> > > > Same thing again, this makes no sense.
> > > 
> > > With "mm->futex_phash" sparse complains about direct RCU access.
> > 
> > Yeah, but sparse is stupid.
> 
> I though we like sparse.

I always ignore it, too much noise.

> > > This makes it obvious that you can access it, it won't change as long
> > > as you have the lock.
> > 
> > It's just plain confusing. rcu_dereference() says you care about the
> > load being single copy atomic and the data dependency, we don't.
> > 
> > If we just want to shut up sparse; can't we write it like:
> > 
> > 	cur = unrcu_pointer(mm->futex_phash);
> > 
> > ?
> 
> But isn't rcu_dereference_protected() doing exactly this? It only
> verifies that lockdep_is_held() thingy and it performs a plain read, no
> READ_ONCE() or anything. And the reader understands why it is safe to
> access the pointer as-is.

Urgh, so we have a rcu_dereference_*() function that does not in fact
imply rcu_dereference() ? WTF kind of insane naming it that?

But yes, it appears you are correct :-(

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 15/21] futex: s/hb_p/fph/
  2025-03-12 15:16 ` [PATCH v10 15/21] futex: s/hb_p/fph/ Sebastian Andrzej Siewior
@ 2025-03-14 12:36   ` Peter Zijlstra
  2025-03-14 13:10     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-03-14 12:36 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On Wed, Mar 12, 2025 at 04:16:28PM +0100, Sebastian Andrzej Siewior wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> To me hb_p reads like hash-bucket-private, but these things are
> pointers to private hash table, not bucket.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Hum, do we want to fold this back instead? It seems a bit daft to
introduce all this code and then go and rename it all again.

But whatever.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 15/21] futex: s/hb_p/fph/
  2025-03-14 12:36   ` Peter Zijlstra
@ 2025-03-14 13:10     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-14 13:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On 2025-03-14 13:36:50 [+0100], Peter Zijlstra wrote:
> On Wed, Mar 12, 2025 at 04:16:28PM +0100, Sebastian Andrzej Siewior wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > To me hb_p reads like hash-bucket-private, but these things are
> > pointers to private hash table, not bucket.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> Hum, do we want to fold this back instead? It seems a bit daft to
> introduce all this code and then go and rename it all again.
> 
> But whatever.

I kept it separate. I can merge the way you want it. Your call.

Sebastian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-14 12:30           ` Peter Zijlstra
@ 2025-03-14 13:30             ` Sebastian Andrzej Siewior
  2025-03-14 14:18               ` Peter Zijlstra
  2025-03-14 14:40             ` Paul E. McKenney
  1 sibling, 1 reply; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-14 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Paul McKenney

On 2025-03-14 13:30:58 [+0100], Peter Zijlstra wrote:
> > > Yeah, but sparse is stupid.
> > 
> > I though we like sparse.
> 
> I always ignore it, too much noise.

What do you suggest? Cleaning up that noise that noise or moving that
RCU checking towards other tooling which is restricted to RCU only?
Knowing you, you have already a plan in your cupboard but not the time :)

Sebastian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-14 13:30             ` Sebastian Andrzej Siewior
@ 2025-03-14 14:18               ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2025-03-14 14:18 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long, Paul McKenney

On Fri, Mar 14, 2025 at 02:30:39PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-14 13:30:58 [+0100], Peter Zijlstra wrote:
> > > > Yeah, but sparse is stupid.
> > > 
> > > I though we like sparse.
> > 
> > I always ignore it, too much noise.
> 
> What do you suggest? Cleaning up that noise that noise or moving that
> RCU checking towards other tooling which is restricted to RCU only?
> Knowing you, you have already a plan in your cupboard but not the time :)

No real plan. Its been so long since I ran sparse I can't even remember
the shape of the noise, just that there was a _lot_ of it.

But IIRC the toolchains are starting to introduce address spaces:

  https://lkml.kernel.org/r/20250127160709.80604-1-ubizjak@gmail.com

so perhaps the __rcu thing can go that way.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-14 12:30           ` Peter Zijlstra
  2025-03-14 13:30             ` Sebastian Andrzej Siewior
@ 2025-03-14 14:40             ` Paul E. McKenney
  1 sibling, 0 replies; 58+ messages in thread
From: Paul E. McKenney @ 2025-03-14 14:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, linux-kernel, André Almeida,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Juri Lelli,
	Thomas Gleixner, Valentin Schneider, Waiman Long

On Fri, Mar 14, 2025 at 01:30:58PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 14, 2025 at 01:00:57PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2025-03-14 12:41:02 [+0100], Peter Zijlstra wrote:
> > > On Fri, Mar 14, 2025 at 12:28:08PM +0100, Sebastian Andrzej Siewior wrote:
> > > > On 2025-03-14 11:58:56 [+0100], Peter Zijlstra wrote:
> > > > > On Wed, Mar 12, 2025 at 04:18:48PM +0100, Sebastian Andrzej Siewior wrote:

[ . . . ]

> > > > This makes it obvious that you can access it, it won't change as long
> > > > as you have the lock.
> > > 
> > > It's just plain confusing. rcu_dereference() says you care about the
> > > load being single copy atomic and the data dependency, we don't.
> > > 
> > > If we just want to shut up sparse; can't we write it like:
> > > 
> > > 	cur = unrcu_pointer(mm->futex_phash);
> > > 
> > > ?
> > 
> > But isn't rcu_dereference_protected() doing exactly this? It only
> > verifies that lockdep_is_held() thingy and it performs a plain read, no
> > READ_ONCE() or anything. And the reader understands why it is safe to
> > access the pointer as-is.
> 
> Urgh, so we have a rcu_dereference_*() function that does not in fact
> imply rcu_dereference() ? WTF kind of insane naming it that?

My kind of insane naming!  ;-)

The rationale is that "_protected" means "protected from updates".

							Thanx, Paul

------------------------------------------------------------------------

/**
 * rcu_dereference_protected() - fetch RCU pointer when updates prevented
 * @p: The pointer to read, prior to dereferencing
 * @c: The conditions under which the dereference will take place
 *
 * Return the value of the specified RCU-protected pointer, but omit
 * the READ_ONCE().  This is useful in cases where update-side locks
 * prevent the value of the pointer from changing.  Please note that this
 * primitive does *not* prevent the compiler from repeating this reference
 * or combining it with other references, so it should not be used without
 * protection of appropriate locks.
 *
 * This function is only for update-side use.  Using this function
 * when protected only by rcu_read_lock() will result in infrequent
 * but very ugly failures.
 */

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
                   ` (21 preceding siblings ...)
  2025-03-12 15:18 ` [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
@ 2025-03-18 13:24 ` Shrikanth Hegde
  2025-03-18 16:12   ` Davidlohr Bueso
                     ` (3 more replies)
  22 siblings, 4 replies; 58+ messages in thread
From: Shrikanth Hegde @ 2025-03-18 13:24 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, linux-kernel

On 3/12/25 20:46, Sebastian Andrzej Siewior wrote:
> Hi,
> 
> this is a follow up on
>          https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb
> 
> and adds support for task local futex_hash_bucket.
> 
> This is the local hash map series based on v9 extended with PeterZ
> FUTEX2_NUMA and FUTEX2_MPOL plus a few fixes on top.
> 
> The complete tree is at
> 	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git/log/?h=futex_local_v10
> 	https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git futex_local_v10
> 

Hi Sebastian. Thanks for working on this (along with bringing back FUTEX2 NUMA) which
might help large systems with many futexes.

I tried this in one of our systems(Single NUMA, 80 CPUs), I see significant reduction in futex/hash.
Maybe i am missing some config or doing something stupid w.r.t to benchmarking.
I am trying to understand this stuff.

I ran "perf bench futex all" as is. No change has been made to perf.
=========================================
Without patch: at 6575d1b4a6ef3336608127c704b612bc5e7b0fdc
# Running futex/hash benchmark...
Run summary [PID 45758]: 80 threads, each operating on 1024 [private] futexes for 10 secs.
Averaged 1556023 operations/sec (+- 0.08%), total secs = 10   <<--- 1.5M

=========================================
With the Series: I had to make PR_FUTEX_HASH=78 since 77 is used for TIMERs.

# Running futex/hash benchmark...
Run summary [PID 8644]: 80 threads, each operating on 1024 [private] futexes for 10 secs.
Averaged 150382 operations/sec (+- 0.42%), total secs = 10   <<-- 0.15M, close to 10x down.

=========================================

Did try a git bisect based on the futex/hash numbers. It narrowed it to this one.
first bad commit: [5dc017a816766be47ffabe97b7e5f75919756e5c] futex: Allow automatic allocation of process wide futex hash.

Is this expected given the complexity of hash function change?

Also, is there a benchmark that could be run to evaluate FUTEX2_NUMA, I would like to
try it on multi-NUMA system to see the benefit.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-18 13:24 ` Shrikanth Hegde
@ 2025-03-18 16:12   ` Davidlohr Bueso
  2025-03-25 19:04   ` Shrikanth Hegde
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 58+ messages in thread
From: Davidlohr Bueso @ 2025-03-18 16:12 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Sebastian Andrzej Siewior, Andrï¿½ Almeida,
	Darren Hart, Ingo Molnar, Juri Lelli, Peter Zijlstra,
	Thomas Gleixner, Valentin Schneider, Waiman Long, linux-kernel

On Tue, 18 Mar 2025, Shrikanth Hegde wrote:

>Also, is there a benchmark that could be run to evaluate FUTEX2_NUMA, I would like to
>try it on multi-NUMA system to see the benefit.

It would be good to integrate futex2 into 'perf bench futex'.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-18 13:24 ` Shrikanth Hegde
  2025-03-18 16:12   ` Davidlohr Bueso
@ 2025-03-25 19:04   ` Shrikanth Hegde
  2025-03-26  9:31     ` Sebastian Andrzej Siewior
  2025-03-26  8:49   ` Sebastian Andrzej Siewior
  2025-04-07 16:15   ` Sebastian Andrzej Siewior
  3 siblings, 1 reply; 58+ messages in thread
From: Shrikanth Hegde @ 2025-03-25 19:04 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, linux-kernel, Nysal Jan K.A.

Hi Sebastian.

On 3/18/25 18:54, Shrikanth Hegde wrote:
> 
>> The complete tree is at
>>     https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/ 
>> staging.git/log/?h=futex_local_v10
>>     https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/ 
>> staging.git futex_local_v10
>>
> 
> Hi Sebastian. Thanks for working on this (along with bringing back 
> FUTEX2 NUMA) which
> might help large systems with many futexes.
> 
> I tried this in one of our systems(Single NUMA, 80 CPUs), I see 
> significant reduction in futex/hash.
> Maybe i am missing some config or doing something stupid w.r.t to 
> benchmarking.
> I am trying to understand this stuff.
> 
> I ran "perf bench futex all" as is. No change has been made to perf.
> =========================================
> Without patch: at 6575d1b4a6ef3336608127c704b612bc5e7b0fdc
> # Running futex/hash benchmark...
> Run summary [PID 45758]: 80 threads, each operating on 1024 [private] 
> futexes for 10 secs.
> Averaged 1556023 operations/sec (+- 0.08%), total secs = 10   <<--- 1.5M
> 
> =========================================
> With the Series: I had to make PR_FUTEX_HASH=78 since 77 is used for 
> TIMERs.
> 
> # Running futex/hash benchmark...
> Run summary [PID 8644]: 80 threads, each operating on 1024 [private] 
> futexes for 10 secs.
> Averaged 150382 operations/sec (+- 0.42%), total secs = 10   <<-- 0.15M, 
> close to 10x down.
> 
> =========================================
> 
> Did try a git bisect based on the futex/hash numbers. It narrowed it to 
> this one.
> first bad commit: [5dc017a816766be47ffabe97b7e5f75919756e5c] futex: 
> Allow automatic allocation of process wide futex hash.
> 
> Is this expected given the complexity of hash function change?

So, did some more bench-marking using the same perf futex hash.
I see that perf creates N threads and binds each thread to a CPU and then
calls futex_wait such that it never blocks. It always returns EWOULDBLOCK.
only futex_hash is exercised.

Numbers with different threads. (private futexes)
threads		baseline		with series    (ratio)
1		3386265			3266560		0.96	
10		1972069			 821565		0.41
40		1580497			 277900		0.17
80		1555482			 150450		0.096


With Shared Futex: (-s option)
Threads		baseline		with series    (ratio)
80		590144			 585067		0.99

After looking into code, and after some hacking, could get the
performance back with below change. this is likely functionally not correct.
the reason for below change is,

1. perf report showed significant time in futex_private_hash_put.
    so removed rcu usage for users. that brought some improvements.
    from 150k to 300k. Is there a better way to do this users protection?

2. Since number of buckets would be less by default, this would cause hb
    collision. This was seen by queued_spin_lock_slowpath. Increased the hash
    bucket size what was before the series. That brought the numbers back to
    1.5M. This could be achieved with prctl in perf/bench/futex-hash.c i guess.

Note: Just increasing the hash bucket size without point 1, didn't matter much.

-------
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 363a7692909d..7d01bf8caa13 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -65,7 +65,7 @@ static struct {
  #define futex_queues	(__futex_data.queues)
  
  struct futex_private_hash {
-	rcuref_t	users;
+	int	users;
  	unsigned int	hash_mask;
  	struct rcu_head	rcu;
  	void		*mm;
@@ -200,7 +200,7 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
  	fph = rcu_dereference_protected(mm->futex_phash,
  					lockdep_is_held(&mm->futex_hash_lock));
  	if (fph) {
-		if (!rcuref_is_dead(&fph->users)) {
+		if (!(fph->users)) {
  			mm->futex_phash_new = new;
  			return false;
  		}
@@ -247,7 +247,7 @@ struct futex_private_hash *futex_private_hash(void)
  		if (!fph)
  			return NULL;
  
-		if (rcuref_get(&fph->users))
+		if ((fph->users))
  			return fph;
  	}
  	futex_pivot_hash(mm);
@@ -256,7 +256,7 @@ struct futex_private_hash *futex_private_hash(void)
  
  bool futex_private_hash_get(struct futex_private_hash *fph)
  {
-	return rcuref_get(&fph->users);
+	return !!(fph->users);
  }
  
  void futex_private_hash_put(struct futex_private_hash *fph)
@@ -265,7 +265,7 @@ void futex_private_hash_put(struct futex_private_hash *fph)
  	 * Ignore the result; the DEAD state is picked up
  	 * when rcuref_get() starts failing via rcuref_is_dead().
  	 */
-	if (rcuref_put(&fph->users))
+	if ((fph->users))
  		wake_up_var(fph->mm);
  }
  
@@ -1509,7 +1509,7 @@ void futex_hash_free(struct mm_struct *mm)
  	kvfree(mm->futex_phash_new);
  	fph = rcu_dereference_raw(mm->futex_phash);
  	if (fph) {
-		WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+		WARN_ON_ONCE((fph->users) > 1);
  		kvfree(fph);
  	}
  }
@@ -1524,7 +1524,7 @@ static bool futex_pivot_pending(struct mm_struct *mm)
  		return false;
  
  	fph = rcu_dereference(mm->futex_phash);
-	return !rcuref_read(&fph->users);
+	return !!(fph->users);
  }
  
  static bool futex_hash_less(struct futex_private_hash *a,
@@ -1576,7 +1576,7 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
  	if (!fph)
  		return -ENOMEM;
  
-	rcuref_init(&fph->users, 1);
+	fph->users = 1;
  	fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
  	fph->custom = custom;
  	fph->mm = mm;
@@ -1671,6 +1671,8 @@ int futex_hash_allocate_default(void)
  	if (current_buckets >= buckets)
  		return 0;
  
+	buckets = 32768;
+
  	return futex_hash_allocate(buckets, false);
  }
  
@@ -1732,6 +1734,8 @@ static int __init futex_init(void)
  	hashsize = max(4, hashsize);
  	hashsize = roundup_pow_of_two(hashsize);
  #endif
+	hashsize = 32768;
+
  	futex_hashshift = ilog2(hashsize);
  	size = sizeof(struct futex_hash_bucket) * hashsize;
  	order = get_order(size);


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 20/21] futex: Implement FUTEX2_NUMA
  2025-03-12 15:16 ` [PATCH v10 20/21] futex: Implement FUTEX2_NUMA Sebastian Andrzej Siewior
@ 2025-03-25 19:52   ` Shrikanth Hegde
  2025-03-25 22:52     ` Peter Zijlstra
                       ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Shrikanth Hegde @ 2025-03-25 19:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long, linux-kernel



On 3/12/25 20:46, Sebastian Andrzej Siewior wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Extend the futex2 interface to be numa aware.
> 
> When FUTEX2_NUMA is specified for a futex, the user value is extended
> to two words (of the same size). The first is the user value we all
> know, the second one will be the node to place this futex on.
> 
>    struct futex_numa_32 {
> 	u32 val;
> 	u32 node;
>    };
> 
> When node is set to ~0, WAIT will set it to the current node_id such
> that WAKE knows where to find it. If userspace corrupts the node value
> between WAIT and WAKE, the futex will not be found and no wakeup will
> happen.
> 
> When FUTEX2_NUMA is not set, the node is simply an extention of the
> hash, such that traditional futexes are still interleaved over the
> nodes.
> 
> This is done to avoid having to have a separate !numa hash-table.
> 
> [bigeasy: ensure to have at least hashsize of 4 in futex_init(), add
> pr_info() for size and allocation information.]
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>   include/linux/futex.h      |   3 ++
>   include/uapi/linux/futex.h |   8 +++
>   kernel/futex/core.c        | 100 ++++++++++++++++++++++++++++++-------
>   kernel/futex/futex.h       |  33 ++++++++++--
>   4 files changed, 124 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/futex.h b/include/linux/futex.h
> index 7e14d2e9162d2..19c37afa0432a 100644
> --- a/include/linux/futex.h
> +++ b/include/linux/futex.h
> @@ -34,6 +34,7 @@ union futex_key {
>   		u64 i_seq;
>   		unsigned long pgoff;
>   		unsigned int offset;
> +		/* unsigned int node; */
>   	} shared;
>   	struct {
>   		union {
> @@ -42,11 +43,13 @@ union futex_key {
>   		};
>   		unsigned long address;
>   		unsigned int offset;
> +		/* unsigned int node; */
>   	} private;
>   	struct {
>   		u64 ptr;
>   		unsigned long word;
>   		unsigned int offset;
> +		unsigned int node;	/* NOT hashed! */
>   	} both;
>   };
>   
> diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
> index d2ee625ea1890..0435025beaae8 100644
> --- a/include/uapi/linux/futex.h
> +++ b/include/uapi/linux/futex.h
> @@ -74,6 +74,14 @@
>   /* do not use */
>   #define FUTEX_32		FUTEX2_SIZE_U32 /* historical accident :-( */
>   
> +
> +/*
> + * When FUTEX2_NUMA doubles the futex word, the second word is a node value.
> + * The special value -1 indicates no-node. This is the same value as
> + * NUMA_NO_NODE, except that value is not ABI, this is.
> + */
> +#define FUTEX_NO_NODE		(-1)
> +
>   /*
>    * Max numbers of elements in a futex_waitv array
>    */
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> index bc7451287b2ce..b9da7dc6a900a 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -36,6 +36,8 @@
>   #include <linux/pagemap.h>
>   #include <linux/debugfs.h>
>   #include <linux/plist.h>
> +#include <linux/gfp.h>
> +#include <linux/vmalloc.h>
>   #include <linux/memblock.h>
>   #include <linux/fault-inject.h>
>   #include <linux/slab.h>
> @@ -51,11 +53,14 @@
>    * reside in the same cacheline.
>    */
>   static struct {
> -	struct futex_hash_bucket *queues;
>   	unsigned long            hashmask;
> +	unsigned int		 hashshift;
> +	struct futex_hash_bucket *queues[MAX_NUMNODES];
>   } __futex_data __read_mostly __aligned(2*sizeof(long));
> -#define futex_queues   (__futex_data.queues)
> -#define futex_hashmask (__futex_data.hashmask)
> +
> +#define futex_hashmask	(__futex_data.hashmask)
> +#define futex_hashshift	(__futex_data.hashshift)
> +#define futex_queues	(__futex_data.queues)
>   
>   struct futex_private_hash {
>   	rcuref_t	users;
> @@ -326,15 +331,35 @@ __futex_hash(union futex_key *key, struct futex_private_hash *fph)
>   {
>   	struct futex_hash_bucket *hb;
>   	u32 hash;
> +	int node;
>   
>   	hb = __futex_hash_private(key, fph);
>   	if (hb)
>   		return hb;
>   
>   	hash = jhash2((u32 *)key,
> -		      offsetof(typeof(*key), both.offset) / 4,
> +		      offsetof(typeof(*key), both.offset) / sizeof(u32),
>   		      key->both.offset);
> -	return &futex_queues[hash & futex_hashmask];
> +	node = key->both.node;
> +
> +	if (node == FUTEX_NO_NODE) {
> +		/*
> +		 * In case of !FLAGS_NUMA, use some unused hash bits to pick a
> +		 * node -- this ensures regular futexes are interleaved across
> +		 * the nodes and avoids having to allocate multiple
> +		 * hash-tables.
> +		 *
> +		 * NOTE: this isn't perfectly uniform, but it is fast and
> +		 * handles sparse node masks.
> +		 */
> +		node = (hash >> futex_hashshift) % nr_node_ids;
> +		if (!node_possible(node)) {
> +			node = find_next_bit_wrap(node_possible_map.bits,
> +						  nr_node_ids, node);
> +		}
> +	}
> +
> +	return &futex_queues[node][hash & futex_hashmask];

IIUC, when one specifies the numa node it can't be private futex anymore?


>   }
>   
>   /**
> @@ -441,25 +466,49 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
>   	struct page *page;
>   	struct folio *folio;
>   	struct address_space *mapping;
> -	int err, ro = 0;
> +	int node, err, size, ro = 0;
>   	bool fshared;
>   
>   	fshared = flags & FLAGS_SHARED;
> +	size = futex_size(flags);
> +	if (flags & FLAGS_NUMA)
> +		size *= 2;
>   
>   	/*
>   	 * The futex address must be "naturally" aligned.
>   	 */
>   	key->both.offset = address % PAGE_SIZE;
> -	if (unlikely((address % sizeof(u32)) != 0))
> +	if (unlikely((address % size) != 0))
>   		return -EINVAL;
>   	address -= key->both.offset;
>   
> -	if (unlikely(!access_ok(uaddr, sizeof(u32))))
> +	if (unlikely(!access_ok(uaddr, size)))
>   		return -EFAULT;
>   
>   	if (unlikely(should_fail_futex(fshared)))
>   		return -EFAULT;
>   
> +	if (flags & FLAGS_NUMA) {
> +		u32 __user *naddr = uaddr + size / 2;
> +
> +		if (futex_get_value(&node, naddr))
> +			return -EFAULT;
> +
> +		if (node == FUTEX_NO_NODE) {
> +			node = numa_node_id();
> +			if (futex_put_value(node, naddr))
> +				return -EFAULT;
> +
> +		} else if (node >= MAX_NUMNODES || !node_possible(node)) {
> +			return -EINVAL;
> +		}
> +
> +		key->both.node = node;
> +
> +	} else {
> +		key->both.node = FUTEX_NO_NODE;
> +	}
> +
>   	/*
>   	 * PROCESS_PRIVATE futexes are fast.
>   	 * As the mm cannot disappear under us and the 'key' only needs
> @@ -1597,24 +1646,41 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
>   static int __init futex_init(void)
>   {
>   	unsigned long hashsize, i;
> -	unsigned int futex_shift;
> +	unsigned int order, n;
> +	unsigned long size;
>   
>   #ifdef CONFIG_BASE_SMALL
>   	hashsize = 16;
>   #else
> -	hashsize = roundup_pow_of_two(256 * num_possible_cpus());
> +	hashsize = 256 * num_possible_cpus();
> +	hashsize /= num_possible_nodes();

Wouldn't it be better to use num_online_nodes? each node may get a bigger
hash bucket which means less collision no?


> +	hashsize = max(4, hashsize);
> +	hashsize = roundup_pow_of_two(hashsize);
>   #endif
> +	futex_hashshift = ilog2(hashsize);
> +	size = sizeof(struct futex_hash_bucket) * hashsize;
> +	order = get_order(size);
>   
> -	futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
> -					       hashsize, 0, 0,
> -					       &futex_shift, NULL,
> -					       hashsize, hashsize);
> -	hashsize = 1UL << futex_shift;
> +	for_each_node(n) {
> +		struct futex_hash_bucket *table;
>   
> -	for (i = 0; i < hashsize; i++)
> -		futex_hash_bucket_init(&futex_queues[i], NULL);
> +		if (order > MAX_PAGE_ORDER)
> +			table = vmalloc_huge_node(size, GFP_KERNEL, n);
> +		else
> +			table = alloc_pages_exact_nid(n, size, GFP_KERNEL);
> +
> +		BUG_ON(!table);
> +
> +		for (i = 0; i < hashsize; i++)
> +			futex_hash_bucket_init(&table[i], NULL);
> +
> +		futex_queues[n] = table;
> +	}
>   
>   	futex_hashmask = hashsize - 1;
> +	pr_info("futex hash table entries: %lu (%lu bytes on %d NUMA nodes, total %lu KiB, %s).\n",
> +		hashsize, size, num_possible_nodes(), size * num_possible_nodes() / 1024,
> +		order > MAX_PAGE_ORDER ? "vmalloc" : "linear");
>   	return 0;
>   }
>   core_initcall(futex_init);
> diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
> index 8eba9982bcae1..11c870a92b5d0 100644
> --- a/kernel/futex/futex.h
> +++ b/kernel/futex/futex.h
> @@ -54,7 +54,7 @@ static inline unsigned int futex_to_flags(unsigned int op)
>   	return flags;
>   }
>   
> -#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_PRIVATE)
> +#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_PRIVATE)
>   
>   /* FUTEX2_ to FLAGS_ */
>   static inline unsigned int futex2_to_flags(unsigned int flags2)
> @@ -87,6 +87,19 @@ static inline bool futex_flags_valid(unsigned int flags)
>   	if ((flags & FLAGS_SIZE_MASK) != FLAGS_SIZE_32)
>   		return false;
>   
> +	/*
> +	 * Must be able to represent both FUTEX_NO_NODE and every valid nodeid
> +	 * in a futex word.
> +	 */
> +	if (flags & FLAGS_NUMA) {
> +		int bits = 8 * futex_size(flags);
> +		u64 max = ~0ULL;
> +
> +		max >>= 64 - bits;
> +		if (nr_node_ids >= max)
> +			return false;
> +	}
> +
>   	return true;
>   }
>   
> @@ -290,7 +303,7 @@ static inline int futex_cmpxchg_value_locked(u32 *curval, u32 __user *uaddr, u32
>    * This looks a bit overkill, but generally just results in a couple
>    * of instructions.
>    */
> -static __always_inline int futex_read_inatomic(u32 *dest, u32 __user *from)
> +static __always_inline int futex_get_value(u32 *dest, u32 __user *from)
>   {
>   	u32 val;
>   
> @@ -307,12 +320,26 @@ static __always_inline int futex_read_inatomic(u32 *dest, u32 __user *from)
>   	return -EFAULT;
>   }
>   
> +static __always_inline int futex_put_value(u32 val, u32 __user *to)
> +{
> +	if (can_do_masked_user_access())
> +		to = masked_user_access_begin(to);
> +	else if (!user_read_access_begin(to, sizeof(*to)))
> +		return -EFAULT;
> +	unsafe_put_user(val, to, Efault);
> +	user_read_access_end();
> +	return 0;
> +Efault:
> +	user_read_access_end();
> +	return -EFAULT;
> +}
> +
>   static inline int futex_get_value_locked(u32 *dest, u32 __user *from)
>   {
>   	int ret;
>   
>   	pagefault_disable();
> -	ret = futex_read_inatomic(dest, from);
> +	ret = futex_get_value(dest, from);
>   	pagefault_enable();
>   
>   	return ret;


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 20/21] futex: Implement FUTEX2_NUMA
  2025-03-25 19:52   ` Shrikanth Hegde
@ 2025-03-25 22:52     ` Peter Zijlstra
  2025-03-25 22:56     ` Peter Zijlstra
  2025-03-26  8:03     ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2025-03-25 22:52 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Sebastian Andrzej Siewior, André Almeida, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Juri Lelli, Thomas Gleixner,
	Valentin Schneider, Waiman Long, linux-kernel

On Wed, Mar 26, 2025 at 01:22:19AM +0530, Shrikanth Hegde wrote:

> > +	if (node == FUTEX_NO_NODE) {
> > +		/*
> > +		 * In case of !FLAGS_NUMA, use some unused hash bits to pick a
> > +		 * node -- this ensures regular futexes are interleaved across
> > +		 * the nodes and avoids having to allocate multiple
> > +		 * hash-tables.
> > +		 *
> > +		 * NOTE: this isn't perfectly uniform, but it is fast and
> > +		 * handles sparse node masks.
> > +		 */
> > +		node = (hash >> futex_hashshift) % nr_node_ids;
> > +		if (!node_possible(node)) {
> > +			node = find_next_bit_wrap(node_possible_map.bits,
> > +						  nr_node_ids, node);
> > +		}
> > +	}
> > +
> > +	return &futex_queues[node][hash & futex_hashmask];
> 
> IIUC, when one specifies the numa node it can't be private futex anymore?

The futex can be private just fine. It just won't end up in the process
private hash, since that is a single mm wide hash.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 20/21] futex: Implement FUTEX2_NUMA
  2025-03-25 19:52   ` Shrikanth Hegde
  2025-03-25 22:52     ` Peter Zijlstra
@ 2025-03-25 22:56     ` Peter Zijlstra
  2025-03-26 12:57       ` Shrikanth Hegde
  2025-03-26  8:03     ` Sebastian Andrzej Siewior
  2 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-03-25 22:56 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Sebastian Andrzej Siewior, André Almeida, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Juri Lelli, Thomas Gleixner,
	Valentin Schneider, Waiman Long, linux-kernel

On Wed, Mar 26, 2025 at 01:22:19AM +0530, Shrikanth Hegde wrote:

> > +	return &futex_queues[node][hash & futex_hashmask];

                            ^^^^^^^

> > +	hashsize = 256 * num_possible_cpus();
> > +	hashsize /= num_possible_nodes();
> 
> Wouldn't it be better to use num_online_nodes? each node may get a bigger
> hash bucket which means less collision no?

No. There are two problems with num_online_nodes, and both are evident
above.

Consider the case of a sparse set.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 20/21] futex: Implement FUTEX2_NUMA
  2025-03-25 19:52   ` Shrikanth Hegde
  2025-03-25 22:52     ` Peter Zijlstra
  2025-03-25 22:56     ` Peter Zijlstra
@ 2025-03-26  8:03     ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-26  8:03 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Peter Zijlstra, André Almeida, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
	Waiman Long, linux-kernel

On 2025-03-26 01:22:19 [+0530], Shrikanth Hegde wrote:
> > diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> > index bc7451287b2ce..b9da7dc6a900a 100644
> > --- a/kernel/futex/core.c
> > +++ b/kernel/futex/core.c
> > @@ -1597,24 +1646,41 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
> >   static int __init futex_init(void)
> >   {
> >   	unsigned long hashsize, i;
> > -	unsigned int futex_shift;
> > +	unsigned int order, n;
> > +	unsigned long size;
> >   #ifdef CONFIG_BASE_SMALL
> >   	hashsize = 16;
> >   #else
> > -	hashsize = roundup_pow_of_two(256 * num_possible_cpus());
> > +	hashsize = 256 * num_possible_cpus();
> > +	hashsize /= num_possible_nodes();
> 
> Wouldn't it be better to use num_online_nodes? each node may get a bigger
> hash bucket which means less collision no?

Ideally at this point you should have online_nodes == possible_nodes.
Due to hotplug you could have more possible than online nodes. However
in this case you would have more smaller buckets which are used even if
the node is offline. Assuming the hash function is perfect, you should
utilize just the same amount of buckets.

Sebastian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-18 13:24 ` Shrikanth Hegde
  2025-03-18 16:12   ` Davidlohr Bueso
  2025-03-25 19:04   ` Shrikanth Hegde
@ 2025-03-26  8:49   ` Sebastian Andrzej Siewior
  2025-04-07 16:15   ` Sebastian Andrzej Siewior
  3 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-26  8:49 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, linux-kernel

On 2025-03-18 18:54:22 [+0530], Shrikanth Hegde wrote:
> I tried this in one of our systems(Single NUMA, 80 CPUs), I see significant reduction in futex/hash.
> Maybe i am missing some config or doing something stupid w.r.t to benchmarking.
> I am trying to understand this stuff.
> 
> I ran "perf bench futex all" as is. No change has been made to perf.
> =========================================
> Without patch: at 6575d1b4a6ef3336608127c704b612bc5e7b0fdc
> # Running futex/hash benchmark...
> Run summary [PID 45758]: 80 threads, each operating on 1024 [private] futexes for 10 secs.
> Averaged 1556023 operations/sec (+- 0.08%), total secs = 10   <<--- 1.5M
> 
> =========================================
> With the Series: I had to make PR_FUTEX_HASH=78 since 77 is used for TIMERs.
> 
> # Running futex/hash benchmark...
> Run summary [PID 8644]: 80 threads, each operating on 1024 [private] futexes for 10 secs.
> Averaged 150382 operations/sec (+- 0.42%), total secs = 10   <<-- 0.15M, close to 10x down.
> 
> =========================================
> 
> Did try a git bisect based on the futex/hash numbers. It narrowed it to this one.
> first bad commit: [5dc017a816766be47ffabe97b7e5f75919756e5c] futex: Allow automatic allocation of process wide futex hash.
> 
> Is this expected given the complexity of hash function change?

So with 80 CPUs/ threads you should end up with roundup_pow_of_two(80 *
4) = 512 buckets. Before the series you should have
roundup_pow_of_two(80 * 256) = 32768 buckets. This is also printed at
boot.
_Now_ you have less buckets so a hash collision is more likely to
happen. To get to the old numbers you would have increase the buckets
and you get the same results. I benchmark a few things at
	https://lore.kernel.org/all/20241101110810.R3AnEqdu@linutronix.de/

This looks like the series makes it worse. But then those buckets are
per-task so you won't collide with a different task. This in turn should
relax the situation as a whole because different tasks can't block each
other. If two threads block on the same bucket then they might use the
same `uaddr'. 

The benchmark measures how many hash operations can be performed per
second. This means hash + lock + unlock. In reality you would also
queue, wait and wake. It is not very use-case driven.
The only thing that it measures is hash quality in terms of distribution
and the time spent to perform the hashing operation. If you want to
improve any of the two then this is the micro benchmark for it.

> Also, is there a benchmark that could be run to evaluate FUTEX2_NUMA, I would like to
> try it on multi-NUMA system to see the benefit.

Let me try to add that up to the test tool.

Sebastian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-25 19:04   ` Shrikanth Hegde
@ 2025-03-26  9:31     ` Sebastian Andrzej Siewior
  2025-03-26 12:54       ` Shrikanth Hegde
  0 siblings, 1 reply; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-26  9:31 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, linux-kernel, Nysal Jan K.A.

On 2025-03-26 00:34:23 [+0530], Shrikanth Hegde wrote:
> Hi Sebastian.
Hi Shrikanth,

> So, did some more bench-marking using the same perf futex hash.
> I see that perf creates N threads and binds each thread to a CPU and then
> calls futex_wait such that it never blocks. It always returns EWOULDBLOCK.
> only futex_hash is exercised.

It also does spin_lock() + unlock on the hash bucket. Without the
locking, you would have constant numbers.

> Numbers with different threads. (private futexes)
> threads	baseline		with series    (ratio)
> 1		3386265			3266560		0.96	
> 10		1972069			 821565		0.41
> 40		1580497			 277900		0.17
> 80		1555482			 150450		0.096
> 
> 
> With Shared Futex: (-s option)
> Threads	baseline		with series    (ratio)
> 80		590144			 585067		0.99

The shared numbers are equal since the code path there is unchanged.

> After looking into code, and after some hacking, could get the
> performance back with below change. this is likely functionally not correct.
> the reason for below change is,
> 
> 1. perf report showed significant time in futex_private_hash_put.
>    so removed rcu usage for users. that brought some improvements.
>    from 150k to 300k. Is there a better way to do this users protection?

This is likely from the atomic dec operation itself. Then there is also
the preemption counter operation. The inc should be also visible but
might be inlined into the hash operation.
This is _just_ the atomic inc/ dec that doubled the "throughput" but you
don't have anything from the regular path.
Anyway. To avoid the atomic part we would need to have a per-CPU counter
instead of a global one and a more expensive slow path for the resize
since you have to sum up all the per-CPU counters and so on. Not sure it
is worth it.

> 2. Since number of buckets would be less by default, this would cause hb
>    collision. This was seen by queued_spin_lock_slowpath. Increased the hash
>    bucket size what was before the series. That brought the numbers back to
>    1.5M. This could be achieved with prctl in perf/bench/futex-hash.c i guess.

Yes. The idea is to avoid a resize at runtime and setting to something
you know best. You can also use it now to disable the private hash and
stick with the global.

> Note: Just increasing the hash bucket size without point 1, didn't matter much.

Sebastian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-26  9:31     ` Sebastian Andrzej Siewior
@ 2025-03-26 12:54       ` Shrikanth Hegde
  2025-03-26 14:01         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 58+ messages in thread
From: Shrikanth Hegde @ 2025-03-26 12:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, linux-kernel, Nysal Jan K.A.



On 3/26/25 15:01, Sebastian Andrzej Siewior wrote:
> On 2025-03-26 00:34:23 [+0530], Shrikanth Hegde wrote:
>> Hi Sebastian.
> Hi Shrikanth,
> 

Hi.

>> So, did some more bench-marking using the same perf futex hash.
>> I see that perf creates N threads and binds each thread to a CPU and then
>> calls futex_wait such that it never blocks. It always returns EWOULDBLOCK.
>> only futex_hash is exercised.
> 
> It also does spin_lock() + unlock on the hash bucket. Without the
> locking, you would have constant numbers.
> 
Thanks for explanations.

Plus the way perf is doing, it would cause all the SMT threads to be up and 1 case
probably get the benefit of SMT folding. So anything after 40 threads, numbers don't change with baseline.

>> Numbers with different threads. (private futexes)
>> threads	baseline		with series    (ratio)
>> 1		3386265			3266560		0.96	
>> 10		1972069			 821565		0.41
>> 40		1580497			 277900		0.17
>> 80		1555482			 150450		0.096
>>
>>
>> With Shared Futex: (-s option)
>> Threads	baseline		with series    (ratio)
>> 80		590144			 585067		0.99
> 
> The shared numbers are equal since the code path there is unchanged.
> 
>> After looking into code, and after some hacking, could get the
>> performance back with below change. this is likely functionally not correct.
>> the reason for below change is,
>>
>> 1. perf report showed significant time in futex_private_hash_put.
>>     so removed rcu usage for users. that brought some improvements.
>>     from 150k to 300k. Is there a better way to do this users protection?
> 
> This is likely from the atomic dec operation itself. Then there is also
> the preemption counter operation. The inc should be also visible but
> might be inlined into the hash operation.
> This is _just_ the atomic inc/ dec that doubled the "throughput" but you
> don't have anything from the regular path.
> Anyway. To avoid the atomic part we would need to have a per-CPU counter
> instead of a global one and a more expensive slow path for the resize
> since you have to sum up all the per-CPU counters and so on. Not sure it
> is worth it.
> 

resize would happen when one does prctl right? or
it can happen automatically too?

fph is going to be on thread leader's CPU and using atomics to do
fph->users would likely cause cacheline bouncing no?

Not sure if this happens only due to this benchmark which doesn't actually block.
Maybe the real life use-case this doesn't matter.

>> 2. Since number of buckets would be less by default, this would cause hb
>>     collision. This was seen by queued_spin_lock_slowpath. Increased the hash
>>     bucket size what was before the series. That brought the numbers back to
>>     1.5M. This could be achieved with prctl in perf/bench/futex-hash.c i guess.
> 
> Yes. The idea is to avoid a resize at runtime and setting to something
> you know best. You can also use it now to disable the private hash and
> stick with the global.

yes. SET_SLOTS would take care of it.

> 
>> Note: Just increasing the hash bucket size without point 1, didn't matter much.
> 
> Sebastian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 20/21] futex: Implement FUTEX2_NUMA
  2025-03-25 22:56     ` Peter Zijlstra
@ 2025-03-26 12:57       ` Shrikanth Hegde
  2025-03-26 13:37         ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Shrikanth Hegde @ 2025-03-26 12:57 UTC (permalink / raw)
  To: Peter Zijlstra, Sebastian Andrzej Siewior
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Thomas Gleixner, Valentin Schneider, Waiman Long,
	linux-kernel



On 3/26/25 04:26, Peter Zijlstra wrote:
> On Wed, Mar 26, 2025 at 01:22:19AM +0530, Shrikanth Hegde wrote:
> 
>>> +	return &futex_queues[node][hash & futex_hashmask];
> 
>                              ^^^^^^^
> 
>>> +	hashsize = 256 * num_possible_cpus();
>>> +	hashsize /= num_possible_nodes();
>>
>> Wouldn't it be better to use num_online_nodes? each node may get a bigger
>> hash bucket which means less collision no?
> 
> No. There are two problems with num_online_nodes, and both are evident
> above.
> 
> Consider the case of a sparse set.

I am sorry, i didn't understand. Could you please explain?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 20/21] futex: Implement FUTEX2_NUMA
  2025-03-26 12:57       ` Shrikanth Hegde
@ 2025-03-26 13:37         ` Peter Zijlstra
  2025-03-26 15:06           ` Shrikanth Hegde
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-03-26 13:37 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Sebastian Andrzej Siewior, André Almeida, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Juri Lelli, Thomas Gleixner,
	Valentin Schneider, Waiman Long, linux-kernel

On Wed, Mar 26, 2025 at 06:27:20PM +0530, Shrikanth Hegde wrote:
> 
> 
> On 3/26/25 04:26, Peter Zijlstra wrote:
> > On Wed, Mar 26, 2025 at 01:22:19AM +0530, Shrikanth Hegde wrote:
> > 
> > > > +	return &futex_queues[node][hash & futex_hashmask];
> > 
> >                              ^^^^^^^
> > 
> > > > +	hashsize = 256 * num_possible_cpus();
> > > > +	hashsize /= num_possible_nodes();
> > > 
> > > Wouldn't it be better to use num_online_nodes? each node may get a bigger
> > > hash bucket which means less collision no?
> > 
> > No. There are two problems with num_online_nodes, and both are evident
> > above.
> > 
> > Consider the case of a sparse set.
> 
> I am sorry, i didn't understand. Could you please explain?

I was confused; I should've just gone sleep :-)

The futex_queues[] array is sized MAX_NUMNODES, such that every possible
node_id has a spot. I thought we did dynamic sizing, but not so.

Anyway, using online here would lead to having to deal with hotplug,
which in turn either leads to more over-all hash buckets in the system,
or having to resize and rehash everything.

Neither are really attractive options.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-26 12:54       ` Shrikanth Hegde
@ 2025-03-26 14:01         ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-26 14:01 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, linux-kernel, Nysal Jan K.A.

On 2025-03-26 18:24:37 [+0530], Shrikanth Hegde wrote:
> > Anyway. To avoid the atomic part we would need to have a per-CPU counter
> > instead of a global one and a more expensive slow path for the resize
> > since you have to sum up all the per-CPU counters and so on. Not sure it
> > is worth it.
> > 
> 
> resize would happen when one does prctl right? or
> it can happen automatically too?

If prctl is used once then only then. Without prctl it will start with
16 buckets once the first thread is created (so you have two threads in
total).
After that it will only increase the buckets if 4 * threads < buckets.
See futex_hash_allocate_default(). 

> fph is going to be on thread leader's CPU and using atomics to do
> fph->users would likely cause cacheline bouncing no?

Yes, this can happen. And since the user can even resize after using
prctl we can't avoid the inc/ dec even if we switch to custom mode.

> Not sure if this happens only due to this benchmark which doesn't actually block.
> Maybe the real life use-case this doesn't matter.

That is what I assume. You go into the kernel if the futex is occupied.
If multiple threads do this at once then the cacheline bouncing is
unfortunate.

Sebastian

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 20/21] futex: Implement FUTEX2_NUMA
  2025-03-26 13:37         ` Peter Zijlstra
@ 2025-03-26 15:06           ` Shrikanth Hegde
  0 siblings, 0 replies; 58+ messages in thread
From: Shrikanth Hegde @ 2025-03-26 15:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, André Almeida, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Juri Lelli, Thomas Gleixner,
	Valentin Schneider, Waiman Long, linux-kernel



On 3/26/25 19:07, Peter Zijlstra wrote:
> On Wed, Mar 26, 2025 at 06:27:20PM +0530, Shrikanth Hegde wrote:
>>
>>
>> On 3/26/25 04:26, Peter Zijlstra wrote:
>>> On Wed, Mar 26, 2025 at 01:22:19AM +0530, Shrikanth Hegde wrote:
>>>
>>>>> +	return &futex_queues[node][hash & futex_hashmask];
>>>
>>>                               ^^^^^^^
>>>
>>>>> +	hashsize = 256 * num_possible_cpus();
>>>>> +	hashsize /= num_possible_nodes();
>>>>
>>>> Wouldn't it be better to use num_online_nodes? each node may get a bigger
>>>> hash bucket which means less collision no?
>>>
>>> No. There are two problems with num_online_nodes, and both are evident
>>> above.
>>>
>>> Consider the case of a sparse set.
>>
>> I am sorry, i didn't understand. Could you please explain?
> 
> I was confused; I should've just gone sleep :-)
> 
> The futex_queues[] array is sized MAX_NUMNODES, such that every possible
> node_id has a spot. I thought we did dynamic sizing, but not so.
> 
> Anyway, using online here would lead to having to deal with hotplug,
> which in turn either leads to more over-all hash buckets in the system,
> or having to resize and rehash everything.
> 
> Neither are really attractive options.
> 

Ok. got it. Thanks.
Keeping with possible nodes seems simpler.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 18/21] futex: Rework SET_SLOTS
  2025-03-12 15:16 ` [PATCH v10 18/21] futex: Rework SET_SLOTS Sebastian Andrzej Siewior
@ 2025-03-26 15:37   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-26 15:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long

On 2025-03-12 16:16:31 [+0100], To linux-kernel@vger.kernel.org wrote:

I am folding and testing and
…
> +static bool futex_pivot_pending(struct mm_struct *mm)
> +{
> +	struct futex_private_hash *fph;
> +
> +	guard(rcu)();
> +
> +	if (!mm->futex_phash_new)
> +		return false;
> +
> +	fph = rcu_dereference(mm->futex_phash);
> +	return !rcuref_read(&fph->users);
> +}
…
> +static int futex_hash_allocate(unsigned int hash_slots, bool custom)
…
>  		/*
> -		 * Will set mm->futex_phash_new on failure;
> -		 * futex_get_private_hash() will try again.
> +		 * Only let prctl() wait / retry; don't unduly delay clone().
>  		 */
> -		__futex_pivot_hash(mm, fph);
> +again:
> +		wait_var_event(mm, futex_pivot_pending(mm));

This wait condition should be !futex_pivot_pending(). Otherwise it
blocks. We want to wait until the current futex_phash_new assignment is
gone and the ::users counter is >0.

This brings me to the wake condition of which we have two:
> @@ -207,6 +203,7 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
>  	}
>  	rcu_assign_pointer(mm->futex_phash, new);
>  	kvfree_rcu(fph, rcu);
> +	wake_up_var(mm);
>  	return true;
>  }
>  
> @@ -262,7 +259,8 @@ void futex_private_hash_put(struct futex_private_hash *fph)
>  	 * Ignore the result; the DEAD state is picked up
>  	 * when rcuref_get() starts failing via rcuref_is_dead().
>  	 */
> -	bool __maybe_unused ignore = rcuref_put(&fph->users);
> +	if (rcuref_put(&fph->users))
> +		wake_up_var(fph->mm);
>  }

The one in __futex_pivot_hash() makes sense because ::futex_phash_new is
NULL and the users counter is set to one.
The wake in futex_private_hash_put() doesn't make sense. At this point
we have ::futex_phash_new set and rcuref_read() returns 0. So we
schedule again after the wake.
Therefore we could remove the wake from futex_private_hash_put().
However, if there is no futex operation (unlikely) then we are stuck in
wait_var_event() forever. Therefore I would suggest to:

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 65523f3cfe32e..64c7be8df955c 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -210,7 +210,6 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
 	}
 	rcu_assign_pointer(mm->futex_phash, new);
 	kvfree_rcu(fph, rcu);
-	wake_up_var(mm);
 	return true;
 }
 
@@ -1522,10 +1521,10 @@ static bool futex_pivot_pending(struct mm_struct *mm)
 	guard(rcu)();
 
 	if (!mm->futex_phash_new)
-		return false;
+		return true;
 
 	fph = rcu_dereference(mm->futex_phash);
-	return !rcuref_read(&fph->users);
+	return rcuref_is_dead(&fph->users);
 }
 
 static bool futex_hash_less(struct futex_private_hash *a,

-> Attempt to replace if there no replacement pending (futex_phash_new == NULL).
-> If there is replacement (futex_phash_new != NULL) then wait until the
   current private hash is DEAD. This happens once the last user is gone
   and gives the wakeup.

Sebastian

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
  2025-03-18 13:24 ` Shrikanth Hegde
                     ` (2 preceding siblings ...)
  2025-03-26  8:49   ` Sebastian Andrzej Siewior
@ 2025-04-07 16:15   ` Sebastian Andrzej Siewior
  3 siblings, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-07 16:15 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
	Waiman Long, linux-kernel

On 2025-03-18 18:54:22 [+0530], Shrikanth Hegde wrote:
> I ran "perf bench futex all" as is. No change has been made to perf.
…

I just posted v11
	https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de/

this series extends "perf bench futex" with
- -b switch to specify number of buckets to use. Default is auto-scale,
  0 is global hash, everything is the number of buckets.
- The used buckets are displayed after the run
- -I switches freezes the used buckets so the get/ put can be avoided.
  This brings the invocations/sec back to where it was.

If you use the "all" instead of "hash" then the arguments are skipped.

I did not wire up the MPOL part. IMHO in order to make sense, the memory
allocation should based on the NUMA node and then the OP itself could be
based on the NUMA node.

Sebastian

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2025-04-07 16:15 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-12 15:16 [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 01/21] rcuref: Provide rcuref_is_dead() Sebastian Andrzej Siewior
2025-03-13  4:23   ` Joel Fernandes
2025-03-13  7:55     ` Sebastian Andrzej Siewior
2025-03-14 10:36   ` Peter Zijlstra
2025-03-12 15:16 ` [PATCH v10 02/21] futex: Move futex_queue() into futex_wait_setup() Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 03/21] futex: Pull futex_hash() out of futex_q_lock() Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 04/21] futex: Create hb scopes Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 05/21] futex: Create futex_hash() get/put class Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 06/21] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 07/21] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 08/21] futex: Hash only the address for private futexes Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 09/21] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 10/21] futex: Decrease the waiter count before the unlock operation Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 11/21] futex: Introduce futex_q_lockptr_lock() Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 12/21] futex: Acquire a hash reference in futex_wait_multiple_setup() Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 13/21] futex: Allow to re-allocate the private local hash Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 14/21] futex: Resize local futex hash table based on number of threads Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 15/21] futex: s/hb_p/fph/ Sebastian Andrzej Siewior
2025-03-14 12:36   ` Peter Zijlstra
2025-03-14 13:10     ` Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 16/21] futex: Remove superfluous state Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 17/21] futex: Untangle and naming Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 18/21] futex: Rework SET_SLOTS Sebastian Andrzej Siewior
2025-03-26 15:37   ` Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 19/21] mm: Add vmalloc_huge_node() Sebastian Andrzej Siewior
2025-03-12 22:02   ` Andrew Morton
2025-03-13  7:59     ` Sebastian Andrzej Siewior
2025-03-13 22:08       ` Andrew Morton
2025-03-14  9:59         ` Sebastian Andrzej Siewior
2025-03-14 10:34           ` Andrew Morton
2025-03-12 15:16 ` [PATCH v10 20/21] futex: Implement FUTEX2_NUMA Sebastian Andrzej Siewior
2025-03-25 19:52   ` Shrikanth Hegde
2025-03-25 22:52     ` Peter Zijlstra
2025-03-25 22:56     ` Peter Zijlstra
2025-03-26 12:57       ` Shrikanth Hegde
2025-03-26 13:37         ` Peter Zijlstra
2025-03-26 15:06           ` Shrikanth Hegde
2025-03-26  8:03     ` Sebastian Andrzej Siewior
2025-03-12 15:16 ` [PATCH v10 21/21] futex: Implement FUTEX2_MPOL Sebastian Andrzej Siewior
2025-03-12 15:18 ` [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
2025-03-14 10:42   ` Peter Zijlstra
2025-03-14 10:58   ` Peter Zijlstra
2025-03-14 11:28     ` Sebastian Andrzej Siewior
2025-03-14 11:41       ` Peter Zijlstra
2025-03-14 12:00         ` Sebastian Andrzej Siewior
2025-03-14 12:30           ` Peter Zijlstra
2025-03-14 13:30             ` Sebastian Andrzej Siewior
2025-03-14 14:18               ` Peter Zijlstra
2025-03-14 14:40             ` Paul E. McKenney
2025-03-18 13:24 ` Shrikanth Hegde
2025-03-18 16:12   ` Davidlohr Bueso
2025-03-25 19:04   ` Shrikanth Hegde
2025-03-26  9:31     ` Sebastian Andrzej Siewior
2025-03-26 12:54       ` Shrikanth Hegde
2025-03-26 14:01         ` Sebastian Andrzej Siewior
2025-03-26  8:49   ` Sebastian Andrzej Siewior
2025-04-07 16:15   ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox