* [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
@ 2025-04-16 16:29 Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 01/21] rcuref: Provide rcuref_is_dead() Sebastian Andrzej Siewior
` (21 more replies)
0 siblings, 22 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
this is a follow up on
https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb
and adds support for task local futex_hash_bucket.
This is the local hash map series with PeterZ FUTEX2_NUMA and
FUTEX2_MPOL. This went through some testing now with the selftests…
The complete tree is at
https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git/log/?h=futex_local_v12
https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git futex_local_v12
v11…v12: https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de
- Moved futex_hash_put() in futex_lock_pi() before
rt_mutex_pre_schedule() for obvious reasons.
- Use __GFP_NOWARN while allocating the local hash to supress warnings
about failures especially if huge values were used and vmalloc
refuses.
- The "immutable" mode is its own patch. The basic infrastructure patch
enforces a "0" for prctl()'s arg4. The "immutable mode" allows only 0
(disabled) or 1 (enabled) as argument.
The "perf bench" bench adds the "bucket" and "immutable" support.
- The position of node member after the uaddr is computed in units of
u32. Added a cast to (void *) to get the math in right.
- Added FUTEX2_MPOL to FUTEX2_VALID_MASK assuming that we want to expose
it. However the mpol seems not to work here but it is likely that my
setup is proper.
- If the user specified FUTEX_NO_NODE as node then the node is updated
to a valid node number. The node value is only written back to the
user if it has been changed.
While this only avoids the unnecessary write back if the user supplied
a valid node number the whole interface is slighly race if
FUTEX_NO_NODE is supplied and two futex_wait() invocations are invoked
on parallel then the first invocation can set node to 0 and the send
to 1. The following callers will stick to node 1 but the first one
will remain waiting on the wrong node.
- Added selftests for private hash and the NUMA bits.
v10…v11: https://lore.kernel.org/all/20250312151634.2183278-1-bigeasy@linutronix.de
- PeterZ' fixups, changes to the local hash series have been folded
into the earlier patches so things are not added and renamed later
and the functionality is changed.
- vmalloc_huge() has been implemented on top of vmalloc_huge_node()
and the NOMMU bots have been adjusted. akpm asked for this.
- wake_up_var() has been removed from __futex_pivot_hash(). It is
enough to wake the userspace waiter after the final put so it can
perform the resize itself.
- Changed to logic in futex_pivot_pending() so it does not block for
the user. It waits for __futex_pivot_hash() which follows the logic
in __futex_pivot_hash().
- Updated kernel doc for __futex_hash().
- Patches 17+ are new:
- Wire up PR_FUTEX_HASH_SET_SLOTS in "perf bench futex"
- Add "immutable" mode to PR_FUTEX_HASH_SET_SLOTS to avoid resizing
the local hash any further. This avoids rcuref usage which is
noticeable in "perf bench futex hash"
Peter Zijlstra (8):
mm: Add vmalloc_huge_node()
futex: Move futex_queue() into futex_wait_setup()
futex: Pull futex_hash() out of futex_q_lock()
futex: Create hb scopes
futex: Create futex_hash() get/put class
futex: Create private_hash() get/put class
futex: Implement FUTEX2_NUMA
futex: Implement FUTEX2_MPOL
Sebastian Andrzej Siewior (13):
rcuref: Provide rcuref_is_dead()
futex: Acquire a hash reference in futex_wait_multiple_setup()
futex: Decrease the waiter count before the unlock operation
futex: Introduce futex_q_lockptr_lock()
futex: Create helper function to initialize a hash slot
futex: Add basic infrastructure for local task local hash
futex: Allow automatic allocation of process wide futex hash
futex: Allow to resize the private local hash
futex: Allow to make the private hash immutable
tools headers: Synchronize prctl.h ABI header
tools/perf: Allow to select the number of hash buckets
selftests/futex: Add futex_priv_hash
selftests/futex: Add futex_numa_mpol
include/linux/futex.h | 36 +-
include/linux/mm_types.h | 7 +-
include/linux/mmap_lock.h | 4 +
include/linux/rcuref.h | 22 +-
include/linux/vmalloc.h | 9 +-
include/uapi/linux/futex.h | 10 +-
include/uapi/linux/prctl.h | 6 +
init/Kconfig | 10 +
io_uring/futex.c | 4 +-
kernel/fork.c | 24 +
kernel/futex/core.c | 802 ++++++++++++++++--
kernel/futex/futex.h | 73 +-
kernel/futex/pi.c | 306 ++++---
kernel/futex/requeue.c | 480 +++++------
kernel/futex/waitwake.c | 201 +++--
kernel/sys.c | 4 +
mm/nommu.c | 18 +-
mm/vmalloc.c | 11 +-
tools/include/uapi/linux/prctl.h | 44 +-
tools/perf/bench/Build | 1 +
tools/perf/bench/futex-hash.c | 7 +
tools/perf/bench/futex-lock-pi.c | 5 +
tools/perf/bench/futex-requeue.c | 6 +
tools/perf/bench/futex-wake-parallel.c | 9 +-
tools/perf/bench/futex-wake.c | 4 +
tools/perf/bench/futex.c | 65 ++
tools/perf/bench/futex.h | 5 +
.../selftests/futex/functional/.gitignore | 6 +-
.../selftests/futex/functional/Makefile | 4 +-
.../futex/functional/futex_numa_mpol.c | 232 +++++
.../futex/functional/futex_priv_hash.c | 315 +++++++
.../testing/selftests/futex/functional/run.sh | 7 +
.../selftests/futex/include/futex2test.h | 34 +
33 files changed, 2199 insertions(+), 572 deletions(-)
create mode 100644 tools/perf/bench/futex.c
create mode 100644 tools/testing/selftests/futex/functional/futex_numa_mpol.c
create mode 100644 tools/testing/selftests/futex/functional/futex_priv_hash.c
--
2.49.0
^ permalink raw reply [flat|nested] 109+ messages in thread
* [PATCH v12 01/21] rcuref: Provide rcuref_is_dead()
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-05 21:09 ` André Almeida
2025-05-08 10:34 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 02/21] mm: Add vmalloc_huge_node() Sebastian Andrzej Siewior
` (20 subsequent siblings)
21 siblings, 2 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
rcuref_read() returns the number of references that are currently held.
If 0 is returned then it is not safe to assume that the object ca be
scheduled for deconstruction because it is marked DEAD. This happens if
the return value of rcuref_put() is ignored and assumptions are made.
If 0 is returned then the counter transitioned from 0 to RCUREF_NOREF.
If rcuref_put() did not return to the caller then the counter did not
yet transition from RCUREF_NOREF to RCUREF_DEAD. This means that there
is still a chance that the counter will transition from RCUREF_NOREF to
0 meaning it is still valid and must not be deconstructed. In this brief
window rcuref_read() will return 0.
Provide rcuref_is_dead() to determine if the counter is marked as
RCUREF_DEAD.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/rcuref.h | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/include/linux/rcuref.h b/include/linux/rcuref.h
index 6322d8c1c6b42..2fb2af6d98249 100644
--- a/include/linux/rcuref.h
+++ b/include/linux/rcuref.h
@@ -30,7 +30,11 @@ static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
* rcuref_read - Read the number of held reference counts of a rcuref
* @ref: Pointer to the reference count
*
- * Return: The number of held references (0 ... N)
+ * Return: The number of held references (0 ... N). The value 0 does not
+ * indicate that it is safe to schedule the object, protected by this reference
+ * counter, for deconstruction.
+ * If you want to know if the reference counter has been marked DEAD (as
+ * signaled by rcuref_put()) please use rcuread_is_dead().
*/
static inline unsigned int rcuref_read(rcuref_t *ref)
{
@@ -40,6 +44,22 @@ static inline unsigned int rcuref_read(rcuref_t *ref)
return c >= RCUREF_RELEASED ? 0 : c + 1;
}
+/**
+ * rcuref_is_dead - Check if the rcuref has been already marked dead
+ * @ref: Pointer to the reference count
+ *
+ * Return: True if the object has been marked DEAD. This signals that a previous
+ * invocation of rcuref_put() returned true on this reference counter meaning
+ * the protected object can safely be scheduled for deconstruction.
+ * Otherwise, returns false.
+ */
+static inline bool rcuref_is_dead(rcuref_t *ref)
+{
+ unsigned int c = atomic_read(&ref->refcnt);
+
+ return (c >= RCUREF_RELEASED) && (c < RCUREF_NOREF);
+}
+
extern __must_check bool rcuref_get_slowpath(rcuref_t *ref);
/**
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 02/21] mm: Add vmalloc_huge_node()
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 01/21] rcuref: Provide rcuref_is_dead() Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 03/21] futex: Move futex_queue() into futex_wait_setup() Sebastian Andrzej Siewior
` (19 subsequent siblings)
21 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
linux-mm, Christoph Hellwig, Sebastian Andrzej Siewior
From: Peter Zijlstra <peterz@infradead.org>
To enable node specific hash-tables using huge pages if possible.
[bigeasy: use __vmalloc_node_range_noprof(), add nommu bits, inline
vmalloc_huge]
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: linux-mm@kvack.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/vmalloc.h | 9 +++++++--
mm/nommu.c | 18 +++++++++++++++++-
mm/vmalloc.c | 11 ++++++-----
3 files changed, 30 insertions(+), 8 deletions(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 31e9ffd936e39..de95794777ad6 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -168,8 +168,13 @@ void *__vmalloc_node_noprof(unsigned long size, unsigned long align, gfp_t gfp_m
int node, const void *caller) __alloc_size(1);
#define __vmalloc_node(...) alloc_hooks(__vmalloc_node_noprof(__VA_ARGS__))
-void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
-#define vmalloc_huge(...) alloc_hooks(vmalloc_huge_noprof(__VA_ARGS__))
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node) __alloc_size(1);
+#define vmalloc_huge_node(...) alloc_hooks(vmalloc_huge_node_noprof(__VA_ARGS__))
+
+static inline void *vmalloc_huge(unsigned long size, gfp_t gfp_mask)
+{
+ return vmalloc_huge_node(size, gfp_mask, NUMA_NO_NODE);
+}
extern void *__vmalloc_array_noprof(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
#define __vmalloc_array(...) alloc_hooks(__vmalloc_array_noprof(__VA_ARGS__))
diff --git a/mm/nommu.c b/mm/nommu.c
index 617e7ba8022f5..70f92f9a7fab3 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -200,7 +200,23 @@ void *vmalloc_noprof(unsigned long size)
}
EXPORT_SYMBOL(vmalloc_noprof);
-void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __weak __alias(__vmalloc_noprof);
+/*
+ * vmalloc_huge_node - allocate virtually contiguous memory, on a node
+ *
+ * @size: allocation size
+ * @gfp_mask: flags for the page level allocator
+ * @node: node to use for allocation or NUMA_NO_NODE
+ *
+ * Allocate enough pages to cover @size from the page level
+ * allocator and map them into contiguous kernel virtual space.
+ *
+ * Due to NOMMU implications the node argument and HUGE page attribute is
+ * ignored.
+ */
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
+{
+ return __vmalloc_noprof(size, gfp_mask);
+}
/*
* vzalloc - allocate virtually contiguous memory with zero fill
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3ed720a787ecd..8b9f6d3c099dd 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3943,9 +3943,10 @@ void *vmalloc_noprof(unsigned long size)
EXPORT_SYMBOL(vmalloc_noprof);
/**
- * vmalloc_huge - allocate virtually contiguous memory, allow huge pages
+ * vmalloc_huge_node - allocate virtually contiguous memory, allow huge pages
* @size: allocation size
* @gfp_mask: flags for the page level allocator
+ * @node: node to use for allocation or NUMA_NO_NODE
*
* Allocate enough pages to cover @size from the page level
* allocator and map them into contiguous kernel virtual space.
@@ -3954,13 +3955,13 @@ EXPORT_SYMBOL(vmalloc_noprof);
*
* Return: pointer to the allocated memory or %NULL on error
*/
-void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask)
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
{
return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
- gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
- NUMA_NO_NODE, __builtin_return_address(0));
+ gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+ node, __builtin_return_address(0));
}
-EXPORT_SYMBOL_GPL(vmalloc_huge_noprof);
+EXPORT_SYMBOL_GPL(vmalloc_huge_node_noprof);
/**
* vzalloc - allocate virtually contiguous memory with zero fill
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 03/21] futex: Move futex_queue() into futex_wait_setup()
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 01/21] rcuref: Provide rcuref_is_dead() Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 02/21] mm: Add vmalloc_huge_node() Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-05 21:43 ` André Almeida
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 04/21] futex: Pull futex_hash() out of futex_q_lock() Sebastian Andrzej Siewior
` (18 subsequent siblings)
21 siblings, 2 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
From: Peter Zijlstra <peterz@infradead.org>
futex_wait_setup() has a weird calling convention in order to return
hb to use as an argument to futex_queue().
Mostly such that requeue can have an extra test in between.
Reorder code a little to get rid of this and keep the hb usage inside
futex_wait_setup().
[bigeasy: fixes]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
io_uring/futex.c | 4 +---
kernel/futex/futex.h | 6 +++---
kernel/futex/requeue.c | 28 ++++++++++--------------
kernel/futex/waitwake.c | 47 +++++++++++++++++++++++------------------
4 files changed, 42 insertions(+), 43 deletions(-)
diff --git a/io_uring/futex.c b/io_uring/futex.c
index 0ea4820cd8ff8..e89c0897117ae 100644
--- a/io_uring/futex.c
+++ b/io_uring/futex.c
@@ -273,7 +273,6 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex);
struct io_ring_ctx *ctx = req->ctx;
struct io_futex_data *ifd = NULL;
- struct futex_hash_bucket *hb;
int ret;
if (!iof->futex_mask) {
@@ -295,12 +294,11 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
ifd->req = req;
ret = futex_wait_setup(iof->uaddr, iof->futex_val, iof->futex_flags,
- &ifd->q, &hb);
+ &ifd->q, NULL, NULL);
if (!ret) {
hlist_add_head(&req->hash_node, &ctx->futex_list);
io_ring_submit_unlock(ctx, issue_flags);
- futex_queue(&ifd->q, hb, NULL);
return IOU_ISSUE_SKIP_COMPLETE;
}
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 6b2f4c7eb720f..16aafd0113442 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -219,9 +219,9 @@ static inline int futex_match(union futex_key *key1, union futex_key *key2)
}
extern int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
- struct futex_q *q, struct futex_hash_bucket **hb);
-extern void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
- struct hrtimer_sleeper *timeout);
+ struct futex_q *q, union futex_key *key2,
+ struct task_struct *task);
+extern void futex_do_wait(struct futex_q *q, struct hrtimer_sleeper *timeout);
extern bool __futex_wake_mark(struct futex_q *q);
extern void futex_wake_mark(struct wake_q_head *wake_q, struct futex_q *q);
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index b47bb764b3520..0e55975af515c 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -769,7 +769,6 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
{
struct hrtimer_sleeper timeout, *to;
struct rt_mutex_waiter rt_waiter;
- struct futex_hash_bucket *hb;
union futex_key key2 = FUTEX_KEY_INIT;
struct futex_q q = futex_q_init;
struct rt_mutex_base *pi_mutex;
@@ -805,29 +804,24 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
* Prepare to wait on uaddr. On success, it holds hb->lock and q
* is initialized.
*/
- ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
+ ret = futex_wait_setup(uaddr, val, flags, &q, &key2, current);
if (ret)
goto out;
- /*
- * The check above which compares uaddrs is not sufficient for
- * shared futexes. We need to compare the keys:
- */
- if (futex_match(&q.key, &key2)) {
- futex_q_unlock(hb);
- ret = -EINVAL;
- goto out;
- }
-
/* Queue the futex_q, drop the hb lock, wait for wakeup. */
- futex_wait_queue(hb, &q, to);
+ futex_do_wait(&q, to);
switch (futex_requeue_pi_wakeup_sync(&q)) {
case Q_REQUEUE_PI_IGNORE:
- /* The waiter is still on uaddr1 */
- spin_lock(&hb->lock);
- ret = handle_early_requeue_pi_wakeup(hb, &q, to);
- spin_unlock(&hb->lock);
+ {
+ struct futex_hash_bucket *hb;
+
+ hb = futex_hash(&q.key);
+ /* The waiter is still on uaddr1 */
+ spin_lock(&hb->lock);
+ ret = handle_early_requeue_pi_wakeup(hb, &q, to);
+ spin_unlock(&hb->lock);
+ }
break;
case Q_REQUEUE_PI_LOCKED:
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 25877d4f2f8f3..6cf10701294b4 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -339,18 +339,8 @@ static long futex_wait_restart(struct restart_block *restart);
* @q: the futex_q to queue up on
* @timeout: the prepared hrtimer_sleeper, or null for no timeout
*/
-void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
- struct hrtimer_sleeper *timeout)
+void futex_do_wait(struct futex_q *q, struct hrtimer_sleeper *timeout)
{
- /*
- * The task state is guaranteed to be set before another task can
- * wake it. set_current_state() is implemented using smp_store_mb() and
- * futex_queue() calls spin_unlock() upon completion, both serializing
- * access to the hash list and forcing another memory barrier.
- */
- set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
- futex_queue(q, hb, current);
-
/* Arm the timer */
if (timeout)
hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
@@ -578,7 +568,8 @@ int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
* @val: the expected value
* @flags: futex flags (FLAGS_SHARED, etc.)
* @q: the associated futex_q
- * @hb: storage for hash_bucket pointer to be returned to caller
+ * @key2: the second futex_key if used for requeue PI
+ * task: Task queueing this futex
*
* Setup the futex_q and locate the hash_bucket. Get the futex value and
* compare it with the expected value. Handle atomic faults internally.
@@ -589,8 +580,10 @@ int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
* - <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
*/
int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
- struct futex_q *q, struct futex_hash_bucket **hb)
+ struct futex_q *q, union futex_key *key2,
+ struct task_struct *task)
{
+ struct futex_hash_bucket *hb;
u32 uval;
int ret;
@@ -618,12 +611,12 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
return ret;
retry_private:
- *hb = futex_q_lock(q);
+ hb = futex_q_lock(q);
ret = futex_get_value_locked(&uval, uaddr);
if (ret) {
- futex_q_unlock(*hb);
+ futex_q_unlock(hb);
ret = get_user(uval, uaddr);
if (ret)
@@ -636,10 +629,25 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
}
if (uval != val) {
- futex_q_unlock(*hb);
- ret = -EWOULDBLOCK;
+ futex_q_unlock(hb);
+ return -EWOULDBLOCK;
}
+ if (key2 && futex_match(&q->key, key2)) {
+ futex_q_unlock(hb);
+ return -EINVAL;
+ }
+
+ /*
+ * The task state is guaranteed to be set before another task can
+ * wake it. set_current_state() is implemented using smp_store_mb() and
+ * futex_queue() calls spin_unlock() upon completion, both serializing
+ * access to the hash list and forcing another memory barrier.
+ */
+ if (task == current)
+ set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+ futex_queue(q, hb, task);
+
return ret;
}
@@ -647,7 +655,6 @@ int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
struct hrtimer_sleeper *to, u32 bitset)
{
struct futex_q q = futex_q_init;
- struct futex_hash_bucket *hb;
int ret;
if (!bitset)
@@ -660,12 +667,12 @@ int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
* Prepare to wait on uaddr. On success, it holds hb->lock and q
* is initialized.
*/
- ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
+ ret = futex_wait_setup(uaddr, val, flags, &q, NULL, current);
if (ret)
return ret;
/* futex_queue and wait for wakeup, timeout, or a signal. */
- futex_wait_queue(hb, &q, to);
+ futex_do_wait(&q, to);
/* If we were woken (and unqueued), we succeeded, whatever. */
if (!futex_unqueue(&q))
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 04/21] futex: Pull futex_hash() out of futex_q_lock()
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (2 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 03/21] futex: Move futex_queue() into futex_wait_setup() Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 05/21] futex: Create hb scopes Sebastian Andrzej Siewior
` (17 subsequent siblings)
21 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
From: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 7 +------
kernel/futex/futex.h | 2 +-
kernel/futex/pi.c | 3 ++-
kernel/futex/waitwake.c | 6 ++++--
4 files changed, 8 insertions(+), 10 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index cca15859a50be..7adc914878933 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -502,13 +502,9 @@ void __futex_unqueue(struct futex_q *q)
}
/* The key must be already stored in q->key. */
-struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
+void futex_q_lock(struct futex_q *q, struct futex_hash_bucket *hb)
__acquires(&hb->lock)
{
- struct futex_hash_bucket *hb;
-
- hb = futex_hash(&q->key);
-
/*
* Increment the counter before taking the lock so that
* a potential waker won't miss a to-be-slept task that is
@@ -522,7 +518,6 @@ struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
q->lock_ptr = &hb->lock;
spin_lock(&hb->lock);
- return hb;
}
void futex_q_unlock(struct futex_hash_bucket *hb)
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 16aafd0113442..a219903e52084 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -354,7 +354,7 @@ static inline int futex_hb_waiters_pending(struct futex_hash_bucket *hb)
#endif
}
-extern struct futex_hash_bucket *futex_q_lock(struct futex_q *q);
+extern void futex_q_lock(struct futex_q *q, struct futex_hash_bucket *hb);
extern void futex_q_unlock(struct futex_hash_bucket *hb);
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 7a941845f7eee..3bf942e9400ac 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -939,7 +939,8 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
goto out;
retry_private:
- hb = futex_q_lock(&q);
+ hb = futex_hash(&q.key);
+ futex_q_lock(&q, hb);
ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
&exiting, 0);
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 6cf10701294b4..1108f373fd315 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -441,7 +441,8 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
struct futex_q *q = &vs[i].q;
u32 val = vs[i].w.val;
- hb = futex_q_lock(q);
+ hb = futex_hash(&q->key);
+ futex_q_lock(q, hb);
ret = futex_get_value_locked(&uval, uaddr);
if (!ret && uval == val) {
@@ -611,7 +612,8 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
return ret;
retry_private:
- hb = futex_q_lock(q);
+ hb = futex_hash(&q->key);
+ futex_q_lock(q, hb);
ret = futex_get_value_locked(&uval, uaddr);
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 05/21] futex: Create hb scopes
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (3 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 04/21] futex: Pull futex_hash() out of futex_q_lock() Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-06 23:45 ` André Almeida
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 06/21] futex: Create futex_hash() get/put class Sebastian Andrzej Siewior
` (16 subsequent siblings)
21 siblings, 2 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
From: Peter Zijlstra <peterz@infradead.org>
Create explicit scopes for hb variables; almost pure re-indent.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 81 ++++----
kernel/futex/pi.c | 282 +++++++++++++-------------
kernel/futex/requeue.c | 433 ++++++++++++++++++++--------------------
kernel/futex/waitwake.c | 193 +++++++++---------
4 files changed, 504 insertions(+), 485 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 7adc914878933..e4cb5ce9785b1 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -944,7 +944,6 @@ static void exit_pi_state_list(struct task_struct *curr)
{
struct list_head *next, *head = &curr->pi_state_list;
struct futex_pi_state *pi_state;
- struct futex_hash_bucket *hb;
union futex_key key = FUTEX_KEY_INIT;
/*
@@ -957,50 +956,54 @@ static void exit_pi_state_list(struct task_struct *curr)
next = head->next;
pi_state = list_entry(next, struct futex_pi_state, list);
key = pi_state->key;
- hb = futex_hash(&key);
+ if (1) {
+ struct futex_hash_bucket *hb;
- /*
- * We can race against put_pi_state() removing itself from the
- * list (a waiter going away). put_pi_state() will first
- * decrement the reference count and then modify the list, so
- * its possible to see the list entry but fail this reference
- * acquire.
- *
- * In that case; drop the locks to let put_pi_state() make
- * progress and retry the loop.
- */
- if (!refcount_inc_not_zero(&pi_state->refcount)) {
+ hb = futex_hash(&key);
+
+ /*
+ * We can race against put_pi_state() removing itself from the
+ * list (a waiter going away). put_pi_state() will first
+ * decrement the reference count and then modify the list, so
+ * its possible to see the list entry but fail this reference
+ * acquire.
+ *
+ * In that case; drop the locks to let put_pi_state() make
+ * progress and retry the loop.
+ */
+ if (!refcount_inc_not_zero(&pi_state->refcount)) {
+ raw_spin_unlock_irq(&curr->pi_lock);
+ cpu_relax();
+ raw_spin_lock_irq(&curr->pi_lock);
+ continue;
+ }
raw_spin_unlock_irq(&curr->pi_lock);
- cpu_relax();
- raw_spin_lock_irq(&curr->pi_lock);
- continue;
- }
- raw_spin_unlock_irq(&curr->pi_lock);
- spin_lock(&hb->lock);
- raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
- raw_spin_lock(&curr->pi_lock);
- /*
- * We dropped the pi-lock, so re-check whether this
- * task still owns the PI-state:
- */
- if (head->next != next) {
- /* retain curr->pi_lock for the loop invariant */
- raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
+ spin_lock(&hb->lock);
+ raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+ raw_spin_lock(&curr->pi_lock);
+ /*
+ * We dropped the pi-lock, so re-check whether this
+ * task still owns the PI-state:
+ */
+ if (head->next != next) {
+ /* retain curr->pi_lock for the loop invariant */
+ raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
+ spin_unlock(&hb->lock);
+ put_pi_state(pi_state);
+ continue;
+ }
+
+ WARN_ON(pi_state->owner != curr);
+ WARN_ON(list_empty(&pi_state->list));
+ list_del_init(&pi_state->list);
+ pi_state->owner = NULL;
+
+ raw_spin_unlock(&curr->pi_lock);
+ raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
spin_unlock(&hb->lock);
- put_pi_state(pi_state);
- continue;
}
- WARN_ON(pi_state->owner != curr);
- WARN_ON(list_empty(&pi_state->list));
- list_del_init(&pi_state->list);
- pi_state->owner = NULL;
-
- raw_spin_unlock(&curr->pi_lock);
- raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
- spin_unlock(&hb->lock);
-
rt_mutex_futex_unlock(&pi_state->pi_mutex);
put_pi_state(pi_state);
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 3bf942e9400ac..a56f28fda58dd 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -920,7 +920,6 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
struct hrtimer_sleeper timeout, *to;
struct task_struct *exiting = NULL;
struct rt_mutex_waiter rt_waiter;
- struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
DEFINE_WAKE_Q(wake_q);
int res, ret;
@@ -939,152 +938,169 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
goto out;
retry_private:
- hb = futex_hash(&q.key);
- futex_q_lock(&q, hb);
+ if (1) {
+ struct futex_hash_bucket *hb;
- ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
- &exiting, 0);
- if (unlikely(ret)) {
- /*
- * Atomic work succeeded and we got the lock,
- * or failed. Either way, we do _not_ block.
- */
- switch (ret) {
- case 1:
- /* We got the lock. */
- ret = 0;
- goto out_unlock_put_key;
- case -EFAULT:
- goto uaddr_faulted;
- case -EBUSY:
- case -EAGAIN:
+ hb = futex_hash(&q.key);
+ futex_q_lock(&q, hb);
+
+ ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
+ &exiting, 0);
+ if (unlikely(ret)) {
/*
- * Two reasons for this:
- * - EBUSY: Task is exiting and we just wait for the
- * exit to complete.
- * - EAGAIN: The user space value changed.
+ * Atomic work succeeded and we got the lock,
+ * or failed. Either way, we do _not_ block.
*/
- futex_q_unlock(hb);
- /*
- * Handle the case where the owner is in the middle of
- * exiting. Wait for the exit to complete otherwise
- * this task might loop forever, aka. live lock.
- */
- wait_for_owner_exiting(ret, exiting);
- cond_resched();
- goto retry;
- default:
- goto out_unlock_put_key;
+ switch (ret) {
+ case 1:
+ /* We got the lock. */
+ ret = 0;
+ goto out_unlock_put_key;
+ case -EFAULT:
+ goto uaddr_faulted;
+ case -EBUSY:
+ case -EAGAIN:
+ /*
+ * Two reasons for this:
+ * - EBUSY: Task is exiting and we just wait for the
+ * exit to complete.
+ * - EAGAIN: The user space value changed.
+ */
+ futex_q_unlock(hb);
+ /*
+ * Handle the case where the owner is in the middle of
+ * exiting. Wait for the exit to complete otherwise
+ * this task might loop forever, aka. live lock.
+ */
+ wait_for_owner_exiting(ret, exiting);
+ cond_resched();
+ goto retry;
+ default:
+ goto out_unlock_put_key;
+ }
}
- }
- WARN_ON(!q.pi_state);
+ WARN_ON(!q.pi_state);
- /*
- * Only actually queue now that the atomic ops are done:
- */
- __futex_queue(&q, hb, current);
+ /*
+ * Only actually queue now that the atomic ops are done:
+ */
+ __futex_queue(&q, hb, current);
- if (trylock) {
- ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
- /* Fixup the trylock return value: */
- ret = ret ? 0 : -EWOULDBLOCK;
- goto no_block;
- }
+ if (trylock) {
+ ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
+ /* Fixup the trylock return value: */
+ ret = ret ? 0 : -EWOULDBLOCK;
+ goto no_block;
+ }
- /*
- * Must be done before we enqueue the waiter, here is unfortunately
- * under the hb lock, but that *should* work because it does nothing.
- */
- rt_mutex_pre_schedule();
+ /*
+ * Must be done before we enqueue the waiter, here is unfortunately
+ * under the hb lock, but that *should* work because it does nothing.
+ */
+ rt_mutex_pre_schedule();
- rt_mutex_init_waiter(&rt_waiter);
+ rt_mutex_init_waiter(&rt_waiter);
- /*
- * On PREEMPT_RT, when hb->lock becomes an rt_mutex, we must not
- * hold it while doing rt_mutex_start_proxy(), because then it will
- * include hb->lock in the blocking chain, even through we'll not in
- * fact hold it while blocking. This will lead it to report -EDEADLK
- * and BUG when futex_unlock_pi() interleaves with this.
- *
- * Therefore acquire wait_lock while holding hb->lock, but drop the
- * latter before calling __rt_mutex_start_proxy_lock(). This
- * interleaves with futex_unlock_pi() -- which does a similar lock
- * handoff -- such that the latter can observe the futex_q::pi_state
- * before __rt_mutex_start_proxy_lock() is done.
- */
- raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
- spin_unlock(q.lock_ptr);
- /*
- * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
- * such that futex_unlock_pi() is guaranteed to observe the waiter when
- * it sees the futex_q::pi_state.
- */
- ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current, &wake_q);
- raw_spin_unlock_irq_wake(&q.pi_state->pi_mutex.wait_lock, &wake_q);
+ /*
+ * On PREEMPT_RT, when hb->lock becomes an rt_mutex, we must not
+ * hold it while doing rt_mutex_start_proxy(), because then it will
+ * include hb->lock in the blocking chain, even through we'll not in
+ * fact hold it while blocking. This will lead it to report -EDEADLK
+ * and BUG when futex_unlock_pi() interleaves with this.
+ *
+ * Therefore acquire wait_lock while holding hb->lock, but drop the
+ * latter before calling __rt_mutex_start_proxy_lock(). This
+ * interleaves with futex_unlock_pi() -- which does a similar lock
+ * handoff -- such that the latter can observe the futex_q::pi_state
+ * before __rt_mutex_start_proxy_lock() is done.
+ */
+ raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
+ spin_unlock(q.lock_ptr);
+ /*
+ * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
+ * such that futex_unlock_pi() is guaranteed to observe the waiter when
+ * it sees the futex_q::pi_state.
+ */
+ ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current, &wake_q);
+ raw_spin_unlock_irq_wake(&q.pi_state->pi_mutex.wait_lock, &wake_q);
- if (ret) {
- if (ret == 1)
- ret = 0;
- goto cleanup;
- }
+ if (ret) {
+ if (ret == 1)
+ ret = 0;
+ goto cleanup;
+ }
- if (unlikely(to))
- hrtimer_sleeper_start_expires(to, HRTIMER_MODE_ABS);
+ if (unlikely(to))
+ hrtimer_sleeper_start_expires(to, HRTIMER_MODE_ABS);
- ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
+ ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
cleanup:
- /*
- * If we failed to acquire the lock (deadlock/signal/timeout), we must
- * must unwind the above, however we canont lock hb->lock because
- * rt_mutex already has a waiter enqueued and hb->lock can itself try
- * and enqueue an rt_waiter through rtlock.
- *
- * Doing the cleanup without holding hb->lock can cause inconsistent
- * state between hb and pi_state, but only in the direction of not
- * seeing a waiter that is leaving.
- *
- * See futex_unlock_pi(), it deals with this inconsistency.
- *
- * There be dragons here, since we must deal with the inconsistency on
- * the way out (here), it is impossible to detect/warn about the race
- * the other way around (missing an incoming waiter).
- *
- * What could possibly go wrong...
- */
- if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
- ret = 0;
+ /*
+ * If we failed to acquire the lock (deadlock/signal/timeout), we must
+ * unwind the above, however we canont lock hb->lock because
+ * rt_mutex already has a waiter enqueued and hb->lock can itself try
+ * and enqueue an rt_waiter through rtlock.
+ *
+ * Doing the cleanup without holding hb->lock can cause inconsistent
+ * state between hb and pi_state, but only in the direction of not
+ * seeing a waiter that is leaving.
+ *
+ * See futex_unlock_pi(), it deals with this inconsistency.
+ *
+ * There be dragons here, since we must deal with the inconsistency on
+ * the way out (here), it is impossible to detect/warn about the race
+ * the other way around (missing an incoming waiter).
+ *
+ * What could possibly go wrong...
+ */
+ if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
+ ret = 0;
- /*
- * Now that the rt_waiter has been dequeued, it is safe to use
- * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
- * the
- */
- spin_lock(q.lock_ptr);
- /*
- * Waiter is unqueued.
- */
- rt_mutex_post_schedule();
+ /*
+ * Now that the rt_waiter has been dequeued, it is safe to use
+ * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
+ * the
+ */
+ spin_lock(q.lock_ptr);
+ /*
+ * Waiter is unqueued.
+ */
+ rt_mutex_post_schedule();
no_block:
- /*
- * Fixup the pi_state owner and possibly acquire the lock if we
- * haven't already.
- */
- res = fixup_pi_owner(uaddr, &q, !ret);
- /*
- * If fixup_pi_owner() returned an error, propagate that. If it acquired
- * the lock, clear our -ETIMEDOUT or -EINTR.
- */
- if (res)
- ret = (res < 0) ? res : 0;
+ /*
+ * Fixup the pi_state owner and possibly acquire the lock if we
+ * haven't already.
+ */
+ res = fixup_pi_owner(uaddr, &q, !ret);
+ /*
+ * If fixup_pi_owner() returned an error, propagate that. If it acquired
+ * the lock, clear our -ETIMEDOUT or -EINTR.
+ */
+ if (res)
+ ret = (res < 0) ? res : 0;
- futex_unqueue_pi(&q);
- spin_unlock(q.lock_ptr);
- goto out;
+ futex_unqueue_pi(&q);
+ spin_unlock(q.lock_ptr);
+ goto out;
out_unlock_put_key:
- futex_q_unlock(hb);
+ futex_q_unlock(hb);
+ goto out;
+
+uaddr_faulted:
+ futex_q_unlock(hb);
+
+ ret = fault_in_user_writeable(uaddr);
+ if (ret)
+ goto out;
+
+ if (!(flags & FLAGS_SHARED))
+ goto retry_private;
+
+ goto retry;
+ }
out:
if (to) {
@@ -1092,18 +1108,6 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
destroy_hrtimer_on_stack(&to->timer);
}
return ret != -EINTR ? ret : -ERESTARTNOINTR;
-
-uaddr_faulted:
- futex_q_unlock(hb);
-
- ret = fault_in_user_writeable(uaddr);
- if (ret)
- goto out;
-
- if (!(flags & FLAGS_SHARED))
- goto retry_private;
-
- goto retry;
}
/*
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 0e55975af515c..209794cad6f2f 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -371,7 +371,6 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
int task_count = 0, ret;
struct futex_pi_state *pi_state = NULL;
- struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
DEFINE_WAKE_Q(wake_q);
@@ -443,240 +442,244 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
if (requeue_pi && futex_match(&key1, &key2))
return -EINVAL;
- hb1 = futex_hash(&key1);
- hb2 = futex_hash(&key2);
-
retry_private:
- futex_hb_waiters_inc(hb2);
- double_lock_hb(hb1, hb2);
+ if (1) {
+ struct futex_hash_bucket *hb1, *hb2;
- if (likely(cmpval != NULL)) {
- u32 curval;
+ hb1 = futex_hash(&key1);
+ hb2 = futex_hash(&key2);
- ret = futex_get_value_locked(&curval, uaddr1);
+ futex_hb_waiters_inc(hb2);
+ double_lock_hb(hb1, hb2);
- if (unlikely(ret)) {
- double_unlock_hb(hb1, hb2);
- futex_hb_waiters_dec(hb2);
+ if (likely(cmpval != NULL)) {
+ u32 curval;
- ret = get_user(curval, uaddr1);
- if (ret)
- return ret;
+ ret = futex_get_value_locked(&curval, uaddr1);
- if (!(flags1 & FLAGS_SHARED))
- goto retry_private;
+ if (unlikely(ret)) {
+ double_unlock_hb(hb1, hb2);
+ futex_hb_waiters_dec(hb2);
- goto retry;
- }
- if (curval != *cmpval) {
- ret = -EAGAIN;
- goto out_unlock;
- }
- }
+ ret = get_user(curval, uaddr1);
+ if (ret)
+ return ret;
- if (requeue_pi) {
- struct task_struct *exiting = NULL;
+ if (!(flags1 & FLAGS_SHARED))
+ goto retry_private;
- /*
- * Attempt to acquire uaddr2 and wake the top waiter. If we
- * intend to requeue waiters, force setting the FUTEX_WAITERS
- * bit. We force this here where we are able to easily handle
- * faults rather in the requeue loop below.
- *
- * Updates topwaiter::requeue_state if a top waiter exists.
- */
- ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
- &key2, &pi_state,
- &exiting, nr_requeue);
-
- /*
- * At this point the top_waiter has either taken uaddr2 or
- * is waiting on it. In both cases pi_state has been
- * established and an initial refcount on it. In case of an
- * error there's nothing.
- *
- * The top waiter's requeue_state is up to date:
- *
- * - If the lock was acquired atomically (ret == 1), then
- * the state is Q_REQUEUE_PI_LOCKED.
- *
- * The top waiter has been dequeued and woken up and can
- * return to user space immediately. The kernel/user
- * space state is consistent. In case that there must be
- * more waiters requeued the WAITERS bit in the user
- * space futex is set so the top waiter task has to go
- * into the syscall slowpath to unlock the futex. This
- * will block until this requeue operation has been
- * completed and the hash bucket locks have been
- * dropped.
- *
- * - If the trylock failed with an error (ret < 0) then
- * the state is either Q_REQUEUE_PI_NONE, i.e. "nothing
- * happened", or Q_REQUEUE_PI_IGNORE when there was an
- * interleaved early wakeup.
- *
- * - If the trylock did not succeed (ret == 0) then the
- * state is either Q_REQUEUE_PI_IN_PROGRESS or
- * Q_REQUEUE_PI_WAIT if an early wakeup interleaved.
- * This will be cleaned up in the loop below, which
- * cannot fail because futex_proxy_trylock_atomic() did
- * the same sanity checks for requeue_pi as the loop
- * below does.
- */
- switch (ret) {
- case 0:
- /* We hold a reference on the pi state. */
- break;
-
- case 1:
- /*
- * futex_proxy_trylock_atomic() acquired the user space
- * futex. Adjust task_count.
- */
- task_count++;
- ret = 0;
- break;
-
- /*
- * If the above failed, then pi_state is NULL and
- * waiter::requeue_state is correct.
- */
- case -EFAULT:
- double_unlock_hb(hb1, hb2);
- futex_hb_waiters_dec(hb2);
- ret = fault_in_user_writeable(uaddr2);
- if (!ret)
goto retry;
- return ret;
- case -EBUSY:
- case -EAGAIN:
- /*
- * Two reasons for this:
- * - EBUSY: Owner is exiting and we just wait for the
- * exit to complete.
- * - EAGAIN: The user space value changed.
- */
- double_unlock_hb(hb1, hb2);
- futex_hb_waiters_dec(hb2);
- /*
- * Handle the case where the owner is in the middle of
- * exiting. Wait for the exit to complete otherwise
- * this task might loop forever, aka. live lock.
- */
- wait_for_owner_exiting(ret, exiting);
- cond_resched();
- goto retry;
- default:
- goto out_unlock;
- }
- }
-
- plist_for_each_entry_safe(this, next, &hb1->chain, list) {
- if (task_count - nr_wake >= nr_requeue)
- break;
-
- if (!futex_match(&this->key, &key1))
- continue;
-
- /*
- * FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI should always
- * be paired with each other and no other futex ops.
- *
- * We should never be requeueing a futex_q with a pi_state,
- * which is awaiting a futex_unlock_pi().
- */
- if ((requeue_pi && !this->rt_waiter) ||
- (!requeue_pi && this->rt_waiter) ||
- this->pi_state) {
- ret = -EINVAL;
- break;
+ }
+ if (curval != *cmpval) {
+ ret = -EAGAIN;
+ goto out_unlock;
+ }
}
- /* Plain futexes just wake or requeue and are done */
- if (!requeue_pi) {
- if (++task_count <= nr_wake)
- this->wake(&wake_q, this);
- else
+ if (requeue_pi) {
+ struct task_struct *exiting = NULL;
+
+ /*
+ * Attempt to acquire uaddr2 and wake the top waiter. If we
+ * intend to requeue waiters, force setting the FUTEX_WAITERS
+ * bit. We force this here where we are able to easily handle
+ * faults rather in the requeue loop below.
+ *
+ * Updates topwaiter::requeue_state if a top waiter exists.
+ */
+ ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
+ &key2, &pi_state,
+ &exiting, nr_requeue);
+
+ /*
+ * At this point the top_waiter has either taken uaddr2 or
+ * is waiting on it. In both cases pi_state has been
+ * established and an initial refcount on it. In case of an
+ * error there's nothing.
+ *
+ * The top waiter's requeue_state is up to date:
+ *
+ * - If the lock was acquired atomically (ret == 1), then
+ * the state is Q_REQUEUE_PI_LOCKED.
+ *
+ * The top waiter has been dequeued and woken up and can
+ * return to user space immediately. The kernel/user
+ * space state is consistent. In case that there must be
+ * more waiters requeued the WAITERS bit in the user
+ * space futex is set so the top waiter task has to go
+ * into the syscall slowpath to unlock the futex. This
+ * will block until this requeue operation has been
+ * completed and the hash bucket locks have been
+ * dropped.
+ *
+ * - If the trylock failed with an error (ret < 0) then
+ * the state is either Q_REQUEUE_PI_NONE, i.e. "nothing
+ * happened", or Q_REQUEUE_PI_IGNORE when there was an
+ * interleaved early wakeup.
+ *
+ * - If the trylock did not succeed (ret == 0) then the
+ * state is either Q_REQUEUE_PI_IN_PROGRESS or
+ * Q_REQUEUE_PI_WAIT if an early wakeup interleaved.
+ * This will be cleaned up in the loop below, which
+ * cannot fail because futex_proxy_trylock_atomic() did
+ * the same sanity checks for requeue_pi as the loop
+ * below does.
+ */
+ switch (ret) {
+ case 0:
+ /* We hold a reference on the pi state. */
+ break;
+
+ case 1:
+ /*
+ * futex_proxy_trylock_atomic() acquired the user space
+ * futex. Adjust task_count.
+ */
+ task_count++;
+ ret = 0;
+ break;
+
+ /*
+ * If the above failed, then pi_state is NULL and
+ * waiter::requeue_state is correct.
+ */
+ case -EFAULT:
+ double_unlock_hb(hb1, hb2);
+ futex_hb_waiters_dec(hb2);
+ ret = fault_in_user_writeable(uaddr2);
+ if (!ret)
+ goto retry;
+ return ret;
+ case -EBUSY:
+ case -EAGAIN:
+ /*
+ * Two reasons for this:
+ * - EBUSY: Owner is exiting and we just wait for the
+ * exit to complete.
+ * - EAGAIN: The user space value changed.
+ */
+ double_unlock_hb(hb1, hb2);
+ futex_hb_waiters_dec(hb2);
+ /*
+ * Handle the case where the owner is in the middle of
+ * exiting. Wait for the exit to complete otherwise
+ * this task might loop forever, aka. live lock.
+ */
+ wait_for_owner_exiting(ret, exiting);
+ cond_resched();
+ goto retry;
+ default:
+ goto out_unlock;
+ }
+ }
+
+ plist_for_each_entry_safe(this, next, &hb1->chain, list) {
+ if (task_count - nr_wake >= nr_requeue)
+ break;
+
+ if (!futex_match(&this->key, &key1))
+ continue;
+
+ /*
+ * FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI should always
+ * be paired with each other and no other futex ops.
+ *
+ * We should never be requeueing a futex_q with a pi_state,
+ * which is awaiting a futex_unlock_pi().
+ */
+ if ((requeue_pi && !this->rt_waiter) ||
+ (!requeue_pi && this->rt_waiter) ||
+ this->pi_state) {
+ ret = -EINVAL;
+ break;
+ }
+
+ /* Plain futexes just wake or requeue and are done */
+ if (!requeue_pi) {
+ if (++task_count <= nr_wake)
+ this->wake(&wake_q, this);
+ else
+ requeue_futex(this, hb1, hb2, &key2);
+ continue;
+ }
+
+ /* Ensure we requeue to the expected futex for requeue_pi. */
+ if (!futex_match(this->requeue_pi_key, &key2)) {
+ ret = -EINVAL;
+ break;
+ }
+
+ /*
+ * Requeue nr_requeue waiters and possibly one more in the case
+ * of requeue_pi if we couldn't acquire the lock atomically.
+ *
+ * Prepare the waiter to take the rt_mutex. Take a refcount
+ * on the pi_state and store the pointer in the futex_q
+ * object of the waiter.
+ */
+ get_pi_state(pi_state);
+
+ /* Don't requeue when the waiter is already on the way out. */
+ if (!futex_requeue_pi_prepare(this, pi_state)) {
+ /*
+ * Early woken waiter signaled that it is on the
+ * way out. Drop the pi_state reference and try the
+ * next waiter. @this->pi_state is still NULL.
+ */
+ put_pi_state(pi_state);
+ continue;
+ }
+
+ ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
+ this->rt_waiter,
+ this->task);
+
+ if (ret == 1) {
+ /*
+ * We got the lock. We do neither drop the refcount
+ * on pi_state nor clear this->pi_state because the
+ * waiter needs the pi_state for cleaning up the
+ * user space value. It will drop the refcount
+ * after doing so. this::requeue_state is updated
+ * in the wakeup as well.
+ */
+ requeue_pi_wake_futex(this, &key2, hb2);
+ task_count++;
+ } else if (!ret) {
+ /* Waiter is queued, move it to hb2 */
requeue_futex(this, hb1, hb2, &key2);
- continue;
- }
-
- /* Ensure we requeue to the expected futex for requeue_pi. */
- if (!futex_match(this->requeue_pi_key, &key2)) {
- ret = -EINVAL;
- break;
+ futex_requeue_pi_complete(this, 0);
+ task_count++;
+ } else {
+ /*
+ * rt_mutex_start_proxy_lock() detected a potential
+ * deadlock when we tried to queue that waiter.
+ * Drop the pi_state reference which we took above
+ * and remove the pointer to the state from the
+ * waiters futex_q object.
+ */
+ this->pi_state = NULL;
+ put_pi_state(pi_state);
+ futex_requeue_pi_complete(this, ret);
+ /*
+ * We stop queueing more waiters and let user space
+ * deal with the mess.
+ */
+ break;
+ }
}
/*
- * Requeue nr_requeue waiters and possibly one more in the case
- * of requeue_pi if we couldn't acquire the lock atomically.
- *
- * Prepare the waiter to take the rt_mutex. Take a refcount
- * on the pi_state and store the pointer in the futex_q
- * object of the waiter.
+ * We took an extra initial reference to the pi_state in
+ * futex_proxy_trylock_atomic(). We need to drop it here again.
*/
- get_pi_state(pi_state);
-
- /* Don't requeue when the waiter is already on the way out. */
- if (!futex_requeue_pi_prepare(this, pi_state)) {
- /*
- * Early woken waiter signaled that it is on the
- * way out. Drop the pi_state reference and try the
- * next waiter. @this->pi_state is still NULL.
- */
- put_pi_state(pi_state);
- continue;
- }
-
- ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
- this->rt_waiter,
- this->task);
-
- if (ret == 1) {
- /*
- * We got the lock. We do neither drop the refcount
- * on pi_state nor clear this->pi_state because the
- * waiter needs the pi_state for cleaning up the
- * user space value. It will drop the refcount
- * after doing so. this::requeue_state is updated
- * in the wakeup as well.
- */
- requeue_pi_wake_futex(this, &key2, hb2);
- task_count++;
- } else if (!ret) {
- /* Waiter is queued, move it to hb2 */
- requeue_futex(this, hb1, hb2, &key2);
- futex_requeue_pi_complete(this, 0);
- task_count++;
- } else {
- /*
- * rt_mutex_start_proxy_lock() detected a potential
- * deadlock when we tried to queue that waiter.
- * Drop the pi_state reference which we took above
- * and remove the pointer to the state from the
- * waiters futex_q object.
- */
- this->pi_state = NULL;
- put_pi_state(pi_state);
- futex_requeue_pi_complete(this, ret);
- /*
- * We stop queueing more waiters and let user space
- * deal with the mess.
- */
- break;
- }
- }
-
- /*
- * We took an extra initial reference to the pi_state in
- * futex_proxy_trylock_atomic(). We need to drop it here again.
- */
- put_pi_state(pi_state);
+ put_pi_state(pi_state);
out_unlock:
- double_unlock_hb(hb1, hb2);
+ double_unlock_hb(hb1, hb2);
+ futex_hb_waiters_dec(hb2);
+ }
wake_up_q(&wake_q);
- futex_hb_waiters_dec(hb2);
return ret ? ret : task_count;
}
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 1108f373fd315..7dc35be09e436 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -253,7 +253,6 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
int nr_wake, int nr_wake2, int op)
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
- struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
int ret, op_ret;
DEFINE_WAKE_Q(wake_q);
@@ -266,67 +265,71 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
if (unlikely(ret != 0))
return ret;
- hb1 = futex_hash(&key1);
- hb2 = futex_hash(&key2);
-
retry_private:
- double_lock_hb(hb1, hb2);
- op_ret = futex_atomic_op_inuser(op, uaddr2);
- if (unlikely(op_ret < 0)) {
- double_unlock_hb(hb1, hb2);
+ if (1) {
+ struct futex_hash_bucket *hb1, *hb2;
- if (!IS_ENABLED(CONFIG_MMU) ||
- unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
- /*
- * we don't get EFAULT from MMU faults if we don't have
- * an MMU, but we might get them from range checking
- */
- ret = op_ret;
- return ret;
- }
+ hb1 = futex_hash(&key1);
+ hb2 = futex_hash(&key2);
- if (op_ret == -EFAULT) {
- ret = fault_in_user_writeable(uaddr2);
- if (ret)
+ double_lock_hb(hb1, hb2);
+ op_ret = futex_atomic_op_inuser(op, uaddr2);
+ if (unlikely(op_ret < 0)) {
+ double_unlock_hb(hb1, hb2);
+
+ if (!IS_ENABLED(CONFIG_MMU) ||
+ unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
+ /*
+ * we don't get EFAULT from MMU faults if we don't have
+ * an MMU, but we might get them from range checking
+ */
+ ret = op_ret;
return ret;
- }
-
- cond_resched();
- if (!(flags & FLAGS_SHARED))
- goto retry_private;
- goto retry;
- }
-
- plist_for_each_entry_safe(this, next, &hb1->chain, list) {
- if (futex_match (&this->key, &key1)) {
- if (this->pi_state || this->rt_waiter) {
- ret = -EINVAL;
- goto out_unlock;
}
- this->wake(&wake_q, this);
- if (++ret >= nr_wake)
- break;
- }
- }
- if (op_ret > 0) {
- op_ret = 0;
- plist_for_each_entry_safe(this, next, &hb2->chain, list) {
- if (futex_match (&this->key, &key2)) {
+ if (op_ret == -EFAULT) {
+ ret = fault_in_user_writeable(uaddr2);
+ if (ret)
+ return ret;
+ }
+
+ cond_resched();
+ if (!(flags & FLAGS_SHARED))
+ goto retry_private;
+ goto retry;
+ }
+
+ plist_for_each_entry_safe(this, next, &hb1->chain, list) {
+ if (futex_match(&this->key, &key1)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
goto out_unlock;
}
this->wake(&wake_q, this);
- if (++op_ret >= nr_wake2)
+ if (++ret >= nr_wake)
break;
}
}
- ret += op_ret;
- }
+
+ if (op_ret > 0) {
+ op_ret = 0;
+ plist_for_each_entry_safe(this, next, &hb2->chain, list) {
+ if (futex_match(&this->key, &key2)) {
+ if (this->pi_state || this->rt_waiter) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+ this->wake(&wake_q, this);
+ if (++op_ret >= nr_wake2)
+ break;
+ }
+ }
+ ret += op_ret;
+ }
out_unlock:
- double_unlock_hb(hb1, hb2);
+ double_unlock_hb(hb1, hb2);
+ }
wake_up_q(&wake_q);
return ret;
}
@@ -402,7 +405,6 @@ int futex_unqueue_multiple(struct futex_vector *v, int count)
*/
int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
{
- struct futex_hash_bucket *hb;
bool retry = false;
int ret, i;
u32 uval;
@@ -441,21 +443,25 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
struct futex_q *q = &vs[i].q;
u32 val = vs[i].w.val;
- hb = futex_hash(&q->key);
- futex_q_lock(q, hb);
- ret = futex_get_value_locked(&uval, uaddr);
+ if (1) {
+ struct futex_hash_bucket *hb;
- if (!ret && uval == val) {
- /*
- * The bucket lock can't be held while dealing with the
- * next futex. Queue each futex at this moment so hb can
- * be unlocked.
- */
- futex_queue(q, hb, current);
- continue;
+ hb = futex_hash(&q->key);
+ futex_q_lock(q, hb);
+ ret = futex_get_value_locked(&uval, uaddr);
+
+ if (!ret && uval == val) {
+ /*
+ * The bucket lock can't be held while dealing with the
+ * next futex. Queue each futex at this moment so hb can
+ * be unlocked.
+ */
+ futex_queue(q, hb, current);
+ continue;
+ }
+
+ futex_q_unlock(hb);
}
-
- futex_q_unlock(hb);
__set_current_state(TASK_RUNNING);
/*
@@ -584,7 +590,6 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
struct futex_q *q, union futex_key *key2,
struct task_struct *task)
{
- struct futex_hash_bucket *hb;
u32 uval;
int ret;
@@ -612,44 +617,48 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
return ret;
retry_private:
- hb = futex_hash(&q->key);
- futex_q_lock(q, hb);
+ if (1) {
+ struct futex_hash_bucket *hb;
- ret = futex_get_value_locked(&uval, uaddr);
+ hb = futex_hash(&q->key);
+ futex_q_lock(q, hb);
- if (ret) {
- futex_q_unlock(hb);
+ ret = futex_get_value_locked(&uval, uaddr);
- ret = get_user(uval, uaddr);
- if (ret)
- return ret;
+ if (ret) {
+ futex_q_unlock(hb);
- if (!(flags & FLAGS_SHARED))
- goto retry_private;
+ ret = get_user(uval, uaddr);
+ if (ret)
+ return ret;
- goto retry;
+ if (!(flags & FLAGS_SHARED))
+ goto retry_private;
+
+ goto retry;
+ }
+
+ if (uval != val) {
+ futex_q_unlock(hb);
+ return -EWOULDBLOCK;
+ }
+
+ if (key2 && futex_match(&q->key, key2)) {
+ futex_q_unlock(hb);
+ return -EINVAL;
+ }
+
+ /*
+ * The task state is guaranteed to be set before another task can
+ * wake it. set_current_state() is implemented using smp_store_mb() and
+ * futex_queue() calls spin_unlock() upon completion, both serializing
+ * access to the hash list and forcing another memory barrier.
+ */
+ if (task == current)
+ set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+ futex_queue(q, hb, task);
}
- if (uval != val) {
- futex_q_unlock(hb);
- return -EWOULDBLOCK;
- }
-
- if (key2 && futex_match(&q->key, key2)) {
- futex_q_unlock(hb);
- return -EINVAL;
- }
-
- /*
- * The task state is guaranteed to be set before another task can
- * wake it. set_current_state() is implemented using smp_store_mb() and
- * futex_queue() calls spin_unlock() upon completion, both serializing
- * access to the hash list and forcing another memory barrier.
- */
- if (task == current)
- set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
- futex_queue(q, hb, task);
-
return ret;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 06/21] futex: Create futex_hash() get/put class
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (4 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 05/21] futex: Create hb scopes Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 07/21] futex: Create private_hash() " Sebastian Andrzej Siewior
` (15 subsequent siblings)
21 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
From: Peter Zijlstra <peterz@infradead.org>
This gets us:
hb = futex_hash(key) /* gets hb and inc users */
futex_hash_get(hb) /* inc users */
futex_hash_put(hb) /* dec users */
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 6 +++---
kernel/futex/futex.h | 7 +++++++
kernel/futex/pi.c | 16 ++++++++++++----
kernel/futex/requeue.c | 10 +++-------
kernel/futex/waitwake.c | 15 +++++----------
5 files changed, 30 insertions(+), 24 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index e4cb5ce9785b1..56a5653e450cb 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -122,6 +122,8 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
return &futex_queues[hash & futex_hashmask];
}
+void futex_hash_get(struct futex_hash_bucket *hb) { }
+void futex_hash_put(struct futex_hash_bucket *hb) { }
/**
* futex_setup_timer - set up the sleeping hrtimer.
@@ -957,9 +959,7 @@ static void exit_pi_state_list(struct task_struct *curr)
pi_state = list_entry(next, struct futex_pi_state, list);
key = pi_state->key;
if (1) {
- struct futex_hash_bucket *hb;
-
- hb = futex_hash(&key);
+ CLASS(hb, hb)(&key);
/*
* We can race against put_pi_state() removing itself from the
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index a219903e52084..77d9b3509f75c 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -7,6 +7,7 @@
#include <linux/sched/wake_q.h>
#include <linux/compat.h>
#include <linux/uaccess.h>
+#include <linux/cleanup.h>
#ifdef CONFIG_PREEMPT_RT
#include <linux/rcuwait.h>
@@ -202,6 +203,12 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
int flags, u64 range_ns);
extern struct futex_hash_bucket *futex_hash(union futex_key *key);
+extern void futex_hash_get(struct futex_hash_bucket *hb);
+extern void futex_hash_put(struct futex_hash_bucket *hb);
+
+DEFINE_CLASS(hb, struct futex_hash_bucket *,
+ if (_T) futex_hash_put(_T),
+ futex_hash(key), union futex_key *key);
/**
* futex_match - Check whether two futex keys are equal
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index a56f28fda58dd..e52f540e81b6a 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -939,9 +939,8 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
retry_private:
if (1) {
- struct futex_hash_bucket *hb;
+ CLASS(hb, hb)(&q.key);
- hb = futex_hash(&q.key);
futex_q_lock(&q, hb);
ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
@@ -994,6 +993,16 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
goto no_block;
}
+ /*
+ * Caution; releasing @hb in-scope. The hb->lock is still locked
+ * while the reference is dropped. The reference can not be dropped
+ * after the unlock because if a user initiated resize is in progress
+ * then we might need to wake him. This can not be done after the
+ * rt_mutex_pre_schedule() invocation. The hb will remain valid because
+ * the thread, performing resize, will block on hb->lock during
+ * the requeue.
+ */
+ futex_hash_put(no_free_ptr(hb));
/*
* Must be done before we enqueue the waiter, here is unfortunately
* under the hb lock, but that *should* work because it does nothing.
@@ -1119,7 +1128,6 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
{
u32 curval, uval, vpid = task_pid_vnr(current);
union futex_key key = FUTEX_KEY_INIT;
- struct futex_hash_bucket *hb;
struct futex_q *top_waiter;
int ret;
@@ -1139,7 +1147,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
if (ret)
return ret;
- hb = futex_hash(&key);
+ CLASS(hb, hb)(&key);
spin_lock(&hb->lock);
retry_hb:
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 209794cad6f2f..992e3ce005c6f 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -444,10 +444,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
retry_private:
if (1) {
- struct futex_hash_bucket *hb1, *hb2;
-
- hb1 = futex_hash(&key1);
- hb2 = futex_hash(&key2);
+ CLASS(hb, hb1)(&key1);
+ CLASS(hb, hb2)(&key2);
futex_hb_waiters_inc(hb2);
double_lock_hb(hb1, hb2);
@@ -817,9 +815,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
switch (futex_requeue_pi_wakeup_sync(&q)) {
case Q_REQUEUE_PI_IGNORE:
{
- struct futex_hash_bucket *hb;
-
- hb = futex_hash(&q.key);
+ CLASS(hb, hb)(&q.key);
/* The waiter is still on uaddr1 */
spin_lock(&hb->lock);
ret = handle_early_requeue_pi_wakeup(hb, &q, to);
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 7dc35be09e436..d52541bcc07e9 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -154,7 +154,6 @@ void futex_wake_mark(struct wake_q_head *wake_q, struct futex_q *q)
*/
int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
- struct futex_hash_bucket *hb;
struct futex_q *this, *next;
union futex_key key = FUTEX_KEY_INIT;
DEFINE_WAKE_Q(wake_q);
@@ -170,7 +169,7 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
if ((flags & FLAGS_STRICT) && !nr_wake)
return 0;
- hb = futex_hash(&key);
+ CLASS(hb, hb)(&key);
/* Make sure we really have tasks to wakeup */
if (!futex_hb_waiters_pending(hb))
@@ -267,10 +266,8 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
retry_private:
if (1) {
- struct futex_hash_bucket *hb1, *hb2;
-
- hb1 = futex_hash(&key1);
- hb2 = futex_hash(&key2);
+ CLASS(hb, hb1)(&key1);
+ CLASS(hb, hb2)(&key2);
double_lock_hb(hb1, hb2);
op_ret = futex_atomic_op_inuser(op, uaddr2);
@@ -444,9 +441,8 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
u32 val = vs[i].w.val;
if (1) {
- struct futex_hash_bucket *hb;
+ CLASS(hb, hb)(&q->key);
- hb = futex_hash(&q->key);
futex_q_lock(q, hb);
ret = futex_get_value_locked(&uval, uaddr);
@@ -618,9 +614,8 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
retry_private:
if (1) {
- struct futex_hash_bucket *hb;
+ CLASS(hb, hb)(&q->key);
- hb = futex_hash(&q->key);
futex_q_lock(q, hb);
ret = futex_get_value_locked(&uval, uaddr);
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 07/21] futex: Create private_hash() get/put class
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (5 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 06/21] futex: Create futex_hash() get/put class Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 08/21] futex: Acquire a hash reference in futex_wait_multiple_setup() Sebastian Andrzej Siewior
` (14 subsequent siblings)
21 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
From: Peter Zijlstra <peterz@infradead.org>
This gets us:
fph = futex_private_hash(key) /* gets fph and inc users */
futex_private_hash_get(fph) /* inc users */
futex_private_hash_put(fph) /* dec users */
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 12 ++++++++++++
kernel/futex/futex.h | 8 ++++++++
2 files changed, 20 insertions(+)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 56a5653e450cb..6a1d6b14277f4 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -107,6 +107,18 @@ late_initcall(fail_futex_debugfs);
#endif /* CONFIG_FAIL_FUTEX */
+struct futex_private_hash *futex_private_hash(void)
+{
+ return NULL;
+}
+
+bool futex_private_hash_get(struct futex_private_hash *fph)
+{
+ return false;
+}
+
+void futex_private_hash_put(struct futex_private_hash *fph) { }
+
/**
* futex_hash - Return the hash bucket in the global hash
* @key: Pointer to the futex key for which the hash is calculated
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 77d9b3509f75c..bc76e366f9a77 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -206,10 +206,18 @@ extern struct futex_hash_bucket *futex_hash(union futex_key *key);
extern void futex_hash_get(struct futex_hash_bucket *hb);
extern void futex_hash_put(struct futex_hash_bucket *hb);
+extern struct futex_private_hash *futex_private_hash(void);
+extern bool futex_private_hash_get(struct futex_private_hash *fph);
+extern void futex_private_hash_put(struct futex_private_hash *fph);
+
DEFINE_CLASS(hb, struct futex_hash_bucket *,
if (_T) futex_hash_put(_T),
futex_hash(key), union futex_key *key);
+DEFINE_CLASS(private_hash, struct futex_private_hash *,
+ if (_T) futex_private_hash_put(_T),
+ futex_private_hash(), void);
+
/**
* futex_match - Check whether two futex keys are equal
* @key1: Pointer to key1
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 08/21] futex: Acquire a hash reference in futex_wait_multiple_setup()
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (6 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 07/21] futex: Create private_hash() " Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 09/21] futex: Decrease the waiter count before the unlock operation Sebastian Andrzej Siewior
` (13 subsequent siblings)
21 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
futex_wait_multiple_setup() changes task_struct::__state to
!TASK_RUNNING and then enqueues on multiple futexes. Every
futex_q_lock() acquires a reference on the global hash which is dropped
later.
If a rehash is in progress then the loop will block on
mm_struct::futex_hash_bucket for the rehash to complete and this will
lose the previously set task_struct::__state.
Acquire a reference on the local hash to avoiding blocking on
mm_struct::futex_hash_bucket.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/waitwake.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index d52541bcc07e9..bd8fef0f8d180 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -406,6 +406,12 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
int ret, i;
u32 uval;
+ /*
+ * Make sure to have a reference on the private_hash such that we
+ * don't block on rehash after changing the task state below.
+ */
+ guard(private_hash)();
+
/*
* Enqueuing multiple futexes is tricky, because we need to enqueue
* each futex on the list before dealing with the next one to avoid
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 09/21] futex: Decrease the waiter count before the unlock operation
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (7 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 08/21] futex: Acquire a hash reference in futex_wait_multiple_setup() Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 10/21] futex: Introduce futex_q_lockptr_lock() Sebastian Andrzej Siewior
` (12 subsequent siblings)
21 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
To support runtime resizing of the process private hash, it's required
to not use the obtained hash bucket once the reference count has been
dropped. The reference will be dropped after the unlock of the hash
bucket.
The amount of waiters is decremented after the unlock operation. There
is no requirement that this needs to happen after the unlock. The
increment happens before acquiring the lock to signal early that there
will be a waiter. The waiter can avoid blocking on the lock if it is
known that there will be no waiter.
There is no difference in terms of ordering if the decrement happens
before or after the unlock.
Decrease the waiter count before the unlock operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 2 +-
kernel/futex/requeue.c | 8 ++++----
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 6a1d6b14277f4..5e70cb8eb2507 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -537,8 +537,8 @@ void futex_q_lock(struct futex_q *q, struct futex_hash_bucket *hb)
void futex_q_unlock(struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
- spin_unlock(&hb->lock);
futex_hb_waiters_dec(hb);
+ spin_unlock(&hb->lock);
}
void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb,
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 992e3ce005c6f..023c028d2fce3 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -456,8 +456,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
ret = futex_get_value_locked(&curval, uaddr1);
if (unlikely(ret)) {
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
ret = get_user(curval, uaddr1);
if (ret)
@@ -542,8 +542,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
* waiter::requeue_state is correct.
*/
case -EFAULT:
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
ret = fault_in_user_writeable(uaddr2);
if (!ret)
goto retry;
@@ -556,8 +556,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
* exit to complete.
* - EAGAIN: The user space value changed.
*/
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
/*
* Handle the case where the owner is in the middle of
* exiting. Wait for the exit to complete otherwise
@@ -674,8 +674,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
put_pi_state(pi_state);
out_unlock:
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
}
wake_up_q(&wake_q);
return ret ? ret : task_count;
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 10/21] futex: Introduce futex_q_lockptr_lock()
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (8 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 09/21] futex: Decrease the waiter count before the unlock operation Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-05-08 19:06 ` [PATCH v12 10/21] " André Almeida
2025-04-16 16:29 ` [PATCH v12 11/21] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
` (11 subsequent siblings)
21 siblings, 2 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
futex_lock_pi() and __fixup_pi_state_owner() acquire the
futex_q::lock_ptr without holding a reference assuming the previously
obtained hash bucket and the assigned lock_ptr are still valid. This
isn't the case once the private hash can be resized and becomes invalid
after the reference drop.
Introduce futex_q_lockptr_lock() to lock the hash bucket recorded in
futex_q::lock_ptr. The lock pointer is read in a RCU section to ensure
that it does not go away if the hash bucket has been replaced and the
old pointer has been observed. After locking the pointer needs to be
compared to check if it changed. If so then the hash bucket has been
replaced and the user has been moved to the new one and lock_ptr has
been updated. The lock operation needs to be redone in this case.
The locked hash bucket is not returned.
A special case is an early return in futex_lock_pi() (due to signal or
timeout) and a successful futex_wait_requeue_pi(). In both cases a valid
futex_q::lock_ptr is expected (and its matching hash bucket) but since
the waiter has been removed from the hash this can no longer be
guaranteed. Therefore before the waiter is removed and a reference is
acquired which is later dropped by the waiter to avoid a resize.
Add futex_q_lockptr_lock() and use it.
Acquire an additional reference in requeue_pi_wake_futex() and
futex_unlock_pi() while the futex_q is removed, denote this extra
reference in futex_q::drop_hb_ref and let the waiter drop the reference
in this case.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 25 +++++++++++++++++++++++++
kernel/futex/futex.h | 3 ++-
kernel/futex/pi.c | 15 +++++++++++++--
kernel/futex/requeue.c | 16 +++++++++++++---
4 files changed, 53 insertions(+), 6 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 5e70cb8eb2507..1443a98dfa7fa 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -134,6 +134,13 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
return &futex_queues[hash & futex_hashmask];
}
+/**
+ * futex_hash_get - Get an additional reference for the local hash.
+ * @hb: ptr to the private local hash.
+ *
+ * Obtain an additional reference for the already obtained hash bucket. The
+ * caller must already own an reference.
+ */
void futex_hash_get(struct futex_hash_bucket *hb) { }
void futex_hash_put(struct futex_hash_bucket *hb) { }
@@ -615,6 +622,24 @@ int futex_unqueue(struct futex_q *q)
return ret;
}
+void futex_q_lockptr_lock(struct futex_q *q)
+{
+ spinlock_t *lock_ptr;
+
+ /*
+ * See futex_unqueue() why lock_ptr can change.
+ */
+ guard(rcu)();
+retry:
+ lock_ptr = READ_ONCE(q->lock_ptr);
+ spin_lock(lock_ptr);
+
+ if (unlikely(lock_ptr != q->lock_ptr)) {
+ spin_unlock(lock_ptr);
+ goto retry;
+ }
+}
+
/*
* PI futexes can not be requeued and must remove themselves from the hash
* bucket. The hash bucket lock (i.e. lock_ptr) is held.
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index bc76e366f9a77..26e69333cb745 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -183,6 +183,7 @@ struct futex_q {
union futex_key *requeue_pi_key;
u32 bitset;
atomic_t requeue_state;
+ bool drop_hb_ref;
#ifdef CONFIG_PREEMPT_RT
struct rcuwait requeue_wait;
#endif
@@ -197,7 +198,7 @@ enum futex_access {
extern int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
enum futex_access rw);
-
+extern void futex_q_lockptr_lock(struct futex_q *q);
extern struct hrtimer_sleeper *
futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
int flags, u64 range_ns);
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index e52f540e81b6a..dacb2330f1fbc 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -806,7 +806,7 @@ static int __fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
break;
}
- spin_lock(q->lock_ptr);
+ futex_q_lockptr_lock(q);
raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
/*
@@ -1072,7 +1072,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
* spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
* the
*/
- spin_lock(q.lock_ptr);
+ futex_q_lockptr_lock(&q);
/*
* Waiter is unqueued.
*/
@@ -1092,6 +1092,11 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
futex_unqueue_pi(&q);
spin_unlock(q.lock_ptr);
+ if (q.drop_hb_ref) {
+ CLASS(hb, hb)(&q.key);
+ /* Additional reference from futex_unlock_pi() */
+ futex_hash_put(hb);
+ }
goto out;
out_unlock_put_key:
@@ -1200,6 +1205,12 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
*/
rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
if (!rt_waiter) {
+ /*
+ * Acquire a reference for the leaving waiter to ensure
+ * valid futex_q::lock_ptr.
+ */
+ futex_hash_get(hb);
+ top_waiter->drop_hb_ref = true;
__futex_unqueue(top_waiter);
raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
goto retry_hb;
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 023c028d2fce3..b0e64fd454d96 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -231,7 +231,12 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
WARN_ON(!q->rt_waiter);
q->rt_waiter = NULL;
-
+ /*
+ * Acquire a reference for the waiter to ensure valid
+ * futex_q::lock_ptr.
+ */
+ futex_hash_get(hb);
+ q->drop_hb_ref = true;
q->lock_ptr = &hb->lock;
/* Signal locked state to the waiter */
@@ -826,7 +831,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
case Q_REQUEUE_PI_LOCKED:
/* The requeue acquired the lock */
if (q.pi_state && (q.pi_state->owner != current)) {
- spin_lock(q.lock_ptr);
+ futex_q_lockptr_lock(&q);
ret = fixup_pi_owner(uaddr2, &q, true);
/*
* Drop the reference to the pi state which the
@@ -853,7 +858,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
ret = 0;
- spin_lock(q.lock_ptr);
+ futex_q_lockptr_lock(&q);
debug_rt_mutex_free_waiter(&rt_waiter);
/*
* Fixup the pi_state owner and possibly acquire the lock if we
@@ -885,6 +890,11 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
default:
BUG();
}
+ if (q.drop_hb_ref) {
+ CLASS(hb, hb)(&q.key);
+ /* Additional reference from requeue_pi_wake_futex() */
+ futex_hash_put(hb);
+ }
out:
if (to) {
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 11/21] futex: Create helper function to initialize a hash slot
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (9 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 10/21] futex: Introduce futex_q_lockptr_lock() Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 12/21] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
` (10 subsequent siblings)
21 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Factor out the futex_hash_bucket initialisation into a helpr function.
The helper function will be used in a follow up patch implementing
process private hash buckets.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 1443a98dfa7fa..afc66780f84fc 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1160,6 +1160,13 @@ void futex_exit_release(struct task_struct *tsk)
futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
}
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
+{
+ atomic_set(&fhb->waiters, 0);
+ plist_head_init(&fhb->chain);
+ spin_lock_init(&fhb->lock);
+}
+
static int __init futex_init(void)
{
unsigned long hashsize, i;
@@ -1177,11 +1184,8 @@ static int __init futex_init(void)
hashsize, hashsize);
hashsize = 1UL << futex_shift;
- for (i = 0; i < hashsize; i++) {
- atomic_set(&futex_queues[i].waiters, 0);
- plist_head_init(&futex_queues[i].chain);
- spin_lock_init(&futex_queues[i].lock);
- }
+ for (i = 0; i < hashsize; i++)
+ futex_hash_bucket_init(&futex_queues[i]);
futex_hashmask = hashsize - 1;
return 0;
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 12/21] futex: Add basic infrastructure for local task local hash
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (10 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 11/21] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 13/21] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
` (9 subsequent siblings)
21 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
The futex hash is system wide and shared by all tasks. Each slot
is hashed based on futex address and the VMA of the thread. Due to
randomized VMAs (and memory allocations) the same logical lock (pointer)
can end up in a different hash bucket on each invocation of the
application. This in turn means that different applications may share a
hash bucket on the first invocation but not on the second and it is not
always clear which applications will be involved. This can result in
high latency's to acquire the futex_hash_bucket::lock especially if the
lock owner is limited to a CPU and can not be effectively PI boosted.
Introduce basic infrastructure for process local hash which is shared by
all threads of process. This hash will only be used for a
PROCESS_PRIVATE FUTEX operation.
The hashmap can be allocated via
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, num)
A `num' of 0 means that the global hash is used instead of a private
hash.
Other values for `num' specify the number of slots for the hash and the
number must be power of two, starting with two.
The prctl() returns zero on success. This function can only be used
before a thread is created.
The current status for the private hash can be queried via
num = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
which return the current number of slots. The value 0 means that the
global hash is used. Values greater than 0 indicate the number of slots
that are used. A negative number indicates an error.
For optimisation, for the private hash jhash2() uses only two arguments
the address and the offset. This omits the VMA which is always the same.
[peterz: Use 0 for global hash. A bit shuffling and renaming. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 26 ++++-
include/linux/mm_types.h | 5 +-
include/uapi/linux/prctl.h | 5 +
init/Kconfig | 5 +
kernel/fork.c | 2 +
kernel/futex/core.c | 208 +++++++++++++++++++++++++++++++++----
kernel/futex/futex.h | 10 ++
kernel/sys.c | 4 +
8 files changed, 244 insertions(+), 21 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index b70df27d7e85c..8f1be08bef18d 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -4,11 +4,11 @@
#include <linux/sched.h>
#include <linux/ktime.h>
+#include <linux/mm_types.h>
#include <uapi/linux/futex.h>
struct inode;
-struct mm_struct;
struct task_struct;
/*
@@ -77,7 +77,22 @@ void futex_exec_release(struct task_struct *tsk);
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3);
-#else
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4);
+
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+void futex_hash_free(struct mm_struct *mm);
+
+static inline void futex_mm_init(struct mm_struct *mm)
+{
+ mm->futex_phash = NULL;
+}
+
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline void futex_mm_init(struct mm_struct *mm) { }
+#endif /* CONFIG_FUTEX_PRIVATE_HASH */
+
+#else /* !CONFIG_FUTEX */
static inline void futex_init_task(struct task_struct *tsk) { }
static inline void futex_exit_recursive(struct task_struct *tsk) { }
static inline void futex_exit_release(struct task_struct *tsk) { }
@@ -88,6 +103,13 @@ static inline long do_futex(u32 __user *uaddr, int op, u32 val,
{
return -EINVAL;
}
+static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
+{
+ return -EINVAL;
+}
+static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline void futex_mm_init(struct mm_struct *mm) { }
+
#endif
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f91..a4b5661e41770 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -31,6 +31,7 @@
#define INIT_PASID 0
struct address_space;
+struct futex_private_hash;
struct mem_cgroup;
/*
@@ -1031,7 +1032,9 @@ struct mm_struct {
*/
seqcount_t mm_lock_seq;
#endif
-
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+ struct futex_private_hash *futex_phash;
+#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15c18ef4eb11a..3b93fb906e3c5 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -364,4 +364,9 @@ struct prctl_mm_map {
# define PR_TIMER_CREATE_RESTORE_IDS_ON 1
# define PR_TIMER_CREATE_RESTORE_IDS_GET 2
+/* FUTEX hash management */
+#define PR_FUTEX_HASH 78
+# define PR_FUTEX_HASH_SET_SLOTS 1
+# define PR_FUTEX_HASH_GET_SLOTS 2
+
#endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index dd2ea3b9a7992..b308b98d79347 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1699,6 +1699,11 @@ config FUTEX_PI
depends on FUTEX && RT_MUTEXES
default y
+config FUTEX_PRIVATE_HASH
+ bool
+ depends on FUTEX && !BASE_SMALL && MMU
+ default y
+
config EPOLL
bool "Enable eventpoll support" if EXPERT
default y
diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b8..831dfec450544 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1305,6 +1305,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
RCU_INIT_POINTER(mm->exe_file, NULL);
mmu_notifier_subscriptions_init(mm);
init_tlb_flush_pending(mm);
+ futex_mm_init(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
mm->pmd_huge_pte = NULL;
#endif
@@ -1387,6 +1388,7 @@ static inline void __mmput(struct mm_struct *mm)
if (mm->binfmt)
module_put(mm->binfmt->module);
lru_gen_del_mm(mm);
+ futex_hash_free(mm);
mmdrop(mm);
}
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index afc66780f84fc..818df7420a1a9 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -39,6 +39,7 @@
#include <linux/memblock.h>
#include <linux/fault-inject.h>
#include <linux/slab.h>
+#include <linux/prctl.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -55,6 +56,12 @@ static struct {
#define futex_queues (__futex_data.queues)
#define futex_hashmask (__futex_data.hashmask)
+struct futex_private_hash {
+ unsigned int hash_mask;
+ void *mm;
+ bool custom;
+ struct futex_hash_bucket queues[];
+};
/*
* Fault injections for futexes.
@@ -107,9 +114,17 @@ late_initcall(fail_futex_debugfs);
#endif /* CONFIG_FAIL_FUTEX */
-struct futex_private_hash *futex_private_hash(void)
+static struct futex_hash_bucket *
+__futex_hash(union futex_key *key, struct futex_private_hash *fph);
+
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+static inline bool futex_key_is_private(union futex_key *key)
{
- return NULL;
+ /*
+ * Relies on get_futex_key() to set either bit for shared
+ * futexes -- see comment with union futex_key.
+ */
+ return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
}
bool futex_private_hash_get(struct futex_private_hash *fph)
@@ -117,21 +132,8 @@ bool futex_private_hash_get(struct futex_private_hash *fph)
return false;
}
-void futex_private_hash_put(struct futex_private_hash *fph) { }
-
-/**
- * futex_hash - Return the hash bucket in the global hash
- * @key: Pointer to the futex key for which the hash is calculated
- *
- * We hash on the keys returned from get_futex_key (see below) and return the
- * corresponding hash bucket in the global hash.
- */
-struct futex_hash_bucket *futex_hash(union futex_key *key)
+void futex_private_hash_put(struct futex_private_hash *fph)
{
- u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
- key->both.offset);
-
- return &futex_queues[hash & futex_hashmask];
}
/**
@@ -144,6 +146,84 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
void futex_hash_get(struct futex_hash_bucket *hb) { }
void futex_hash_put(struct futex_hash_bucket *hb) { }
+static struct futex_hash_bucket *
+__futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
+{
+ u32 hash;
+
+ if (!futex_key_is_private(key))
+ return NULL;
+
+ if (!fph)
+ fph = key->private.mm->futex_phash;
+ if (!fph || !fph->hash_mask)
+ return NULL;
+
+ hash = jhash2((void *)&key->private.address,
+ sizeof(key->private.address) / 4,
+ key->both.offset);
+ return &fph->queues[hash & fph->hash_mask];
+}
+
+struct futex_private_hash *futex_private_hash(void)
+{
+ struct mm_struct *mm = current->mm;
+ struct futex_private_hash *fph;
+
+ fph = mm->futex_phash;
+ return fph;
+}
+
+struct futex_hash_bucket *futex_hash(union futex_key *key)
+{
+ struct futex_hash_bucket *hb;
+
+ hb = __futex_hash(key, NULL);
+ return hb;
+}
+
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+
+static struct futex_hash_bucket *
+__futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
+{
+ return NULL;
+}
+
+struct futex_hash_bucket *futex_hash(union futex_key *key)
+{
+ return __futex_hash(key, NULL);
+}
+
+#endif /* CONFIG_FUTEX_PRIVATE_HASH */
+
+/**
+ * __futex_hash - Return the hash bucket
+ * @key: Pointer to the futex key for which the hash is calculated
+ * @fph: Pointer to private hash if known
+ *
+ * We hash on the keys returned from get_futex_key (see below) and return the
+ * corresponding hash bucket.
+ * If the FUTEX is PROCESS_PRIVATE then a per-process hash bucket (from the
+ * private hash) is returned if existing. Otherwise a hash bucket from the
+ * global hash is returned.
+ */
+static struct futex_hash_bucket *
+__futex_hash(union futex_key *key, struct futex_private_hash *fph)
+{
+ struct futex_hash_bucket *hb;
+ u32 hash;
+
+ hb = __futex_hash_private(key, fph);
+ if (hb)
+ return hb;
+
+ hash = jhash2((u32 *)key,
+ offsetof(typeof(*key), both.offset) / 4,
+ key->both.offset);
+ return &futex_queues[hash & futex_hashmask];
+}
+
/**
* futex_setup_timer - set up the sleeping hrtimer.
* @time: ptr to the given timeout value
@@ -985,6 +1065,13 @@ static void exit_pi_state_list(struct task_struct *curr)
struct futex_pi_state *pi_state;
union futex_key key = FUTEX_KEY_INIT;
+ /*
+ * Ensure the hash remains stable (no resize) during the while loop
+ * below. The hb pointer is acquired under the pi_lock so we can't block
+ * on the mutex.
+ */
+ WARN_ON(curr != current);
+ guard(private_hash)();
/*
* We are a ZOMBIE and nobody can enqueue itself on
* pi_state_list anymore, but we have to be careful
@@ -1160,13 +1247,98 @@ void futex_exit_release(struct task_struct *tsk)
futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
}
-static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
+ struct futex_private_hash *fph)
{
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+ fhb->priv = fph;
+#endif
atomic_set(&fhb->waiters, 0);
plist_head_init(&fhb->chain);
spin_lock_init(&fhb->lock);
}
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+void futex_hash_free(struct mm_struct *mm)
+{
+ kvfree(mm->futex_phash);
+}
+
+static int futex_hash_allocate(unsigned int hash_slots, bool custom)
+{
+ struct mm_struct *mm = current->mm;
+ struct futex_private_hash *fph;
+ int i;
+
+ if (hash_slots && (hash_slots == 1 || !is_power_of_2(hash_slots)))
+ return -EINVAL;
+
+ if (mm->futex_phash)
+ return -EALREADY;
+
+ if (!thread_group_empty(current))
+ return -EINVAL;
+
+ fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
+ if (!fph)
+ return -ENOMEM;
+
+ fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
+ fph->custom = custom;
+ fph->mm = mm;
+
+ for (i = 0; i < hash_slots; i++)
+ futex_hash_bucket_init(&fph->queues[i], fph);
+
+ mm->futex_phash = fph;
+ return 0;
+}
+
+static int futex_hash_get_slots(void)
+{
+ struct futex_private_hash *fph;
+
+ fph = current->mm->futex_phash;
+ if (fph && fph->hash_mask)
+ return fph->hash_mask + 1;
+ return 0;
+}
+
+#else
+
+static int futex_hash_allocate(unsigned int hash_slots, bool custom)
+{
+ return -EINVAL;
+}
+
+static int futex_hash_get_slots(void)
+{
+ return 0;
+}
+#endif
+
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
+{
+ int ret;
+
+ switch (arg2) {
+ case PR_FUTEX_HASH_SET_SLOTS:
+ if (arg4 != 0)
+ return -EINVAL;
+ ret = futex_hash_allocate(arg3, true);
+ break;
+
+ case PR_FUTEX_HASH_GET_SLOTS:
+ ret = futex_hash_get_slots();
+ break;
+
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
static int __init futex_init(void)
{
unsigned long hashsize, i;
@@ -1185,7 +1357,7 @@ static int __init futex_init(void)
hashsize = 1UL << futex_shift;
for (i = 0; i < hashsize; i++)
- futex_hash_bucket_init(&futex_queues[i]);
+ futex_hash_bucket_init(&futex_queues[i], NULL);
futex_hashmask = hashsize - 1;
return 0;
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 26e69333cb745..899aed5acde12 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -118,6 +118,7 @@ struct futex_hash_bucket {
atomic_t waiters;
spinlock_t lock;
struct plist_head chain;
+ struct futex_private_hash *priv;
} ____cacheline_aligned_in_smp;
/*
@@ -204,6 +205,7 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
int flags, u64 range_ns);
extern struct futex_hash_bucket *futex_hash(union futex_key *key);
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
extern void futex_hash_get(struct futex_hash_bucket *hb);
extern void futex_hash_put(struct futex_hash_bucket *hb);
@@ -211,6 +213,14 @@ extern struct futex_private_hash *futex_private_hash(void);
extern bool futex_private_hash_get(struct futex_private_hash *fph);
extern void futex_private_hash_put(struct futex_private_hash *fph);
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+static inline void futex_hash_get(struct futex_hash_bucket *hb) { }
+static inline void futex_hash_put(struct futex_hash_bucket *hb) { }
+static inline struct futex_private_hash *futex_private_hash(void) { return NULL; }
+static inline bool futex_private_hash_get(void) { return false; }
+static inline void futex_private_hash_put(struct futex_private_hash *fph) { }
+#endif
+
DEFINE_CLASS(hb, struct futex_hash_bucket *,
if (_T) futex_hash_put(_T),
futex_hash(key), union futex_key *key);
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968e9f5dd..adc0de0aa364a 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -52,6 +52,7 @@
#include <linux/user_namespace.h>
#include <linux/time_namespace.h>
#include <linux/binfmts.h>
+#include <linux/futex.h>
#include <linux/sched.h>
#include <linux/sched/autogroup.h>
@@ -2820,6 +2821,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
error = posixtimer_create_prctl(arg2);
break;
+ case PR_FUTEX_HASH:
+ error = futex_hash_prctl(arg2, arg3, arg4);
+ break;
default:
trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 13/21] futex: Allow automatic allocation of process wide futex hash
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (11 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 12/21] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 14/21] futex: Allow to resize the private local hash Sebastian Andrzej Siewior
` (8 subsequent siblings)
21 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Allocate a private futex hash with 16 slots if a task forks its first
thread.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 6 ++++++
kernel/fork.c | 22 ++++++++++++++++++++++
kernel/futex/core.c | 11 +++++++++++
3 files changed, 39 insertions(+)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 8f1be08bef18d..1d3f7555825ec 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -80,6 +80,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4);
#ifdef CONFIG_FUTEX_PRIVATE_HASH
+int futex_hash_allocate_default(void);
void futex_hash_free(struct mm_struct *mm);
static inline void futex_mm_init(struct mm_struct *mm)
@@ -88,6 +89,7 @@ static inline void futex_mm_init(struct mm_struct *mm)
}
#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+static inline int futex_hash_allocate_default(void) { return 0; }
static inline void futex_hash_free(struct mm_struct *mm) { }
static inline void futex_mm_init(struct mm_struct *mm) { }
#endif /* CONFIG_FUTEX_PRIVATE_HASH */
@@ -107,6 +109,10 @@ static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsig
{
return -EINVAL;
}
+static inline int futex_hash_allocate_default(void)
+{
+ return 0;
+}
static inline void futex_hash_free(struct mm_struct *mm) { }
static inline void futex_mm_init(struct mm_struct *mm) { }
diff --git a/kernel/fork.c b/kernel/fork.c
index 831dfec450544..1f5d8083eeb25 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2164,6 +2164,13 @@ static void rv_task_fork(struct task_struct *p)
#define rv_task_fork(p) do {} while (0)
#endif
+static bool need_futex_hash_allocate_default(u64 clone_flags)
+{
+ if ((clone_flags & (CLONE_THREAD | CLONE_VM)) != (CLONE_THREAD | CLONE_VM))
+ return false;
+ return true;
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -2544,6 +2551,21 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cancel_cgroup;
+ /*
+ * Allocate a default futex hash for the user process once the first
+ * thread spawns.
+ */
+ if (need_futex_hash_allocate_default(clone_flags)) {
+ retval = futex_hash_allocate_default();
+ if (retval)
+ goto bad_fork_core_free;
+ /*
+ * If we fail beyond this point we don't free the allocated
+ * futex hash map. We assume that another thread will be created
+ * and makes use of it. The hash map will be freed once the main
+ * thread terminates.
+ */
+ }
/*
* From this point on we must avoid any synchronous user-space
* communication until we take the tasklist-lock. In particular, we do
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 818df7420a1a9..53b3a00a92539 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1294,6 +1294,17 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
return 0;
}
+int futex_hash_allocate_default(void)
+{
+ if (!current->mm)
+ return 0;
+
+ if (current->mm->futex_phash)
+ return 0;
+
+ return futex_hash_allocate(16, false);
+}
+
static int futex_hash_get_slots(void)
{
struct futex_private_hash *fph;
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 14/21] futex: Allow to resize the private local hash
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (12 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 13/21] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
` (3 more replies)
2025-04-16 16:29 ` [PATCH v12 15/21] futex: Allow to make the private hash immutable Sebastian Andrzej Siewior
` (7 subsequent siblings)
21 siblings, 4 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/
replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS
operation can now be invoked at runtime and resize an already existing
internal private futex_hash_bucket to another size.
The reallocation is based on an idea by Thomas Gleixner: The initial
allocation of struct futex_private_hash sets the reference count
to one. Every user acquires a reference on the local hash before using
it and drops it after it enqueued itself on the hash bucket. There is no
reference held while the task is scheduled out while waiting for the
wake up.
The resize process allocates a new struct futex_private_hash and drops
the initial reference. Synchronized with mm_struct::futex_hash_lock it
is checked if the reference counter for the currently used
mm_struct::futex_phash is marked as DEAD. If so, then all users enqueued
on the current private hash are requeued on the new private hash and the
new private hash is set to mm_struct::futex_phash. Otherwise the newly
allocated private hash is saved as mm_struct::futex_phash_new and the
rehashing and reassigning is delayed to the futex_hash() caller once the
reference counter is marked DEAD.
The replacement is not performed at rcuref_put() time because certain
callers, such as futex_wait_queue(), drop their reference after changing
the task state. This change will be destroyed once the futex_hash_lock
is acquired.
The user can change the number slots with PR_FUTEX_HASH_SET_SLOTS
multiple times. An increase and decrease is allowed and request blocks
until the assignment is done.
The private hash allocated at thread creation is changed from 16 to
16 <= 4 * number_of_threads <= global_hash_size
where number_of_threads can not exceed the number of online CPUs. Should
the user PR_FUTEX_HASH_SET_SLOTS then the auto scaling is disabled.
[peterz: reorganize the code to avoid state tracking and simplify new
object handling, block the user until changes are in effect, allow
increase and decrease of the hash].
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 3 +-
include/linux/mm_types.h | 4 +-
kernel/futex/core.c | 290 ++++++++++++++++++++++++++++++++++++---
kernel/futex/requeue.c | 5 +
4 files changed, 281 insertions(+), 21 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 1d3f7555825ec..40bc778b2bb45 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -85,7 +85,8 @@ void futex_hash_free(struct mm_struct *mm);
static inline void futex_mm_init(struct mm_struct *mm)
{
- mm->futex_phash = NULL;
+ rcu_assign_pointer(mm->futex_phash, NULL);
+ mutex_init(&mm->futex_hash_lock);
}
#else /* !CONFIG_FUTEX_PRIVATE_HASH */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a4b5661e41770..32ba5126e2214 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1033,7 +1033,9 @@ struct mm_struct {
seqcount_t mm_lock_seq;
#endif
#ifdef CONFIG_FUTEX_PRIVATE_HASH
- struct futex_private_hash *futex_phash;
+ struct mutex futex_hash_lock;
+ struct futex_private_hash __rcu *futex_phash;
+ struct futex_private_hash *futex_phash_new;
#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 53b3a00a92539..9e7dad52abea8 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -40,6 +40,7 @@
#include <linux/fault-inject.h>
#include <linux/slab.h>
#include <linux/prctl.h>
+#include <linux/rcuref.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -57,7 +58,9 @@ static struct {
#define futex_hashmask (__futex_data.hashmask)
struct futex_private_hash {
+ rcuref_t users;
unsigned int hash_mask;
+ struct rcu_head rcu;
void *mm;
bool custom;
struct futex_hash_bucket queues[];
@@ -129,11 +132,14 @@ static inline bool futex_key_is_private(union futex_key *key)
bool futex_private_hash_get(struct futex_private_hash *fph)
{
- return false;
+ return rcuref_get(&fph->users);
}
void futex_private_hash_put(struct futex_private_hash *fph)
{
+ /* Ignore return value, last put is verified via rcuref_is_dead() */
+ if (rcuref_put(&fph->users))
+ wake_up_var(fph->mm);
}
/**
@@ -143,8 +149,23 @@ void futex_private_hash_put(struct futex_private_hash *fph)
* Obtain an additional reference for the already obtained hash bucket. The
* caller must already own an reference.
*/
-void futex_hash_get(struct futex_hash_bucket *hb) { }
-void futex_hash_put(struct futex_hash_bucket *hb) { }
+void futex_hash_get(struct futex_hash_bucket *hb)
+{
+ struct futex_private_hash *fph = hb->priv;
+
+ if (!fph)
+ return;
+ WARN_ON_ONCE(!futex_private_hash_get(fph));
+}
+
+void futex_hash_put(struct futex_hash_bucket *hb)
+{
+ struct futex_private_hash *fph = hb->priv;
+
+ if (!fph)
+ return;
+ futex_private_hash_put(fph);
+}
static struct futex_hash_bucket *
__futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
@@ -155,7 +176,7 @@ __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
return NULL;
if (!fph)
- fph = key->private.mm->futex_phash;
+ fph = rcu_dereference(key->private.mm->futex_phash);
if (!fph || !fph->hash_mask)
return NULL;
@@ -165,21 +186,119 @@ __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
return &fph->queues[hash & fph->hash_mask];
}
+static void futex_rehash_private(struct futex_private_hash *old,
+ struct futex_private_hash *new)
+{
+ struct futex_hash_bucket *hb_old, *hb_new;
+ unsigned int slots = old->hash_mask + 1;
+ unsigned int i;
+
+ for (i = 0; i < slots; i++) {
+ struct futex_q *this, *tmp;
+
+ hb_old = &old->queues[i];
+
+ spin_lock(&hb_old->lock);
+ plist_for_each_entry_safe(this, tmp, &hb_old->chain, list) {
+
+ plist_del(&this->list, &hb_old->chain);
+ futex_hb_waiters_dec(hb_old);
+
+ WARN_ON_ONCE(this->lock_ptr != &hb_old->lock);
+
+ hb_new = __futex_hash(&this->key, new);
+ futex_hb_waiters_inc(hb_new);
+ /*
+ * The new pointer isn't published yet but an already
+ * moved user can be unqueued due to timeout or signal.
+ */
+ spin_lock_nested(&hb_new->lock, SINGLE_DEPTH_NESTING);
+ plist_add(&this->list, &hb_new->chain);
+ this->lock_ptr = &hb_new->lock;
+ spin_unlock(&hb_new->lock);
+ }
+ spin_unlock(&hb_old->lock);
+ }
+}
+
+static bool __futex_pivot_hash(struct mm_struct *mm,
+ struct futex_private_hash *new)
+{
+ struct futex_private_hash *fph;
+
+ WARN_ON_ONCE(mm->futex_phash_new);
+
+ fph = rcu_dereference_protected(mm->futex_phash,
+ lockdep_is_held(&mm->futex_hash_lock));
+ if (fph) {
+ if (!rcuref_is_dead(&fph->users)) {
+ mm->futex_phash_new = new;
+ return false;
+ }
+
+ futex_rehash_private(fph, new);
+ }
+ rcu_assign_pointer(mm->futex_phash, new);
+ kvfree_rcu(fph, rcu);
+ return true;
+}
+
+static void futex_pivot_hash(struct mm_struct *mm)
+{
+ scoped_guard(mutex, &mm->futex_hash_lock) {
+ struct futex_private_hash *fph;
+
+ fph = mm->futex_phash_new;
+ if (fph) {
+ mm->futex_phash_new = NULL;
+ __futex_pivot_hash(mm, fph);
+ }
+ }
+}
+
struct futex_private_hash *futex_private_hash(void)
{
struct mm_struct *mm = current->mm;
- struct futex_private_hash *fph;
+ /*
+ * Ideally we don't loop. If there is a replacement in progress
+ * then a new private hash is already prepared and a reference can't be
+ * obtained once the last user dropped it's.
+ * In that case we block on mm_struct::futex_hash_lock and either have
+ * to perform the replacement or wait while someone else is doing the
+ * job. Eitherway, on the second iteration we acquire a reference on the
+ * new private hash or loop again because a new replacement has been
+ * requested.
+ */
+again:
+ scoped_guard(rcu) {
+ struct futex_private_hash *fph;
- fph = mm->futex_phash;
- return fph;
+ fph = rcu_dereference(mm->futex_phash);
+ if (!fph)
+ return NULL;
+
+ if (rcuref_get(&fph->users))
+ return fph;
+ }
+ futex_pivot_hash(mm);
+ goto again;
}
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
+ struct futex_private_hash *fph;
struct futex_hash_bucket *hb;
- hb = __futex_hash(key, NULL);
- return hb;
+again:
+ scoped_guard(rcu) {
+ hb = __futex_hash(key, NULL);
+ fph = hb->priv;
+
+ if (!fph || futex_private_hash_get(fph))
+ return hb;
+ }
+ futex_pivot_hash(key->private.mm);
+ goto again;
}
#else /* !CONFIG_FUTEX_PRIVATE_HASH */
@@ -664,6 +783,8 @@ int futex_unqueue(struct futex_q *q)
spinlock_t *lock_ptr;
int ret = 0;
+ /* RCU so lock_ptr is not going away during locking. */
+ guard(rcu)();
/* In the common case we don't take the spinlock, which is nice. */
retry:
/*
@@ -1065,6 +1186,10 @@ static void exit_pi_state_list(struct task_struct *curr)
struct futex_pi_state *pi_state;
union futex_key key = FUTEX_KEY_INIT;
+ /*
+ * The mutex mm_struct::futex_hash_lock might be acquired.
+ */
+ might_sleep();
/*
* Ensure the hash remains stable (no resize) during the while loop
* below. The hb pointer is acquired under the pi_lock so we can't block
@@ -1261,7 +1386,51 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
#ifdef CONFIG_FUTEX_PRIVATE_HASH
void futex_hash_free(struct mm_struct *mm)
{
- kvfree(mm->futex_phash);
+ struct futex_private_hash *fph;
+
+ kvfree(mm->futex_phash_new);
+ fph = rcu_dereference_raw(mm->futex_phash);
+ if (fph) {
+ WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+ kvfree(fph);
+ }
+}
+
+static bool futex_pivot_pending(struct mm_struct *mm)
+{
+ struct futex_private_hash *fph;
+
+ guard(rcu)();
+
+ if (!mm->futex_phash_new)
+ return true;
+
+ fph = rcu_dereference(mm->futex_phash);
+ return rcuref_is_dead(&fph->users);
+}
+
+static bool futex_hash_less(struct futex_private_hash *a,
+ struct futex_private_hash *b)
+{
+ /* user provided always wins */
+ if (!a->custom && b->custom)
+ return true;
+ if (a->custom && !b->custom)
+ return false;
+
+ /* zero-sized hash wins */
+ if (!b->hash_mask)
+ return true;
+ if (!a->hash_mask)
+ return false;
+
+ /* keep the biggest */
+ if (a->hash_mask < b->hash_mask)
+ return true;
+ if (a->hash_mask > b->hash_mask)
+ return false;
+
+ return false; /* equal */
}
static int futex_hash_allocate(unsigned int hash_slots, bool custom)
@@ -1273,16 +1442,23 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
if (hash_slots && (hash_slots == 1 || !is_power_of_2(hash_slots)))
return -EINVAL;
- if (mm->futex_phash)
- return -EALREADY;
-
- if (!thread_group_empty(current))
- return -EINVAL;
+ /*
+ * Once we've disabled the global hash there is no way back.
+ */
+ scoped_guard(rcu) {
+ fph = rcu_dereference(mm->futex_phash);
+ if (fph && !fph->hash_mask) {
+ if (custom)
+ return -EBUSY;
+ return 0;
+ }
+ }
fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
if (!fph)
return -ENOMEM;
+ rcuref_init(&fph->users, 1);
fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
fph->custom = custom;
fph->mm = mm;
@@ -1290,26 +1466,102 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
for (i = 0; i < hash_slots; i++)
futex_hash_bucket_init(&fph->queues[i], fph);
- mm->futex_phash = fph;
+ if (custom) {
+ /*
+ * Only let prctl() wait / retry; don't unduly delay clone().
+ */
+again:
+ wait_var_event(mm, futex_pivot_pending(mm));
+ }
+
+ scoped_guard(mutex, &mm->futex_hash_lock) {
+ struct futex_private_hash *free __free(kvfree) = NULL;
+ struct futex_private_hash *cur, *new;
+
+ cur = rcu_dereference_protected(mm->futex_phash,
+ lockdep_is_held(&mm->futex_hash_lock));
+ new = mm->futex_phash_new;
+ mm->futex_phash_new = NULL;
+
+ if (fph) {
+ if (cur && !new) {
+ /*
+ * If we have an existing hash, but do not yet have
+ * allocated a replacement hash, drop the initial
+ * reference on the existing hash.
+ */
+ futex_private_hash_put(cur);
+ }
+
+ if (new) {
+ /*
+ * Two updates raced; throw out the lesser one.
+ */
+ if (futex_hash_less(new, fph)) {
+ free = new;
+ new = fph;
+ } else {
+ free = fph;
+ }
+ } else {
+ new = fph;
+ }
+ fph = NULL;
+ }
+
+ if (new) {
+ /*
+ * Will set mm->futex_phash_new on failure;
+ * futex_private_hash_get() will try again.
+ */
+ if (!__futex_pivot_hash(mm, new) && custom)
+ goto again;
+ }
+ }
return 0;
}
int futex_hash_allocate_default(void)
{
+ unsigned int threads, buckets, current_buckets = 0;
+ struct futex_private_hash *fph;
+
if (!current->mm)
return 0;
- if (current->mm->futex_phash)
+ scoped_guard(rcu) {
+ threads = min_t(unsigned int,
+ get_nr_threads(current),
+ num_online_cpus());
+
+ fph = rcu_dereference(current->mm->futex_phash);
+ if (fph) {
+ if (fph->custom)
+ return 0;
+
+ current_buckets = fph->hash_mask + 1;
+ }
+ }
+
+ /*
+ * The default allocation will remain within
+ * 16 <= threads * 4 <= global hash size
+ */
+ buckets = roundup_pow_of_two(4 * threads);
+ buckets = clamp(buckets, 16, futex_hashmask + 1);
+
+ if (current_buckets >= buckets)
return 0;
- return futex_hash_allocate(16, false);
+ return futex_hash_allocate(buckets, false);
}
static int futex_hash_get_slots(void)
{
struct futex_private_hash *fph;
- fph = current->mm->futex_phash;
+ guard(rcu)();
+ fph = rcu_dereference(current->mm->futex_phash);
if (fph && fph->hash_mask)
return fph->hash_mask + 1;
return 0;
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index b0e64fd454d96..c716a66f86929 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -87,6 +87,11 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
futex_hb_waiters_inc(hb2);
plist_add(&q->list, &hb2->chain);
q->lock_ptr = &hb2->lock;
+ /*
+ * hb1 and hb2 belong to the same futex_hash_bucket_private
+ * because if we managed get a reference on hb1 then it can't be
+ * replaced. Therefore we avoid put(hb1)+get(hb2) here.
+ */
}
q->key = *key2;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 15/21] futex: Allow to make the private hash immutable
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (13 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 14/21] futex: Allow to resize the private local hash Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-02 18:01 ` Peter Zijlstra
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 16/21] futex: Implement FUTEX2_NUMA Sebastian Andrzej Siewior
` (6 subsequent siblings)
21 siblings, 2 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior, Shrikanth Hegde
My initial testing showed that
perf bench futex hash
reported less operations/sec with private hash. After using the same
amount of buckets in the private hash as used by the global hash then
the operations/sec were about the same.
This changed once the private hash became resizable. This feature added
a RCU section and reference counting via atomic inc+dec operation into
the hot path.
The reference counting can be avoided if the private hash is made
immutable.
Extend PR_FUTEX_HASH_SET_SLOTS by a fourth argument which denotes if the
private should be made immutable. Once set (to true) the a further
resize is not allowed (same if set to global hash).
Add PR_FUTEX_HASH_GET_IMMUTABLE which returns true if the hash can not
be changed.
Update "perf bench" suite.
For comparison, results of "perf bench futex hash -s":
- Xeon CPU E5-2650, 2 NUMA nodes, total 32 CPUs:
- Before the introducing task local hash
shared Averaged 1.487.148 operations/sec (+- 0,53%), total secs = 10
private Averaged 2.192.405 operations/sec (+- 0,07%), total secs = 10
- With the series
shared Averaged 1.326.342 operations/sec (+- 0,41%), total secs = 10
-b128 Averaged 141.394 operations/sec (+- 1,15%), total secs = 10
-Ib128 Averaged 851.490 operations/sec (+- 0,67%), total secs = 10
-b8192 Averaged 131.321 operations/sec (+- 2,13%), total secs = 10
-Ib8192 Averaged 1.923.077 operations/sec (+- 0,61%), total secs = 10
128 is the default allocation of hash buckets.
8192 was the previous amount of allocated hash buckets.
- Xeon(R) CPU E7-8890 v3, 4 NUMA nodes, total 144 CPUs:
- Before the introducing task local hash
shared Averaged 1.810.936 operations/sec (+- 0,26%), total secs = 20
private Averaged 2.505.801 operations/sec (+- 0,05%), total secs = 20
- With the series
shared Averaged 1.589.002 operations/sec (+- 0,25%), total secs = 20
-b1024 Averaged 42.410 operations/sec (+- 0,20%), total secs = 20
-Ib1024 Averaged 740.638 operations/sec (+- 1,51%), total secs = 20
-b65536 Averaged 48.811 operations/sec (+- 1,35%), total secs = 20
-Ib65536 Averaged 1.963.165 operations/sec (+- 0,18%), total secs = 20
1024 is the default allocation of hash buckets.
65536 was the previous amount of allocated hash buckets.
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/uapi/linux/prctl.h | 1 +
kernel/futex/core.c | 44 ++++++++++++++++++++++++++++++++------
2 files changed, 38 insertions(+), 7 deletions(-)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 3b93fb906e3c5..21f30b3ded74b 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -368,5 +368,6 @@ struct prctl_mm_map {
#define PR_FUTEX_HASH 78
# define PR_FUTEX_HASH_SET_SLOTS 1
# define PR_FUTEX_HASH_GET_SLOTS 2
+# define PR_FUTEX_HASH_GET_IMMUTABLE 3
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 9e7dad52abea8..81c5705b6af5e 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -63,6 +63,7 @@ struct futex_private_hash {
struct rcu_head rcu;
void *mm;
bool custom;
+ bool immutable;
struct futex_hash_bucket queues[];
};
@@ -132,12 +133,16 @@ static inline bool futex_key_is_private(union futex_key *key)
bool futex_private_hash_get(struct futex_private_hash *fph)
{
+ if (fph->immutable)
+ return true;
return rcuref_get(&fph->users);
}
void futex_private_hash_put(struct futex_private_hash *fph)
{
/* Ignore return value, last put is verified via rcuref_is_dead() */
+ if (fph->immutable)
+ return;
if (rcuref_put(&fph->users))
wake_up_var(fph->mm);
}
@@ -277,6 +282,8 @@ struct futex_private_hash *futex_private_hash(void)
if (!fph)
return NULL;
+ if (fph->immutable)
+ return fph;
if (rcuref_get(&fph->users))
return fph;
}
@@ -1433,7 +1440,7 @@ static bool futex_hash_less(struct futex_private_hash *a,
return false; /* equal */
}
-static int futex_hash_allocate(unsigned int hash_slots, bool custom)
+static int futex_hash_allocate(unsigned int hash_slots, unsigned int immutable, bool custom)
{
struct mm_struct *mm = current->mm;
struct futex_private_hash *fph;
@@ -1441,13 +1448,15 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
if (hash_slots && (hash_slots == 1 || !is_power_of_2(hash_slots)))
return -EINVAL;
+ if (immutable > 2)
+ return -EINVAL;
/*
* Once we've disabled the global hash there is no way back.
*/
scoped_guard(rcu) {
fph = rcu_dereference(mm->futex_phash);
- if (fph && !fph->hash_mask) {
+ if (fph && (!fph->hash_mask || fph->immutable)) {
if (custom)
return -EBUSY;
return 0;
@@ -1461,6 +1470,7 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
rcuref_init(&fph->users, 1);
fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
fph->custom = custom;
+ fph->immutable = !!immutable;
fph->mm = mm;
for (i = 0; i < hash_slots; i++)
@@ -1553,7 +1563,7 @@ int futex_hash_allocate_default(void)
if (current_buckets >= buckets)
return 0;
- return futex_hash_allocate(buckets, false);
+ return futex_hash_allocate(buckets, 0, false);
}
static int futex_hash_get_slots(void)
@@ -1567,9 +1577,22 @@ static int futex_hash_get_slots(void)
return 0;
}
+static int futex_hash_get_immutable(void)
+{
+ struct futex_private_hash *fph;
+
+ guard(rcu)();
+ fph = rcu_dereference(current->mm->futex_phash);
+ if (fph && fph->immutable)
+ return 1;
+ if (fph && !fph->hash_mask)
+ return 1;
+ return 0;
+}
+
#else
-static int futex_hash_allocate(unsigned int hash_slots, bool custom)
+static int futex_hash_allocate(unsigned int hash_slots, unsigned int immutable, bool custom)
{
return -EINVAL;
}
@@ -1578,6 +1601,11 @@ static int futex_hash_get_slots(void)
{
return 0;
}
+
+static int futex_hash_get_immutable(void)
+{
+ return 0;
+}
#endif
int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
@@ -1586,15 +1614,17 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
switch (arg2) {
case PR_FUTEX_HASH_SET_SLOTS:
- if (arg4 != 0)
- return -EINVAL;
- ret = futex_hash_allocate(arg3, true);
+ ret = futex_hash_allocate(arg3, arg4, true);
break;
case PR_FUTEX_HASH_GET_SLOTS:
ret = futex_hash_get_slots();
break;
+ case PR_FUTEX_HASH_GET_IMMUTABLE:
+ ret = futex_hash_get_immutable();
+ break;
+
default:
ret = -EINVAL;
break;
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 16/21] futex: Implement FUTEX2_NUMA
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (14 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 15/21] futex: Allow to make the private hash immutable Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 17/21] futex: Implement FUTEX2_MPOL Sebastian Andrzej Siewior
` (5 subsequent siblings)
21 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
From: Peter Zijlstra <peterz@infradead.org>
Extend the futex2 interface to be numa aware.
When FUTEX2_NUMA is specified for a futex, the user value is extended
to two words (of the same size). The first is the user value we all
know, the second one will be the node to place this futex on.
struct futex_numa_32 {
u32 val;
u32 node;
};
When node is set to ~0, WAIT will set it to the current node_id such
that WAKE knows where to find it. If userspace corrupts the node value
between WAIT and WAKE, the futex will not be found and no wakeup will
happen.
When FUTEX2_NUMA is not set, the node is simply an extension of the
hash, such that traditional futexes are still interleaved over the
nodes.
This is done to avoid having to have a separate !numa hash-table.
[bigeasy: ensure to have at least hashsize of 4 in futex_init(), add
pr_info() for size and allocation information. Cast the naddr math to
void*]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 3 ++
include/uapi/linux/futex.h | 8 +++
kernel/futex/core.c | 100 ++++++++++++++++++++++++++++++-------
kernel/futex/futex.h | 33 ++++++++++--
4 files changed, 124 insertions(+), 20 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 40bc778b2bb45..eccc99751bd94 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -34,6 +34,7 @@ union futex_key {
u64 i_seq;
unsigned long pgoff;
unsigned int offset;
+ /* unsigned int node; */
} shared;
struct {
union {
@@ -42,11 +43,13 @@ union futex_key {
};
unsigned long address;
unsigned int offset;
+ /* unsigned int node; */
} private;
struct {
u64 ptr;
unsigned long word;
unsigned int offset;
+ unsigned int node; /* NOT hashed! */
} both;
};
diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index d2ee625ea1890..0435025beaae8 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -74,6 +74,14 @@
/* do not use */
#define FUTEX_32 FUTEX2_SIZE_U32 /* historical accident :-( */
+
+/*
+ * When FUTEX2_NUMA doubles the futex word, the second word is a node value.
+ * The special value -1 indicates no-node. This is the same value as
+ * NUMA_NO_NODE, except that value is not ABI, this is.
+ */
+#define FUTEX_NO_NODE (-1)
+
/*
* Max numbers of elements in a futex_waitv array
*/
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 81c5705b6af5e..b5be2d4a34a53 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -36,6 +36,8 @@
#include <linux/pagemap.h>
#include <linux/debugfs.h>
#include <linux/plist.h>
+#include <linux/gfp.h>
+#include <linux/vmalloc.h>
#include <linux/memblock.h>
#include <linux/fault-inject.h>
#include <linux/slab.h>
@@ -51,11 +53,14 @@
* reside in the same cacheline.
*/
static struct {
- struct futex_hash_bucket *queues;
unsigned long hashmask;
+ unsigned int hashshift;
+ struct futex_hash_bucket *queues[MAX_NUMNODES];
} __futex_data __read_mostly __aligned(2*sizeof(long));
-#define futex_queues (__futex_data.queues)
-#define futex_hashmask (__futex_data.hashmask)
+
+#define futex_hashmask (__futex_data.hashmask)
+#define futex_hashshift (__futex_data.hashshift)
+#define futex_queues (__futex_data.queues)
struct futex_private_hash {
rcuref_t users;
@@ -339,15 +344,35 @@ __futex_hash(union futex_key *key, struct futex_private_hash *fph)
{
struct futex_hash_bucket *hb;
u32 hash;
+ int node;
hb = __futex_hash_private(key, fph);
if (hb)
return hb;
hash = jhash2((u32 *)key,
- offsetof(typeof(*key), both.offset) / 4,
+ offsetof(typeof(*key), both.offset) / sizeof(u32),
key->both.offset);
- return &futex_queues[hash & futex_hashmask];
+ node = key->both.node;
+
+ if (node == FUTEX_NO_NODE) {
+ /*
+ * In case of !FLAGS_NUMA, use some unused hash bits to pick a
+ * node -- this ensures regular futexes are interleaved across
+ * the nodes and avoids having to allocate multiple
+ * hash-tables.
+ *
+ * NOTE: this isn't perfectly uniform, but it is fast and
+ * handles sparse node masks.
+ */
+ node = (hash >> futex_hashshift) % nr_node_ids;
+ if (!node_possible(node)) {
+ node = find_next_bit_wrap(node_possible_map.bits,
+ nr_node_ids, node);
+ }
+ }
+
+ return &futex_queues[node][hash & futex_hashmask];
}
/**
@@ -454,25 +479,49 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
struct page *page;
struct folio *folio;
struct address_space *mapping;
- int err, ro = 0;
+ int node, err, size, ro = 0;
bool fshared;
fshared = flags & FLAGS_SHARED;
+ size = futex_size(flags);
+ if (flags & FLAGS_NUMA)
+ size *= 2;
/*
* The futex address must be "naturally" aligned.
*/
key->both.offset = address % PAGE_SIZE;
- if (unlikely((address % sizeof(u32)) != 0))
+ if (unlikely((address % size) != 0))
return -EINVAL;
address -= key->both.offset;
- if (unlikely(!access_ok(uaddr, sizeof(u32))))
+ if (unlikely(!access_ok(uaddr, size)))
return -EFAULT;
if (unlikely(should_fail_futex(fshared)))
return -EFAULT;
+ if (flags & FLAGS_NUMA) {
+ u32 __user *naddr = (void *)uaddr + size / 2;
+
+ if (futex_get_value(&node, naddr))
+ return -EFAULT;
+
+ if (node == FUTEX_NO_NODE) {
+ node = numa_node_id();
+ if (futex_put_value(node, naddr))
+ return -EFAULT;
+
+ } else if (node >= MAX_NUMNODES || !node_possible(node)) {
+ return -EINVAL;
+ }
+
+ key->both.node = node;
+
+ } else {
+ key->both.node = FUTEX_NO_NODE;
+ }
+
/*
* PROCESS_PRIVATE futexes are fast.
* As the mm cannot disappear under us and the 'key' only needs
@@ -1635,24 +1684,41 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
static int __init futex_init(void)
{
unsigned long hashsize, i;
- unsigned int futex_shift;
+ unsigned int order, n;
+ unsigned long size;
#ifdef CONFIG_BASE_SMALL
hashsize = 16;
#else
- hashsize = roundup_pow_of_two(256 * num_possible_cpus());
+ hashsize = 256 * num_possible_cpus();
+ hashsize /= num_possible_nodes();
+ hashsize = max(4, hashsize);
+ hashsize = roundup_pow_of_two(hashsize);
#endif
+ futex_hashshift = ilog2(hashsize);
+ size = sizeof(struct futex_hash_bucket) * hashsize;
+ order = get_order(size);
- futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
- hashsize, 0, 0,
- &futex_shift, NULL,
- hashsize, hashsize);
- hashsize = 1UL << futex_shift;
+ for_each_node(n) {
+ struct futex_hash_bucket *table;
- for (i = 0; i < hashsize; i++)
- futex_hash_bucket_init(&futex_queues[i], NULL);
+ if (order > MAX_PAGE_ORDER)
+ table = vmalloc_huge_node(size, GFP_KERNEL, n);
+ else
+ table = alloc_pages_exact_nid(n, size, GFP_KERNEL);
+
+ BUG_ON(!table);
+
+ for (i = 0; i < hashsize; i++)
+ futex_hash_bucket_init(&table[i], NULL);
+
+ futex_queues[n] = table;
+ }
futex_hashmask = hashsize - 1;
+ pr_info("futex hash table entries: %lu (%lu bytes on %d NUMA nodes, total %lu KiB, %s).\n",
+ hashsize, size, num_possible_nodes(), size * num_possible_nodes() / 1024,
+ order > MAX_PAGE_ORDER ? "vmalloc" : "linear");
return 0;
}
core_initcall(futex_init);
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 899aed5acde12..acc7953678898 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -54,7 +54,7 @@ static inline unsigned int futex_to_flags(unsigned int op)
return flags;
}
-#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_PRIVATE)
+#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_PRIVATE)
/* FUTEX2_ to FLAGS_ */
static inline unsigned int futex2_to_flags(unsigned int flags2)
@@ -87,6 +87,19 @@ static inline bool futex_flags_valid(unsigned int flags)
if ((flags & FLAGS_SIZE_MASK) != FLAGS_SIZE_32)
return false;
+ /*
+ * Must be able to represent both FUTEX_NO_NODE and every valid nodeid
+ * in a futex word.
+ */
+ if (flags & FLAGS_NUMA) {
+ int bits = 8 * futex_size(flags);
+ u64 max = ~0ULL;
+
+ max >>= 64 - bits;
+ if (nr_node_ids >= max)
+ return false;
+ }
+
return true;
}
@@ -282,7 +295,7 @@ static inline int futex_cmpxchg_value_locked(u32 *curval, u32 __user *uaddr, u32
* This looks a bit overkill, but generally just results in a couple
* of instructions.
*/
-static __always_inline int futex_read_inatomic(u32 *dest, u32 __user *from)
+static __always_inline int futex_get_value(u32 *dest, u32 __user *from)
{
u32 val;
@@ -299,12 +312,26 @@ static __always_inline int futex_read_inatomic(u32 *dest, u32 __user *from)
return -EFAULT;
}
+static __always_inline int futex_put_value(u32 val, u32 __user *to)
+{
+ if (can_do_masked_user_access())
+ to = masked_user_access_begin(to);
+ else if (!user_read_access_begin(to, sizeof(*to)))
+ return -EFAULT;
+ unsafe_put_user(val, to, Efault);
+ user_read_access_end();
+ return 0;
+Efault:
+ user_read_access_end();
+ return -EFAULT;
+}
+
static inline int futex_get_value_locked(u32 *dest, u32 __user *from)
{
int ret;
pagefault_disable();
- ret = futex_read_inatomic(dest, from);
+ ret = futex_get_value(dest, from);
pagefault_enable();
return ret;
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 17/21] futex: Implement FUTEX2_MPOL
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (15 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 16/21] futex: Implement FUTEX2_NUMA Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-02 18:45 ` Peter Zijlstra
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 18/21] tools headers: Synchronize prctl.h ABI header Sebastian Andrzej Siewior
` (4 subsequent siblings)
21 siblings, 2 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
From: Peter Zijlstra <peterz@infradead.org>
Extend the futex2 interface to be aware of mempolicy.
When FUTEX2_MPOL is specified and there is a MPOL_PREFERRED or
home_node specified covering the futex address, use that hash-map.
Notably, in this case the futex will go to the global node hashtable,
even if it is a PRIVATE futex.
When FUTEX2_NUMA|FUTEX2_MPOL is specified and the user specified node
value is FUTEX_NO_NODE, the MPOL lookup (as described above) will be
tried first before reverting to setting node to the local node.
[bigeasy: add CONFIG_FUTEX_MPOL, add MPOL to FUTEX2_VALID_MASK, write
the node only to user if FUTEX_NO_NODE was supplied]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/mmap_lock.h | 4 ++
include/uapi/linux/futex.h | 2 +-
init/Kconfig | 5 ++
kernel/futex/core.c | 114 +++++++++++++++++++++++++++++++------
kernel/futex/futex.h | 6 +-
5 files changed, 113 insertions(+), 18 deletions(-)
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 4706c67699027..e0eddfd306ef3 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -7,6 +7,7 @@
#include <linux/rwsem.h>
#include <linux/tracepoint-defs.h>
#include <linux/types.h>
+#include <linux/cleanup.h>
#define MMAP_LOCK_INITIALIZER(name) \
.mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),
@@ -211,6 +212,9 @@ static inline void mmap_read_unlock(struct mm_struct *mm)
up_read(&mm->mmap_lock);
}
+DEFINE_GUARD(mmap_read_lock, struct mm_struct *,
+ mmap_read_lock(_T), mmap_read_unlock(_T))
+
static inline void mmap_read_unlock_non_owner(struct mm_struct *mm)
{
__mmap_lock_trace_released(mm, false);
diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 0435025beaae8..247c425e175ef 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -63,7 +63,7 @@
#define FUTEX2_SIZE_U32 0x02
#define FUTEX2_SIZE_U64 0x03
#define FUTEX2_NUMA 0x04
- /* 0x08 */
+#define FUTEX2_MPOL 0x08
/* 0x10 */
/* 0x20 */
/* 0x40 */
diff --git a/init/Kconfig b/init/Kconfig
index b308b98d79347..174633bc9810b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1704,6 +1704,11 @@ config FUTEX_PRIVATE_HASH
depends on FUTEX && !BASE_SMALL && MMU
default y
+config FUTEX_MPOL
+ bool
+ depends on FUTEX && NUMA
+ default y
+
config EPOLL
bool "Enable eventpoll support" if EXPERT
default y
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index b5be2d4a34a53..ee1d7182ce0c0 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -43,6 +43,8 @@
#include <linux/slab.h>
#include <linux/prctl.h>
#include <linux/rcuref.h>
+#include <linux/mempolicy.h>
+#include <linux/mmap_lock.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -328,6 +330,73 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
#endif /* CONFIG_FUTEX_PRIVATE_HASH */
+#ifdef CONFIG_FUTEX_MPOL
+static int __futex_key_to_node(struct mm_struct *mm, unsigned long addr)
+{
+ struct vm_area_struct *vma = vma_lookup(mm, addr);
+ struct mempolicy *mpol;
+ int node = FUTEX_NO_NODE;
+
+ if (!vma)
+ return FUTEX_NO_NODE;
+
+ mpol = vma_policy(vma);
+ if (!mpol)
+ return FUTEX_NO_NODE;
+
+ switch (mpol->mode) {
+ case MPOL_PREFERRED:
+ node = first_node(mpol->nodes);
+ break;
+ case MPOL_PREFERRED_MANY:
+ case MPOL_BIND:
+ if (mpol->home_node != NUMA_NO_NODE)
+ node = mpol->home_node;
+ break;
+ default:
+ break;
+ }
+
+ return node;
+}
+
+static int futex_key_to_node_opt(struct mm_struct *mm, unsigned long addr)
+{
+ int seq, node;
+
+ guard(rcu)();
+
+ if (!mmap_lock_speculate_try_begin(mm, &seq))
+ return -EBUSY;
+
+ node = __futex_key_to_node(mm, addr);
+
+ if (mmap_lock_speculate_retry(mm, seq))
+ return -EAGAIN;
+
+ return node;
+}
+
+static int futex_mpol(struct mm_struct *mm, unsigned long addr)
+{
+ int node;
+
+ node = futex_key_to_node_opt(mm, addr);
+ if (node >= FUTEX_NO_NODE)
+ return node;
+
+ guard(mmap_read_lock)(mm);
+ return __futex_key_to_node(mm, addr);
+}
+#else /* !CONFIG_FUTEX_MPOL */
+
+static int futex_mpol(struct mm_struct *mm, unsigned long addr)
+{
+ return FUTEX_NO_NODE;
+}
+
+#endif /* CONFIG_FUTEX_MPOL */
+
/**
* __futex_hash - Return the hash bucket
* @key: Pointer to the futex key for which the hash is calculated
@@ -342,18 +411,20 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
static struct futex_hash_bucket *
__futex_hash(union futex_key *key, struct futex_private_hash *fph)
{
- struct futex_hash_bucket *hb;
+ int node = key->both.node;
u32 hash;
- int node;
- hb = __futex_hash_private(key, fph);
- if (hb)
- return hb;
+ if (node == FUTEX_NO_NODE) {
+ struct futex_hash_bucket *hb;
+
+ hb = __futex_hash_private(key, fph);
+ if (hb)
+ return hb;
+ }
hash = jhash2((u32 *)key,
offsetof(typeof(*key), both.offset) / sizeof(u32),
key->both.offset);
- node = key->both.node;
if (node == FUTEX_NO_NODE) {
/*
@@ -480,6 +551,7 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
struct folio *folio;
struct address_space *mapping;
int node, err, size, ro = 0;
+ bool node_updated = false;
bool fshared;
fshared = flags & FLAGS_SHARED;
@@ -501,27 +573,37 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
if (unlikely(should_fail_futex(fshared)))
return -EFAULT;
+ node = FUTEX_NO_NODE;
+
if (flags & FLAGS_NUMA) {
u32 __user *naddr = (void *)uaddr + size / 2;
if (futex_get_value(&node, naddr))
return -EFAULT;
+ if (node != FUTEX_NO_NODE &&
+ (node >= MAX_NUMNODES || !node_possible(node)))
+ return -EINVAL;
+ }
+
+ if (node == FUTEX_NO_NODE && (flags & FLAGS_MPOL)) {
+ node = futex_mpol(mm, address);
+ node_updated = true;
+ }
+
+ if (flags & FLAGS_NUMA) {
+ u32 __user *naddr = (void *)uaddr + size / 2;
+
if (node == FUTEX_NO_NODE) {
node = numa_node_id();
- if (futex_put_value(node, naddr))
- return -EFAULT;
-
- } else if (node >= MAX_NUMNODES || !node_possible(node)) {
- return -EINVAL;
+ node_updated = true;
}
-
- key->both.node = node;
-
- } else {
- key->both.node = FUTEX_NO_NODE;
+ if (node_updated && futex_put_value(node, naddr))
+ return -EFAULT;
}
+ key->both.node = node;
+
/*
* PROCESS_PRIVATE futexes are fast.
* As the mm cannot disappear under us and the 'key' only needs
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index acc7953678898..069fc2a83080d 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -39,6 +39,7 @@
#define FLAGS_HAS_TIMEOUT 0x0040
#define FLAGS_NUMA 0x0080
#define FLAGS_STRICT 0x0100
+#define FLAGS_MPOL 0x0200
/* FUTEX_ to FLAGS_ */
static inline unsigned int futex_to_flags(unsigned int op)
@@ -54,7 +55,7 @@ static inline unsigned int futex_to_flags(unsigned int op)
return flags;
}
-#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_PRIVATE)
+#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_MPOL | FUTEX2_PRIVATE)
/* FUTEX2_ to FLAGS_ */
static inline unsigned int futex2_to_flags(unsigned int flags2)
@@ -67,6 +68,9 @@ static inline unsigned int futex2_to_flags(unsigned int flags2)
if (flags2 & FUTEX2_NUMA)
flags |= FLAGS_NUMA;
+ if (flags2 & FUTEX2_MPOL)
+ flags |= FLAGS_MPOL;
+
return flags;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 18/21] tools headers: Synchronize prctl.h ABI header
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (16 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 17/21] futex: Implement FUTEX2_MPOL Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 19/21] tools/perf: Allow to select the number of hash buckets Sebastian Andrzej Siewior
` (3 subsequent siblings)
21 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior, Liang, Kan, Adrian Hunter,
Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
Jiri Olsa, Mark Rutland, Namhyung Kim, linux-perf-users
Synchronize prctl.h with current uapi version after adding
PR_FUTEX_HASH.
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: linux-perf-users@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
tools/include/uapi/linux/prctl.h | 44 +++++++++++++++++++++++++++++++-
1 file changed, 43 insertions(+), 1 deletion(-)
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 35791791a879b..21f30b3ded74b 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -230,7 +230,7 @@ struct prctl_mm_map {
# define PR_PAC_APDBKEY (1UL << 3)
# define PR_PAC_APGAKEY (1UL << 4)
-/* Tagged user address controls for arm64 */
+/* Tagged user address controls for arm64 and RISC-V */
#define PR_SET_TAGGED_ADDR_CTRL 55
#define PR_GET_TAGGED_ADDR_CTRL 56
# define PR_TAGGED_ADDR_ENABLE (1UL << 0)
@@ -244,6 +244,9 @@ struct prctl_mm_map {
# define PR_MTE_TAG_MASK (0xffffUL << PR_MTE_TAG_SHIFT)
/* Unused; kept only for source compatibility */
# define PR_MTE_TCF_SHIFT 1
+/* RISC-V pointer masking tag length */
+# define PR_PMLEN_SHIFT 24
+# define PR_PMLEN_MASK (0x7fUL << PR_PMLEN_SHIFT)
/* Control reclaim behavior when allocating memory */
#define PR_SET_IO_FLUSHER 57
@@ -328,4 +331,43 @@ struct prctl_mm_map {
# define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */
# define PR_PPC_DEXCR_CTRL_MASK 0x1f
+/*
+ * Get the current shadow stack configuration for the current thread,
+ * this will be the value configured via PR_SET_SHADOW_STACK_STATUS.
+ */
+#define PR_GET_SHADOW_STACK_STATUS 74
+
+/*
+ * Set the current shadow stack configuration. Enabling the shadow
+ * stack will cause a shadow stack to be allocated for the thread.
+ */
+#define PR_SET_SHADOW_STACK_STATUS 75
+# define PR_SHADOW_STACK_ENABLE (1UL << 0)
+# define PR_SHADOW_STACK_WRITE (1UL << 1)
+# define PR_SHADOW_STACK_PUSH (1UL << 2)
+
+/*
+ * Prevent further changes to the specified shadow stack
+ * configuration. All bits may be locked via this call, including
+ * undefined bits.
+ */
+#define PR_LOCK_SHADOW_STACK_STATUS 76
+
+/*
+ * Controls the mode of timer_create() for CRIU restore operations.
+ * Enabling this allows CRIU to restore timers with explicit IDs.
+ *
+ * Don't use for normal operations as the result might be undefined.
+ */
+#define PR_TIMER_CREATE_RESTORE_IDS 77
+# define PR_TIMER_CREATE_RESTORE_IDS_OFF 0
+# define PR_TIMER_CREATE_RESTORE_IDS_ON 1
+# define PR_TIMER_CREATE_RESTORE_IDS_GET 2
+
+/* FUTEX hash management */
+#define PR_FUTEX_HASH 78
+# define PR_FUTEX_HASH_SET_SLOTS 1
+# define PR_FUTEX_HASH_GET_SLOTS 2
+# define PR_FUTEX_HASH_GET_IMMUTABLE 3
+
#endif /* _LINUX_PRCTL_H */
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 19/21] tools/perf: Allow to select the number of hash buckets
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (17 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 18/21] tools headers: Synchronize prctl.h ABI header Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 20/21] selftests/futex: Add futex_priv_hash Sebastian Andrzej Siewior
` (2 subsequent siblings)
21 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior, Liang, Kan, Adrian Hunter,
Alexander Shishkin, Arnaldo Carvalho de Melo, Ian Rogers,
Jiri Olsa, Mark Rutland, Namhyung Kim, linux-perf-users
Add the -b/ --buckets argument to specify the number of hash buckets for
the private futex hash. This is directly passed to
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, buckets, immutable)
and must return without an error if specified. The `immutable' is 0 by
default and can be set to 1 via the -I/ --immutable argument.
The size of the private hash is verified with PR_FUTEX_HASH_GET_SLOTS.
If PR_FUTEX_HASH_GET_SLOTS failed then it is assumed that an older
kernel was used without the support and that the global hash is used.
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: linux-perf-users@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
tools/perf/bench/Build | 1 +
tools/perf/bench/futex-hash.c | 7 +++
tools/perf/bench/futex-lock-pi.c | 5 ++
tools/perf/bench/futex-requeue.c | 6 +++
tools/perf/bench/futex-wake-parallel.c | 9 +++-
tools/perf/bench/futex-wake.c | 4 ++
tools/perf/bench/futex.c | 65 ++++++++++++++++++++++++++
tools/perf/bench/futex.h | 5 ++
8 files changed, 101 insertions(+), 1 deletion(-)
create mode 100644 tools/perf/bench/futex.c
diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
index 279ab2ab4abe4..b558ab98719f9 100644
--- a/tools/perf/bench/Build
+++ b/tools/perf/bench/Build
@@ -3,6 +3,7 @@ perf-bench-y += sched-pipe.o
perf-bench-y += sched-seccomp-notify.o
perf-bench-y += syscall.o
perf-bench-y += mem-functions.o
+perf-bench-y += futex.o
perf-bench-y += futex-hash.o
perf-bench-y += futex-wake.o
perf-bench-y += futex-wake-parallel.o
diff --git a/tools/perf/bench/futex-hash.c b/tools/perf/bench/futex-hash.c
index b472eded521b1..fdf133c9520f7 100644
--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -18,9 +18,11 @@
#include <stdlib.h>
#include <linux/compiler.h>
#include <linux/kernel.h>
+#include <linux/prctl.h>
#include <linux/zalloc.h>
#include <sys/time.h>
#include <sys/mman.h>
+#include <sys/prctl.h>
#include <perf/cpumap.h>
#include "../util/mutex.h"
@@ -50,9 +52,12 @@ struct worker {
static struct bench_futex_parameters params = {
.nfutexes = 1024,
.runtime = 10,
+ .nbuckets = -1,
};
static const struct option options[] = {
+ OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash buckets"),
+ OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash buckets immutable"),
OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads"),
OPT_UINTEGER('r', "runtime", ¶ms.runtime, "Specify runtime (in seconds)"),
OPT_UINTEGER('f', "futexes", ¶ms.nfutexes, "Specify amount of futexes per threads"),
@@ -118,6 +123,7 @@ static void print_summary(void)
printf("%sAveraged %ld operations/sec (+- %.2f%%), total secs = %d\n",
!params.silent ? "\n" : "", avg, rel_stddev_stats(stddev, avg),
(int)bench__runtime.tv_sec);
+ futex_print_nbuckets(¶ms);
}
int bench_futex_hash(int argc, const char **argv)
@@ -161,6 +167,7 @@ int bench_futex_hash(int argc, const char **argv)
if (!params.fshared)
futex_flag = FUTEX_PRIVATE_FLAG;
+ futex_set_nbuckets_param(¶ms);
printf("Run summary [PID %d]: %d threads, each operating on %d [%s] futexes for %d secs.\n\n",
getpid(), params.nthreads, params.nfutexes, params.fshared ? "shared":"private", params.runtime);
diff --git a/tools/perf/bench/futex-lock-pi.c b/tools/perf/bench/futex-lock-pi.c
index 0416120c091b2..5144a158512cc 100644
--- a/tools/perf/bench/futex-lock-pi.c
+++ b/tools/perf/bench/futex-lock-pi.c
@@ -41,10 +41,13 @@ static struct stats throughput_stats;
static struct cond thread_parent, thread_worker;
static struct bench_futex_parameters params = {
+ .nbuckets = -1,
.runtime = 10,
};
static const struct option options[] = {
+ OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash buckets"),
+ OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash buckets immutable"),
OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads"),
OPT_UINTEGER('r', "runtime", ¶ms.runtime, "Specify runtime (in seconds)"),
OPT_BOOLEAN( 'M', "multi", ¶ms.multi, "Use multiple futexes"),
@@ -67,6 +70,7 @@ static void print_summary(void)
printf("%sAveraged %ld operations/sec (+- %.2f%%), total secs = %d\n",
!params.silent ? "\n" : "", avg, rel_stddev_stats(stddev, avg),
(int)bench__runtime.tv_sec);
+ futex_print_nbuckets(¶ms);
}
static void toggle_done(int sig __maybe_unused,
@@ -203,6 +207,7 @@ int bench_futex_lock_pi(int argc, const char **argv)
mutex_init(&thread_lock);
cond_init(&thread_parent);
cond_init(&thread_worker);
+ futex_set_nbuckets_param(¶ms);
threads_starting = params.nthreads;
gettimeofday(&bench__start, NULL);
diff --git a/tools/perf/bench/futex-requeue.c b/tools/perf/bench/futex-requeue.c
index aad5bfc4fe188..a2f91ee1950b3 100644
--- a/tools/perf/bench/futex-requeue.c
+++ b/tools/perf/bench/futex-requeue.c
@@ -42,6 +42,7 @@ static unsigned int threads_starting;
static int futex_flag = 0;
static struct bench_futex_parameters params = {
+ .nbuckets = -1,
/*
* How many tasks to requeue at a time.
* Default to 1 in order to make the kernel work more.
@@ -50,6 +51,8 @@ static struct bench_futex_parameters params = {
};
static const struct option options[] = {
+ OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash buckets"),
+ OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash buckets immutable"),
OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads"),
OPT_UINTEGER('q', "nrequeue", ¶ms.nrequeue, "Specify amount of threads to requeue at once"),
OPT_BOOLEAN( 's', "silent", ¶ms.silent, "Silent mode: do not display data/details"),
@@ -77,6 +80,7 @@ static void print_summary(void)
params.nthreads,
requeuetime_avg / USEC_PER_MSEC,
rel_stddev_stats(requeuetime_stddev, requeuetime_avg));
+ futex_print_nbuckets(¶ms);
}
static void *workerfn(void *arg __maybe_unused)
@@ -204,6 +208,8 @@ int bench_futex_requeue(int argc, const char **argv)
if (params.broadcast)
params.nrequeue = params.nthreads;
+ futex_set_nbuckets_param(¶ms);
+
printf("Run summary [PID %d]: Requeuing %d threads (from [%s] %p to %s%p), "
"%d at a time.\n\n", getpid(), params.nthreads,
params.fshared ? "shared":"private", &futex1,
diff --git a/tools/perf/bench/futex-wake-parallel.c b/tools/perf/bench/futex-wake-parallel.c
index 4352e318631e9..ee66482c29fd1 100644
--- a/tools/perf/bench/futex-wake-parallel.c
+++ b/tools/perf/bench/futex-wake-parallel.c
@@ -57,9 +57,13 @@ static struct stats waketime_stats, wakeup_stats;
static unsigned int threads_starting;
static int futex_flag = 0;
-static struct bench_futex_parameters params;
+static struct bench_futex_parameters params = {
+ .nbuckets = -1,
+};
static const struct option options[] = {
+ OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash buckets"),
+ OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash buckets immutable"),
OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads"),
OPT_UINTEGER('w', "nwakers", ¶ms.nwakes, "Specify amount of waking threads"),
OPT_BOOLEAN( 's', "silent", ¶ms.silent, "Silent mode: do not display data/details"),
@@ -218,6 +222,7 @@ static void print_summary(void)
params.nthreads,
waketime_avg / USEC_PER_MSEC,
rel_stddev_stats(waketime_stddev, waketime_avg));
+ futex_print_nbuckets(¶ms);
}
@@ -291,6 +296,8 @@ int bench_futex_wake_parallel(int argc, const char **argv)
if (!params.fshared)
futex_flag = FUTEX_PRIVATE_FLAG;
+ futex_set_nbuckets_param(¶ms);
+
printf("Run summary [PID %d]: blocking on %d threads (at [%s] "
"futex %p), %d threads waking up %d at a time.\n\n",
getpid(), params.nthreads, params.fshared ? "shared":"private",
diff --git a/tools/perf/bench/futex-wake.c b/tools/perf/bench/futex-wake.c
index 49b3c89b0b35d..8d6107f7cd941 100644
--- a/tools/perf/bench/futex-wake.c
+++ b/tools/perf/bench/futex-wake.c
@@ -42,6 +42,7 @@ static unsigned int threads_starting;
static int futex_flag = 0;
static struct bench_futex_parameters params = {
+ .nbuckets = -1,
/*
* How many wakeups to do at a time.
* Default to 1 in order to make the kernel work more.
@@ -50,6 +51,8 @@ static struct bench_futex_parameters params = {
};
static const struct option options[] = {
+ OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash buckets"),
+ OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash buckets immutable"),
OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads"),
OPT_UINTEGER('w', "nwakes", ¶ms.nwakes, "Specify amount of threads to wake at once"),
OPT_BOOLEAN( 's', "silent", ¶ms.silent, "Silent mode: do not display data/details"),
@@ -93,6 +96,7 @@ static void print_summary(void)
params.nthreads,
waketime_avg / USEC_PER_MSEC,
rel_stddev_stats(waketime_stddev, waketime_avg));
+ futex_print_nbuckets(¶ms);
}
static void block_threads(pthread_t *w, struct perf_cpu_map *cpu)
diff --git a/tools/perf/bench/futex.c b/tools/perf/bench/futex.c
new file mode 100644
index 0000000000000..02ae6c52ba881
--- /dev/null
+++ b/tools/perf/bench/futex.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <err.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <linux/prctl.h>
+#include <sys/prctl.h>
+
+#include "futex.h"
+
+void futex_set_nbuckets_param(struct bench_futex_parameters *params)
+{
+ int ret;
+
+ if (params->nbuckets < 0)
+ return;
+
+ ret = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, params->nbuckets, params->buckets_immutable);
+ if (ret) {
+ printf("Requesting %d hash buckets failed: %d/%m\n",
+ params->nbuckets, ret);
+ err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
+ }
+}
+
+void futex_print_nbuckets(struct bench_futex_parameters *params)
+{
+ char *futex_hash_mode;
+ int ret;
+
+ ret = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
+ if (params->nbuckets >= 0) {
+ if (ret != params->nbuckets) {
+ if (ret < 0) {
+ printf("Can't query number of buckets: %m\n");
+ err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
+ }
+ printf("Requested number of hash buckets does not currently used.\n");
+ printf("Requested: %d in usage: %d\n", params->nbuckets, ret);
+ err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
+ }
+ if (params->nbuckets == 0) {
+ ret = asprintf(&futex_hash_mode, "Futex hashing: global hash");
+ } else {
+ ret = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE);
+ if (ret < 0) {
+ printf("Can't check if the hash is immutable: %m\n");
+ err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
+ }
+ ret = asprintf(&futex_hash_mode, "Futex hashing: %d hash buckets %s",
+ params->nbuckets,
+ ret == 1 ? "(immutable)" : "");
+ }
+ } else {
+ if (ret <= 0) {
+ ret = asprintf(&futex_hash_mode, "Futex hashing: global hash");
+ } else {
+ ret = asprintf(&futex_hash_mode, "Futex hashing: auto resized to %d buckets",
+ ret);
+ }
+ }
+ if (ret < 0)
+ err(EXIT_FAILURE, "ENOMEM, futex_hash_mode");
+ printf("%s\n", futex_hash_mode);
+ free(futex_hash_mode);
+}
diff --git a/tools/perf/bench/futex.h b/tools/perf/bench/futex.h
index ebdc2b032afc1..9c9a73f9d865e 100644
--- a/tools/perf/bench/futex.h
+++ b/tools/perf/bench/futex.h
@@ -25,6 +25,8 @@ struct bench_futex_parameters {
unsigned int nfutexes;
unsigned int nwakes;
unsigned int nrequeue;
+ int nbuckets;
+ bool buckets_immutable;
};
/**
@@ -143,4 +145,7 @@ futex_cmp_requeue_pi(u_int32_t *uaddr, u_int32_t val, u_int32_t *uaddr2,
val, opflags);
}
+void futex_set_nbuckets_param(struct bench_futex_parameters *params);
+void futex_print_nbuckets(struct bench_futex_parameters *params);
+
#endif /* _FUTEX_H */
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 20/21] selftests/futex: Add futex_priv_hash
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (18 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 19/21] tools/perf: Allow to select the number of hash buckets Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
` (2 more replies)
2025-04-16 16:29 ` [PATCH v12 21/21] selftests/futex: Add futex_numa_mpol Sebastian Andrzej Siewior
2025-04-16 16:31 ` [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
21 siblings, 3 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Test the basic functionality of the private hash:
- Upon start, with no threads there is no private hash.
- The first thread initializes the private hash.
- More than four threads will increase the size of the private hash if
the system has more than 16 CPUs online.
- Once the user sets the size of private hash, auto scaling is disabled.
- The user is only allowed to use numbers to the power of two.
- The user may request the global or make the hash immutable.
- Once the global hash has been set or the hash has been made immutable,
further changes are not allowed.
- Futex operations should work the whole time. It must be possible to
hold a lock, such a PI initialised mutex, during the resize operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
.../selftests/futex/functional/.gitignore | 5 +-
.../selftests/futex/functional/Makefile | 1 +
.../futex/functional/futex_priv_hash.c | 315 ++++++++++++++++++
.../testing/selftests/futex/functional/run.sh | 4 +
4 files changed, 323 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/futex/functional/futex_priv_hash.c
diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
index fbcbdb6963b3a..d37ae7c6e879e 100644
--- a/tools/testing/selftests/futex/functional/.gitignore
+++ b/tools/testing/selftests/futex/functional/.gitignore
@@ -1,11 +1,12 @@
# SPDX-License-Identifier: GPL-2.0-only
+futex_priv_hash
+futex_requeue
futex_requeue_pi
futex_requeue_pi_mismatched_ops
futex_requeue_pi_signal_restart
+futex_wait
futex_wait_private_mapped_file
futex_wait_timeout
futex_wait_uninitialized_heap
futex_wait_wouldblock
-futex_wait
-futex_requeue
futex_waitv
diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
index f79f9bac7918b..67d9e16d8a1f8 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -17,6 +17,7 @@ TEST_GEN_PROGS := \
futex_wait_private_mapped_file \
futex_wait \
futex_requeue \
+ futex_priv_hash \
futex_waitv
TEST_PROGS := run.sh
diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c b/tools/testing/selftests/futex/functional/futex_priv_hash.c
new file mode 100644
index 0000000000000..4d37650baa192
--- /dev/null
+++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c
@@ -0,0 +1,315 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2025 Sebastian Andrzej Siewior <bigeasy@linutronix.de>
+ */
+
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <linux/prctl.h>
+#include <sys/prctl.h>
+
+#include "logging.h"
+
+#define MAX_THREADS 64
+
+static pthread_barrier_t barrier_main;
+static pthread_mutex_t global_lock;
+static pthread_t threads[MAX_THREADS];
+static int counter;
+
+#ifndef PR_FUTEX_HASH
+#define PR_FUTEX_HASH 78
+# define PR_FUTEX_HASH_SET_SLOTS 1
+# define PR_FUTEX_HASH_GET_SLOTS 2
+# define PR_FUTEX_HASH_GET_IMMUTABLE 3
+#endif
+
+static int futex_hash_slots_set(unsigned int slots, int immutable)
+{
+ return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, slots, immutable);
+}
+
+static int futex_hash_slots_get(void)
+{
+ return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
+}
+
+static int futex_hash_immutable_get(void)
+{
+ return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE);
+}
+
+static void futex_hash_slots_set_verify(int slots)
+{
+ int ret;
+
+ ret = futex_hash_slots_set(slots, 0);
+ if (ret != 0) {
+ error("Failed to set slots to %d\n", errno, slots);
+ exit(1);
+ }
+ ret = futex_hash_slots_get();
+ if (ret != slots) {
+ error("Set %d slots but PR_FUTEX_HASH_GET_SLOTS returns: %d\n",
+ errno, slots, ret);
+ exit(1);
+ }
+}
+
+static void futex_hash_slots_set_must_fail(int slots, int immutable)
+{
+ int ret;
+
+ ret = futex_hash_slots_set(slots, immutable);
+ if (ret < 0)
+ return;
+
+ fail("futex_hash_slots_set(%d, %d) expected to fail but succeeded.\n",
+ slots, immutable);
+ exit(1);
+}
+
+static void *thread_return_fn(void *arg)
+{
+ return NULL;
+}
+
+static void *thread_lock_fn(void *arg)
+{
+ pthread_barrier_wait(&barrier_main);
+
+ pthread_mutex_lock(&global_lock);
+ counter++;
+ usleep(20);
+ pthread_mutex_unlock(&global_lock);
+ return NULL;
+}
+
+static void create_max_threads(void *(*thread_fn)(void *))
+{
+ int i, ret;
+
+ for (i = 0; i < MAX_THREADS; i++) {
+ ret = pthread_create(&threads[i], NULL, thread_fn, NULL);
+ if (ret) {
+ error("pthread_create failed\n", errno);
+ exit(1);
+ }
+ }
+}
+
+static void join_max_threads(void)
+{
+ int i, ret;
+
+ for (i = 0; i < MAX_THREADS; i++) {
+ ret = pthread_join(threads[i], NULL);
+ if (ret) {
+ error("pthread_join failed for thread %d\n", errno, i);
+ exit(1);
+ }
+ }
+}
+
+static void usage(char *prog)
+{
+ printf("Usage: %s\n", prog);
+ printf(" -c Use color\n");
+ printf(" -g Test global hash instead intead local immutable \n");
+ printf(" -h Display this help message\n");
+ printf(" -v L Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+ VQUIET, VCRITICAL, VINFO);
+}
+
+int main(int argc, char *argv[])
+{
+ int futex_slots1, futex_slotsn, online_cpus;
+ pthread_mutexattr_t mutex_attr_pi;
+ int use_global_hash = 0;
+ int ret;
+ char c;
+
+ while ((c = getopt(argc, argv, "cghv:")) != -1) {
+ switch (c) {
+ case 'c':
+ log_color(1);
+ break;
+ case 'g':
+ use_global_hash = 1;
+ break;
+ case 'h':
+ usage(basename(argv[0]));
+ exit(0);
+ break;
+ case 'v':
+ log_verbosity(atoi(optarg));
+ break;
+ default:
+ usage(basename(argv[0]));
+ exit(1);
+ }
+ }
+
+
+ ret = pthread_mutexattr_init(&mutex_attr_pi);
+ ret |= pthread_mutexattr_setprotocol(&mutex_attr_pi, PTHREAD_PRIO_INHERIT);
+ ret |= pthread_mutex_init(&global_lock, &mutex_attr_pi);
+ if (ret != 0) {
+ fail("Failed to initialize pthread mutex.\n");
+ return 1;
+ }
+
+ /* First thread, expect to be 0, not yet initialized */
+ ret = futex_hash_slots_get();
+ if (ret != 0) {
+ error("futex_hash_slots_get() failed: %d\n", errno, ret);
+ return 1;
+ }
+ ret = futex_hash_immutable_get();
+ if (ret != 0) {
+ error("futex_hash_immutable_get() failed: %d\n", errno, ret);
+ return 1;
+ }
+
+ ret = pthread_create(&threads[0], NULL, thread_return_fn, NULL);
+ if (ret != 0) {
+ error("pthread_create() failed: %d\n", errno, ret);
+ return 1;
+ }
+ ret = pthread_join(threads[0], NULL);
+ if (ret != 0) {
+ error("pthread_join() failed: %d\n", errno, ret);
+ return 1;
+ }
+ /* First thread, has to initialiaze private hash */
+ futex_slots1 = futex_hash_slots_get();
+ if (futex_slots1 <= 0) {
+ fail("Expected > 0 hash buckets, got: %d\n", futex_slots1);
+ return 1;
+ }
+
+ online_cpus = sysconf(_SC_NPROCESSORS_ONLN);
+ ret = pthread_barrier_init(&barrier_main, NULL, MAX_THREADS + 1);
+ if (ret != 0) {
+ error("pthread_barrier_init failed.\n", errno);
+ return 1;
+ }
+
+ ret = pthread_mutex_lock(&global_lock);
+ if (ret != 0) {
+ error("pthread_mutex_lock failed.\n", errno);
+ return 1;
+ }
+
+ counter = 0;
+ create_max_threads(thread_lock_fn);
+ pthread_barrier_wait(&barrier_main);
+
+ /*
+ * The current default size of hash buckets is 16. The auto increase
+ * works only if more than 16 CPUs are available.
+ */
+ if (online_cpus > 16) {
+ futex_slotsn = futex_hash_slots_get();
+ if (futex_slotsn < 0 || futex_slots1 == futex_slotsn) {
+ fail("Expected increase of hash buckets but got: %d -> %d\n",
+ futex_slots1, futex_slotsn);
+ info("Online CPUs: %d\n", online_cpus);
+ return 1;
+ }
+ }
+ ret = pthread_mutex_unlock(&global_lock);
+
+ /* Once the user changes it, it has to be what is set */
+ futex_hash_slots_set_verify(2);
+ futex_hash_slots_set_verify(4);
+ futex_hash_slots_set_verify(8);
+ futex_hash_slots_set_verify(32);
+ futex_hash_slots_set_verify(16);
+
+ ret = futex_hash_slots_set(15, 0);
+ if (ret >= 0) {
+ fail("Expected to fail with 15 slots but succeeded: %d.\n", ret);
+ return 1;
+ }
+ futex_hash_slots_set_verify(2);
+ join_max_threads();
+ if (counter != MAX_THREADS) {
+ fail("Expected thread counter at %d but is %d\n",
+ MAX_THREADS, counter);
+ return 1;
+ }
+ counter = 0;
+ /* Once the user set something, auto reisze must be disabled */
+ ret = pthread_barrier_init(&barrier_main, NULL, MAX_THREADS);
+
+ create_max_threads(thread_lock_fn);
+ join_max_threads();
+
+ ret = futex_hash_slots_get();
+ if (ret != 2) {
+ printf("Expected 2 slots, no auto-resize, got %d\n", ret);
+ return 1;
+ }
+
+ futex_hash_slots_set_must_fail(1 << 29, 0);
+
+ /*
+ * Once the private hash has been made immutable or global hash has been requested,
+ * then this requested can not be undone.
+ */
+ if (use_global_hash) {
+ ret = futex_hash_slots_set(0, 0);
+ if (ret != 0) {
+ printf("Can't request global hash: %m\n");
+ return 1;
+ }
+ } else {
+ ret = futex_hash_slots_set(4, 1);
+ if (ret != 0) {
+ printf("Immutable resize to 4 failed: %m\n");
+ return 1;
+ }
+ }
+
+ futex_hash_slots_set_must_fail(4, 0);
+ futex_hash_slots_set_must_fail(4, 1);
+ futex_hash_slots_set_must_fail(8, 0);
+ futex_hash_slots_set_must_fail(8, 1);
+ futex_hash_slots_set_must_fail(0, 1);
+ futex_hash_slots_set_must_fail(6, 1);
+
+ ret = pthread_barrier_init(&barrier_main, NULL, MAX_THREADS);
+ if (ret != 0) {
+ error("pthread_barrier_init failed.\n", errno);
+ return 1;
+ }
+ create_max_threads(thread_lock_fn);
+ join_max_threads();
+
+ ret = futex_hash_slots_get();
+ if (use_global_hash) {
+ if (ret != 0) {
+ error("Expected global hash, got %d\n", errno, ret);
+ return 1;
+ }
+ } else {
+ if (ret != 4) {
+ error("Expected 4 slots, no auto-resize, got %d\n", errno, ret);
+ return 1;
+ }
+ }
+
+ ret = futex_hash_immutable_get();
+ if (ret != 1) {
+ fail("Expected immutable private hash, got %d\n", ret);
+ return 1;
+ }
+ return 0;
+}
diff --git a/tools/testing/selftests/futex/functional/run.sh b/tools/testing/selftests/futex/functional/run.sh
index 5ccd599da6c30..f0f0d2b683d7e 100755
--- a/tools/testing/selftests/futex/functional/run.sh
+++ b/tools/testing/selftests/futex/functional/run.sh
@@ -82,3 +82,7 @@ echo
echo
./futex_waitv $COLOR
+
+echo
+./futex_priv_hash $COLOR
+./futex_priv_hash -g $COLOR
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH v12 21/21] selftests/futex: Add futex_numa_mpol
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (19 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 20/21] selftests/futex: Add futex_priv_hash Sebastian Andrzej Siewior
@ 2025-04-16 16:29 ` Sebastian Andrzej Siewior
2025-05-02 19:08 ` Peter Zijlstra
` (2 more replies)
2025-04-16 16:31 ` [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
21 siblings, 3 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:29 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Test the basic functionality for the NUMA and MPOL flags:
- FUTEX2_NUMA should take the NUMA node which is after the uaddr
and use it.
- Only update the node if FUTEX_NO_NODE was set by the user
- FUTEX2_MPOL should use the memory based on the policy. I attempted to
set the node with mbind() and then use this with MPOL but this fails
and futex falls back to the default node for the current CPU.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
.../selftests/futex/functional/.gitignore | 1 +
.../selftests/futex/functional/Makefile | 3 +-
.../futex/functional/futex_numa_mpol.c | 232 ++++++++++++++++++
.../testing/selftests/futex/functional/run.sh | 3 +
.../selftests/futex/include/futex2test.h | 34 +++
5 files changed, 272 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/futex/functional/futex_numa_mpol.c
diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
index d37ae7c6e879e..7b24ae89594a9 100644
--- a/tools/testing/selftests/futex/functional/.gitignore
+++ b/tools/testing/selftests/futex/functional/.gitignore
@@ -1,4 +1,5 @@
# SPDX-License-Identifier: GPL-2.0-only
+futex_numa_mpol
futex_priv_hash
futex_requeue
futex_requeue_pi
diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
index 67d9e16d8a1f8..a4881fd2cd540 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
INCLUDES := -I../include -I../../ $(KHDR_INCLUDES)
CFLAGS := $(CFLAGS) -g -O2 -Wall -pthread $(INCLUDES) $(KHDR_INCLUDES)
-LDLIBS := -lpthread -lrt
+LDLIBS := -lpthread -lrt -lnuma
LOCAL_HDRS := \
../include/futextest.h \
@@ -18,6 +18,7 @@ TEST_GEN_PROGS := \
futex_wait \
futex_requeue \
futex_priv_hash \
+ futex_numa_mpol \
futex_waitv
TEST_PROGS := run.sh
diff --git a/tools/testing/selftests/futex/functional/futex_numa_mpol.c b/tools/testing/selftests/futex/functional/futex_numa_mpol.c
new file mode 100644
index 0000000000000..30302691303f0
--- /dev/null
+++ b/tools/testing/selftests/futex/functional/futex_numa_mpol.c
@@ -0,0 +1,232 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2025 Sebastian Andrzej Siewior <bigeasy@linutronix.de>
+ */
+
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <numa.h>
+#include <numaif.h>
+
+#include <linux/futex.h>
+#include <sys/mman.h>
+
+#include "logging.h"
+#include "futextest.h"
+#include "futex2test.h"
+
+#define MAX_THREADS 64
+
+static pthread_barrier_t barrier_main;
+static pthread_t threads[MAX_THREADS];
+
+struct thread_args {
+ void *futex_ptr;
+ unsigned int flags;
+ int result;
+};
+
+static struct thread_args thread_args[MAX_THREADS];
+
+#ifndef FUTEX_NO_NODE
+#define FUTEX_NO_NODE (-1)
+#endif
+
+#ifndef FUTEX2_MPOL
+#define FUTEX2_MPOL 0x08
+#endif
+
+static void *thread_lock_fn(void *arg)
+{
+ struct thread_args *args = arg;
+ int ret;
+
+ pthread_barrier_wait(&barrier_main);
+ ret = futex2_wait2(args->futex_ptr, 0, args->flags, NULL, 0);
+ args->result = ret;
+ return NULL;
+}
+
+static void create_max_threads(void *futex_ptr)
+{
+ int i, ret;
+
+ for (i = 0; i < MAX_THREADS; i++) {
+ thread_args[i].futex_ptr = futex_ptr;
+ thread_args[i].flags = FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG | FUTEX2_NUMA;
+ thread_args[i].result = 0;
+ ret = pthread_create(&threads[i], NULL, thread_lock_fn, &thread_args[i]);
+ if (ret) {
+ error("pthread_create failed\n", errno);
+ exit(1);
+ }
+ }
+}
+
+static void join_max_threads(void)
+{
+ int i, ret;
+
+ for (i = 0; i < MAX_THREADS; i++) {
+ ret = pthread_join(threads[i], NULL);
+ if (ret) {
+ error("pthread_join failed for thread %d\n", errno, i);
+ exit(1);
+ }
+ }
+}
+
+static void __test_futex(void *futex_ptr, int must_fail, unsigned int futex_flags)
+{
+ int to_wake, ret, i, need_exit = 0;
+
+ pthread_barrier_init(&barrier_main, NULL, MAX_THREADS + 1);
+ create_max_threads(futex_ptr);
+ pthread_barrier_wait(&barrier_main);
+ to_wake = MAX_THREADS;
+
+ do {
+ ret = futex2_wake(futex_ptr, to_wake, futex_flags);
+ if (must_fail) {
+ if (ret < 0)
+ break;
+ fail("Should fail, but didn't\n");
+ exit(1);
+ }
+ if (ret < 0) {
+ error("Failed futex2_wake(%d)\n", errno, to_wake);
+ exit(1);
+ }
+ if (!ret)
+ usleep(50);
+ to_wake -= ret;
+
+ } while (to_wake);
+ join_max_threads();
+
+ for (i = 0; i < MAX_THREADS; i++) {
+ if (must_fail && thread_args[i].result != -1) {
+ fail("Thread %d should fail but succeeded (%d)\n", i, thread_args[i].result);
+ need_exit = 1;
+ }
+ if (!must_fail && thread_args[i].result != 0) {
+ fail("Thread %d failed (%d)\n", i, thread_args[i].result);
+ need_exit = 1;
+ }
+ }
+ if (need_exit)
+ exit(1);
+}
+
+static void test_futex(void *futex_ptr, int must_fail)
+{
+ __test_futex(futex_ptr, must_fail, FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG | FUTEX2_NUMA);
+}
+
+static void test_futex_mpol(void *futex_ptr, int must_fail)
+{
+ __test_futex(futex_ptr, must_fail, FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG | FUTEX2_NUMA | FUTEX2_MPOL);
+}
+
+static void usage(char *prog)
+{
+ printf("Usage: %s\n", prog);
+ printf(" -c Use color\n");
+ printf(" -h Display this help message\n");
+ printf(" -v L Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+ VQUIET, VCRITICAL, VINFO);
+}
+
+int main(int argc, char *argv[])
+{
+ struct futex32_numa *futex_numa;
+ int mem_size, i;
+ void *futex_ptr;
+ char c;
+
+ while ((c = getopt(argc, argv, "chv:")) != -1) {
+ switch (c) {
+ case 'c':
+ log_color(1);
+ break;
+ case 'h':
+ usage(basename(argv[0]));
+ exit(0);
+ break;
+ case 'v':
+ log_verbosity(atoi(optarg));
+ break;
+ default:
+ usage(basename(argv[0]));
+ exit(1);
+ }
+ }
+
+ mem_size = sysconf(_SC_PAGE_SIZE);
+ futex_ptr = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ if (futex_ptr == MAP_FAILED) {
+ error("mmap() for %d bytes failed\n", errno, mem_size);
+ return 1;
+ }
+ futex_numa = futex_ptr;
+
+ info("Regular test\n");
+ futex_numa->futex = 0;
+ futex_numa->numa = FUTEX_NO_NODE;
+ test_futex(futex_ptr, 0);
+
+ if (futex_numa->numa == FUTEX_NO_NODE) {
+ fail("NUMA node is left unitiliazed\n");
+ return 1;
+ }
+
+ info("Memory too small\n");
+ test_futex(futex_ptr + mem_size - 4, 1);
+
+ info("Memory out of range\n");
+ test_futex(futex_ptr + mem_size, 1);
+
+ futex_numa->numa = FUTEX_NO_NODE;
+ mprotect(futex_ptr, mem_size, PROT_READ);
+ info("Memory, RO\n");
+ test_futex(futex_ptr, 1);
+
+ mprotect(futex_ptr, mem_size, PROT_NONE);
+ info("Memory, no access\n");
+ test_futex(futex_ptr, 1);
+
+ mprotect(futex_ptr, mem_size, PROT_READ | PROT_WRITE);
+ info("Memory back to RW\n");
+ test_futex(futex_ptr, 0);
+
+ /* MPOL test. Does not work as expected */
+ for (i = 0; i < 4; i++) {
+ unsigned long nodemask;
+ int ret;
+
+ nodemask = 1 << i;
+ ret = mbind(futex_ptr, mem_size, MPOL_BIND, &nodemask,
+ sizeof(nodemask) * 8, 0);
+ if (ret == 0) {
+ info("Node %d test\n", i);
+ futex_numa->futex = 0;
+ futex_numa->numa = FUTEX_NO_NODE;
+
+ ret = futex2_wake(futex_ptr, 0, FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG | FUTEX2_NUMA | FUTEX2_MPOL);
+ if (ret < 0)
+ error("Failed to wake 0 with MPOL.\n", errno);
+ if (0)
+ test_futex_mpol(futex_numa, 0);
+ if (futex_numa->numa != i) {
+ fail("Returned NUMA node is %d expected %d\n",
+ futex_numa->numa, i);
+ }
+ }
+ }
+ return 0;
+}
diff --git a/tools/testing/selftests/futex/functional/run.sh b/tools/testing/selftests/futex/functional/run.sh
index f0f0d2b683d7e..81739849f2994 100755
--- a/tools/testing/selftests/futex/functional/run.sh
+++ b/tools/testing/selftests/futex/functional/run.sh
@@ -86,3 +86,6 @@ echo
echo
./futex_priv_hash $COLOR
./futex_priv_hash -g $COLOR
+
+echo
+./futex_numa_mpol $COLOR
diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h
index 9d305520e849b..b664e8f92bfd7 100644
--- a/tools/testing/selftests/futex/include/futex2test.h
+++ b/tools/testing/selftests/futex/include/futex2test.h
@@ -8,6 +8,11 @@
#define u64_to_ptr(x) ((void *)(uintptr_t)(x))
+struct futex32_numa {
+ futex_t futex;
+ futex_t numa;
+};
+
/**
* futex_waitv - Wait at multiple futexes, wake on any
* @waiters: Array of waiters
@@ -20,3 +25,32 @@ static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned lon
{
return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
}
+
+static inline int futex2_wait(volatile struct futex_waitv *waiters, unsigned long nr_waiters,
+ unsigned long flags, struct timespec *timo, clockid_t clockid)
+{
+ return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
+}
+
+/*
+ * futex_wait2() - block on uaddr with optional timeout
+ * @val: Expected value
+ * @flags: FUTEX2 flags
+ * @timeout: Relative timeout
+ * @clockid: Clock id for the timeout
+ */
+static inline int futex2_wait2(void *uaddr, long val, unsigned int flags,
+ struct timespec *timeout, clockid_t clockid)
+{
+ return syscall(__NR_futex_wait, uaddr, val, 1, flags, timeout, clockid);
+}
+
+/*
+ * futex2_wake() - Wake a number of futexes
+ * @nr: Number of threads to wake at most
+ * @flags: FUTEX2 flags
+ */
+static inline int futex2_wake(void *uaddr, int nr, unsigned int flags)
+{
+ return syscall(__NR_futex_wake, uaddr, 1, nr, flags);
+}
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* Re: [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
` (20 preceding siblings ...)
2025-04-16 16:29 ` [PATCH v12 21/21] selftests/futex: Add futex_numa_mpol Sebastian Andrzej Siewior
@ 2025-04-16 16:31 ` Sebastian Andrzej Siewior
2025-05-02 19:48 ` Peter Zijlstra
21 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-16 16:31 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-04-16 18:29:00 [+0200], To linux-kernel@vger.kernel.org wrote:
> v11…v12: https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de
A diff excluding the tools/testing/ changes:
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 96c7229856d97..eccc99751bd94 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -109,7 +109,7 @@ static inline long do_futex(u32 __user *uaddr, int op, u32 val,
{
return -EINVAL;
}
-static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
+static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
{
return -EINVAL;
}
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 44bb9eeb0a9c1..ee1d7182ce0c0 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -551,6 +551,7 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
struct folio *folio;
struct address_space *mapping;
int node, err, size, ro = 0;
+ bool node_updated = false;
bool fshared;
fshared = flags & FLAGS_SHARED;
@@ -575,24 +576,29 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
node = FUTEX_NO_NODE;
if (flags & FLAGS_NUMA) {
- u32 __user *naddr = uaddr + size / 2;
+ u32 __user *naddr = (void *)uaddr + size / 2;
if (futex_get_value(&node, naddr))
return -EFAULT;
- if (node >= MAX_NUMNODES || !node_possible(node))
+ if (node != FUTEX_NO_NODE &&
+ (node >= MAX_NUMNODES || !node_possible(node)))
return -EINVAL;
}
- if (node == FUTEX_NO_NODE && (flags & FLAGS_MPOL))
+ if (node == FUTEX_NO_NODE && (flags & FLAGS_MPOL)) {
node = futex_mpol(mm, address);
+ node_updated = true;
+ }
if (flags & FLAGS_NUMA) {
- u32 __user *naddr = uaddr + size / 2;
+ u32 __user *naddr = (void *)uaddr + size / 2;
- if (node == FUTEX_NO_NODE)
+ if (node == FUTEX_NO_NODE) {
node = numa_node_id();
- if (futex_put_value(node, naddr))
+ node_updated = true;
+ }
+ if (node_updated && futex_put_value(node, naddr))
return -EFAULT;
}
@@ -1573,6 +1579,8 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int immutable,
if (hash_slots && (hash_slots == 1 || !is_power_of_2(hash_slots)))
return -EINVAL;
+ if (immutable > 2)
+ return -EINVAL;
/*
* Once we've disabled the global hash there is no way back.
@@ -1586,7 +1594,7 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int immutable,
}
}
- fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT);
+ fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
if (!fph)
return -ENOMEM;
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 004e4dbee4f93..069fc2a83080d 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -55,7 +55,7 @@ static inline unsigned int futex_to_flags(unsigned int op)
return flags;
}
-#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_PRIVATE)
+#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_MPOL | FUTEX2_PRIVATE)
/* FUTEX2_ to FLAGS_ */
static inline unsigned int futex2_to_flags(unsigned int flags2)
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 356e52c17d3c5..dacb2330f1fbc 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -993,6 +993,16 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
goto no_block;
}
+ /*
+ * Caution; releasing @hb in-scope. The hb->lock is still locked
+ * while the reference is dropped. The reference can not be dropped
+ * after the unlock because if a user initiated resize is in progress
+ * then we might need to wake him. This can not be done after the
+ * rt_mutex_pre_schedule() invocation. The hb will remain valid because
+ * the thread, performing resize, will block on hb->lock during
+ * the requeue.
+ */
+ futex_hash_put(no_free_ptr(hb));
/*
* Must be done before we enqueue the waiter, here is unfortunately
* under the hb lock, but that *should* work because it does nothing.
@@ -1016,10 +1026,6 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
*/
raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
spin_unlock(q.lock_ptr);
- /*
- * Caution; releasing @hb in-scope.
- */
- futex_hash_put(no_free_ptr(hb));
/*
* __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
* such that futex_unlock_pi() is guaranteed to observe the waiter when
diff --git a/tools/perf/bench/futex.c b/tools/perf/bench/futex.c
index bed3b6e46d109..02ae6c52ba881 100644
--- a/tools/perf/bench/futex.c
+++ b/tools/perf/bench/futex.c
@@ -31,20 +31,25 @@ void futex_print_nbuckets(struct bench_futex_parameters *params)
if (params->nbuckets >= 0) {
if (ret != params->nbuckets) {
if (ret < 0) {
- printf("Can't query number of buckets: %d/%m\n", ret);
+ printf("Can't query number of buckets: %m\n");
err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
}
printf("Requested number of hash buckets does not currently used.\n");
printf("Requested: %d in usage: %d\n", params->nbuckets, ret);
err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
}
- ret = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE);
- if (params->nbuckets == 0)
+ if (params->nbuckets == 0) {
ret = asprintf(&futex_hash_mode, "Futex hashing: global hash");
- else
+ } else {
+ ret = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE);
+ if (ret < 0) {
+ printf("Can't check if the hash is immutable: %m\n");
+ err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
+ }
ret = asprintf(&futex_hash_mode, "Futex hashing: %d hash buckets %s",
params->nbuckets,
ret == 1 ? "(immutable)" : "");
+ }
} else {
if (ret <= 0) {
ret = asprintf(&futex_hash_mode, "Futex hashing: global hash");
Sebastian
^ permalink raw reply related [flat|nested] 109+ messages in thread
* Re: [PATCH v12 15/21] futex: Allow to make the private hash immutable
2025-04-16 16:29 ` [PATCH v12 15/21] futex: Allow to make the private hash immutable Sebastian Andrzej Siewior
@ 2025-05-02 18:01 ` Peter Zijlstra
2025-05-05 7:14 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
1 sibling, 1 reply; 109+ messages in thread
From: Peter Zijlstra @ 2025-05-02 18:01 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long, Shrikanth Hegde
On Wed, Apr 16, 2025 at 06:29:15PM +0200, Sebastian Andrzej Siewior wrote:
> My initial testing showed that
> perf bench futex hash
>
> reported less operations/sec with private hash. After using the same
> amount of buckets in the private hash as used by the global hash then
> the operations/sec were about the same.
>
> This changed once the private hash became resizable. This feature added
> a RCU section and reference counting via atomic inc+dec operation into
> the hot path.
> The reference counting can be avoided if the private hash is made
> immutable.
> Extend PR_FUTEX_HASH_SET_SLOTS by a fourth argument which denotes if the
> private should be made immutable. Once set (to true) the a further
> resize is not allowed (same if set to global hash).
> Add PR_FUTEX_HASH_GET_IMMUTABLE which returns true if the hash can not
> be changed.
> Update "perf bench" suite.
Does the below make sense? This changes arg4 into a flags field and uses
bit0 for immutable.
(the point where I got upset is where arg4==2 was accepted :-)
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -367,6 +367,7 @@ struct prctl_mm_map {
/* FUTEX hash management */
#define PR_FUTEX_HASH 78
# define PR_FUTEX_HASH_SET_SLOTS 1
+# define FH_FLAG_IMMUTABLE (1ULL << 0)
# define PR_FUTEX_HASH_GET_SLOTS 2
# define PR_FUTEX_HASH_GET_IMMUTABLE 3
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1440,7 +1440,10 @@ static bool futex_hash_less(struct futex
return false; /* equal */
}
-static int futex_hash_allocate(unsigned int hash_slots, unsigned int immutable, bool custom)
+#define FH_CUSTOM 0x01
+#define FH_IMMUTABLE 0x02
+
+static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
{
struct mm_struct *mm = current->mm;
struct futex_private_hash *fph;
@@ -1448,8 +1451,6 @@ static int futex_hash_allocate(unsigned
if (hash_slots && (hash_slots == 1 || !is_power_of_2(hash_slots)))
return -EINVAL;
- if (immutable > 2)
- return -EINVAL;
/*
* Once we've disabled the global hash there is no way back.
@@ -1469,8 +1470,8 @@ static int futex_hash_allocate(unsigned
rcuref_init(&fph->users, 1);
fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
- fph->custom = custom;
- fph->immutable = !!immutable;
+ fph->custom = !!(flags & FH_CUSTOM);
+ fph->immutable = !!(flags & FH_IMMUTABLE);
fph->mm = mm;
for (i = 0; i < hash_slots; i++)
@@ -1563,7 +1564,7 @@ int futex_hash_allocate_default(void)
if (current_buckets >= buckets)
return 0;
- return futex_hash_allocate(buckets, 0, false);
+ return futex_hash_allocate(buckets, 0);
}
static int futex_hash_get_slots(void)
@@ -1592,7 +1593,7 @@ static int futex_hash_get_immutable(void
#else
-static int futex_hash_allocate(unsigned int hash_slots, unsigned int immutable, bool custom)
+static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
{
return -EINVAL;
}
@@ -1610,11 +1611,16 @@ static int futex_hash_get_immutable(void
int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
{
+ unsigned int flags = FH_CUSTOM;
int ret;
switch (arg2) {
case PR_FUTEX_HASH_SET_SLOTS:
- ret = futex_hash_allocate(arg3, arg4, true);
+ if (arg4 & ~FH_FLAG_IMMUTABLE)
+ return -EINVAL;
+ if (arg4 & FH_FLAG_IMMUTABLE)
+ flags |= FH_IMMUTABLE;
+ ret = futex_hash_allocate(arg3, flags);
break;
case PR_FUTEX_HASH_GET_SLOTS:
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 17/21] futex: Implement FUTEX2_MPOL
2025-04-16 16:29 ` [PATCH v12 17/21] futex: Implement FUTEX2_MPOL Sebastian Andrzej Siewior
@ 2025-05-02 18:45 ` Peter Zijlstra
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
1 sibling, 0 replies; 109+ messages in thread
From: Peter Zijlstra @ 2025-05-02 18:45 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Wed, Apr 16, 2025 at 06:29:17PM +0200, Sebastian Andrzej Siewior wrote:
> @@ -342,18 +411,20 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
> static struct futex_hash_bucket *
> __futex_hash(union futex_key *key, struct futex_private_hash *fph)
> {
> - struct futex_hash_bucket *hb;
> + int node = key->both.node;
> u32 hash;
> - int node;
>
> - hb = __futex_hash_private(key, fph);
> - if (hb)
> - return hb;
> + if (node == FUTEX_NO_NODE) {
> + struct futex_hash_bucket *hb;
> +
> + hb = __futex_hash_private(key, fph);
> + if (hb)
> + return hb;
> + }
>
> hash = jhash2((u32 *)key,
> offsetof(typeof(*key), both.offset) / sizeof(u32),
> key->both.offset);
> - node = key->both.node;
>
> if (node == FUTEX_NO_NODE) {
> /*
I *think* this hunk should've been in the previous patch; because it
changes the behaviour of FUTEX2_NUMA (to the better).
Anyway, nevermind, just wanted to point it out.
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 21/21] selftests/futex: Add futex_numa_mpol
2025-04-16 16:29 ` [PATCH v12 21/21] selftests/futex: Add futex_numa_mpol Sebastian Andrzej Siewior
@ 2025-05-02 19:08 ` Peter Zijlstra
2025-05-05 7:33 ` Sebastian Andrzej Siewior
2025-05-02 19:16 ` Peter Zijlstra
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2 siblings, 1 reply; 109+ messages in thread
From: Peter Zijlstra @ 2025-05-02 19:08 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Wed, Apr 16, 2025 at 06:29:21PM +0200, Sebastian Andrzej Siewior wrote:
> diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h
> index 9d305520e849b..b664e8f92bfd7 100644
> --- a/tools/testing/selftests/futex/include/futex2test.h
> +++ b/tools/testing/selftests/futex/include/futex2test.h
> @@ -8,6 +8,11 @@
>
> #define u64_to_ptr(x) ((void *)(uintptr_t)(x))
>
> +struct futex32_numa {
> + futex_t futex;
> + futex_t numa;
> +};
> +
> /**
> * futex_waitv - Wait at multiple futexes, wake on any
> * @waiters: Array of waiters
> @@ -20,3 +25,32 @@ static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned lon
> {
> return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
> }
> +
> +static inline int futex2_wait(volatile struct futex_waitv *waiters, unsigned long nr_waiters,
> + unsigned long flags, struct timespec *timo, clockid_t clockid)
> +{
> + return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
I'm confused, sure this should be __NR_futex_wait
> +}
> +
> +/*
> + * futex_wait2() - block on uaddr with optional timeout
> + * @val: Expected value
> + * @flags: FUTEX2 flags
> + * @timeout: Relative timeout
> + * @clockid: Clock id for the timeout
> + */
> +static inline int futex2_wait2(void *uaddr, long val, unsigned int flags,
> + struct timespec *timeout, clockid_t clockid)
And this should be futex2_wait().
> +{
> + return syscall(__NR_futex_wait, uaddr, val, 1, flags, timeout, clockid);
> +}
> +
> +/*
> + * futex2_wake() - Wake a number of futexes
> + * @nr: Number of threads to wake at most
> + * @flags: FUTEX2 flags
> + */
> +static inline int futex2_wake(void *uaddr, int nr, unsigned int flags)
> +{
> + return syscall(__NR_futex_wake, uaddr, 1, nr, flags);
> +}
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 21/21] selftests/futex: Add futex_numa_mpol
2025-04-16 16:29 ` [PATCH v12 21/21] selftests/futex: Add futex_numa_mpol Sebastian Andrzej Siewior
2025-05-02 19:08 ` Peter Zijlstra
@ 2025-05-02 19:16 ` Peter Zijlstra
2025-05-05 7:36 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2 siblings, 1 reply; 109+ messages in thread
From: Peter Zijlstra @ 2025-05-02 19:16 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Wed, Apr 16, 2025 at 06:29:21PM +0200, Sebastian Andrzej Siewior wrote:
> diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h
> index 9d305520e849b..b664e8f92bfd7 100644
> --- a/tools/testing/selftests/futex/include/futex2test.h
> +++ b/tools/testing/selftests/futex/include/futex2test.h
> @@ -8,6 +8,11 @@
>
> #define u64_to_ptr(x) ((void *)(uintptr_t)(x))
>
> +struct futex32_numa {
> + futex_t futex;
> + futex_t numa;
> +};
> +
> /**
> * futex_waitv - Wait at multiple futexes, wake on any
> * @waiters: Array of waiters
> @@ -20,3 +25,32 @@ static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned lon
> {
> return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
> }
> +
> +static inline int futex2_wait(volatile struct futex_waitv *waiters, unsigned long nr_waiters,
> + unsigned long flags, struct timespec *timo, clockid_t clockid)
> +{
> + return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
> +}
So this one seemed unused, deleted it and
> +/*
> + * futex_wait2() - block on uaddr with optional timeout
> + * @val: Expected value
> + * @flags: FUTEX2 flags
> + * @timeout: Relative timeout
> + * @clockid: Clock id for the timeout
> + */
> +static inline int futex2_wait2(void *uaddr, long val, unsigned int flags,
> + struct timespec *timeout, clockid_t clockid)
> +{
> + return syscall(__NR_futex_wait, uaddr, val, 1, flags, timeout, clockid);
> +}
renamed this one.
> +/*
> + * futex2_wake() - Wake a number of futexes
> + * @nr: Number of threads to wake at most
> + * @flags: FUTEX2 flags
> + */
> +static inline int futex2_wake(void *uaddr, int nr, unsigned int flags)
> +{
> + return syscall(__NR_futex_wake, uaddr, 1, nr, flags);
> +}
Next question; you're setting bitmask to 1 instead of
FUTEX_BITSET_MATCH_ANY, which is the default value.
I'm going to make it ~0U unless you have a reason for this 1.
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
2025-04-16 16:31 ` [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
@ 2025-05-02 19:48 ` Peter Zijlstra
2025-05-03 10:09 ` Peter Zijlstra
0 siblings, 1 reply; 109+ messages in thread
From: Peter Zijlstra @ 2025-05-02 19:48 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Wed, Apr 16, 2025 at 06:31:42PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-04-16 18:29:00 [+0200], To linux-kernel@vger.kernel.org wrote:
> > v11…v12: https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de
I made a few changes (mostly the stuff I mailed about) and pushed out to
queue/locking/futex.
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
2025-05-02 19:48 ` Peter Zijlstra
@ 2025-05-03 10:09 ` Peter Zijlstra
2025-05-05 7:30 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 109+ messages in thread
From: Peter Zijlstra @ 2025-05-03 10:09 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Fri, May 02, 2025 at 09:48:07PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 16, 2025 at 06:31:42PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2025-04-16 18:29:00 [+0200], To linux-kernel@vger.kernel.org wrote:
> > > v11…v12: https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de
>
> I made a few changes (mostly the stuff I mailed about) and pushed out to
> queue/locking/futex.
And again, with hopefully less build errors included :-)
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 15/21] futex: Allow to make the private hash immutable
2025-05-02 18:01 ` Peter Zijlstra
@ 2025-05-05 7:14 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-05 7:14 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long, Shrikanth Hegde
On 2025-05-02 20:01:54 [+0200], Peter Zijlstra wrote:
> On Wed, Apr 16, 2025 at 06:29:15PM +0200, Sebastian Andrzej Siewior wrote:
> > My initial testing showed that
> > perf bench futex hash
> >
> > reported less operations/sec with private hash. After using the same
> > amount of buckets in the private hash as used by the global hash then
> > the operations/sec were about the same.
> >
> > This changed once the private hash became resizable. This feature added
> > a RCU section and reference counting via atomic inc+dec operation into
> > the hot path.
> > The reference counting can be avoided if the private hash is made
> > immutable.
> > Extend PR_FUTEX_HASH_SET_SLOTS by a fourth argument which denotes if the
> > private should be made immutable. Once set (to true) the a further
> > resize is not allowed (same if set to global hash).
> > Add PR_FUTEX_HASH_GET_IMMUTABLE which returns true if the hash can not
> > be changed.
> > Update "perf bench" suite.
>
> Does the below make sense? This changes arg4 into a flags field and uses
> bit0 for immutable.
>
> (the point where I got upset is where arg4==2 was accepted :-)
I see, it makes sense. It makes sense and leaves room for later.
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
2025-05-03 10:09 ` Peter Zijlstra
@ 2025-05-05 7:30 ` Sebastian Andrzej Siewior
2025-05-06 7:36 ` Peter Zijlstra
0 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-05 7:30 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-05-03 12:09:05 [+0200], Peter Zijlstra wrote:
> On Fri, May 02, 2025 at 09:48:07PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 16, 2025 at 06:31:42PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2025-04-16 18:29:00 [+0200], To linux-kernel@vger.kernel.org wrote:
> > > > v11…v12: https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de
> >
> > I made a few changes (mostly the stuff I mailed about) and pushed out to
> > queue/locking/futex.
>
> And again, with hopefully less build errors included :-)
Okay. I guess the NUMA part where the nodeid is written back to userland
if 0 was supplied is not an issue. I was worried that if you fire
multiple threads which end up in the sys_futex_wait() at the same time,
waiting on the same addr on two nodes and the "current" nodeid is used
then the variable might be written back twice with two node ids. The
mpol interface should report always the same one.
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 21/21] selftests/futex: Add futex_numa_mpol
2025-05-02 19:08 ` Peter Zijlstra
@ 2025-05-05 7:33 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-05 7:33 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-05-02 21:08:38 [+0200], Peter Zijlstra wrote:
> On Wed, Apr 16, 2025 at 06:29:21PM +0200, Sebastian Andrzej Siewior wrote:
> > diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h
> > index 9d305520e849b..b664e8f92bfd7 100644
> > --- a/tools/testing/selftests/futex/include/futex2test.h
> > +++ b/tools/testing/selftests/futex/include/futex2test.h
> > @@ -8,6 +8,11 @@
> >
> > #define u64_to_ptr(x) ((void *)(uintptr_t)(x))
> >
> > +struct futex32_numa {
> > + futex_t futex;
> > + futex_t numa;
> > +};
> > +
> > /**
> > * futex_waitv - Wait at multiple futexes, wake on any
> > * @waiters: Array of waiters
> > @@ -20,3 +25,32 @@ static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned lon
> > {
> > return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
> > }
> > +
> > +static inline int futex2_wait(volatile struct futex_waitv *waiters, unsigned long nr_waiters,
> > + unsigned long flags, struct timespec *timo, clockid_t clockid)
> > +{
> > + return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
>
> I'm confused, sure this should be __NR_futex_wait
>
> > +}
> > +
> > +/*
> > + * futex_wait2() - block on uaddr with optional timeout
> > + * @val: Expected value
> > + * @flags: FUTEX2 flags
> > + * @timeout: Relative timeout
> > + * @clockid: Clock id for the timeout
> > + */
> > +static inline int futex2_wait2(void *uaddr, long val, unsigned int flags,
> > + struct timespec *timeout, clockid_t clockid)
>
> And this should be futex2_wait().
Yes. I didn't want to change it right away but yes. There already is a
futex2_wait() which uses the waitv syscall.
>
> > +{
> > + return syscall(__NR_futex_wait, uaddr, val, 1, flags, timeout, clockid);
> > +}
> > +
> > +/*
> > + * futex2_wake() - Wake a number of futexes
> > + * @nr: Number of threads to wake at most
> > + * @flags: FUTEX2 flags
> > + */
> > +static inline int futex2_wake(void *uaddr, int nr, unsigned int flags)
> > +{
> > + return syscall(__NR_futex_wake, uaddr, 1, nr, flags);
> > +}
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 21/21] selftests/futex: Add futex_numa_mpol
2025-05-02 19:16 ` Peter Zijlstra
@ 2025-05-05 7:36 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-05 7:36 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
> > @@ -20,3 +25,32 @@ static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned lon
> > {
> > return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
> > }
> > +
> > +static inline int futex2_wait(volatile struct futex_waitv *waiters, unsigned long nr_waiters,
> > + unsigned long flags, struct timespec *timo, clockid_t clockid)
> > +{
> > + return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
> > +}
>
> So this one seemed unused, deleted it and
This is odd but okay.
> > +/*
> > + * futex_wait2() - block on uaddr with optional timeout
> > + * @val: Expected value
> > + * @flags: FUTEX2 flags
> > + * @timeout: Relative timeout
> > + * @clockid: Clock id for the timeout
> > + */
> > +static inline int futex2_wait2(void *uaddr, long val, unsigned int flags,
> > + struct timespec *timeout, clockid_t clockid)
> > +{
> > + return syscall(__NR_futex_wait, uaddr, val, 1, flags, timeout, clockid);
> > +}
>
> renamed this one.
perfect.
> > +/*
> > + * futex2_wake() - Wake a number of futexes
> > + * @nr: Number of threads to wake at most
> > + * @flags: FUTEX2 flags
> > + */
> > +static inline int futex2_wake(void *uaddr, int nr, unsigned int flags)
> > +{
> > + return syscall(__NR_futex_wake, uaddr, 1, nr, flags);
> > +}
>
> Next question; you're setting bitmask to 1 instead of
> FUTEX_BITSET_MATCH_ANY, which is the default value.
>
> I'm going to make it ~0U unless you have a reason for this 1.
No special reason. If ~0 is the default, go for it.
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 01/21] rcuref: Provide rcuref_is_dead()
2025-04-16 16:29 ` [PATCH v12 01/21] rcuref: Provide rcuref_is_dead() Sebastian Andrzej Siewior
@ 2025-05-05 21:09 ` André Almeida
2025-05-08 10:34 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
1 sibling, 0 replies; 109+ messages in thread
From: André Almeida @ 2025-05-05 21:09 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Darren Hart, Davidlohr Bueso, Ingo Molnar, Juri Lelli,
linux-kernel, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long
Em 16/04/2025 13:29, Sebastian Andrzej Siewior escreveu:
> rcuref_read() returns the number of references that are currently held.
> If 0 is returned then it is not safe to assume that the object ca be
the object *can* be
> scheduled for deconstruction because it is marked DEAD. This happens if
> the return value of rcuref_put() is ignored and assumptions are made.
>
> If 0 is returned then the counter transitioned from 0 to RCUREF_NOREF.
> If rcuref_put() did not return to the caller then the counter did not
> yet transition from RCUREF_NOREF to RCUREF_DEAD. This means that there
> is still a chance that the counter will transition from RCUREF_NOREF to
> 0 meaning it is still valid and must not be deconstructed. In this brief
> window rcuref_read() will return 0.
>
> Provide rcuref_is_dead() to determine if the counter is marked as
> RCUREF_DEAD.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> include/linux/rcuref.h | 22 +++++++++++++++++++++-
> 1 file changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/rcuref.h b/include/linux/rcuref.h
> index 6322d8c1c6b42..2fb2af6d98249 100644
> --- a/include/linux/rcuref.h
> +++ b/include/linux/rcuref.h
> @@ -30,7 +30,11 @@ static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
> * rcuref_read - Read the number of held reference counts of a rcuref
> * @ref: Pointer to the reference count
> *
> - * Return: The number of held references (0 ... N)
> + * Return: The number of held references (0 ... N). The value 0 does not
> + * indicate that it is safe to schedule the object, protected by this reference
> + * counter, for deconstruction.
> + * If you want to know if the reference counter has been marked DEAD (as
> + * signaled by rcuref_put()) please use rcuread_is_dead().
> */
> static inline unsigned int rcuref_read(rcuref_t *ref)
> {
> @@ -40,6 +44,22 @@ static inline unsigned int rcuref_read(rcuref_t *ref)
Above this line there's a comment "Return 0 if within the DEAD zone."
Perhaps move this comment from the function to the kernel doc as well?
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 03/21] futex: Move futex_queue() into futex_wait_setup()
2025-04-16 16:29 ` [PATCH v12 03/21] futex: Move futex_queue() into futex_wait_setup() Sebastian Andrzej Siewior
@ 2025-05-05 21:43 ` André Almeida
2025-05-16 12:53 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
1 sibling, 1 reply; 109+ messages in thread
From: André Almeida @ 2025-05-05 21:43 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Darren Hart, linux-kernel, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long
Em 16/04/2025 13:29, Sebastian Andrzej Siewior escreveu:
> From: Peter Zijlstra <peterz@infradead.org>
>
> futex_wait_setup() has a weird calling convention in order to return
> hb to use as an argument to futex_queue().
>
> Mostly such that requeue can have an extra test in between.
>
> Reorder code a little to get rid of this and keep the hb usage inside
> futex_wait_setup().
>
> [bigeasy: fixes]
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[...]
> diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
> index 25877d4f2f8f3..6cf10701294b4 100644
> --- a/kernel/futex/waitwake.c
> +++ b/kernel/futex/waitwake.c
> @@ -339,18 +339,8 @@ static long futex_wait_restart(struct restart_block *restart);
> * @q: the futex_q to queue up on
> * @timeout: the prepared hrtimer_sleeper, or null for no timeout
> */
> -void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
> - struct hrtimer_sleeper *timeout)
> +void futex_do_wait(struct futex_q *q, struct hrtimer_sleeper *timeout)
Update the name in the kernel doc comment as well. Also drop from the
comment the part that says "futex_queue() and ..."
> {
> - /*
> - * The task state is guaranteed to be set before another task can
> - * wake it. set_current_state() is implemented using smp_store_mb() and
> - * futex_queue() calls spin_unlock() upon completion, both serializing
> - * access to the hash list and forcing another memory barrier.
> - */
> - set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
> - futex_queue(q, hb, current);
> -
> /* Arm the timer */
> if (timeout)
> hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
> @@ -578,7 +568,8 @@ int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
> * @val: the expected value
> * @flags: futex flags (FLAGS_SHARED, etc.)
> * @q: the associated futex_q
> - * @hb: storage for hash_bucket pointer to be returned to caller
> + * @key2: the second futex_key if used for requeue PI
> + * task: Task queueing this futex
> *
> * Setup the futex_q and locate the hash_bucket. Get the futex value and
> * compare it with the expected value. Handle atomic faults internally.
> @@ -589,8 +580,10 @@ int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
> * - <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
> */
> int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
> - struct futex_q *q, struct futex_hash_bucket **hb)
> + struct futex_q *q, union futex_key *key2,
> + struct task_struct *task)
> {
> + struct futex_hash_bucket *hb;
> u32 uval;
> int ret;
>
> @@ -618,12 +611,12 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
> return ret;
>
> retry_private:
> - *hb = futex_q_lock(q);
> + hb = futex_q_lock(q);
>
> ret = futex_get_value_locked(&uval, uaddr);
>
> if (ret) {
> - futex_q_unlock(*hb);
> + futex_q_unlock(hb);
>
> ret = get_user(uval, uaddr);
> if (ret)
> @@ -636,10 +629,25 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
> }
>
> if (uval != val) {
> - futex_q_unlock(*hb);
> - ret = -EWOULDBLOCK;
> + futex_q_unlock(hb);
> + return -EWOULDBLOCK;
> }
>
> + if (key2 && futex_match(&q->key, key2)) {
> + futex_q_unlock(hb);
> + return -EINVAL;
Please add this new ret value in the kernel doc too.
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
2025-05-05 7:30 ` Sebastian Andrzej Siewior
@ 2025-05-06 7:36 ` Peter Zijlstra
2025-05-09 11:41 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 109+ messages in thread
From: Peter Zijlstra @ 2025-05-06 7:36 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, May 05, 2025 at 09:30:36AM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-05-03 12:09:05 [+0200], Peter Zijlstra wrote:
> > On Fri, May 02, 2025 at 09:48:07PM +0200, Peter Zijlstra wrote:
> > > On Wed, Apr 16, 2025 at 06:31:42PM +0200, Sebastian Andrzej Siewior wrote:
> > > > On 2025-04-16 18:29:00 [+0200], To linux-kernel@vger.kernel.org wrote:
> > > > > v11…v12: https://lore.kernel.org/all/20250407155742.968816-1-bigeasy@linutronix.de
> > >
> > > I made a few changes (mostly the stuff I mailed about) and pushed out to
> > > queue/locking/futex.
> >
> > And again, with hopefully less build errors included :-)
>
> Okay. I guess the NUMA part where the nodeid is written back to userland
> if 0 was supplied is not an issue. I was worried that if you fire
> multiple threads which end up in the sys_futex_wait() at the same time,
> waiting on the same addr on two nodes and the "current" nodeid is used
> then the variable might be written back twice with two node ids. The
> mpol interface should report always the same one.
Well, if you do stupid things, you get to keep the pieces or something
along those lines. Same as when userspace goes scribble the node value
while another thread is waiting and all that.
Even with the unconditional write back you're going to have a problem
with concurrent wait on the same futex.
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 05/21] futex: Create hb scopes
2025-04-16 16:29 ` [PATCH v12 05/21] futex: Create hb scopes Sebastian Andrzej Siewior
@ 2025-05-06 23:45 ` André Almeida
2025-05-16 12:20 ` Sebastian Andrzej Siewior
2025-05-16 13:23 ` Peter Zijlstra
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
1 sibling, 2 replies; 109+ messages in thread
From: André Almeida @ 2025-05-06 23:45 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: Darren Hart, Davidlohr Bueso, Ingo Molnar, Juri Lelli,
Peter Zijlstra, Thomas Gleixner, Valentin Schneider, Waiman Long
Em 16/04/2025 13:29, Sebastian Andrzej Siewior escreveu:
> From: Peter Zijlstra <peterz@infradead.org>
>
> Create explicit scopes for hb variables; almost pure re-indent.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> kernel/futex/core.c | 81 ++++----
> kernel/futex/pi.c | 282 +++++++++++++-------------
> kernel/futex/requeue.c | 433 ++++++++++++++++++++--------------------
> kernel/futex/waitwake.c | 193 +++++++++---------
> 4 files changed, 504 insertions(+), 485 deletions(-)
>
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> index 7adc914878933..e4cb5ce9785b1 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -944,7 +944,6 @@ static void exit_pi_state_list(struct task_struct *curr)
> {
> struct list_head *next, *head = &curr->pi_state_list;
> struct futex_pi_state *pi_state;
> - struct futex_hash_bucket *hb;
> union futex_key key = FUTEX_KEY_INIT;
>
> /*
> @@ -957,50 +956,54 @@ static void exit_pi_state_list(struct task_struct *curr)
> next = head->next;
> pi_state = list_entry(next, struct futex_pi_state, list);
> key = pi_state->key;
> - hb = futex_hash(&key);
> + if (1) {
Couldn't those explict scopes be achive without the if (1), just {}?
> + struct futex_hash_bucket *hb;
^ permalink raw reply [flat|nested] 109+ messages in thread
* [tip: locking/futex] selftests/futex: Add futex_numa_mpol
2025-04-16 16:29 ` [PATCH v12 21/21] selftests/futex: Add futex_numa_mpol Sebastian Andrzej Siewior
2025-05-02 19:08 ` Peter Zijlstra
2025-05-02 19:16 ` Peter Zijlstra
@ 2025-05-08 10:33 ` tip-bot2 for Sebastian Andrzej Siewior
2 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 3163369407baf8331a234fe4817e9ea27ba7ea9c
Gitweb: https://git.kernel.org/tip/3163369407baf8331a234fe4817e9ea27ba7ea9c
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 16 Apr 2025 18:29:21 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:10 +02:00
selftests/futex: Add futex_numa_mpol
Test the basic functionality for the NUMA and MPOL flags:
- FUTEX2_NUMA should take the NUMA node which is after the uaddr
and use it.
- Only update the node if FUTEX_NO_NODE was set by the user
- FUTEX2_MPOL should use the memory based on the policy. I attempted to
set the node with mbind() and then use this with MPOL but this fails
and futex falls back to the default node for the current CPU.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-22-bigeasy@linutronix.de
---
tools/testing/selftests/futex/functional/.gitignore | 1 +-
tools/testing/selftests/futex/functional/Makefile | 3 +-
tools/testing/selftests/futex/functional/futex_numa_mpol.c | 232 +++++++-
tools/testing/selftests/futex/functional/run.sh | 3 +-
tools/testing/selftests/futex/include/futex2test.h | 52 ++-
5 files changed, 290 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/futex/functional/futex_numa_mpol.c
diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
index d37ae7c..7b24ae8 100644
--- a/tools/testing/selftests/futex/functional/.gitignore
+++ b/tools/testing/selftests/futex/functional/.gitignore
@@ -1,4 +1,5 @@
# SPDX-License-Identifier: GPL-2.0-only
+futex_numa_mpol
futex_priv_hash
futex_requeue
futex_requeue_pi
diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
index 67d9e16..a4881fd 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
INCLUDES := -I../include -I../../ $(KHDR_INCLUDES)
CFLAGS := $(CFLAGS) -g -O2 -Wall -pthread $(INCLUDES) $(KHDR_INCLUDES)
-LDLIBS := -lpthread -lrt
+LDLIBS := -lpthread -lrt -lnuma
LOCAL_HDRS := \
../include/futextest.h \
@@ -18,6 +18,7 @@ TEST_GEN_PROGS := \
futex_wait \
futex_requeue \
futex_priv_hash \
+ futex_numa_mpol \
futex_waitv
TEST_PROGS := run.sh
diff --git a/tools/testing/selftests/futex/functional/futex_numa_mpol.c b/tools/testing/selftests/futex/functional/futex_numa_mpol.c
new file mode 100644
index 0000000..dd70532
--- /dev/null
+++ b/tools/testing/selftests/futex/functional/futex_numa_mpol.c
@@ -0,0 +1,232 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2025 Sebastian Andrzej Siewior <bigeasy@linutronix.de>
+ */
+
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <numa.h>
+#include <numaif.h>
+
+#include <linux/futex.h>
+#include <sys/mman.h>
+
+#include "logging.h"
+#include "futextest.h"
+#include "futex2test.h"
+
+#define MAX_THREADS 64
+
+static pthread_barrier_t barrier_main;
+static pthread_t threads[MAX_THREADS];
+
+struct thread_args {
+ void *futex_ptr;
+ unsigned int flags;
+ int result;
+};
+
+static struct thread_args thread_args[MAX_THREADS];
+
+#ifndef FUTEX_NO_NODE
+#define FUTEX_NO_NODE (-1)
+#endif
+
+#ifndef FUTEX2_MPOL
+#define FUTEX2_MPOL 0x08
+#endif
+
+static void *thread_lock_fn(void *arg)
+{
+ struct thread_args *args = arg;
+ int ret;
+
+ pthread_barrier_wait(&barrier_main);
+ ret = futex2_wait(args->futex_ptr, 0, args->flags, NULL, 0);
+ args->result = ret;
+ return NULL;
+}
+
+static void create_max_threads(void *futex_ptr)
+{
+ int i, ret;
+
+ for (i = 0; i < MAX_THREADS; i++) {
+ thread_args[i].futex_ptr = futex_ptr;
+ thread_args[i].flags = FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG | FUTEX2_NUMA;
+ thread_args[i].result = 0;
+ ret = pthread_create(&threads[i], NULL, thread_lock_fn, &thread_args[i]);
+ if (ret) {
+ error("pthread_create failed\n", errno);
+ exit(1);
+ }
+ }
+}
+
+static void join_max_threads(void)
+{
+ int i, ret;
+
+ for (i = 0; i < MAX_THREADS; i++) {
+ ret = pthread_join(threads[i], NULL);
+ if (ret) {
+ error("pthread_join failed for thread %d\n", errno, i);
+ exit(1);
+ }
+ }
+}
+
+static void __test_futex(void *futex_ptr, int must_fail, unsigned int futex_flags)
+{
+ int to_wake, ret, i, need_exit = 0;
+
+ pthread_barrier_init(&barrier_main, NULL, MAX_THREADS + 1);
+ create_max_threads(futex_ptr);
+ pthread_barrier_wait(&barrier_main);
+ to_wake = MAX_THREADS;
+
+ do {
+ ret = futex2_wake(futex_ptr, to_wake, futex_flags);
+ if (must_fail) {
+ if (ret < 0)
+ break;
+ fail("Should fail, but didn't\n");
+ exit(1);
+ }
+ if (ret < 0) {
+ error("Failed futex2_wake(%d)\n", errno, to_wake);
+ exit(1);
+ }
+ if (!ret)
+ usleep(50);
+ to_wake -= ret;
+
+ } while (to_wake);
+ join_max_threads();
+
+ for (i = 0; i < MAX_THREADS; i++) {
+ if (must_fail && thread_args[i].result != -1) {
+ fail("Thread %d should fail but succeeded (%d)\n", i, thread_args[i].result);
+ need_exit = 1;
+ }
+ if (!must_fail && thread_args[i].result != 0) {
+ fail("Thread %d failed (%d)\n", i, thread_args[i].result);
+ need_exit = 1;
+ }
+ }
+ if (need_exit)
+ exit(1);
+}
+
+static void test_futex(void *futex_ptr, int must_fail)
+{
+ __test_futex(futex_ptr, must_fail, FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG | FUTEX2_NUMA);
+}
+
+static void test_futex_mpol(void *futex_ptr, int must_fail)
+{
+ __test_futex(futex_ptr, must_fail, FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG | FUTEX2_NUMA | FUTEX2_MPOL);
+}
+
+static void usage(char *prog)
+{
+ printf("Usage: %s\n", prog);
+ printf(" -c Use color\n");
+ printf(" -h Display this help message\n");
+ printf(" -v L Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+ VQUIET, VCRITICAL, VINFO);
+}
+
+int main(int argc, char *argv[])
+{
+ struct futex32_numa *futex_numa;
+ int mem_size, i;
+ void *futex_ptr;
+ char c;
+
+ while ((c = getopt(argc, argv, "chv:")) != -1) {
+ switch (c) {
+ case 'c':
+ log_color(1);
+ break;
+ case 'h':
+ usage(basename(argv[0]));
+ exit(0);
+ break;
+ case 'v':
+ log_verbosity(atoi(optarg));
+ break;
+ default:
+ usage(basename(argv[0]));
+ exit(1);
+ }
+ }
+
+ mem_size = sysconf(_SC_PAGE_SIZE);
+ futex_ptr = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ if (futex_ptr == MAP_FAILED) {
+ error("mmap() for %d bytes failed\n", errno, mem_size);
+ return 1;
+ }
+ futex_numa = futex_ptr;
+
+ info("Regular test\n");
+ futex_numa->futex = 0;
+ futex_numa->numa = FUTEX_NO_NODE;
+ test_futex(futex_ptr, 0);
+
+ if (futex_numa->numa == FUTEX_NO_NODE) {
+ fail("NUMA node is left unitiliazed\n");
+ return 1;
+ }
+
+ info("Memory too small\n");
+ test_futex(futex_ptr + mem_size - 4, 1);
+
+ info("Memory out of range\n");
+ test_futex(futex_ptr + mem_size, 1);
+
+ futex_numa->numa = FUTEX_NO_NODE;
+ mprotect(futex_ptr, mem_size, PROT_READ);
+ info("Memory, RO\n");
+ test_futex(futex_ptr, 1);
+
+ mprotect(futex_ptr, mem_size, PROT_NONE);
+ info("Memory, no access\n");
+ test_futex(futex_ptr, 1);
+
+ mprotect(futex_ptr, mem_size, PROT_READ | PROT_WRITE);
+ info("Memory back to RW\n");
+ test_futex(futex_ptr, 0);
+
+ /* MPOL test. Does not work as expected */
+ for (i = 0; i < 4; i++) {
+ unsigned long nodemask;
+ int ret;
+
+ nodemask = 1 << i;
+ ret = mbind(futex_ptr, mem_size, MPOL_BIND, &nodemask,
+ sizeof(nodemask) * 8, 0);
+ if (ret == 0) {
+ info("Node %d test\n", i);
+ futex_numa->futex = 0;
+ futex_numa->numa = FUTEX_NO_NODE;
+
+ ret = futex2_wake(futex_ptr, 0, FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG | FUTEX2_NUMA | FUTEX2_MPOL);
+ if (ret < 0)
+ error("Failed to wake 0 with MPOL.\n", errno);
+ if (0)
+ test_futex_mpol(futex_numa, 0);
+ if (futex_numa->numa != i) {
+ fail("Returned NUMA node is %d expected %d\n",
+ futex_numa->numa, i);
+ }
+ }
+ }
+ return 0;
+}
diff --git a/tools/testing/selftests/futex/functional/run.sh b/tools/testing/selftests/futex/functional/run.sh
index f0f0d2b..8173984 100755
--- a/tools/testing/selftests/futex/functional/run.sh
+++ b/tools/testing/selftests/futex/functional/run.sh
@@ -86,3 +86,6 @@ echo
echo
./futex_priv_hash $COLOR
./futex_priv_hash -g $COLOR
+
+echo
+./futex_numa_mpol $COLOR
diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h
index 9ee3592..ea79662 100644
--- a/tools/testing/selftests/futex/include/futex2test.h
+++ b/tools/testing/selftests/futex/include/futex2test.h
@@ -18,14 +18,43 @@ struct futex_waitv {
};
#endif
+#ifndef __NR_futex_wake
+#define __NR_futex_wake 454
+#endif
+
+#ifndef __NR_futex_wait
+#define __NR_futex_wait 455
+#endif
+
#ifndef FUTEX2_SIZE_U32
#define FUTEX2_SIZE_U32 0x02
#endif
+#ifndef FUTEX2_NUMA
+#define FUTEX2_NUMA 0x04
+#endif
+
+#ifndef FUTEX2_MPOL
+#define FUTEX2_MPOL 0x08
+#endif
+
+#ifndef FUTEX2_PRIVATE
+#define FUTEX2_PRIVATE FUTEX_PRIVATE_FLAG
+#endif
+
+#ifndef FUTEX2_NO_NODE
+#define FUTEX_NO_NODE (-1)
+#endif
+
#ifndef FUTEX_32
#define FUTEX_32 FUTEX2_SIZE_U32
#endif
+struct futex32_numa {
+ futex_t futex;
+ futex_t numa;
+};
+
/**
* futex_waitv - Wait at multiple futexes, wake on any
* @waiters: Array of waiters
@@ -38,3 +67,26 @@ static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned lon
{
return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
}
+
+/*
+ * futex_wait() - block on uaddr with optional timeout
+ * @val: Expected value
+ * @flags: FUTEX2 flags
+ * @timeout: Relative timeout
+ * @clockid: Clock id for the timeout
+ */
+static inline int futex2_wait(void *uaddr, long val, unsigned int flags,
+ struct timespec *timeout, clockid_t clockid)
+{
+ return syscall(__NR_futex_wait, uaddr, val, ~0U, flags, timeout, clockid);
+}
+
+/*
+ * futex2_wake() - Wake a number of futexes
+ * @nr: Number of threads to wake at most
+ * @flags: FUTEX2 flags
+ */
+static inline int futex2_wake(void *uaddr, int nr, unsigned int flags)
+{
+ return syscall(__NR_futex_wake, uaddr, ~0U, nr, flags);
+}
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] selftests/futex: Add futex_priv_hash
2025-04-16 16:29 ` [PATCH v12 20/21] selftests/futex: Add futex_priv_hash Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Sebastian Andrzej Siewior
2025-05-09 21:22 ` [PATCH v12 20/21] " André Almeida
2025-05-27 11:28 ` Mark Brown
2 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: cda95faef7bcf26ba3f54c3cddce66d50116d146
Gitweb: https://git.kernel.org/tip/cda95faef7bcf26ba3f54c3cddce66d50116d146
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 16 Apr 2025 18:29:20 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:10 +02:00
selftests/futex: Add futex_priv_hash
Test the basic functionality of the private hash:
- Upon start, with no threads there is no private hash.
- The first thread initializes the private hash.
- More than four threads will increase the size of the private hash if
the system has more than 16 CPUs online.
- Once the user sets the size of private hash, auto scaling is disabled.
- The user is only allowed to use numbers to the power of two.
- The user may request the global or make the hash immutable.
- Once the global hash has been set or the hash has been made immutable,
further changes are not allowed.
- Futex operations should work the whole time. It must be possible to
hold a lock, such a PI initialised mutex, during the resize operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-21-bigeasy@linutronix.de
---
tools/testing/selftests/futex/functional/.gitignore | 5 +-
tools/testing/selftests/futex/functional/Makefile | 1 +-
tools/testing/selftests/futex/functional/futex_priv_hash.c | 315 +++++++-
tools/testing/selftests/futex/functional/run.sh | 4 +-
4 files changed, 323 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/futex/functional/futex_priv_hash.c
diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
index fbcbdb6..d37ae7c 100644
--- a/tools/testing/selftests/futex/functional/.gitignore
+++ b/tools/testing/selftests/futex/functional/.gitignore
@@ -1,11 +1,12 @@
# SPDX-License-Identifier: GPL-2.0-only
+futex_priv_hash
+futex_requeue
futex_requeue_pi
futex_requeue_pi_mismatched_ops
futex_requeue_pi_signal_restart
+futex_wait
futex_wait_private_mapped_file
futex_wait_timeout
futex_wait_uninitialized_heap
futex_wait_wouldblock
-futex_wait
-futex_requeue
futex_waitv
diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
index f79f9ba..67d9e16 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -17,6 +17,7 @@ TEST_GEN_PROGS := \
futex_wait_private_mapped_file \
futex_wait \
futex_requeue \
+ futex_priv_hash \
futex_waitv
TEST_PROGS := run.sh
diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c b/tools/testing/selftests/futex/functional/futex_priv_hash.c
new file mode 100644
index 0000000..4d37650
--- /dev/null
+++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c
@@ -0,0 +1,315 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2025 Sebastian Andrzej Siewior <bigeasy@linutronix.de>
+ */
+
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <linux/prctl.h>
+#include <sys/prctl.h>
+
+#include "logging.h"
+
+#define MAX_THREADS 64
+
+static pthread_barrier_t barrier_main;
+static pthread_mutex_t global_lock;
+static pthread_t threads[MAX_THREADS];
+static int counter;
+
+#ifndef PR_FUTEX_HASH
+#define PR_FUTEX_HASH 78
+# define PR_FUTEX_HASH_SET_SLOTS 1
+# define PR_FUTEX_HASH_GET_SLOTS 2
+# define PR_FUTEX_HASH_GET_IMMUTABLE 3
+#endif
+
+static int futex_hash_slots_set(unsigned int slots, int immutable)
+{
+ return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, slots, immutable);
+}
+
+static int futex_hash_slots_get(void)
+{
+ return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
+}
+
+static int futex_hash_immutable_get(void)
+{
+ return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE);
+}
+
+static void futex_hash_slots_set_verify(int slots)
+{
+ int ret;
+
+ ret = futex_hash_slots_set(slots, 0);
+ if (ret != 0) {
+ error("Failed to set slots to %d\n", errno, slots);
+ exit(1);
+ }
+ ret = futex_hash_slots_get();
+ if (ret != slots) {
+ error("Set %d slots but PR_FUTEX_HASH_GET_SLOTS returns: %d\n",
+ errno, slots, ret);
+ exit(1);
+ }
+}
+
+static void futex_hash_slots_set_must_fail(int slots, int immutable)
+{
+ int ret;
+
+ ret = futex_hash_slots_set(slots, immutable);
+ if (ret < 0)
+ return;
+
+ fail("futex_hash_slots_set(%d, %d) expected to fail but succeeded.\n",
+ slots, immutable);
+ exit(1);
+}
+
+static void *thread_return_fn(void *arg)
+{
+ return NULL;
+}
+
+static void *thread_lock_fn(void *arg)
+{
+ pthread_barrier_wait(&barrier_main);
+
+ pthread_mutex_lock(&global_lock);
+ counter++;
+ usleep(20);
+ pthread_mutex_unlock(&global_lock);
+ return NULL;
+}
+
+static void create_max_threads(void *(*thread_fn)(void *))
+{
+ int i, ret;
+
+ for (i = 0; i < MAX_THREADS; i++) {
+ ret = pthread_create(&threads[i], NULL, thread_fn, NULL);
+ if (ret) {
+ error("pthread_create failed\n", errno);
+ exit(1);
+ }
+ }
+}
+
+static void join_max_threads(void)
+{
+ int i, ret;
+
+ for (i = 0; i < MAX_THREADS; i++) {
+ ret = pthread_join(threads[i], NULL);
+ if (ret) {
+ error("pthread_join failed for thread %d\n", errno, i);
+ exit(1);
+ }
+ }
+}
+
+static void usage(char *prog)
+{
+ printf("Usage: %s\n", prog);
+ printf(" -c Use color\n");
+ printf(" -g Test global hash instead intead local immutable \n");
+ printf(" -h Display this help message\n");
+ printf(" -v L Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+ VQUIET, VCRITICAL, VINFO);
+}
+
+int main(int argc, char *argv[])
+{
+ int futex_slots1, futex_slotsn, online_cpus;
+ pthread_mutexattr_t mutex_attr_pi;
+ int use_global_hash = 0;
+ int ret;
+ char c;
+
+ while ((c = getopt(argc, argv, "cghv:")) != -1) {
+ switch (c) {
+ case 'c':
+ log_color(1);
+ break;
+ case 'g':
+ use_global_hash = 1;
+ break;
+ case 'h':
+ usage(basename(argv[0]));
+ exit(0);
+ break;
+ case 'v':
+ log_verbosity(atoi(optarg));
+ break;
+ default:
+ usage(basename(argv[0]));
+ exit(1);
+ }
+ }
+
+
+ ret = pthread_mutexattr_init(&mutex_attr_pi);
+ ret |= pthread_mutexattr_setprotocol(&mutex_attr_pi, PTHREAD_PRIO_INHERIT);
+ ret |= pthread_mutex_init(&global_lock, &mutex_attr_pi);
+ if (ret != 0) {
+ fail("Failed to initialize pthread mutex.\n");
+ return 1;
+ }
+
+ /* First thread, expect to be 0, not yet initialized */
+ ret = futex_hash_slots_get();
+ if (ret != 0) {
+ error("futex_hash_slots_get() failed: %d\n", errno, ret);
+ return 1;
+ }
+ ret = futex_hash_immutable_get();
+ if (ret != 0) {
+ error("futex_hash_immutable_get() failed: %d\n", errno, ret);
+ return 1;
+ }
+
+ ret = pthread_create(&threads[0], NULL, thread_return_fn, NULL);
+ if (ret != 0) {
+ error("pthread_create() failed: %d\n", errno, ret);
+ return 1;
+ }
+ ret = pthread_join(threads[0], NULL);
+ if (ret != 0) {
+ error("pthread_join() failed: %d\n", errno, ret);
+ return 1;
+ }
+ /* First thread, has to initialiaze private hash */
+ futex_slots1 = futex_hash_slots_get();
+ if (futex_slots1 <= 0) {
+ fail("Expected > 0 hash buckets, got: %d\n", futex_slots1);
+ return 1;
+ }
+
+ online_cpus = sysconf(_SC_NPROCESSORS_ONLN);
+ ret = pthread_barrier_init(&barrier_main, NULL, MAX_THREADS + 1);
+ if (ret != 0) {
+ error("pthread_barrier_init failed.\n", errno);
+ return 1;
+ }
+
+ ret = pthread_mutex_lock(&global_lock);
+ if (ret != 0) {
+ error("pthread_mutex_lock failed.\n", errno);
+ return 1;
+ }
+
+ counter = 0;
+ create_max_threads(thread_lock_fn);
+ pthread_barrier_wait(&barrier_main);
+
+ /*
+ * The current default size of hash buckets is 16. The auto increase
+ * works only if more than 16 CPUs are available.
+ */
+ if (online_cpus > 16) {
+ futex_slotsn = futex_hash_slots_get();
+ if (futex_slotsn < 0 || futex_slots1 == futex_slotsn) {
+ fail("Expected increase of hash buckets but got: %d -> %d\n",
+ futex_slots1, futex_slotsn);
+ info("Online CPUs: %d\n", online_cpus);
+ return 1;
+ }
+ }
+ ret = pthread_mutex_unlock(&global_lock);
+
+ /* Once the user changes it, it has to be what is set */
+ futex_hash_slots_set_verify(2);
+ futex_hash_slots_set_verify(4);
+ futex_hash_slots_set_verify(8);
+ futex_hash_slots_set_verify(32);
+ futex_hash_slots_set_verify(16);
+
+ ret = futex_hash_slots_set(15, 0);
+ if (ret >= 0) {
+ fail("Expected to fail with 15 slots but succeeded: %d.\n", ret);
+ return 1;
+ }
+ futex_hash_slots_set_verify(2);
+ join_max_threads();
+ if (counter != MAX_THREADS) {
+ fail("Expected thread counter at %d but is %d\n",
+ MAX_THREADS, counter);
+ return 1;
+ }
+ counter = 0;
+ /* Once the user set something, auto reisze must be disabled */
+ ret = pthread_barrier_init(&barrier_main, NULL, MAX_THREADS);
+
+ create_max_threads(thread_lock_fn);
+ join_max_threads();
+
+ ret = futex_hash_slots_get();
+ if (ret != 2) {
+ printf("Expected 2 slots, no auto-resize, got %d\n", ret);
+ return 1;
+ }
+
+ futex_hash_slots_set_must_fail(1 << 29, 0);
+
+ /*
+ * Once the private hash has been made immutable or global hash has been requested,
+ * then this requested can not be undone.
+ */
+ if (use_global_hash) {
+ ret = futex_hash_slots_set(0, 0);
+ if (ret != 0) {
+ printf("Can't request global hash: %m\n");
+ return 1;
+ }
+ } else {
+ ret = futex_hash_slots_set(4, 1);
+ if (ret != 0) {
+ printf("Immutable resize to 4 failed: %m\n");
+ return 1;
+ }
+ }
+
+ futex_hash_slots_set_must_fail(4, 0);
+ futex_hash_slots_set_must_fail(4, 1);
+ futex_hash_slots_set_must_fail(8, 0);
+ futex_hash_slots_set_must_fail(8, 1);
+ futex_hash_slots_set_must_fail(0, 1);
+ futex_hash_slots_set_must_fail(6, 1);
+
+ ret = pthread_barrier_init(&barrier_main, NULL, MAX_THREADS);
+ if (ret != 0) {
+ error("pthread_barrier_init failed.\n", errno);
+ return 1;
+ }
+ create_max_threads(thread_lock_fn);
+ join_max_threads();
+
+ ret = futex_hash_slots_get();
+ if (use_global_hash) {
+ if (ret != 0) {
+ error("Expected global hash, got %d\n", errno, ret);
+ return 1;
+ }
+ } else {
+ if (ret != 4) {
+ error("Expected 4 slots, no auto-resize, got %d\n", errno, ret);
+ return 1;
+ }
+ }
+
+ ret = futex_hash_immutable_get();
+ if (ret != 1) {
+ fail("Expected immutable private hash, got %d\n", ret);
+ return 1;
+ }
+ return 0;
+}
diff --git a/tools/testing/selftests/futex/functional/run.sh b/tools/testing/selftests/futex/functional/run.sh
index 5ccd599..f0f0d2b 100755
--- a/tools/testing/selftests/futex/functional/run.sh
+++ b/tools/testing/selftests/futex/functional/run.sh
@@ -82,3 +82,7 @@ echo
echo
./futex_waitv $COLOR
+
+echo
+./futex_priv_hash $COLOR
+./futex_priv_hash -g $COLOR
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] tools headers: Synchronize prctl.h ABI header
2025-04-16 16:29 ` [PATCH v12 18/21] tools headers: Synchronize prctl.h ABI header Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: f25051dce97cfd7a945add0c9e273e624e060624
Gitweb: https://git.kernel.org/tip/f25051dce97cfd7a945add0c9e273e624e060624
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 16 Apr 2025 18:29:18 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:09 +02:00
tools headers: Synchronize prctl.h ABI header
Synchronize prctl.h with current uapi version after adding
PR_FUTEX_HASH.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-19-bigeasy@linutronix.de
---
tools/include/uapi/linux/prctl.h | 44 ++++++++++++++++++++++++++++++-
1 file changed, 43 insertions(+), 1 deletion(-)
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 3579179..21f30b3 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -230,7 +230,7 @@ struct prctl_mm_map {
# define PR_PAC_APDBKEY (1UL << 3)
# define PR_PAC_APGAKEY (1UL << 4)
-/* Tagged user address controls for arm64 */
+/* Tagged user address controls for arm64 and RISC-V */
#define PR_SET_TAGGED_ADDR_CTRL 55
#define PR_GET_TAGGED_ADDR_CTRL 56
# define PR_TAGGED_ADDR_ENABLE (1UL << 0)
@@ -244,6 +244,9 @@ struct prctl_mm_map {
# define PR_MTE_TAG_MASK (0xffffUL << PR_MTE_TAG_SHIFT)
/* Unused; kept only for source compatibility */
# define PR_MTE_TCF_SHIFT 1
+/* RISC-V pointer masking tag length */
+# define PR_PMLEN_SHIFT 24
+# define PR_PMLEN_MASK (0x7fUL << PR_PMLEN_SHIFT)
/* Control reclaim behavior when allocating memory */
#define PR_SET_IO_FLUSHER 57
@@ -328,4 +331,43 @@ struct prctl_mm_map {
# define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */
# define PR_PPC_DEXCR_CTRL_MASK 0x1f
+/*
+ * Get the current shadow stack configuration for the current thread,
+ * this will be the value configured via PR_SET_SHADOW_STACK_STATUS.
+ */
+#define PR_GET_SHADOW_STACK_STATUS 74
+
+/*
+ * Set the current shadow stack configuration. Enabling the shadow
+ * stack will cause a shadow stack to be allocated for the thread.
+ */
+#define PR_SET_SHADOW_STACK_STATUS 75
+# define PR_SHADOW_STACK_ENABLE (1UL << 0)
+# define PR_SHADOW_STACK_WRITE (1UL << 1)
+# define PR_SHADOW_STACK_PUSH (1UL << 2)
+
+/*
+ * Prevent further changes to the specified shadow stack
+ * configuration. All bits may be locked via this call, including
+ * undefined bits.
+ */
+#define PR_LOCK_SHADOW_STACK_STATUS 76
+
+/*
+ * Controls the mode of timer_create() for CRIU restore operations.
+ * Enabling this allows CRIU to restore timers with explicit IDs.
+ *
+ * Don't use for normal operations as the result might be undefined.
+ */
+#define PR_TIMER_CREATE_RESTORE_IDS 77
+# define PR_TIMER_CREATE_RESTORE_IDS_OFF 0
+# define PR_TIMER_CREATE_RESTORE_IDS_ON 1
+# define PR_TIMER_CREATE_RESTORE_IDS_GET 2
+
+/* FUTEX hash management */
+#define PR_FUTEX_HASH 78
+# define PR_FUTEX_HASH_SET_SLOTS 1
+# define PR_FUTEX_HASH_GET_SLOTS 2
+# define PR_FUTEX_HASH_GET_IMMUTABLE 3
+
#endif /* _LINUX_PRCTL_H */
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] tools/perf: Allow to select the number of hash buckets
2025-04-16 16:29 ` [PATCH v12 19/21] tools/perf: Allow to select the number of hash buckets Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 60035a3981a7f9d965df81a48a07b94e52ccd54f
Gitweb: https://git.kernel.org/tip/60035a3981a7f9d965df81a48a07b94e52ccd54f
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 16 Apr 2025 18:29:19 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:10 +02:00
tools/perf: Allow to select the number of hash buckets
Add the -b/ --buckets argument to specify the number of hash buckets for
the private futex hash. This is directly passed to
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, buckets, immutable)
and must return without an error if specified. The `immutable' is 0 by
default and can be set to 1 via the -I/ --immutable argument.
The size of the private hash is verified with PR_FUTEX_HASH_GET_SLOTS.
If PR_FUTEX_HASH_GET_SLOTS failed then it is assumed that an older
kernel was used without the support and that the global hash is used.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-20-bigeasy@linutronix.de
---
tools/perf/bench/Build | 1 +-
tools/perf/bench/futex-hash.c | 7 +++-
tools/perf/bench/futex-lock-pi.c | 5 ++-
tools/perf/bench/futex-requeue.c | 6 ++-
tools/perf/bench/futex-wake-parallel.c | 9 ++-
tools/perf/bench/futex-wake.c | 4 ++-
tools/perf/bench/futex.c | 65 +++++++++++++++++++++++++-
tools/perf/bench/futex.h | 5 ++-
8 files changed, 101 insertions(+), 1 deletion(-)
create mode 100644 tools/perf/bench/futex.c
diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
index 279ab2a..b558ab9 100644
--- a/tools/perf/bench/Build
+++ b/tools/perf/bench/Build
@@ -3,6 +3,7 @@ perf-bench-y += sched-pipe.o
perf-bench-y += sched-seccomp-notify.o
perf-bench-y += syscall.o
perf-bench-y += mem-functions.o
+perf-bench-y += futex.o
perf-bench-y += futex-hash.o
perf-bench-y += futex-wake.o
perf-bench-y += futex-wake-parallel.o
diff --git a/tools/perf/bench/futex-hash.c b/tools/perf/bench/futex-hash.c
index b472ede..fdf133c 100644
--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -18,9 +18,11 @@
#include <stdlib.h>
#include <linux/compiler.h>
#include <linux/kernel.h>
+#include <linux/prctl.h>
#include <linux/zalloc.h>
#include <sys/time.h>
#include <sys/mman.h>
+#include <sys/prctl.h>
#include <perf/cpumap.h>
#include "../util/mutex.h"
@@ -50,9 +52,12 @@ struct worker {
static struct bench_futex_parameters params = {
.nfutexes = 1024,
.runtime = 10,
+ .nbuckets = -1,
};
static const struct option options[] = {
+ OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash buckets"),
+ OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash buckets immutable"),
OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads"),
OPT_UINTEGER('r', "runtime", ¶ms.runtime, "Specify runtime (in seconds)"),
OPT_UINTEGER('f', "futexes", ¶ms.nfutexes, "Specify amount of futexes per threads"),
@@ -118,6 +123,7 @@ static void print_summary(void)
printf("%sAveraged %ld operations/sec (+- %.2f%%), total secs = %d\n",
!params.silent ? "\n" : "", avg, rel_stddev_stats(stddev, avg),
(int)bench__runtime.tv_sec);
+ futex_print_nbuckets(¶ms);
}
int bench_futex_hash(int argc, const char **argv)
@@ -161,6 +167,7 @@ int bench_futex_hash(int argc, const char **argv)
if (!params.fshared)
futex_flag = FUTEX_PRIVATE_FLAG;
+ futex_set_nbuckets_param(¶ms);
printf("Run summary [PID %d]: %d threads, each operating on %d [%s] futexes for %d secs.\n\n",
getpid(), params.nthreads, params.nfutexes, params.fshared ? "shared":"private", params.runtime);
diff --git a/tools/perf/bench/futex-lock-pi.c b/tools/perf/bench/futex-lock-pi.c
index 0416120..5144a15 100644
--- a/tools/perf/bench/futex-lock-pi.c
+++ b/tools/perf/bench/futex-lock-pi.c
@@ -41,10 +41,13 @@ static struct stats throughput_stats;
static struct cond thread_parent, thread_worker;
static struct bench_futex_parameters params = {
+ .nbuckets = -1,
.runtime = 10,
};
static const struct option options[] = {
+ OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash buckets"),
+ OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash buckets immutable"),
OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads"),
OPT_UINTEGER('r', "runtime", ¶ms.runtime, "Specify runtime (in seconds)"),
OPT_BOOLEAN( 'M', "multi", ¶ms.multi, "Use multiple futexes"),
@@ -67,6 +70,7 @@ static void print_summary(void)
printf("%sAveraged %ld operations/sec (+- %.2f%%), total secs = %d\n",
!params.silent ? "\n" : "", avg, rel_stddev_stats(stddev, avg),
(int)bench__runtime.tv_sec);
+ futex_print_nbuckets(¶ms);
}
static void toggle_done(int sig __maybe_unused,
@@ -203,6 +207,7 @@ int bench_futex_lock_pi(int argc, const char **argv)
mutex_init(&thread_lock);
cond_init(&thread_parent);
cond_init(&thread_worker);
+ futex_set_nbuckets_param(¶ms);
threads_starting = params.nthreads;
gettimeofday(&bench__start, NULL);
diff --git a/tools/perf/bench/futex-requeue.c b/tools/perf/bench/futex-requeue.c
index aad5bfc..a2f91ee 100644
--- a/tools/perf/bench/futex-requeue.c
+++ b/tools/perf/bench/futex-requeue.c
@@ -42,6 +42,7 @@ static unsigned int threads_starting;
static int futex_flag = 0;
static struct bench_futex_parameters params = {
+ .nbuckets = -1,
/*
* How many tasks to requeue at a time.
* Default to 1 in order to make the kernel work more.
@@ -50,6 +51,8 @@ static struct bench_futex_parameters params = {
};
static const struct option options[] = {
+ OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash buckets"),
+ OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash buckets immutable"),
OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads"),
OPT_UINTEGER('q', "nrequeue", ¶ms.nrequeue, "Specify amount of threads to requeue at once"),
OPT_BOOLEAN( 's', "silent", ¶ms.silent, "Silent mode: do not display data/details"),
@@ -77,6 +80,7 @@ static void print_summary(void)
params.nthreads,
requeuetime_avg / USEC_PER_MSEC,
rel_stddev_stats(requeuetime_stddev, requeuetime_avg));
+ futex_print_nbuckets(¶ms);
}
static void *workerfn(void *arg __maybe_unused)
@@ -204,6 +208,8 @@ int bench_futex_requeue(int argc, const char **argv)
if (params.broadcast)
params.nrequeue = params.nthreads;
+ futex_set_nbuckets_param(¶ms);
+
printf("Run summary [PID %d]: Requeuing %d threads (from [%s] %p to %s%p), "
"%d at a time.\n\n", getpid(), params.nthreads,
params.fshared ? "shared":"private", &futex1,
diff --git a/tools/perf/bench/futex-wake-parallel.c b/tools/perf/bench/futex-wake-parallel.c
index 4352e31..ee66482 100644
--- a/tools/perf/bench/futex-wake-parallel.c
+++ b/tools/perf/bench/futex-wake-parallel.c
@@ -57,9 +57,13 @@ static struct stats waketime_stats, wakeup_stats;
static unsigned int threads_starting;
static int futex_flag = 0;
-static struct bench_futex_parameters params;
+static struct bench_futex_parameters params = {
+ .nbuckets = -1,
+};
static const struct option options[] = {
+ OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash buckets"),
+ OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash buckets immutable"),
OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads"),
OPT_UINTEGER('w', "nwakers", ¶ms.nwakes, "Specify amount of waking threads"),
OPT_BOOLEAN( 's', "silent", ¶ms.silent, "Silent mode: do not display data/details"),
@@ -218,6 +222,7 @@ static void print_summary(void)
params.nthreads,
waketime_avg / USEC_PER_MSEC,
rel_stddev_stats(waketime_stddev, waketime_avg));
+ futex_print_nbuckets(¶ms);
}
@@ -291,6 +296,8 @@ int bench_futex_wake_parallel(int argc, const char **argv)
if (!params.fshared)
futex_flag = FUTEX_PRIVATE_FLAG;
+ futex_set_nbuckets_param(¶ms);
+
printf("Run summary [PID %d]: blocking on %d threads (at [%s] "
"futex %p), %d threads waking up %d at a time.\n\n",
getpid(), params.nthreads, params.fshared ? "shared":"private",
diff --git a/tools/perf/bench/futex-wake.c b/tools/perf/bench/futex-wake.c
index 49b3c89..8d6107f 100644
--- a/tools/perf/bench/futex-wake.c
+++ b/tools/perf/bench/futex-wake.c
@@ -42,6 +42,7 @@ static unsigned int threads_starting;
static int futex_flag = 0;
static struct bench_futex_parameters params = {
+ .nbuckets = -1,
/*
* How many wakeups to do at a time.
* Default to 1 in order to make the kernel work more.
@@ -50,6 +51,8 @@ static struct bench_futex_parameters params = {
};
static const struct option options[] = {
+ OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash buckets"),
+ OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash buckets immutable"),
OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads"),
OPT_UINTEGER('w', "nwakes", ¶ms.nwakes, "Specify amount of threads to wake at once"),
OPT_BOOLEAN( 's', "silent", ¶ms.silent, "Silent mode: do not display data/details"),
@@ -93,6 +96,7 @@ static void print_summary(void)
params.nthreads,
waketime_avg / USEC_PER_MSEC,
rel_stddev_stats(waketime_stddev, waketime_avg));
+ futex_print_nbuckets(¶ms);
}
static void block_threads(pthread_t *w, struct perf_cpu_map *cpu)
diff --git a/tools/perf/bench/futex.c b/tools/perf/bench/futex.c
new file mode 100644
index 0000000..02ae6c5
--- /dev/null
+++ b/tools/perf/bench/futex.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <err.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <linux/prctl.h>
+#include <sys/prctl.h>
+
+#include "futex.h"
+
+void futex_set_nbuckets_param(struct bench_futex_parameters *params)
+{
+ int ret;
+
+ if (params->nbuckets < 0)
+ return;
+
+ ret = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, params->nbuckets, params->buckets_immutable);
+ if (ret) {
+ printf("Requesting %d hash buckets failed: %d/%m\n",
+ params->nbuckets, ret);
+ err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
+ }
+}
+
+void futex_print_nbuckets(struct bench_futex_parameters *params)
+{
+ char *futex_hash_mode;
+ int ret;
+
+ ret = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
+ if (params->nbuckets >= 0) {
+ if (ret != params->nbuckets) {
+ if (ret < 0) {
+ printf("Can't query number of buckets: %m\n");
+ err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
+ }
+ printf("Requested number of hash buckets does not currently used.\n");
+ printf("Requested: %d in usage: %d\n", params->nbuckets, ret);
+ err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
+ }
+ if (params->nbuckets == 0) {
+ ret = asprintf(&futex_hash_mode, "Futex hashing: global hash");
+ } else {
+ ret = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE);
+ if (ret < 0) {
+ printf("Can't check if the hash is immutable: %m\n");
+ err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)");
+ }
+ ret = asprintf(&futex_hash_mode, "Futex hashing: %d hash buckets %s",
+ params->nbuckets,
+ ret == 1 ? "(immutable)" : "");
+ }
+ } else {
+ if (ret <= 0) {
+ ret = asprintf(&futex_hash_mode, "Futex hashing: global hash");
+ } else {
+ ret = asprintf(&futex_hash_mode, "Futex hashing: auto resized to %d buckets",
+ ret);
+ }
+ }
+ if (ret < 0)
+ err(EXIT_FAILURE, "ENOMEM, futex_hash_mode");
+ printf("%s\n", futex_hash_mode);
+ free(futex_hash_mode);
+}
diff --git a/tools/perf/bench/futex.h b/tools/perf/bench/futex.h
index ebdc2b0..9c9a73f 100644
--- a/tools/perf/bench/futex.h
+++ b/tools/perf/bench/futex.h
@@ -25,6 +25,8 @@ struct bench_futex_parameters {
unsigned int nfutexes;
unsigned int nwakes;
unsigned int nrequeue;
+ int nbuckets;
+ bool buckets_immutable;
};
/**
@@ -143,4 +145,7 @@ futex_cmp_requeue_pi(u_int32_t *uaddr, u_int32_t val, u_int32_t *uaddr2,
val, opflags);
}
+void futex_set_nbuckets_param(struct bench_futex_parameters *params);
+void futex_print_nbuckets(struct bench_futex_parameters *params);
+
#endif /* _FUTEX_H */
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Implement FUTEX2_MPOL
2025-04-16 16:29 ` [PATCH v12 17/21] futex: Implement FUTEX2_MPOL Sebastian Andrzej Siewior
2025-05-02 18:45 ` Peter Zijlstra
@ 2025-05-08 10:33 ` tip-bot2 for Peter Zijlstra
1 sibling, 0 replies; 109+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Sebastian Andrzej Siewior, x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: c042c505210dc3453f378df432c10fff3d471bc5
Gitweb: https://git.kernel.org/tip/c042c505210dc3453f378df432c10fff3d471bc5
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Apr 2025 18:29:17 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:09 +02:00
futex: Implement FUTEX2_MPOL
Extend the futex2 interface to be aware of mempolicy.
When FUTEX2_MPOL is specified and there is a MPOL_PREFERRED or
home_node specified covering the futex address, use that hash-map.
Notably, in this case the futex will go to the global node hashtable,
even if it is a PRIVATE futex.
When FUTEX2_NUMA|FUTEX2_MPOL is specified and the user specified node
value is FUTEX_NO_NODE, the MPOL lookup (as described above) will be
tried first before reverting to setting node to the local node.
[bigeasy: add CONFIG_FUTEX_MPOL, add MPOL to FUTEX2_VALID_MASK, write
the node only to user if FUTEX_NO_NODE was supplied]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-18-bigeasy@linutronix.de
---
include/linux/mmap_lock.h | 4 +-
include/uapi/linux/futex.h | 2 +-
init/Kconfig | 5 ++-
kernel/futex/core.c | 116 +++++++++++++++++++++++++++++++-----
kernel/futex/futex.h | 6 +-
5 files changed, 115 insertions(+), 18 deletions(-)
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 4706c67..e0eddfd 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -7,6 +7,7 @@
#include <linux/rwsem.h>
#include <linux/tracepoint-defs.h>
#include <linux/types.h>
+#include <linux/cleanup.h>
#define MMAP_LOCK_INITIALIZER(name) \
.mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),
@@ -211,6 +212,9 @@ static inline void mmap_read_unlock(struct mm_struct *mm)
up_read(&mm->mmap_lock);
}
+DEFINE_GUARD(mmap_read_lock, struct mm_struct *,
+ mmap_read_lock(_T), mmap_read_unlock(_T))
+
static inline void mmap_read_unlock_non_owner(struct mm_struct *mm)
{
__mmap_lock_trace_released(mm, false);
diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 6b94da4..7e2744e 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -63,7 +63,7 @@
#define FUTEX2_SIZE_U32 0x02
#define FUTEX2_SIZE_U64 0x03
#define FUTEX2_NUMA 0x04
- /* 0x08 */
+#define FUTEX2_MPOL 0x08
/* 0x10 */
/* 0x20 */
/* 0x40 */
diff --git a/init/Kconfig b/init/Kconfig
index 4b84da2..b373267 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1704,6 +1704,11 @@ config FUTEX_PRIVATE_HASH
depends on FUTEX && !BASE_SMALL && MMU
default y
+config FUTEX_MPOL
+ bool
+ depends on FUTEX && NUMA
+ default y
+
config EPOLL
bool "Enable eventpoll support" if EXPERT
default y
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 1490e64..19a2c65 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -43,6 +43,8 @@
#include <linux/slab.h>
#include <linux/prctl.h>
#include <linux/rcuref.h>
+#include <linux/mempolicy.h>
+#include <linux/mmap_lock.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -328,6 +330,75 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
#endif /* CONFIG_FUTEX_PRIVATE_HASH */
+#ifdef CONFIG_FUTEX_MPOL
+
+static int __futex_key_to_node(struct mm_struct *mm, unsigned long addr)
+{
+ struct vm_area_struct *vma = vma_lookup(mm, addr);
+ struct mempolicy *mpol;
+ int node = FUTEX_NO_NODE;
+
+ if (!vma)
+ return FUTEX_NO_NODE;
+
+ mpol = vma_policy(vma);
+ if (!mpol)
+ return FUTEX_NO_NODE;
+
+ switch (mpol->mode) {
+ case MPOL_PREFERRED:
+ node = first_node(mpol->nodes);
+ break;
+ case MPOL_PREFERRED_MANY:
+ case MPOL_BIND:
+ if (mpol->home_node != NUMA_NO_NODE)
+ node = mpol->home_node;
+ break;
+ default:
+ break;
+ }
+
+ return node;
+}
+
+static int futex_key_to_node_opt(struct mm_struct *mm, unsigned long addr)
+{
+ int seq, node;
+
+ guard(rcu)();
+
+ if (!mmap_lock_speculate_try_begin(mm, &seq))
+ return -EBUSY;
+
+ node = __futex_key_to_node(mm, addr);
+
+ if (mmap_lock_speculate_retry(mm, seq))
+ return -EAGAIN;
+
+ return node;
+}
+
+static int futex_mpol(struct mm_struct *mm, unsigned long addr)
+{
+ int node;
+
+ node = futex_key_to_node_opt(mm, addr);
+ if (node >= FUTEX_NO_NODE)
+ return node;
+
+ guard(mmap_read_lock)(mm);
+ return __futex_key_to_node(mm, addr);
+}
+
+#else /* !CONFIG_FUTEX_MPOL */
+
+static int futex_mpol(struct mm_struct *mm, unsigned long addr)
+{
+ return FUTEX_NO_NODE;
+}
+
+#endif /* CONFIG_FUTEX_MPOL */
+
/**
* __futex_hash - Return the hash bucket
* @key: Pointer to the futex key for which the hash is calculated
@@ -342,18 +413,20 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
static struct futex_hash_bucket *
__futex_hash(union futex_key *key, struct futex_private_hash *fph)
{
- struct futex_hash_bucket *hb;
+ int node = key->both.node;
u32 hash;
- int node;
- hb = __futex_hash_private(key, fph);
- if (hb)
- return hb;
+ if (node == FUTEX_NO_NODE) {
+ struct futex_hash_bucket *hb;
+
+ hb = __futex_hash_private(key, fph);
+ if (hb)
+ return hb;
+ }
hash = jhash2((u32 *)key,
offsetof(typeof(*key), both.offset) / sizeof(u32),
key->both.offset);
- node = key->both.node;
if (node == FUTEX_NO_NODE) {
/*
@@ -480,6 +553,7 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
struct folio *folio;
struct address_space *mapping;
int node, err, size, ro = 0;
+ bool node_updated = false;
bool fshared;
fshared = flags & FLAGS_SHARED;
@@ -501,27 +575,37 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
if (unlikely(should_fail_futex(fshared)))
return -EFAULT;
+ node = FUTEX_NO_NODE;
+
if (flags & FLAGS_NUMA) {
u32 __user *naddr = (void *)uaddr + size / 2;
if (futex_get_value(&node, naddr))
return -EFAULT;
- if (node == FUTEX_NO_NODE) {
- node = numa_node_id();
- if (futex_put_value(node, naddr))
- return -EFAULT;
-
- } else if (node >= MAX_NUMNODES || !node_possible(node)) {
+ if (node != FUTEX_NO_NODE &&
+ (node >= MAX_NUMNODES || !node_possible(node)))
return -EINVAL;
- }
+ }
- key->both.node = node;
+ if (node == FUTEX_NO_NODE && (flags & FLAGS_MPOL)) {
+ node = futex_mpol(mm, address);
+ node_updated = true;
+ }
- } else {
- key->both.node = FUTEX_NO_NODE;
+ if (flags & FLAGS_NUMA) {
+ u32 __user *naddr = (void *)uaddr + size / 2;
+
+ if (node == FUTEX_NO_NODE) {
+ node = numa_node_id();
+ node_updated = true;
+ }
+ if (node_updated && futex_put_value(node, naddr))
+ return -EFAULT;
}
+ key->both.node = node;
+
/*
* PROCESS_PRIVATE futexes are fast.
* As the mm cannot disappear under us and the 'key' only needs
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index acc7953..069fc2a 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -39,6 +39,7 @@
#define FLAGS_HAS_TIMEOUT 0x0040
#define FLAGS_NUMA 0x0080
#define FLAGS_STRICT 0x0100
+#define FLAGS_MPOL 0x0200
/* FUTEX_ to FLAGS_ */
static inline unsigned int futex_to_flags(unsigned int op)
@@ -54,7 +55,7 @@ static inline unsigned int futex_to_flags(unsigned int op)
return flags;
}
-#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_PRIVATE)
+#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_MPOL | FUTEX2_PRIVATE)
/* FUTEX2_ to FLAGS_ */
static inline unsigned int futex2_to_flags(unsigned int flags2)
@@ -67,6 +68,9 @@ static inline unsigned int futex2_to_flags(unsigned int flags2)
if (flags2 & FUTEX2_NUMA)
flags |= FLAGS_NUMA;
+ if (flags2 & FUTEX2_MPOL)
+ flags |= FLAGS_MPOL;
+
return flags;
}
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Implement FUTEX2_NUMA
2025-04-16 16:29 ` [PATCH v12 16/21] futex: Implement FUTEX2_NUMA Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Peter Zijlstra
0 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Sebastian Andrzej Siewior, x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: cec199c5e39bde7191a08087cc3d002ccfab31ff
Gitweb: https://git.kernel.org/tip/cec199c5e39bde7191a08087cc3d002ccfab31ff
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Apr 2025 18:29:16 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:09 +02:00
futex: Implement FUTEX2_NUMA
Extend the futex2 interface to be numa aware.
When FUTEX2_NUMA is specified for a futex, the user value is extended
to two words (of the same size). The first is the user value we all
know, the second one will be the node to place this futex on.
struct futex_numa_32 {
u32 val;
u32 node;
};
When node is set to ~0, WAIT will set it to the current node_id such
that WAKE knows where to find it. If userspace corrupts the node value
between WAIT and WAKE, the futex will not be found and no wakeup will
happen.
When FUTEX2_NUMA is not set, the node is simply an extension of the
hash, such that traditional futexes are still interleaved over the
nodes.
This is done to avoid having to have a separate !numa hash-table.
[bigeasy: ensure to have at least hashsize of 4 in futex_init(), add
pr_info() for size and allocation information. Cast the naddr math to
void*]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-17-bigeasy@linutronix.de
---
include/linux/futex.h | 3 +-
include/uapi/linux/futex.h | 7 +++-
kernel/futex/core.c | 100 +++++++++++++++++++++++++++++-------
kernel/futex/futex.h | 33 ++++++++++--
4 files changed, 123 insertions(+), 20 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 40bc778..eccc997 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -34,6 +34,7 @@ union futex_key {
u64 i_seq;
unsigned long pgoff;
unsigned int offset;
+ /* unsigned int node; */
} shared;
struct {
union {
@@ -42,11 +43,13 @@ union futex_key {
};
unsigned long address;
unsigned int offset;
+ /* unsigned int node; */
} private;
struct {
u64 ptr;
unsigned long word;
unsigned int offset;
+ unsigned int node; /* NOT hashed! */
} both;
};
diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index d2ee625..6b94da4 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -75,6 +75,13 @@
#define FUTEX_32 FUTEX2_SIZE_U32 /* historical accident :-( */
/*
+ * When FUTEX2_NUMA doubles the futex word, the second word is a node value.
+ * The special value -1 indicates no-node. This is the same value as
+ * NUMA_NO_NODE, except that value is not ABI, this is.
+ */
+#define FUTEX_NO_NODE (-1)
+
+/*
* Max numbers of elements in a futex_waitv array
*/
#define FUTEX_WAITV_MAX 128
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 8054fda..1490e64 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -36,6 +36,8 @@
#include <linux/pagemap.h>
#include <linux/debugfs.h>
#include <linux/plist.h>
+#include <linux/gfp.h>
+#include <linux/vmalloc.h>
#include <linux/memblock.h>
#include <linux/fault-inject.h>
#include <linux/slab.h>
@@ -51,11 +53,14 @@
* reside in the same cacheline.
*/
static struct {
- struct futex_hash_bucket *queues;
unsigned long hashmask;
+ unsigned int hashshift;
+ struct futex_hash_bucket *queues[MAX_NUMNODES];
} __futex_data __read_mostly __aligned(2*sizeof(long));
-#define futex_queues (__futex_data.queues)
-#define futex_hashmask (__futex_data.hashmask)
+
+#define futex_hashmask (__futex_data.hashmask)
+#define futex_hashshift (__futex_data.hashshift)
+#define futex_queues (__futex_data.queues)
struct futex_private_hash {
rcuref_t users;
@@ -339,15 +344,35 @@ __futex_hash(union futex_key *key, struct futex_private_hash *fph)
{
struct futex_hash_bucket *hb;
u32 hash;
+ int node;
hb = __futex_hash_private(key, fph);
if (hb)
return hb;
hash = jhash2((u32 *)key,
- offsetof(typeof(*key), both.offset) / 4,
+ offsetof(typeof(*key), both.offset) / sizeof(u32),
key->both.offset);
- return &futex_queues[hash & futex_hashmask];
+ node = key->both.node;
+
+ if (node == FUTEX_NO_NODE) {
+ /*
+ * In case of !FLAGS_NUMA, use some unused hash bits to pick a
+ * node -- this ensures regular futexes are interleaved across
+ * the nodes and avoids having to allocate multiple
+ * hash-tables.
+ *
+ * NOTE: this isn't perfectly uniform, but it is fast and
+ * handles sparse node masks.
+ */
+ node = (hash >> futex_hashshift) % nr_node_ids;
+ if (!node_possible(node)) {
+ node = find_next_bit_wrap(node_possible_map.bits,
+ nr_node_ids, node);
+ }
+ }
+
+ return &futex_queues[node][hash & futex_hashmask];
}
/**
@@ -454,25 +479,49 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
struct page *page;
struct folio *folio;
struct address_space *mapping;
- int err, ro = 0;
+ int node, err, size, ro = 0;
bool fshared;
fshared = flags & FLAGS_SHARED;
+ size = futex_size(flags);
+ if (flags & FLAGS_NUMA)
+ size *= 2;
/*
* The futex address must be "naturally" aligned.
*/
key->both.offset = address % PAGE_SIZE;
- if (unlikely((address % sizeof(u32)) != 0))
+ if (unlikely((address % size) != 0))
return -EINVAL;
address -= key->both.offset;
- if (unlikely(!access_ok(uaddr, sizeof(u32))))
+ if (unlikely(!access_ok(uaddr, size)))
return -EFAULT;
if (unlikely(should_fail_futex(fshared)))
return -EFAULT;
+ if (flags & FLAGS_NUMA) {
+ u32 __user *naddr = (void *)uaddr + size / 2;
+
+ if (futex_get_value(&node, naddr))
+ return -EFAULT;
+
+ if (node == FUTEX_NO_NODE) {
+ node = numa_node_id();
+ if (futex_put_value(node, naddr))
+ return -EFAULT;
+
+ } else if (node >= MAX_NUMNODES || !node_possible(node)) {
+ return -EINVAL;
+ }
+
+ key->both.node = node;
+
+ } else {
+ key->both.node = FUTEX_NO_NODE;
+ }
+
/*
* PROCESS_PRIVATE futexes are fast.
* As the mm cannot disappear under us and the 'key' only needs
@@ -1642,24 +1691,41 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
static int __init futex_init(void)
{
unsigned long hashsize, i;
- unsigned int futex_shift;
+ unsigned int order, n;
+ unsigned long size;
#ifdef CONFIG_BASE_SMALL
hashsize = 16;
#else
- hashsize = roundup_pow_of_two(256 * num_possible_cpus());
+ hashsize = 256 * num_possible_cpus();
+ hashsize /= num_possible_nodes();
+ hashsize = max(4, hashsize);
+ hashsize = roundup_pow_of_two(hashsize);
#endif
+ futex_hashshift = ilog2(hashsize);
+ size = sizeof(struct futex_hash_bucket) * hashsize;
+ order = get_order(size);
- futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
- hashsize, 0, 0,
- &futex_shift, NULL,
- hashsize, hashsize);
- hashsize = 1UL << futex_shift;
+ for_each_node(n) {
+ struct futex_hash_bucket *table;
- for (i = 0; i < hashsize; i++)
- futex_hash_bucket_init(&futex_queues[i], NULL);
+ if (order > MAX_PAGE_ORDER)
+ table = vmalloc_huge_node(size, GFP_KERNEL, n);
+ else
+ table = alloc_pages_exact_nid(n, size, GFP_KERNEL);
+
+ BUG_ON(!table);
+
+ for (i = 0; i < hashsize; i++)
+ futex_hash_bucket_init(&table[i], NULL);
+
+ futex_queues[n] = table;
+ }
futex_hashmask = hashsize - 1;
+ pr_info("futex hash table entries: %lu (%lu bytes on %d NUMA nodes, total %lu KiB, %s).\n",
+ hashsize, size, num_possible_nodes(), size * num_possible_nodes() / 1024,
+ order > MAX_PAGE_ORDER ? "vmalloc" : "linear");
return 0;
}
core_initcall(futex_init);
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 899aed5..acc7953 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -54,7 +54,7 @@ static inline unsigned int futex_to_flags(unsigned int op)
return flags;
}
-#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_PRIVATE)
+#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_PRIVATE)
/* FUTEX2_ to FLAGS_ */
static inline unsigned int futex2_to_flags(unsigned int flags2)
@@ -87,6 +87,19 @@ static inline bool futex_flags_valid(unsigned int flags)
if ((flags & FLAGS_SIZE_MASK) != FLAGS_SIZE_32)
return false;
+ /*
+ * Must be able to represent both FUTEX_NO_NODE and every valid nodeid
+ * in a futex word.
+ */
+ if (flags & FLAGS_NUMA) {
+ int bits = 8 * futex_size(flags);
+ u64 max = ~0ULL;
+
+ max >>= 64 - bits;
+ if (nr_node_ids >= max)
+ return false;
+ }
+
return true;
}
@@ -282,7 +295,7 @@ static inline int futex_cmpxchg_value_locked(u32 *curval, u32 __user *uaddr, u32
* This looks a bit overkill, but generally just results in a couple
* of instructions.
*/
-static __always_inline int futex_read_inatomic(u32 *dest, u32 __user *from)
+static __always_inline int futex_get_value(u32 *dest, u32 __user *from)
{
u32 val;
@@ -299,12 +312,26 @@ Efault:
return -EFAULT;
}
+static __always_inline int futex_put_value(u32 val, u32 __user *to)
+{
+ if (can_do_masked_user_access())
+ to = masked_user_access_begin(to);
+ else if (!user_read_access_begin(to, sizeof(*to)))
+ return -EFAULT;
+ unsafe_put_user(val, to, Efault);
+ user_read_access_end();
+ return 0;
+Efault:
+ user_read_access_end();
+ return -EFAULT;
+}
+
static inline int futex_get_value_locked(u32 *dest, u32 __user *from)
{
int ret;
pagefault_disable();
- ret = futex_read_inatomic(dest, from);
+ ret = futex_get_value(dest, from);
pagefault_enable();
return ret;
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Allow to make the private hash immutable
2025-04-16 16:29 ` [PATCH v12 15/21] futex: Allow to make the private hash immutable Sebastian Andrzej Siewior
2025-05-02 18:01 ` Peter Zijlstra
@ 2025-05-08 10:33 ` tip-bot2 for Sebastian Andrzej Siewior
1 sibling, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel),
Shrikanth Hegde, x86, linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 63e8595c060a1fef421e3eecfc05ad882dafb8ac
Gitweb: https://git.kernel.org/tip/63e8595c060a1fef421e3eecfc05ad882dafb8ac
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 16 Apr 2025 18:29:15 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:08 +02:00
futex: Allow to make the private hash immutable
My initial testing showed that:
perf bench futex hash
reported less operations/sec with private hash. After using the same
amount of buckets in the private hash as used by the global hash then
the operations/sec were about the same.
This changed once the private hash became resizable. This feature added
an RCU section and reference counting via atomic inc+dec operation into
the hot path.
The reference counting can be avoided if the private hash is made
immutable.
Extend PR_FUTEX_HASH_SET_SLOTS by a fourth argument which denotes if the
private should be made immutable. Once set (to true) the a further
resize is not allowed (same if set to global hash).
Add PR_FUTEX_HASH_GET_IMMUTABLE which returns true if the hash can not
be changed.
Update "perf bench" suite.
For comparison, results of "perf bench futex hash -s":
- Xeon CPU E5-2650, 2 NUMA nodes, total 32 CPUs:
- Before the introducing task local hash
shared Averaged 1.487.148 operations/sec (+- 0,53%), total secs = 10
private Averaged 2.192.405 operations/sec (+- 0,07%), total secs = 10
- With the series
shared Averaged 1.326.342 operations/sec (+- 0,41%), total secs = 10
-b128 Averaged 141.394 operations/sec (+- 1,15%), total secs = 10
-Ib128 Averaged 851.490 operations/sec (+- 0,67%), total secs = 10
-b8192 Averaged 131.321 operations/sec (+- 2,13%), total secs = 10
-Ib8192 Averaged 1.923.077 operations/sec (+- 0,61%), total secs = 10
128 is the default allocation of hash buckets.
8192 was the previous amount of allocated hash buckets.
- Xeon(R) CPU E7-8890 v3, 4 NUMA nodes, total 144 CPUs:
- Before the introducing task local hash
shared Averaged 1.810.936 operations/sec (+- 0,26%), total secs = 20
private Averaged 2.505.801 operations/sec (+- 0,05%), total secs = 20
- With the series
shared Averaged 1.589.002 operations/sec (+- 0,25%), total secs = 20
-b1024 Averaged 42.410 operations/sec (+- 0,20%), total secs = 20
-Ib1024 Averaged 740.638 operations/sec (+- 1,51%), total secs = 20
-b65536 Averaged 48.811 operations/sec (+- 1,35%), total secs = 20
-Ib65536 Averaged 1.963.165 operations/sec (+- 0,18%), total secs = 20
1024 is the default allocation of hash buckets.
65536 was the previous amount of allocated hash buckets.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20250416162921.513656-16-bigeasy@linutronix.de
---
include/uapi/linux/prctl.h | 2 ++-
kernel/futex/core.c | 49 ++++++++++++++++++++++++++++++++-----
2 files changed, 45 insertions(+), 6 deletions(-)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 3b93fb9..43dec6e 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -367,6 +367,8 @@ struct prctl_mm_map {
/* FUTEX hash management */
#define PR_FUTEX_HASH 78
# define PR_FUTEX_HASH_SET_SLOTS 1
+# define FH_FLAG_IMMUTABLE (1ULL << 0)
# define PR_FUTEX_HASH_GET_SLOTS 2
+# define PR_FUTEX_HASH_GET_IMMUTABLE 3
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 9e7dad5..8054fda 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -63,6 +63,7 @@ struct futex_private_hash {
struct rcu_head rcu;
void *mm;
bool custom;
+ bool immutable;
struct futex_hash_bucket queues[];
};
@@ -132,12 +133,16 @@ static inline bool futex_key_is_private(union futex_key *key)
bool futex_private_hash_get(struct futex_private_hash *fph)
{
+ if (fph->immutable)
+ return true;
return rcuref_get(&fph->users);
}
void futex_private_hash_put(struct futex_private_hash *fph)
{
/* Ignore return value, last put is verified via rcuref_is_dead() */
+ if (fph->immutable)
+ return;
if (rcuref_put(&fph->users))
wake_up_var(fph->mm);
}
@@ -277,6 +282,8 @@ again:
if (!fph)
return NULL;
+ if (fph->immutable)
+ return fph;
if (rcuref_get(&fph->users))
return fph;
}
@@ -1383,6 +1390,9 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
spin_lock_init(&fhb->lock);
}
+#define FH_CUSTOM 0x01
+#define FH_IMMUTABLE 0x02
+
#ifdef CONFIG_FUTEX_PRIVATE_HASH
void futex_hash_free(struct mm_struct *mm)
{
@@ -1433,10 +1443,11 @@ static bool futex_hash_less(struct futex_private_hash *a,
return false; /* equal */
}
-static int futex_hash_allocate(unsigned int hash_slots, bool custom)
+static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
{
struct mm_struct *mm = current->mm;
struct futex_private_hash *fph;
+ bool custom = flags & FH_CUSTOM;
int i;
if (hash_slots && (hash_slots == 1 || !is_power_of_2(hash_slots)))
@@ -1447,7 +1458,7 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
*/
scoped_guard(rcu) {
fph = rcu_dereference(mm->futex_phash);
- if (fph && !fph->hash_mask) {
+ if (fph && (!fph->hash_mask || fph->immutable)) {
if (custom)
return -EBUSY;
return 0;
@@ -1461,6 +1472,7 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
rcuref_init(&fph->users, 1);
fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
fph->custom = custom;
+ fph->immutable = !!(flags & FH_IMMUTABLE);
fph->mm = mm;
for (i = 0; i < hash_slots; i++)
@@ -1553,7 +1565,7 @@ int futex_hash_allocate_default(void)
if (current_buckets >= buckets)
return 0;
- return futex_hash_allocate(buckets, false);
+ return futex_hash_allocate(buckets, 0);
}
static int futex_hash_get_slots(void)
@@ -1567,9 +1579,22 @@ static int futex_hash_get_slots(void)
return 0;
}
+static int futex_hash_get_immutable(void)
+{
+ struct futex_private_hash *fph;
+
+ guard(rcu)();
+ fph = rcu_dereference(current->mm->futex_phash);
+ if (fph && fph->immutable)
+ return 1;
+ if (fph && !fph->hash_mask)
+ return 1;
+ return 0;
+}
+
#else
-static int futex_hash_allocate(unsigned int hash_slots, bool custom)
+static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
{
return -EINVAL;
}
@@ -1578,23 +1603,35 @@ static int futex_hash_get_slots(void)
{
return 0;
}
+
+static int futex_hash_get_immutable(void)
+{
+ return 0;
+}
#endif
int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
{
+ unsigned int flags = FH_CUSTOM;
int ret;
switch (arg2) {
case PR_FUTEX_HASH_SET_SLOTS:
- if (arg4 != 0)
+ if (arg4 & ~FH_FLAG_IMMUTABLE)
return -EINVAL;
- ret = futex_hash_allocate(arg3, true);
+ if (arg4 & FH_FLAG_IMMUTABLE)
+ flags |= FH_IMMUTABLE;
+ ret = futex_hash_allocate(arg3, flags);
break;
case PR_FUTEX_HASH_GET_SLOTS:
ret = futex_hash_get_slots();
break;
+ case PR_FUTEX_HASH_GET_IMMUTABLE:
+ ret = futex_hash_get_immutable();
+ break;
+
default:
ret = -EINVAL;
break;
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Allow to resize the private local hash
2025-04-16 16:29 ` [PATCH v12 14/21] futex: Allow to resize the private local hash Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Sebastian Andrzej Siewior
2025-05-08 20:32 ` [PATCH v12 14/21] " André Almeida
` (2 subsequent siblings)
3 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: bd54df5ea7cadac520e346d5f0fe5d58e635b6ba
Gitweb: https://git.kernel.org/tip/bd54df5ea7cadac520e346d5f0fe5d58e635b6ba
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 16 Apr 2025 18:29:14 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:08 +02:00
futex: Allow to resize the private local hash
The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/
replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS
operation can now be invoked at runtime and resize an already existing
internal private futex_hash_bucket to another size.
The reallocation is based on an idea by Thomas Gleixner: The initial
allocation of struct futex_private_hash sets the reference count
to one. Every user acquires a reference on the local hash before using
it and drops it after it enqueued itself on the hash bucket. There is no
reference held while the task is scheduled out while waiting for the
wake up.
The resize process allocates a new struct futex_private_hash and drops
the initial reference. Synchronized with mm_struct::futex_hash_lock it
is checked if the reference counter for the currently used
mm_struct::futex_phash is marked as DEAD. If so, then all users enqueued
on the current private hash are requeued on the new private hash and the
new private hash is set to mm_struct::futex_phash. Otherwise the newly
allocated private hash is saved as mm_struct::futex_phash_new and the
rehashing and reassigning is delayed to the futex_hash() caller once the
reference counter is marked DEAD.
The replacement is not performed at rcuref_put() time because certain
callers, such as futex_wait_queue(), drop their reference after changing
the task state. This change will be destroyed once the futex_hash_lock
is acquired.
The user can change the number slots with PR_FUTEX_HASH_SET_SLOTS
multiple times. An increase and decrease is allowed and request blocks
until the assignment is done.
The private hash allocated at thread creation is changed from 16 to
16 <= 4 * number_of_threads <= global_hash_size
where number_of_threads can not exceed the number of online CPUs. Should
the user PR_FUTEX_HASH_SET_SLOTS then the auto scaling is disabled.
[peterz: reorganize the code to avoid state tracking and simplify new
object handling, block the user until changes are in effect, allow
increase and decrease of the hash].
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-15-bigeasy@linutronix.de
---
include/linux/futex.h | 3 +-
include/linux/mm_types.h | 4 +-
kernel/futex/core.c | 290 +++++++++++++++++++++++++++++++++++---
kernel/futex/requeue.c | 5 +-
4 files changed, 281 insertions(+), 21 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 1d3f755..40bc778 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -85,7 +85,8 @@ void futex_hash_free(struct mm_struct *mm);
static inline void futex_mm_init(struct mm_struct *mm)
{
- mm->futex_phash = NULL;
+ rcu_assign_pointer(mm->futex_phash, NULL);
+ mutex_init(&mm->futex_hash_lock);
}
#else /* !CONFIG_FUTEX_PRIVATE_HASH */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a4b5661..32ba512 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1033,7 +1033,9 @@ struct mm_struct {
seqcount_t mm_lock_seq;
#endif
#ifdef CONFIG_FUTEX_PRIVATE_HASH
- struct futex_private_hash *futex_phash;
+ struct mutex futex_hash_lock;
+ struct futex_private_hash __rcu *futex_phash;
+ struct futex_private_hash *futex_phash_new;
#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 53b3a00..9e7dad5 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -40,6 +40,7 @@
#include <linux/fault-inject.h>
#include <linux/slab.h>
#include <linux/prctl.h>
+#include <linux/rcuref.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -57,7 +58,9 @@ static struct {
#define futex_hashmask (__futex_data.hashmask)
struct futex_private_hash {
+ rcuref_t users;
unsigned int hash_mask;
+ struct rcu_head rcu;
void *mm;
bool custom;
struct futex_hash_bucket queues[];
@@ -129,11 +132,14 @@ static inline bool futex_key_is_private(union futex_key *key)
bool futex_private_hash_get(struct futex_private_hash *fph)
{
- return false;
+ return rcuref_get(&fph->users);
}
void futex_private_hash_put(struct futex_private_hash *fph)
{
+ /* Ignore return value, last put is verified via rcuref_is_dead() */
+ if (rcuref_put(&fph->users))
+ wake_up_var(fph->mm);
}
/**
@@ -143,8 +149,23 @@ void futex_private_hash_put(struct futex_private_hash *fph)
* Obtain an additional reference for the already obtained hash bucket. The
* caller must already own an reference.
*/
-void futex_hash_get(struct futex_hash_bucket *hb) { }
-void futex_hash_put(struct futex_hash_bucket *hb) { }
+void futex_hash_get(struct futex_hash_bucket *hb)
+{
+ struct futex_private_hash *fph = hb->priv;
+
+ if (!fph)
+ return;
+ WARN_ON_ONCE(!futex_private_hash_get(fph));
+}
+
+void futex_hash_put(struct futex_hash_bucket *hb)
+{
+ struct futex_private_hash *fph = hb->priv;
+
+ if (!fph)
+ return;
+ futex_private_hash_put(fph);
+}
static struct futex_hash_bucket *
__futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
@@ -155,7 +176,7 @@ __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
return NULL;
if (!fph)
- fph = key->private.mm->futex_phash;
+ fph = rcu_dereference(key->private.mm->futex_phash);
if (!fph || !fph->hash_mask)
return NULL;
@@ -165,21 +186,119 @@ __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
return &fph->queues[hash & fph->hash_mask];
}
+static void futex_rehash_private(struct futex_private_hash *old,
+ struct futex_private_hash *new)
+{
+ struct futex_hash_bucket *hb_old, *hb_new;
+ unsigned int slots = old->hash_mask + 1;
+ unsigned int i;
+
+ for (i = 0; i < slots; i++) {
+ struct futex_q *this, *tmp;
+
+ hb_old = &old->queues[i];
+
+ spin_lock(&hb_old->lock);
+ plist_for_each_entry_safe(this, tmp, &hb_old->chain, list) {
+
+ plist_del(&this->list, &hb_old->chain);
+ futex_hb_waiters_dec(hb_old);
+
+ WARN_ON_ONCE(this->lock_ptr != &hb_old->lock);
+
+ hb_new = __futex_hash(&this->key, new);
+ futex_hb_waiters_inc(hb_new);
+ /*
+ * The new pointer isn't published yet but an already
+ * moved user can be unqueued due to timeout or signal.
+ */
+ spin_lock_nested(&hb_new->lock, SINGLE_DEPTH_NESTING);
+ plist_add(&this->list, &hb_new->chain);
+ this->lock_ptr = &hb_new->lock;
+ spin_unlock(&hb_new->lock);
+ }
+ spin_unlock(&hb_old->lock);
+ }
+}
+
+static bool __futex_pivot_hash(struct mm_struct *mm,
+ struct futex_private_hash *new)
+{
+ struct futex_private_hash *fph;
+
+ WARN_ON_ONCE(mm->futex_phash_new);
+
+ fph = rcu_dereference_protected(mm->futex_phash,
+ lockdep_is_held(&mm->futex_hash_lock));
+ if (fph) {
+ if (!rcuref_is_dead(&fph->users)) {
+ mm->futex_phash_new = new;
+ return false;
+ }
+
+ futex_rehash_private(fph, new);
+ }
+ rcu_assign_pointer(mm->futex_phash, new);
+ kvfree_rcu(fph, rcu);
+ return true;
+}
+
+static void futex_pivot_hash(struct mm_struct *mm)
+{
+ scoped_guard(mutex, &mm->futex_hash_lock) {
+ struct futex_private_hash *fph;
+
+ fph = mm->futex_phash_new;
+ if (fph) {
+ mm->futex_phash_new = NULL;
+ __futex_pivot_hash(mm, fph);
+ }
+ }
+}
+
struct futex_private_hash *futex_private_hash(void)
{
struct mm_struct *mm = current->mm;
- struct futex_private_hash *fph;
+ /*
+ * Ideally we don't loop. If there is a replacement in progress
+ * then a new private hash is already prepared and a reference can't be
+ * obtained once the last user dropped it's.
+ * In that case we block on mm_struct::futex_hash_lock and either have
+ * to perform the replacement or wait while someone else is doing the
+ * job. Eitherway, on the second iteration we acquire a reference on the
+ * new private hash or loop again because a new replacement has been
+ * requested.
+ */
+again:
+ scoped_guard(rcu) {
+ struct futex_private_hash *fph;
- fph = mm->futex_phash;
- return fph;
+ fph = rcu_dereference(mm->futex_phash);
+ if (!fph)
+ return NULL;
+
+ if (rcuref_get(&fph->users))
+ return fph;
+ }
+ futex_pivot_hash(mm);
+ goto again;
}
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
+ struct futex_private_hash *fph;
struct futex_hash_bucket *hb;
- hb = __futex_hash(key, NULL);
- return hb;
+again:
+ scoped_guard(rcu) {
+ hb = __futex_hash(key, NULL);
+ fph = hb->priv;
+
+ if (!fph || futex_private_hash_get(fph))
+ return hb;
+ }
+ futex_pivot_hash(key->private.mm);
+ goto again;
}
#else /* !CONFIG_FUTEX_PRIVATE_HASH */
@@ -664,6 +783,8 @@ int futex_unqueue(struct futex_q *q)
spinlock_t *lock_ptr;
int ret = 0;
+ /* RCU so lock_ptr is not going away during locking. */
+ guard(rcu)();
/* In the common case we don't take the spinlock, which is nice. */
retry:
/*
@@ -1066,6 +1187,10 @@ static void exit_pi_state_list(struct task_struct *curr)
union futex_key key = FUTEX_KEY_INIT;
/*
+ * The mutex mm_struct::futex_hash_lock might be acquired.
+ */
+ might_sleep();
+ /*
* Ensure the hash remains stable (no resize) during the while loop
* below. The hb pointer is acquired under the pi_lock so we can't block
* on the mutex.
@@ -1261,7 +1386,51 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
#ifdef CONFIG_FUTEX_PRIVATE_HASH
void futex_hash_free(struct mm_struct *mm)
{
- kvfree(mm->futex_phash);
+ struct futex_private_hash *fph;
+
+ kvfree(mm->futex_phash_new);
+ fph = rcu_dereference_raw(mm->futex_phash);
+ if (fph) {
+ WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+ kvfree(fph);
+ }
+}
+
+static bool futex_pivot_pending(struct mm_struct *mm)
+{
+ struct futex_private_hash *fph;
+
+ guard(rcu)();
+
+ if (!mm->futex_phash_new)
+ return true;
+
+ fph = rcu_dereference(mm->futex_phash);
+ return rcuref_is_dead(&fph->users);
+}
+
+static bool futex_hash_less(struct futex_private_hash *a,
+ struct futex_private_hash *b)
+{
+ /* user provided always wins */
+ if (!a->custom && b->custom)
+ return true;
+ if (a->custom && !b->custom)
+ return false;
+
+ /* zero-sized hash wins */
+ if (!b->hash_mask)
+ return true;
+ if (!a->hash_mask)
+ return false;
+
+ /* keep the biggest */
+ if (a->hash_mask < b->hash_mask)
+ return true;
+ if (a->hash_mask > b->hash_mask)
+ return false;
+
+ return false; /* equal */
}
static int futex_hash_allocate(unsigned int hash_slots, bool custom)
@@ -1273,16 +1442,23 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
if (hash_slots && (hash_slots == 1 || !is_power_of_2(hash_slots)))
return -EINVAL;
- if (mm->futex_phash)
- return -EALREADY;
-
- if (!thread_group_empty(current))
- return -EINVAL;
+ /*
+ * Once we've disabled the global hash there is no way back.
+ */
+ scoped_guard(rcu) {
+ fph = rcu_dereference(mm->futex_phash);
+ if (fph && !fph->hash_mask) {
+ if (custom)
+ return -EBUSY;
+ return 0;
+ }
+ }
fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
if (!fph)
return -ENOMEM;
+ rcuref_init(&fph->users, 1);
fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
fph->custom = custom;
fph->mm = mm;
@@ -1290,26 +1466,102 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
for (i = 0; i < hash_slots; i++)
futex_hash_bucket_init(&fph->queues[i], fph);
- mm->futex_phash = fph;
+ if (custom) {
+ /*
+ * Only let prctl() wait / retry; don't unduly delay clone().
+ */
+again:
+ wait_var_event(mm, futex_pivot_pending(mm));
+ }
+
+ scoped_guard(mutex, &mm->futex_hash_lock) {
+ struct futex_private_hash *free __free(kvfree) = NULL;
+ struct futex_private_hash *cur, *new;
+
+ cur = rcu_dereference_protected(mm->futex_phash,
+ lockdep_is_held(&mm->futex_hash_lock));
+ new = mm->futex_phash_new;
+ mm->futex_phash_new = NULL;
+
+ if (fph) {
+ if (cur && !new) {
+ /*
+ * If we have an existing hash, but do not yet have
+ * allocated a replacement hash, drop the initial
+ * reference on the existing hash.
+ */
+ futex_private_hash_put(cur);
+ }
+
+ if (new) {
+ /*
+ * Two updates raced; throw out the lesser one.
+ */
+ if (futex_hash_less(new, fph)) {
+ free = new;
+ new = fph;
+ } else {
+ free = fph;
+ }
+ } else {
+ new = fph;
+ }
+ fph = NULL;
+ }
+
+ if (new) {
+ /*
+ * Will set mm->futex_phash_new on failure;
+ * futex_private_hash_get() will try again.
+ */
+ if (!__futex_pivot_hash(mm, new) && custom)
+ goto again;
+ }
+ }
return 0;
}
int futex_hash_allocate_default(void)
{
+ unsigned int threads, buckets, current_buckets = 0;
+ struct futex_private_hash *fph;
+
if (!current->mm)
return 0;
- if (current->mm->futex_phash)
+ scoped_guard(rcu) {
+ threads = min_t(unsigned int,
+ get_nr_threads(current),
+ num_online_cpus());
+
+ fph = rcu_dereference(current->mm->futex_phash);
+ if (fph) {
+ if (fph->custom)
+ return 0;
+
+ current_buckets = fph->hash_mask + 1;
+ }
+ }
+
+ /*
+ * The default allocation will remain within
+ * 16 <= threads * 4 <= global hash size
+ */
+ buckets = roundup_pow_of_two(4 * threads);
+ buckets = clamp(buckets, 16, futex_hashmask + 1);
+
+ if (current_buckets >= buckets)
return 0;
- return futex_hash_allocate(16, false);
+ return futex_hash_allocate(buckets, false);
}
static int futex_hash_get_slots(void)
{
struct futex_private_hash *fph;
- fph = current->mm->futex_phash;
+ guard(rcu)();
+ fph = rcu_dereference(current->mm->futex_phash);
if (fph && fph->hash_mask)
return fph->hash_mask + 1;
return 0;
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index b0e64fd..c716a66 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -87,6 +87,11 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
futex_hb_waiters_inc(hb2);
plist_add(&q->list, &hb2->chain);
q->lock_ptr = &hb2->lock;
+ /*
+ * hb1 and hb2 belong to the same futex_hash_bucket_private
+ * because if we managed get a reference on hb1 then it can't be
+ * replaced. Therefore we avoid put(hb1)+get(hb2) here.
+ */
}
q->key = *key2;
}
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Allow automatic allocation of process wide futex hash
2025-04-16 16:29 ` [PATCH v12 13/21] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 7c4f75a21f636486d2969d9b6680403ea8483539
Gitweb: https://git.kernel.org/tip/7c4f75a21f636486d2969d9b6680403ea8483539
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 16 Apr 2025 18:29:13 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:08 +02:00
futex: Allow automatic allocation of process wide futex hash
Allocate a private futex hash with 16 slots if a task forks its first
thread.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-14-bigeasy@linutronix.de
---
include/linux/futex.h | 6 ++++++
kernel/fork.c | 22 ++++++++++++++++++++++
kernel/futex/core.c | 11 +++++++++++
3 files changed, 39 insertions(+)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 8f1be08..1d3f755 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -80,6 +80,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4);
#ifdef CONFIG_FUTEX_PRIVATE_HASH
+int futex_hash_allocate_default(void);
void futex_hash_free(struct mm_struct *mm);
static inline void futex_mm_init(struct mm_struct *mm)
@@ -88,6 +89,7 @@ static inline void futex_mm_init(struct mm_struct *mm)
}
#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+static inline int futex_hash_allocate_default(void) { return 0; }
static inline void futex_hash_free(struct mm_struct *mm) { }
static inline void futex_mm_init(struct mm_struct *mm) { }
#endif /* CONFIG_FUTEX_PRIVATE_HASH */
@@ -107,6 +109,10 @@ static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsig
{
return -EINVAL;
}
+static inline int futex_hash_allocate_default(void)
+{
+ return 0;
+}
static inline void futex_hash_free(struct mm_struct *mm) { }
static inline void futex_mm_init(struct mm_struct *mm) { }
diff --git a/kernel/fork.c b/kernel/fork.c
index 831dfec..1f5d808 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2164,6 +2164,13 @@ static void rv_task_fork(struct task_struct *p)
#define rv_task_fork(p) do {} while (0)
#endif
+static bool need_futex_hash_allocate_default(u64 clone_flags)
+{
+ if ((clone_flags & (CLONE_THREAD | CLONE_VM)) != (CLONE_THREAD | CLONE_VM))
+ return false;
+ return true;
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -2545,6 +2552,21 @@ __latent_entropy struct task_struct *copy_process(
goto bad_fork_cancel_cgroup;
/*
+ * Allocate a default futex hash for the user process once the first
+ * thread spawns.
+ */
+ if (need_futex_hash_allocate_default(clone_flags)) {
+ retval = futex_hash_allocate_default();
+ if (retval)
+ goto bad_fork_core_free;
+ /*
+ * If we fail beyond this point we don't free the allocated
+ * futex hash map. We assume that another thread will be created
+ * and makes use of it. The hash map will be freed once the main
+ * thread terminates.
+ */
+ }
+ /*
* From this point on we must avoid any synchronous user-space
* communication until we take the tasklist-lock. In particular, we do
* not want user-space to be able to predict the process start-time by
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 818df74..53b3a00 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1294,6 +1294,17 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
return 0;
}
+int futex_hash_allocate_default(void)
+{
+ if (!current->mm)
+ return 0;
+
+ if (current->mm->futex_phash)
+ return 0;
+
+ return futex_hash_allocate(16, false);
+}
+
static int futex_hash_get_slots(void)
{
struct futex_private_hash *fph;
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Add basic infrastructure for local task local hash
2025-04-16 16:29 ` [PATCH v12 12/21] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 80367ad01d93ac781b0e1df246edaf006928002f
Gitweb: https://git.kernel.org/tip/80367ad01d93ac781b0e1df246edaf006928002f
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 16 Apr 2025 18:29:12 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:07 +02:00
futex: Add basic infrastructure for local task local hash
The futex hash is system wide and shared by all tasks. Each slot
is hashed based on futex address and the VMA of the thread. Due to
randomized VMAs (and memory allocations) the same logical lock (pointer)
can end up in a different hash bucket on each invocation of the
application. This in turn means that different applications may share a
hash bucket on the first invocation but not on the second and it is not
always clear which applications will be involved. This can result in
high latency's to acquire the futex_hash_bucket::lock especially if the
lock owner is limited to a CPU and can not be effectively PI boosted.
Introduce basic infrastructure for process local hash which is shared by
all threads of process. This hash will only be used for a
PROCESS_PRIVATE FUTEX operation.
The hashmap can be allocated via:
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, num);
A `num' of 0 means that the global hash is used instead of a private
hash.
Other values for `num' specify the number of slots for the hash and the
number must be power of two, starting with two.
The prctl() returns zero on success. This function can only be used
before a thread is created.
The current status for the private hash can be queried via:
num = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
which return the current number of slots. The value 0 means that the
global hash is used. Values greater than 0 indicate the number of slots
that are used. A negative number indicates an error.
For optimisation, for the private hash jhash2() uses only two arguments
the address and the offset. This omits the VMA which is always the same.
[peterz: Use 0 for global hash. A bit shuffling and renaming. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-13-bigeasy@linutronix.de
---
include/linux/futex.h | 26 ++++-
include/linux/mm_types.h | 5 +-
include/uapi/linux/prctl.h | 5 +-
init/Kconfig | 5 +-
kernel/fork.c | 2 +-
kernel/futex/core.c | 208 ++++++++++++++++++++++++++++++++----
kernel/futex/futex.h | 10 ++-
kernel/sys.c | 4 +-
8 files changed, 244 insertions(+), 21 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index b70df27..8f1be08 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -4,11 +4,11 @@
#include <linux/sched.h>
#include <linux/ktime.h>
+#include <linux/mm_types.h>
#include <uapi/linux/futex.h>
struct inode;
-struct mm_struct;
struct task_struct;
/*
@@ -77,7 +77,22 @@ void futex_exec_release(struct task_struct *tsk);
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3);
-#else
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4);
+
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+void futex_hash_free(struct mm_struct *mm);
+
+static inline void futex_mm_init(struct mm_struct *mm)
+{
+ mm->futex_phash = NULL;
+}
+
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline void futex_mm_init(struct mm_struct *mm) { }
+#endif /* CONFIG_FUTEX_PRIVATE_HASH */
+
+#else /* !CONFIG_FUTEX */
static inline void futex_init_task(struct task_struct *tsk) { }
static inline void futex_exit_recursive(struct task_struct *tsk) { }
static inline void futex_exit_release(struct task_struct *tsk) { }
@@ -88,6 +103,13 @@ static inline long do_futex(u32 __user *uaddr, int op, u32 val,
{
return -EINVAL;
}
+static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
+{
+ return -EINVAL;
+}
+static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline void futex_mm_init(struct mm_struct *mm) { }
+
#endif
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07ed..a4b5661 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -31,6 +31,7 @@
#define INIT_PASID 0
struct address_space;
+struct futex_private_hash;
struct mem_cgroup;
/*
@@ -1031,7 +1032,9 @@ struct mm_struct {
*/
seqcount_t mm_lock_seq;
#endif
-
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+ struct futex_private_hash *futex_phash;
+#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15c18ef..3b93fb9 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -364,4 +364,9 @@ struct prctl_mm_map {
# define PR_TIMER_CREATE_RESTORE_IDS_ON 1
# define PR_TIMER_CREATE_RESTORE_IDS_GET 2
+/* FUTEX hash management */
+#define PR_FUTEX_HASH 78
+# define PR_FUTEX_HASH_SET_SLOTS 1
+# define PR_FUTEX_HASH_GET_SLOTS 2
+
#endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 63f5974..4b84da2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1699,6 +1699,11 @@ config FUTEX_PI
depends on FUTEX && RT_MUTEXES
default y
+config FUTEX_PRIVATE_HASH
+ bool
+ depends on FUTEX && !BASE_SMALL && MMU
+ default y
+
config EPOLL
bool "Enable eventpoll support" if EXPERT
default y
diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd..831dfec 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1305,6 +1305,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
RCU_INIT_POINTER(mm->exe_file, NULL);
mmu_notifier_subscriptions_init(mm);
init_tlb_flush_pending(mm);
+ futex_mm_init(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
mm->pmd_huge_pte = NULL;
#endif
@@ -1387,6 +1388,7 @@ static inline void __mmput(struct mm_struct *mm)
if (mm->binfmt)
module_put(mm->binfmt->module);
lru_gen_del_mm(mm);
+ futex_hash_free(mm);
mmdrop(mm);
}
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index afc6678..818df74 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -39,6 +39,7 @@
#include <linux/memblock.h>
#include <linux/fault-inject.h>
#include <linux/slab.h>
+#include <linux/prctl.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -55,6 +56,12 @@ static struct {
#define futex_queues (__futex_data.queues)
#define futex_hashmask (__futex_data.hashmask)
+struct futex_private_hash {
+ unsigned int hash_mask;
+ void *mm;
+ bool custom;
+ struct futex_hash_bucket queues[];
+};
/*
* Fault injections for futexes.
@@ -107,9 +114,17 @@ late_initcall(fail_futex_debugfs);
#endif /* CONFIG_FAIL_FUTEX */
-struct futex_private_hash *futex_private_hash(void)
+static struct futex_hash_bucket *
+__futex_hash(union futex_key *key, struct futex_private_hash *fph);
+
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+static inline bool futex_key_is_private(union futex_key *key)
{
- return NULL;
+ /*
+ * Relies on get_futex_key() to set either bit for shared
+ * futexes -- see comment with union futex_key.
+ */
+ return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
}
bool futex_private_hash_get(struct futex_private_hash *fph)
@@ -117,21 +132,8 @@ bool futex_private_hash_get(struct futex_private_hash *fph)
return false;
}
-void futex_private_hash_put(struct futex_private_hash *fph) { }
-
-/**
- * futex_hash - Return the hash bucket in the global hash
- * @key: Pointer to the futex key for which the hash is calculated
- *
- * We hash on the keys returned from get_futex_key (see below) and return the
- * corresponding hash bucket in the global hash.
- */
-struct futex_hash_bucket *futex_hash(union futex_key *key)
+void futex_private_hash_put(struct futex_private_hash *fph)
{
- u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
- key->both.offset);
-
- return &futex_queues[hash & futex_hashmask];
}
/**
@@ -144,6 +146,84 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
void futex_hash_get(struct futex_hash_bucket *hb) { }
void futex_hash_put(struct futex_hash_bucket *hb) { }
+static struct futex_hash_bucket *
+__futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
+{
+ u32 hash;
+
+ if (!futex_key_is_private(key))
+ return NULL;
+
+ if (!fph)
+ fph = key->private.mm->futex_phash;
+ if (!fph || !fph->hash_mask)
+ return NULL;
+
+ hash = jhash2((void *)&key->private.address,
+ sizeof(key->private.address) / 4,
+ key->both.offset);
+ return &fph->queues[hash & fph->hash_mask];
+}
+
+struct futex_private_hash *futex_private_hash(void)
+{
+ struct mm_struct *mm = current->mm;
+ struct futex_private_hash *fph;
+
+ fph = mm->futex_phash;
+ return fph;
+}
+
+struct futex_hash_bucket *futex_hash(union futex_key *key)
+{
+ struct futex_hash_bucket *hb;
+
+ hb = __futex_hash(key, NULL);
+ return hb;
+}
+
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+
+static struct futex_hash_bucket *
+__futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
+{
+ return NULL;
+}
+
+struct futex_hash_bucket *futex_hash(union futex_key *key)
+{
+ return __futex_hash(key, NULL);
+}
+
+#endif /* CONFIG_FUTEX_PRIVATE_HASH */
+
+/**
+ * __futex_hash - Return the hash bucket
+ * @key: Pointer to the futex key for which the hash is calculated
+ * @fph: Pointer to private hash if known
+ *
+ * We hash on the keys returned from get_futex_key (see below) and return the
+ * corresponding hash bucket.
+ * If the FUTEX is PROCESS_PRIVATE then a per-process hash bucket (from the
+ * private hash) is returned if existing. Otherwise a hash bucket from the
+ * global hash is returned.
+ */
+static struct futex_hash_bucket *
+__futex_hash(union futex_key *key, struct futex_private_hash *fph)
+{
+ struct futex_hash_bucket *hb;
+ u32 hash;
+
+ hb = __futex_hash_private(key, fph);
+ if (hb)
+ return hb;
+
+ hash = jhash2((u32 *)key,
+ offsetof(typeof(*key), both.offset) / 4,
+ key->both.offset);
+ return &futex_queues[hash & futex_hashmask];
+}
+
/**
* futex_setup_timer - set up the sleeping hrtimer.
* @time: ptr to the given timeout value
@@ -986,6 +1066,13 @@ static void exit_pi_state_list(struct task_struct *curr)
union futex_key key = FUTEX_KEY_INIT;
/*
+ * Ensure the hash remains stable (no resize) during the while loop
+ * below. The hb pointer is acquired under the pi_lock so we can't block
+ * on the mutex.
+ */
+ WARN_ON(curr != current);
+ guard(private_hash)();
+ /*
* We are a ZOMBIE and nobody can enqueue itself on
* pi_state_list anymore, but we have to be careful
* versus waiters unqueueing themselves:
@@ -1160,13 +1247,98 @@ void futex_exit_release(struct task_struct *tsk)
futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
}
-static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
+ struct futex_private_hash *fph)
{
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+ fhb->priv = fph;
+#endif
atomic_set(&fhb->waiters, 0);
plist_head_init(&fhb->chain);
spin_lock_init(&fhb->lock);
}
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
+void futex_hash_free(struct mm_struct *mm)
+{
+ kvfree(mm->futex_phash);
+}
+
+static int futex_hash_allocate(unsigned int hash_slots, bool custom)
+{
+ struct mm_struct *mm = current->mm;
+ struct futex_private_hash *fph;
+ int i;
+
+ if (hash_slots && (hash_slots == 1 || !is_power_of_2(hash_slots)))
+ return -EINVAL;
+
+ if (mm->futex_phash)
+ return -EALREADY;
+
+ if (!thread_group_empty(current))
+ return -EINVAL;
+
+ fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
+ if (!fph)
+ return -ENOMEM;
+
+ fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
+ fph->custom = custom;
+ fph->mm = mm;
+
+ for (i = 0; i < hash_slots; i++)
+ futex_hash_bucket_init(&fph->queues[i], fph);
+
+ mm->futex_phash = fph;
+ return 0;
+}
+
+static int futex_hash_get_slots(void)
+{
+ struct futex_private_hash *fph;
+
+ fph = current->mm->futex_phash;
+ if (fph && fph->hash_mask)
+ return fph->hash_mask + 1;
+ return 0;
+}
+
+#else
+
+static int futex_hash_allocate(unsigned int hash_slots, bool custom)
+{
+ return -EINVAL;
+}
+
+static int futex_hash_get_slots(void)
+{
+ return 0;
+}
+#endif
+
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
+{
+ int ret;
+
+ switch (arg2) {
+ case PR_FUTEX_HASH_SET_SLOTS:
+ if (arg4 != 0)
+ return -EINVAL;
+ ret = futex_hash_allocate(arg3, true);
+ break;
+
+ case PR_FUTEX_HASH_GET_SLOTS:
+ ret = futex_hash_get_slots();
+ break;
+
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
static int __init futex_init(void)
{
unsigned long hashsize, i;
@@ -1185,7 +1357,7 @@ static int __init futex_init(void)
hashsize = 1UL << futex_shift;
for (i = 0; i < hashsize; i++)
- futex_hash_bucket_init(&futex_queues[i]);
+ futex_hash_bucket_init(&futex_queues[i], NULL);
futex_hashmask = hashsize - 1;
return 0;
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 26e6933..899aed5 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -118,6 +118,7 @@ struct futex_hash_bucket {
atomic_t waiters;
spinlock_t lock;
struct plist_head chain;
+ struct futex_private_hash *priv;
} ____cacheline_aligned_in_smp;
/*
@@ -204,6 +205,7 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
int flags, u64 range_ns);
extern struct futex_hash_bucket *futex_hash(union futex_key *key);
+#ifdef CONFIG_FUTEX_PRIVATE_HASH
extern void futex_hash_get(struct futex_hash_bucket *hb);
extern void futex_hash_put(struct futex_hash_bucket *hb);
@@ -211,6 +213,14 @@ extern struct futex_private_hash *futex_private_hash(void);
extern bool futex_private_hash_get(struct futex_private_hash *fph);
extern void futex_private_hash_put(struct futex_private_hash *fph);
+#else /* !CONFIG_FUTEX_PRIVATE_HASH */
+static inline void futex_hash_get(struct futex_hash_bucket *hb) { }
+static inline void futex_hash_put(struct futex_hash_bucket *hb) { }
+static inline struct futex_private_hash *futex_private_hash(void) { return NULL; }
+static inline bool futex_private_hash_get(void) { return false; }
+static inline void futex_private_hash_put(struct futex_private_hash *fph) { }
+#endif
+
DEFINE_CLASS(hb, struct futex_hash_bucket *,
if (_T) futex_hash_put(_T),
futex_hash(key), union futex_key *key);
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968..adc0de0 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -52,6 +52,7 @@
#include <linux/user_namespace.h>
#include <linux/time_namespace.h>
#include <linux/binfmts.h>
+#include <linux/futex.h>
#include <linux/sched.h>
#include <linux/sched/autogroup.h>
@@ -2820,6 +2821,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
error = posixtimer_create_prctl(arg2);
break;
+ case PR_FUTEX_HASH:
+ error = futex_hash_prctl(arg2, arg3, arg4);
+ break;
default:
trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Create helper function to initialize a hash slot
2025-04-16 16:29 ` [PATCH v12 11/21] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 9a9bdfdd687395a3dc949d3ae3323494395a93d4
Gitweb: https://git.kernel.org/tip/9a9bdfdd687395a3dc949d3ae3323494395a93d4
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 16 Apr 2025 18:29:11 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:07 +02:00
futex: Create helper function to initialize a hash slot
Factor out the futex_hash_bucket initialisation into a helpr function.
The helper function will be used in a follow up patch implementing
process private hash buckets.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-12-bigeasy@linutronix.de
---
kernel/futex/core.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 1443a98..afc6678 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1160,6 +1160,13 @@ void futex_exit_release(struct task_struct *tsk)
futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
}
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
+{
+ atomic_set(&fhb->waiters, 0);
+ plist_head_init(&fhb->chain);
+ spin_lock_init(&fhb->lock);
+}
+
static int __init futex_init(void)
{
unsigned long hashsize, i;
@@ -1177,11 +1184,8 @@ static int __init futex_init(void)
hashsize, hashsize);
hashsize = 1UL << futex_shift;
- for (i = 0; i < hashsize; i++) {
- atomic_set(&futex_queues[i].waiters, 0);
- plist_head_init(&futex_queues[i].chain);
- spin_lock_init(&futex_queues[i].lock);
- }
+ for (i = 0; i < hashsize; i++)
+ futex_hash_bucket_init(&futex_queues[i]);
futex_hashmask = hashsize - 1;
return 0;
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Introduce futex_q_lockptr_lock()
2025-04-16 16:29 ` [PATCH v12 10/21] futex: Introduce futex_q_lockptr_lock() Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Sebastian Andrzej Siewior
2025-05-08 19:06 ` [PATCH v12 10/21] " André Almeida
1 sibling, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: b04b8f3032aae6121303bfa324c768faba032242
Gitweb: https://git.kernel.org/tip/b04b8f3032aae6121303bfa324c768faba032242
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 16 Apr 2025 18:29:10 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:07 +02:00
futex: Introduce futex_q_lockptr_lock()
futex_lock_pi() and __fixup_pi_state_owner() acquire the
futex_q::lock_ptr without holding a reference assuming the previously
obtained hash bucket and the assigned lock_ptr are still valid. This
isn't the case once the private hash can be resized and becomes invalid
after the reference drop.
Introduce futex_q_lockptr_lock() to lock the hash bucket recorded in
futex_q::lock_ptr. The lock pointer is read in a RCU section to ensure
that it does not go away if the hash bucket has been replaced and the
old pointer has been observed. After locking the pointer needs to be
compared to check if it changed. If so then the hash bucket has been
replaced and the user has been moved to the new one and lock_ptr has
been updated. The lock operation needs to be redone in this case.
The locked hash bucket is not returned.
A special case is an early return in futex_lock_pi() (due to signal or
timeout) and a successful futex_wait_requeue_pi(). In both cases a valid
futex_q::lock_ptr is expected (and its matching hash bucket) but since
the waiter has been removed from the hash this can no longer be
guaranteed. Therefore before the waiter is removed and a reference is
acquired which is later dropped by the waiter to avoid a resize.
Add futex_q_lockptr_lock() and use it.
Acquire an additional reference in requeue_pi_wake_futex() and
futex_unlock_pi() while the futex_q is removed, denote this extra
reference in futex_q::drop_hb_ref and let the waiter drop the reference
in this case.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-11-bigeasy@linutronix.de
---
kernel/futex/core.c | 25 +++++++++++++++++++++++++
kernel/futex/futex.h | 3 ++-
kernel/futex/pi.c | 15 +++++++++++++--
kernel/futex/requeue.c | 16 +++++++++++++---
4 files changed, 53 insertions(+), 6 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 5e70cb8..1443a98 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -134,6 +134,13 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
return &futex_queues[hash & futex_hashmask];
}
+/**
+ * futex_hash_get - Get an additional reference for the local hash.
+ * @hb: ptr to the private local hash.
+ *
+ * Obtain an additional reference for the already obtained hash bucket. The
+ * caller must already own an reference.
+ */
void futex_hash_get(struct futex_hash_bucket *hb) { }
void futex_hash_put(struct futex_hash_bucket *hb) { }
@@ -615,6 +622,24 @@ retry:
return ret;
}
+void futex_q_lockptr_lock(struct futex_q *q)
+{
+ spinlock_t *lock_ptr;
+
+ /*
+ * See futex_unqueue() why lock_ptr can change.
+ */
+ guard(rcu)();
+retry:
+ lock_ptr = READ_ONCE(q->lock_ptr);
+ spin_lock(lock_ptr);
+
+ if (unlikely(lock_ptr != q->lock_ptr)) {
+ spin_unlock(lock_ptr);
+ goto retry;
+ }
+}
+
/*
* PI futexes can not be requeued and must remove themselves from the hash
* bucket. The hash bucket lock (i.e. lock_ptr) is held.
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index bc76e36..26e6933 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -183,6 +183,7 @@ struct futex_q {
union futex_key *requeue_pi_key;
u32 bitset;
atomic_t requeue_state;
+ bool drop_hb_ref;
#ifdef CONFIG_PREEMPT_RT
struct rcuwait requeue_wait;
#endif
@@ -197,7 +198,7 @@ enum futex_access {
extern int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
enum futex_access rw);
-
+extern void futex_q_lockptr_lock(struct futex_q *q);
extern struct hrtimer_sleeper *
futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
int flags, u64 range_ns);
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index e52f540..dacb233 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -806,7 +806,7 @@ handle_err:
break;
}
- spin_lock(q->lock_ptr);
+ futex_q_lockptr_lock(q);
raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
/*
@@ -1072,7 +1072,7 @@ cleanup:
* spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
* the
*/
- spin_lock(q.lock_ptr);
+ futex_q_lockptr_lock(&q);
/*
* Waiter is unqueued.
*/
@@ -1092,6 +1092,11 @@ no_block:
futex_unqueue_pi(&q);
spin_unlock(q.lock_ptr);
+ if (q.drop_hb_ref) {
+ CLASS(hb, hb)(&q.key);
+ /* Additional reference from futex_unlock_pi() */
+ futex_hash_put(hb);
+ }
goto out;
out_unlock_put_key:
@@ -1200,6 +1205,12 @@ retry_hb:
*/
rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
if (!rt_waiter) {
+ /*
+ * Acquire a reference for the leaving waiter to ensure
+ * valid futex_q::lock_ptr.
+ */
+ futex_hash_get(hb);
+ top_waiter->drop_hb_ref = true;
__futex_unqueue(top_waiter);
raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
goto retry_hb;
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 023c028..b0e64fd 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -231,7 +231,12 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
WARN_ON(!q->rt_waiter);
q->rt_waiter = NULL;
-
+ /*
+ * Acquire a reference for the waiter to ensure valid
+ * futex_q::lock_ptr.
+ */
+ futex_hash_get(hb);
+ q->drop_hb_ref = true;
q->lock_ptr = &hb->lock;
/* Signal locked state to the waiter */
@@ -826,7 +831,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
case Q_REQUEUE_PI_LOCKED:
/* The requeue acquired the lock */
if (q.pi_state && (q.pi_state->owner != current)) {
- spin_lock(q.lock_ptr);
+ futex_q_lockptr_lock(&q);
ret = fixup_pi_owner(uaddr2, &q, true);
/*
* Drop the reference to the pi state which the
@@ -853,7 +858,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
ret = 0;
- spin_lock(q.lock_ptr);
+ futex_q_lockptr_lock(&q);
debug_rt_mutex_free_waiter(&rt_waiter);
/*
* Fixup the pi_state owner and possibly acquire the lock if we
@@ -885,6 +890,11 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
default:
BUG();
}
+ if (q.drop_hb_ref) {
+ CLASS(hb, hb)(&q.key);
+ /* Additional reference from requeue_pi_wake_futex() */
+ futex_hash_put(hb);
+ }
out:
if (to) {
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Decrease the waiter count before the unlock operation
2025-04-16 16:29 ` [PATCH v12 09/21] futex: Decrease the waiter count before the unlock operation Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: fe00e88d217a7bf7a4d0268d08f51e624d40ee53
Gitweb: https://git.kernel.org/tip/fe00e88d217a7bf7a4d0268d08f51e624d40ee53
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 16 Apr 2025 18:29:09 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:06 +02:00
futex: Decrease the waiter count before the unlock operation
To support runtime resizing of the process private hash, it's required
to not use the obtained hash bucket once the reference count has been
dropped. The reference will be dropped after the unlock of the hash
bucket.
The amount of waiters is decremented after the unlock operation. There
is no requirement that this needs to happen after the unlock. The
increment happens before acquiring the lock to signal early that there
will be a waiter. The waiter can avoid blocking on the lock if it is
known that there will be no waiter.
There is no difference in terms of ordering if the decrement happens
before or after the unlock.
Decrease the waiter count before the unlock operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-10-bigeasy@linutronix.de
---
kernel/futex/core.c | 2 +-
kernel/futex/requeue.c | 8 ++++----
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 6a1d6b1..5e70cb8 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -537,8 +537,8 @@ void futex_q_lock(struct futex_q *q, struct futex_hash_bucket *hb)
void futex_q_unlock(struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
- spin_unlock(&hb->lock);
futex_hb_waiters_dec(hb);
+ spin_unlock(&hb->lock);
}
void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb,
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 992e3ce..023c028 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -456,8 +456,8 @@ retry_private:
ret = futex_get_value_locked(&curval, uaddr1);
if (unlikely(ret)) {
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
ret = get_user(curval, uaddr1);
if (ret)
@@ -542,8 +542,8 @@ retry_private:
* waiter::requeue_state is correct.
*/
case -EFAULT:
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
ret = fault_in_user_writeable(uaddr2);
if (!ret)
goto retry;
@@ -556,8 +556,8 @@ retry_private:
* exit to complete.
* - EAGAIN: The user space value changed.
*/
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
/*
* Handle the case where the owner is in the middle of
* exiting. Wait for the exit to complete otherwise
@@ -674,8 +674,8 @@ retry_private:
put_pi_state(pi_state);
out_unlock:
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
}
wake_up_q(&wake_q);
return ret ? ret : task_count;
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Acquire a hash reference in futex_wait_multiple_setup()
2025-04-16 16:29 ` [PATCH v12 08/21] futex: Acquire a hash reference in futex_wait_multiple_setup() Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 3f6b233018af2a6fb449faa324d94a437e2e47ce
Gitweb: https://git.kernel.org/tip/3f6b233018af2a6fb449faa324d94a437e2e47ce
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 16 Apr 2025 18:29:08 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:06 +02:00
futex: Acquire a hash reference in futex_wait_multiple_setup()
futex_wait_multiple_setup() changes task_struct::__state to
!TASK_RUNNING and then enqueues on multiple futexes. Every
futex_q_lock() acquires a reference on the global hash which is
dropped later.
If a rehash is in progress then the loop will block on
mm_struct::futex_hash_bucket for the rehash to complete and this will
lose the previously set task_struct::__state.
Acquire a reference on the local hash to avoiding blocking on
mm_struct::futex_hash_bucket.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-9-bigeasy@linutronix.de
---
kernel/futex/waitwake.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index d52541b..bd8fef0 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -407,6 +407,12 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
u32 uval;
/*
+ * Make sure to have a reference on the private_hash such that we
+ * don't block on rehash after changing the task state below.
+ */
+ guard(private_hash)();
+
+ /*
* Enqueuing multiple futexes is tricky, because we need to enqueue
* each futex on the list before dealing with the next one to avoid
* deadlocking on the hash bucket. But, before enqueuing, we need to
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Create private_hash() get/put class
2025-04-16 16:29 ` [PATCH v12 07/21] futex: Create private_hash() " Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Peter Zijlstra
0 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Sebastian Andrzej Siewior, x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: d854e4e7850e6d3ed24f863a877abc2279d60506
Gitweb: https://git.kernel.org/tip/d854e4e7850e6d3ed24f863a877abc2279d60506
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Apr 2025 18:29:07 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:06 +02:00
futex: Create private_hash() get/put class
This gets us:
fph = futex_private_hash(key) /* gets fph and inc users */
futex_private_hash_get(fph) /* inc users */
futex_private_hash_put(fph) /* dec users */
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-8-bigeasy@linutronix.de
---
kernel/futex/core.c | 12 ++++++++++++
kernel/futex/futex.h | 8 ++++++++
2 files changed, 20 insertions(+)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 56a5653..6a1d6b1 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -107,6 +107,18 @@ late_initcall(fail_futex_debugfs);
#endif /* CONFIG_FAIL_FUTEX */
+struct futex_private_hash *futex_private_hash(void)
+{
+ return NULL;
+}
+
+bool futex_private_hash_get(struct futex_private_hash *fph)
+{
+ return false;
+}
+
+void futex_private_hash_put(struct futex_private_hash *fph) { }
+
/**
* futex_hash - Return the hash bucket in the global hash
* @key: Pointer to the futex key for which the hash is calculated
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 77d9b35..bc76e36 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -206,10 +206,18 @@ extern struct futex_hash_bucket *futex_hash(union futex_key *key);
extern void futex_hash_get(struct futex_hash_bucket *hb);
extern void futex_hash_put(struct futex_hash_bucket *hb);
+extern struct futex_private_hash *futex_private_hash(void);
+extern bool futex_private_hash_get(struct futex_private_hash *fph);
+extern void futex_private_hash_put(struct futex_private_hash *fph);
+
DEFINE_CLASS(hb, struct futex_hash_bucket *,
if (_T) futex_hash_put(_T),
futex_hash(key), union futex_key *key);
+DEFINE_CLASS(private_hash, struct futex_private_hash *,
+ if (_T) futex_private_hash_put(_T),
+ futex_private_hash(), void);
+
/**
* futex_match - Check whether two futex keys are equal
* @key1: Pointer to key1
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Create futex_hash() get/put class
2025-04-16 16:29 ` [PATCH v12 06/21] futex: Create futex_hash() get/put class Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Peter Zijlstra
0 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Sebastian Andrzej Siewior, x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 6c67f8d880c0950215b8e6f8539562ad1971a05a
Gitweb: https://git.kernel.org/tip/6c67f8d880c0950215b8e6f8539562ad1971a05a
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Apr 2025 18:29:06 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:06 +02:00
futex: Create futex_hash() get/put class
This gets us:
hb = futex_hash(key) /* gets hb and inc users */
futex_hash_get(hb) /* inc users */
futex_hash_put(hb) /* dec users */
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-7-bigeasy@linutronix.de
---
kernel/futex/core.c | 6 +++---
kernel/futex/futex.h | 7 +++++++
kernel/futex/pi.c | 16 ++++++++++++----
kernel/futex/requeue.c | 10 +++-------
kernel/futex/waitwake.c | 15 +++++----------
5 files changed, 30 insertions(+), 24 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index e4cb5ce..56a5653 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -122,6 +122,8 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
return &futex_queues[hash & futex_hashmask];
}
+void futex_hash_get(struct futex_hash_bucket *hb) { }
+void futex_hash_put(struct futex_hash_bucket *hb) { }
/**
* futex_setup_timer - set up the sleeping hrtimer.
@@ -957,9 +959,7 @@ static void exit_pi_state_list(struct task_struct *curr)
pi_state = list_entry(next, struct futex_pi_state, list);
key = pi_state->key;
if (1) {
- struct futex_hash_bucket *hb;
-
- hb = futex_hash(&key);
+ CLASS(hb, hb)(&key);
/*
* We can race against put_pi_state() removing itself from the
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index a219903..77d9b35 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -7,6 +7,7 @@
#include <linux/sched/wake_q.h>
#include <linux/compat.h>
#include <linux/uaccess.h>
+#include <linux/cleanup.h>
#ifdef CONFIG_PREEMPT_RT
#include <linux/rcuwait.h>
@@ -202,6 +203,12 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
int flags, u64 range_ns);
extern struct futex_hash_bucket *futex_hash(union futex_key *key);
+extern void futex_hash_get(struct futex_hash_bucket *hb);
+extern void futex_hash_put(struct futex_hash_bucket *hb);
+
+DEFINE_CLASS(hb, struct futex_hash_bucket *,
+ if (_T) futex_hash_put(_T),
+ futex_hash(key), union futex_key *key);
/**
* futex_match - Check whether two futex keys are equal
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index a56f28f..e52f540 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -939,9 +939,8 @@ retry:
retry_private:
if (1) {
- struct futex_hash_bucket *hb;
+ CLASS(hb, hb)(&q.key);
- hb = futex_hash(&q.key);
futex_q_lock(&q, hb);
ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
@@ -995,6 +994,16 @@ retry_private:
}
/*
+ * Caution; releasing @hb in-scope. The hb->lock is still locked
+ * while the reference is dropped. The reference can not be dropped
+ * after the unlock because if a user initiated resize is in progress
+ * then we might need to wake him. This can not be done after the
+ * rt_mutex_pre_schedule() invocation. The hb will remain valid because
+ * the thread, performing resize, will block on hb->lock during
+ * the requeue.
+ */
+ futex_hash_put(no_free_ptr(hb));
+ /*
* Must be done before we enqueue the waiter, here is unfortunately
* under the hb lock, but that *should* work because it does nothing.
*/
@@ -1119,7 +1128,6 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
{
u32 curval, uval, vpid = task_pid_vnr(current);
union futex_key key = FUTEX_KEY_INIT;
- struct futex_hash_bucket *hb;
struct futex_q *top_waiter;
int ret;
@@ -1139,7 +1147,7 @@ retry:
if (ret)
return ret;
- hb = futex_hash(&key);
+ CLASS(hb, hb)(&key);
spin_lock(&hb->lock);
retry_hb:
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 209794c..992e3ce 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -444,10 +444,8 @@ retry:
retry_private:
if (1) {
- struct futex_hash_bucket *hb1, *hb2;
-
- hb1 = futex_hash(&key1);
- hb2 = futex_hash(&key2);
+ CLASS(hb, hb1)(&key1);
+ CLASS(hb, hb2)(&key2);
futex_hb_waiters_inc(hb2);
double_lock_hb(hb1, hb2);
@@ -817,9 +815,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
switch (futex_requeue_pi_wakeup_sync(&q)) {
case Q_REQUEUE_PI_IGNORE:
{
- struct futex_hash_bucket *hb;
-
- hb = futex_hash(&q.key);
+ CLASS(hb, hb)(&q.key);
/* The waiter is still on uaddr1 */
spin_lock(&hb->lock);
ret = handle_early_requeue_pi_wakeup(hb, &q, to);
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 7dc35be..d52541b 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -154,7 +154,6 @@ void futex_wake_mark(struct wake_q_head *wake_q, struct futex_q *q)
*/
int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
- struct futex_hash_bucket *hb;
struct futex_q *this, *next;
union futex_key key = FUTEX_KEY_INIT;
DEFINE_WAKE_Q(wake_q);
@@ -170,7 +169,7 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
if ((flags & FLAGS_STRICT) && !nr_wake)
return 0;
- hb = futex_hash(&key);
+ CLASS(hb, hb)(&key);
/* Make sure we really have tasks to wakeup */
if (!futex_hb_waiters_pending(hb))
@@ -267,10 +266,8 @@ retry:
retry_private:
if (1) {
- struct futex_hash_bucket *hb1, *hb2;
-
- hb1 = futex_hash(&key1);
- hb2 = futex_hash(&key2);
+ CLASS(hb, hb1)(&key1);
+ CLASS(hb, hb2)(&key2);
double_lock_hb(hb1, hb2);
op_ret = futex_atomic_op_inuser(op, uaddr2);
@@ -444,9 +441,8 @@ retry:
u32 val = vs[i].w.val;
if (1) {
- struct futex_hash_bucket *hb;
+ CLASS(hb, hb)(&q->key);
- hb = futex_hash(&q->key);
futex_q_lock(q, hb);
ret = futex_get_value_locked(&uval, uaddr);
@@ -618,9 +614,8 @@ retry:
retry_private:
if (1) {
- struct futex_hash_bucket *hb;
+ CLASS(hb, hb)(&q->key);
- hb = futex_hash(&q->key);
futex_q_lock(q, hb);
ret = futex_get_value_locked(&uval, uaddr);
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Create hb scopes
2025-04-16 16:29 ` [PATCH v12 05/21] futex: Create hb scopes Sebastian Andrzej Siewior
2025-05-06 23:45 ` André Almeida
@ 2025-05-08 10:33 ` tip-bot2 for Peter Zijlstra
1 sibling, 0 replies; 109+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Sebastian Andrzej Siewior, x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 8486d12f558ff9e4e90331e8ef841d84bf3a8c24
Gitweb: https://git.kernel.org/tip/8486d12f558ff9e4e90331e8ef841d84bf3a8c24
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Apr 2025 18:29:05 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:05 +02:00
futex: Create hb scopes
Create explicit scopes for hb variables; almost pure re-indent.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-6-bigeasy@linutronix.de
---
kernel/futex/core.c | 83 ++++----
kernel/futex/pi.c | 282 +++++++++++++--------------
kernel/futex/requeue.c | 413 +++++++++++++++++++--------------------
kernel/futex/waitwake.c | 189 +++++++++---------
4 files changed, 493 insertions(+), 474 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 7adc914..e4cb5ce 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -944,7 +944,6 @@ static void exit_pi_state_list(struct task_struct *curr)
{
struct list_head *next, *head = &curr->pi_state_list;
struct futex_pi_state *pi_state;
- struct futex_hash_bucket *hb;
union futex_key key = FUTEX_KEY_INIT;
/*
@@ -957,50 +956,54 @@ static void exit_pi_state_list(struct task_struct *curr)
next = head->next;
pi_state = list_entry(next, struct futex_pi_state, list);
key = pi_state->key;
- hb = futex_hash(&key);
-
- /*
- * We can race against put_pi_state() removing itself from the
- * list (a waiter going away). put_pi_state() will first
- * decrement the reference count and then modify the list, so
- * its possible to see the list entry but fail this reference
- * acquire.
- *
- * In that case; drop the locks to let put_pi_state() make
- * progress and retry the loop.
- */
- if (!refcount_inc_not_zero(&pi_state->refcount)) {
+ if (1) {
+ struct futex_hash_bucket *hb;
+
+ hb = futex_hash(&key);
+
+ /*
+ * We can race against put_pi_state() removing itself from the
+ * list (a waiter going away). put_pi_state() will first
+ * decrement the reference count and then modify the list, so
+ * its possible to see the list entry but fail this reference
+ * acquire.
+ *
+ * In that case; drop the locks to let put_pi_state() make
+ * progress and retry the loop.
+ */
+ if (!refcount_inc_not_zero(&pi_state->refcount)) {
+ raw_spin_unlock_irq(&curr->pi_lock);
+ cpu_relax();
+ raw_spin_lock_irq(&curr->pi_lock);
+ continue;
+ }
raw_spin_unlock_irq(&curr->pi_lock);
- cpu_relax();
- raw_spin_lock_irq(&curr->pi_lock);
- continue;
- }
- raw_spin_unlock_irq(&curr->pi_lock);
- spin_lock(&hb->lock);
- raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
- raw_spin_lock(&curr->pi_lock);
- /*
- * We dropped the pi-lock, so re-check whether this
- * task still owns the PI-state:
- */
- if (head->next != next) {
- /* retain curr->pi_lock for the loop invariant */
- raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
+ spin_lock(&hb->lock);
+ raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+ raw_spin_lock(&curr->pi_lock);
+ /*
+ * We dropped the pi-lock, so re-check whether this
+ * task still owns the PI-state:
+ */
+ if (head->next != next) {
+ /* retain curr->pi_lock for the loop invariant */
+ raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
+ spin_unlock(&hb->lock);
+ put_pi_state(pi_state);
+ continue;
+ }
+
+ WARN_ON(pi_state->owner != curr);
+ WARN_ON(list_empty(&pi_state->list));
+ list_del_init(&pi_state->list);
+ pi_state->owner = NULL;
+
+ raw_spin_unlock(&curr->pi_lock);
+ raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
spin_unlock(&hb->lock);
- put_pi_state(pi_state);
- continue;
}
- WARN_ON(pi_state->owner != curr);
- WARN_ON(list_empty(&pi_state->list));
- list_del_init(&pi_state->list);
- pi_state->owner = NULL;
-
- raw_spin_unlock(&curr->pi_lock);
- raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
- spin_unlock(&hb->lock);
-
rt_mutex_futex_unlock(&pi_state->pi_mutex);
put_pi_state(pi_state);
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 3bf942e..a56f28f 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -920,7 +920,6 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
struct hrtimer_sleeper timeout, *to;
struct task_struct *exiting = NULL;
struct rt_mutex_waiter rt_waiter;
- struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
DEFINE_WAKE_Q(wake_q);
int res, ret;
@@ -939,152 +938,169 @@ retry:
goto out;
retry_private:
- hb = futex_hash(&q.key);
- futex_q_lock(&q, hb);
+ if (1) {
+ struct futex_hash_bucket *hb;
- ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
- &exiting, 0);
- if (unlikely(ret)) {
- /*
- * Atomic work succeeded and we got the lock,
- * or failed. Either way, we do _not_ block.
- */
- switch (ret) {
- case 1:
- /* We got the lock. */
- ret = 0;
- goto out_unlock_put_key;
- case -EFAULT:
- goto uaddr_faulted;
- case -EBUSY:
- case -EAGAIN:
- /*
- * Two reasons for this:
- * - EBUSY: Task is exiting and we just wait for the
- * exit to complete.
- * - EAGAIN: The user space value changed.
- */
- futex_q_unlock(hb);
+ hb = futex_hash(&q.key);
+ futex_q_lock(&q, hb);
+
+ ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
+ &exiting, 0);
+ if (unlikely(ret)) {
/*
- * Handle the case where the owner is in the middle of
- * exiting. Wait for the exit to complete otherwise
- * this task might loop forever, aka. live lock.
+ * Atomic work succeeded and we got the lock,
+ * or failed. Either way, we do _not_ block.
*/
- wait_for_owner_exiting(ret, exiting);
- cond_resched();
- goto retry;
- default:
- goto out_unlock_put_key;
+ switch (ret) {
+ case 1:
+ /* We got the lock. */
+ ret = 0;
+ goto out_unlock_put_key;
+ case -EFAULT:
+ goto uaddr_faulted;
+ case -EBUSY:
+ case -EAGAIN:
+ /*
+ * Two reasons for this:
+ * - EBUSY: Task is exiting and we just wait for the
+ * exit to complete.
+ * - EAGAIN: The user space value changed.
+ */
+ futex_q_unlock(hb);
+ /*
+ * Handle the case where the owner is in the middle of
+ * exiting. Wait for the exit to complete otherwise
+ * this task might loop forever, aka. live lock.
+ */
+ wait_for_owner_exiting(ret, exiting);
+ cond_resched();
+ goto retry;
+ default:
+ goto out_unlock_put_key;
+ }
}
- }
- WARN_ON(!q.pi_state);
+ WARN_ON(!q.pi_state);
- /*
- * Only actually queue now that the atomic ops are done:
- */
- __futex_queue(&q, hb, current);
+ /*
+ * Only actually queue now that the atomic ops are done:
+ */
+ __futex_queue(&q, hb, current);
- if (trylock) {
- ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
- /* Fixup the trylock return value: */
- ret = ret ? 0 : -EWOULDBLOCK;
- goto no_block;
- }
+ if (trylock) {
+ ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
+ /* Fixup the trylock return value: */
+ ret = ret ? 0 : -EWOULDBLOCK;
+ goto no_block;
+ }
- /*
- * Must be done before we enqueue the waiter, here is unfortunately
- * under the hb lock, but that *should* work because it does nothing.
- */
- rt_mutex_pre_schedule();
+ /*
+ * Must be done before we enqueue the waiter, here is unfortunately
+ * under the hb lock, but that *should* work because it does nothing.
+ */
+ rt_mutex_pre_schedule();
- rt_mutex_init_waiter(&rt_waiter);
+ rt_mutex_init_waiter(&rt_waiter);
- /*
- * On PREEMPT_RT, when hb->lock becomes an rt_mutex, we must not
- * hold it while doing rt_mutex_start_proxy(), because then it will
- * include hb->lock in the blocking chain, even through we'll not in
- * fact hold it while blocking. This will lead it to report -EDEADLK
- * and BUG when futex_unlock_pi() interleaves with this.
- *
- * Therefore acquire wait_lock while holding hb->lock, but drop the
- * latter before calling __rt_mutex_start_proxy_lock(). This
- * interleaves with futex_unlock_pi() -- which does a similar lock
- * handoff -- such that the latter can observe the futex_q::pi_state
- * before __rt_mutex_start_proxy_lock() is done.
- */
- raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
- spin_unlock(q.lock_ptr);
- /*
- * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
- * such that futex_unlock_pi() is guaranteed to observe the waiter when
- * it sees the futex_q::pi_state.
- */
- ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current, &wake_q);
- raw_spin_unlock_irq_wake(&q.pi_state->pi_mutex.wait_lock, &wake_q);
+ /*
+ * On PREEMPT_RT, when hb->lock becomes an rt_mutex, we must not
+ * hold it while doing rt_mutex_start_proxy(), because then it will
+ * include hb->lock in the blocking chain, even through we'll not in
+ * fact hold it while blocking. This will lead it to report -EDEADLK
+ * and BUG when futex_unlock_pi() interleaves with this.
+ *
+ * Therefore acquire wait_lock while holding hb->lock, but drop the
+ * latter before calling __rt_mutex_start_proxy_lock(). This
+ * interleaves with futex_unlock_pi() -- which does a similar lock
+ * handoff -- such that the latter can observe the futex_q::pi_state
+ * before __rt_mutex_start_proxy_lock() is done.
+ */
+ raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
+ spin_unlock(q.lock_ptr);
+ /*
+ * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
+ * such that futex_unlock_pi() is guaranteed to observe the waiter when
+ * it sees the futex_q::pi_state.
+ */
+ ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current, &wake_q);
+ raw_spin_unlock_irq_wake(&q.pi_state->pi_mutex.wait_lock, &wake_q);
- if (ret) {
- if (ret == 1)
- ret = 0;
- goto cleanup;
- }
+ if (ret) {
+ if (ret == 1)
+ ret = 0;
+ goto cleanup;
+ }
- if (unlikely(to))
- hrtimer_sleeper_start_expires(to, HRTIMER_MODE_ABS);
+ if (unlikely(to))
+ hrtimer_sleeper_start_expires(to, HRTIMER_MODE_ABS);
- ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
+ ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
cleanup:
- /*
- * If we failed to acquire the lock (deadlock/signal/timeout), we must
- * must unwind the above, however we canont lock hb->lock because
- * rt_mutex already has a waiter enqueued and hb->lock can itself try
- * and enqueue an rt_waiter through rtlock.
- *
- * Doing the cleanup without holding hb->lock can cause inconsistent
- * state between hb and pi_state, but only in the direction of not
- * seeing a waiter that is leaving.
- *
- * See futex_unlock_pi(), it deals with this inconsistency.
- *
- * There be dragons here, since we must deal with the inconsistency on
- * the way out (here), it is impossible to detect/warn about the race
- * the other way around (missing an incoming waiter).
- *
- * What could possibly go wrong...
- */
- if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
- ret = 0;
+ /*
+ * If we failed to acquire the lock (deadlock/signal/timeout), we must
+ * unwind the above, however we canont lock hb->lock because
+ * rt_mutex already has a waiter enqueued and hb->lock can itself try
+ * and enqueue an rt_waiter through rtlock.
+ *
+ * Doing the cleanup without holding hb->lock can cause inconsistent
+ * state between hb and pi_state, but only in the direction of not
+ * seeing a waiter that is leaving.
+ *
+ * See futex_unlock_pi(), it deals with this inconsistency.
+ *
+ * There be dragons here, since we must deal with the inconsistency on
+ * the way out (here), it is impossible to detect/warn about the race
+ * the other way around (missing an incoming waiter).
+ *
+ * What could possibly go wrong...
+ */
+ if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
+ ret = 0;
- /*
- * Now that the rt_waiter has been dequeued, it is safe to use
- * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
- * the
- */
- spin_lock(q.lock_ptr);
- /*
- * Waiter is unqueued.
- */
- rt_mutex_post_schedule();
+ /*
+ * Now that the rt_waiter has been dequeued, it is safe to use
+ * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
+ * the
+ */
+ spin_lock(q.lock_ptr);
+ /*
+ * Waiter is unqueued.
+ */
+ rt_mutex_post_schedule();
no_block:
- /*
- * Fixup the pi_state owner and possibly acquire the lock if we
- * haven't already.
- */
- res = fixup_pi_owner(uaddr, &q, !ret);
- /*
- * If fixup_pi_owner() returned an error, propagate that. If it acquired
- * the lock, clear our -ETIMEDOUT or -EINTR.
- */
- if (res)
- ret = (res < 0) ? res : 0;
+ /*
+ * Fixup the pi_state owner and possibly acquire the lock if we
+ * haven't already.
+ */
+ res = fixup_pi_owner(uaddr, &q, !ret);
+ /*
+ * If fixup_pi_owner() returned an error, propagate that. If it acquired
+ * the lock, clear our -ETIMEDOUT or -EINTR.
+ */
+ if (res)
+ ret = (res < 0) ? res : 0;
- futex_unqueue_pi(&q);
- spin_unlock(q.lock_ptr);
- goto out;
+ futex_unqueue_pi(&q);
+ spin_unlock(q.lock_ptr);
+ goto out;
out_unlock_put_key:
- futex_q_unlock(hb);
+ futex_q_unlock(hb);
+ goto out;
+
+uaddr_faulted:
+ futex_q_unlock(hb);
+
+ ret = fault_in_user_writeable(uaddr);
+ if (ret)
+ goto out;
+
+ if (!(flags & FLAGS_SHARED))
+ goto retry_private;
+
+ goto retry;
+ }
out:
if (to) {
@@ -1092,18 +1108,6 @@ out:
destroy_hrtimer_on_stack(&to->timer);
}
return ret != -EINTR ? ret : -ERESTARTNOINTR;
-
-uaddr_faulted:
- futex_q_unlock(hb);
-
- ret = fault_in_user_writeable(uaddr);
- if (ret)
- goto out;
-
- if (!(flags & FLAGS_SHARED))
- goto retry_private;
-
- goto retry;
}
/*
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 0e55975..209794c 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -371,7 +371,6 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
int task_count = 0, ret;
struct futex_pi_state *pi_state = NULL;
- struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
DEFINE_WAKE_Q(wake_q);
@@ -443,240 +442,244 @@ retry:
if (requeue_pi && futex_match(&key1, &key2))
return -EINVAL;
- hb1 = futex_hash(&key1);
- hb2 = futex_hash(&key2);
-
retry_private:
- futex_hb_waiters_inc(hb2);
- double_lock_hb(hb1, hb2);
+ if (1) {
+ struct futex_hash_bucket *hb1, *hb2;
- if (likely(cmpval != NULL)) {
- u32 curval;
+ hb1 = futex_hash(&key1);
+ hb2 = futex_hash(&key2);
- ret = futex_get_value_locked(&curval, uaddr1);
+ futex_hb_waiters_inc(hb2);
+ double_lock_hb(hb1, hb2);
- if (unlikely(ret)) {
- double_unlock_hb(hb1, hb2);
- futex_hb_waiters_dec(hb2);
+ if (likely(cmpval != NULL)) {
+ u32 curval;
- ret = get_user(curval, uaddr1);
- if (ret)
- return ret;
+ ret = futex_get_value_locked(&curval, uaddr1);
- if (!(flags1 & FLAGS_SHARED))
- goto retry_private;
+ if (unlikely(ret)) {
+ double_unlock_hb(hb1, hb2);
+ futex_hb_waiters_dec(hb2);
- goto retry;
- }
- if (curval != *cmpval) {
- ret = -EAGAIN;
- goto out_unlock;
- }
- }
+ ret = get_user(curval, uaddr1);
+ if (ret)
+ return ret;
- if (requeue_pi) {
- struct task_struct *exiting = NULL;
+ if (!(flags1 & FLAGS_SHARED))
+ goto retry_private;
- /*
- * Attempt to acquire uaddr2 and wake the top waiter. If we
- * intend to requeue waiters, force setting the FUTEX_WAITERS
- * bit. We force this here where we are able to easily handle
- * faults rather in the requeue loop below.
- *
- * Updates topwaiter::requeue_state if a top waiter exists.
- */
- ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
- &key2, &pi_state,
- &exiting, nr_requeue);
+ goto retry;
+ }
+ if (curval != *cmpval) {
+ ret = -EAGAIN;
+ goto out_unlock;
+ }
+ }
- /*
- * At this point the top_waiter has either taken uaddr2 or
- * is waiting on it. In both cases pi_state has been
- * established and an initial refcount on it. In case of an
- * error there's nothing.
- *
- * The top waiter's requeue_state is up to date:
- *
- * - If the lock was acquired atomically (ret == 1), then
- * the state is Q_REQUEUE_PI_LOCKED.
- *
- * The top waiter has been dequeued and woken up and can
- * return to user space immediately. The kernel/user
- * space state is consistent. In case that there must be
- * more waiters requeued the WAITERS bit in the user
- * space futex is set so the top waiter task has to go
- * into the syscall slowpath to unlock the futex. This
- * will block until this requeue operation has been
- * completed and the hash bucket locks have been
- * dropped.
- *
- * - If the trylock failed with an error (ret < 0) then
- * the state is either Q_REQUEUE_PI_NONE, i.e. "nothing
- * happened", or Q_REQUEUE_PI_IGNORE when there was an
- * interleaved early wakeup.
- *
- * - If the trylock did not succeed (ret == 0) then the
- * state is either Q_REQUEUE_PI_IN_PROGRESS or
- * Q_REQUEUE_PI_WAIT if an early wakeup interleaved.
- * This will be cleaned up in the loop below, which
- * cannot fail because futex_proxy_trylock_atomic() did
- * the same sanity checks for requeue_pi as the loop
- * below does.
- */
- switch (ret) {
- case 0:
- /* We hold a reference on the pi state. */
- break;
+ if (requeue_pi) {
+ struct task_struct *exiting = NULL;
- case 1:
/*
- * futex_proxy_trylock_atomic() acquired the user space
- * futex. Adjust task_count.
+ * Attempt to acquire uaddr2 and wake the top waiter. If we
+ * intend to requeue waiters, force setting the FUTEX_WAITERS
+ * bit. We force this here where we are able to easily handle
+ * faults rather in the requeue loop below.
+ *
+ * Updates topwaiter::requeue_state if a top waiter exists.
*/
- task_count++;
- ret = 0;
- break;
+ ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
+ &key2, &pi_state,
+ &exiting, nr_requeue);
- /*
- * If the above failed, then pi_state is NULL and
- * waiter::requeue_state is correct.
- */
- case -EFAULT:
- double_unlock_hb(hb1, hb2);
- futex_hb_waiters_dec(hb2);
- ret = fault_in_user_writeable(uaddr2);
- if (!ret)
- goto retry;
- return ret;
- case -EBUSY:
- case -EAGAIN:
- /*
- * Two reasons for this:
- * - EBUSY: Owner is exiting and we just wait for the
- * exit to complete.
- * - EAGAIN: The user space value changed.
- */
- double_unlock_hb(hb1, hb2);
- futex_hb_waiters_dec(hb2);
/*
- * Handle the case where the owner is in the middle of
- * exiting. Wait for the exit to complete otherwise
- * this task might loop forever, aka. live lock.
+ * At this point the top_waiter has either taken uaddr2 or
+ * is waiting on it. In both cases pi_state has been
+ * established and an initial refcount on it. In case of an
+ * error there's nothing.
+ *
+ * The top waiter's requeue_state is up to date:
+ *
+ * - If the lock was acquired atomically (ret == 1), then
+ * the state is Q_REQUEUE_PI_LOCKED.
+ *
+ * The top waiter has been dequeued and woken up and can
+ * return to user space immediately. The kernel/user
+ * space state is consistent. In case that there must be
+ * more waiters requeued the WAITERS bit in the user
+ * space futex is set so the top waiter task has to go
+ * into the syscall slowpath to unlock the futex. This
+ * will block until this requeue operation has been
+ * completed and the hash bucket locks have been
+ * dropped.
+ *
+ * - If the trylock failed with an error (ret < 0) then
+ * the state is either Q_REQUEUE_PI_NONE, i.e. "nothing
+ * happened", or Q_REQUEUE_PI_IGNORE when there was an
+ * interleaved early wakeup.
+ *
+ * - If the trylock did not succeed (ret == 0) then the
+ * state is either Q_REQUEUE_PI_IN_PROGRESS or
+ * Q_REQUEUE_PI_WAIT if an early wakeup interleaved.
+ * This will be cleaned up in the loop below, which
+ * cannot fail because futex_proxy_trylock_atomic() did
+ * the same sanity checks for requeue_pi as the loop
+ * below does.
*/
- wait_for_owner_exiting(ret, exiting);
- cond_resched();
- goto retry;
- default:
- goto out_unlock;
- }
- }
-
- plist_for_each_entry_safe(this, next, &hb1->chain, list) {
- if (task_count - nr_wake >= nr_requeue)
- break;
-
- if (!futex_match(&this->key, &key1))
- continue;
-
- /*
- * FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI should always
- * be paired with each other and no other futex ops.
- *
- * We should never be requeueing a futex_q with a pi_state,
- * which is awaiting a futex_unlock_pi().
- */
- if ((requeue_pi && !this->rt_waiter) ||
- (!requeue_pi && this->rt_waiter) ||
- this->pi_state) {
- ret = -EINVAL;
- break;
- }
-
- /* Plain futexes just wake or requeue and are done */
- if (!requeue_pi) {
- if (++task_count <= nr_wake)
- this->wake(&wake_q, this);
- else
- requeue_futex(this, hb1, hb2, &key2);
- continue;
+ switch (ret) {
+ case 0:
+ /* We hold a reference on the pi state. */
+ break;
+
+ case 1:
+ /*
+ * futex_proxy_trylock_atomic() acquired the user space
+ * futex. Adjust task_count.
+ */
+ task_count++;
+ ret = 0;
+ break;
+
+ /*
+ * If the above failed, then pi_state is NULL and
+ * waiter::requeue_state is correct.
+ */
+ case -EFAULT:
+ double_unlock_hb(hb1, hb2);
+ futex_hb_waiters_dec(hb2);
+ ret = fault_in_user_writeable(uaddr2);
+ if (!ret)
+ goto retry;
+ return ret;
+ case -EBUSY:
+ case -EAGAIN:
+ /*
+ * Two reasons for this:
+ * - EBUSY: Owner is exiting and we just wait for the
+ * exit to complete.
+ * - EAGAIN: The user space value changed.
+ */
+ double_unlock_hb(hb1, hb2);
+ futex_hb_waiters_dec(hb2);
+ /*
+ * Handle the case where the owner is in the middle of
+ * exiting. Wait for the exit to complete otherwise
+ * this task might loop forever, aka. live lock.
+ */
+ wait_for_owner_exiting(ret, exiting);
+ cond_resched();
+ goto retry;
+ default:
+ goto out_unlock;
+ }
}
- /* Ensure we requeue to the expected futex for requeue_pi. */
- if (!futex_match(this->requeue_pi_key, &key2)) {
- ret = -EINVAL;
- break;
- }
+ plist_for_each_entry_safe(this, next, &hb1->chain, list) {
+ if (task_count - nr_wake >= nr_requeue)
+ break;
- /*
- * Requeue nr_requeue waiters and possibly one more in the case
- * of requeue_pi if we couldn't acquire the lock atomically.
- *
- * Prepare the waiter to take the rt_mutex. Take a refcount
- * on the pi_state and store the pointer in the futex_q
- * object of the waiter.
- */
- get_pi_state(pi_state);
+ if (!futex_match(&this->key, &key1))
+ continue;
- /* Don't requeue when the waiter is already on the way out. */
- if (!futex_requeue_pi_prepare(this, pi_state)) {
/*
- * Early woken waiter signaled that it is on the
- * way out. Drop the pi_state reference and try the
- * next waiter. @this->pi_state is still NULL.
+ * FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI should always
+ * be paired with each other and no other futex ops.
+ *
+ * We should never be requeueing a futex_q with a pi_state,
+ * which is awaiting a futex_unlock_pi().
*/
- put_pi_state(pi_state);
- continue;
- }
-
- ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
- this->rt_waiter,
- this->task);
+ if ((requeue_pi && !this->rt_waiter) ||
+ (!requeue_pi && this->rt_waiter) ||
+ this->pi_state) {
+ ret = -EINVAL;
+ break;
+ }
+
+ /* Plain futexes just wake or requeue and are done */
+ if (!requeue_pi) {
+ if (++task_count <= nr_wake)
+ this->wake(&wake_q, this);
+ else
+ requeue_futex(this, hb1, hb2, &key2);
+ continue;
+ }
+
+ /* Ensure we requeue to the expected futex for requeue_pi. */
+ if (!futex_match(this->requeue_pi_key, &key2)) {
+ ret = -EINVAL;
+ break;
+ }
- if (ret == 1) {
- /*
- * We got the lock. We do neither drop the refcount
- * on pi_state nor clear this->pi_state because the
- * waiter needs the pi_state for cleaning up the
- * user space value. It will drop the refcount
- * after doing so. this::requeue_state is updated
- * in the wakeup as well.
- */
- requeue_pi_wake_futex(this, &key2, hb2);
- task_count++;
- } else if (!ret) {
- /* Waiter is queued, move it to hb2 */
- requeue_futex(this, hb1, hb2, &key2);
- futex_requeue_pi_complete(this, 0);
- task_count++;
- } else {
- /*
- * rt_mutex_start_proxy_lock() detected a potential
- * deadlock when we tried to queue that waiter.
- * Drop the pi_state reference which we took above
- * and remove the pointer to the state from the
- * waiters futex_q object.
- */
- this->pi_state = NULL;
- put_pi_state(pi_state);
- futex_requeue_pi_complete(this, ret);
/*
- * We stop queueing more waiters and let user space
- * deal with the mess.
+ * Requeue nr_requeue waiters and possibly one more in the case
+ * of requeue_pi if we couldn't acquire the lock atomically.
+ *
+ * Prepare the waiter to take the rt_mutex. Take a refcount
+ * on the pi_state and store the pointer in the futex_q
+ * object of the waiter.
*/
- break;
+ get_pi_state(pi_state);
+
+ /* Don't requeue when the waiter is already on the way out. */
+ if (!futex_requeue_pi_prepare(this, pi_state)) {
+ /*
+ * Early woken waiter signaled that it is on the
+ * way out. Drop the pi_state reference and try the
+ * next waiter. @this->pi_state is still NULL.
+ */
+ put_pi_state(pi_state);
+ continue;
+ }
+
+ ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
+ this->rt_waiter,
+ this->task);
+
+ if (ret == 1) {
+ /*
+ * We got the lock. We do neither drop the refcount
+ * on pi_state nor clear this->pi_state because the
+ * waiter needs the pi_state for cleaning up the
+ * user space value. It will drop the refcount
+ * after doing so. this::requeue_state is updated
+ * in the wakeup as well.
+ */
+ requeue_pi_wake_futex(this, &key2, hb2);
+ task_count++;
+ } else if (!ret) {
+ /* Waiter is queued, move it to hb2 */
+ requeue_futex(this, hb1, hb2, &key2);
+ futex_requeue_pi_complete(this, 0);
+ task_count++;
+ } else {
+ /*
+ * rt_mutex_start_proxy_lock() detected a potential
+ * deadlock when we tried to queue that waiter.
+ * Drop the pi_state reference which we took above
+ * and remove the pointer to the state from the
+ * waiters futex_q object.
+ */
+ this->pi_state = NULL;
+ put_pi_state(pi_state);
+ futex_requeue_pi_complete(this, ret);
+ /*
+ * We stop queueing more waiters and let user space
+ * deal with the mess.
+ */
+ break;
+ }
}
- }
- /*
- * We took an extra initial reference to the pi_state in
- * futex_proxy_trylock_atomic(). We need to drop it here again.
- */
- put_pi_state(pi_state);
+ /*
+ * We took an extra initial reference to the pi_state in
+ * futex_proxy_trylock_atomic(). We need to drop it here again.
+ */
+ put_pi_state(pi_state);
out_unlock:
- double_unlock_hb(hb1, hb2);
+ double_unlock_hb(hb1, hb2);
+ futex_hb_waiters_dec(hb2);
+ }
wake_up_q(&wake_q);
- futex_hb_waiters_dec(hb2);
return ret ? ret : task_count;
}
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 1108f37..7dc35be 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -253,7 +253,6 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
int nr_wake, int nr_wake2, int op)
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
- struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
int ret, op_ret;
DEFINE_WAKE_Q(wake_q);
@@ -266,67 +265,71 @@ retry:
if (unlikely(ret != 0))
return ret;
- hb1 = futex_hash(&key1);
- hb2 = futex_hash(&key2);
-
retry_private:
- double_lock_hb(hb1, hb2);
- op_ret = futex_atomic_op_inuser(op, uaddr2);
- if (unlikely(op_ret < 0)) {
- double_unlock_hb(hb1, hb2);
-
- if (!IS_ENABLED(CONFIG_MMU) ||
- unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
- /*
- * we don't get EFAULT from MMU faults if we don't have
- * an MMU, but we might get them from range checking
- */
- ret = op_ret;
- return ret;
- }
-
- if (op_ret == -EFAULT) {
- ret = fault_in_user_writeable(uaddr2);
- if (ret)
+ if (1) {
+ struct futex_hash_bucket *hb1, *hb2;
+
+ hb1 = futex_hash(&key1);
+ hb2 = futex_hash(&key2);
+
+ double_lock_hb(hb1, hb2);
+ op_ret = futex_atomic_op_inuser(op, uaddr2);
+ if (unlikely(op_ret < 0)) {
+ double_unlock_hb(hb1, hb2);
+
+ if (!IS_ENABLED(CONFIG_MMU) ||
+ unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
+ /*
+ * we don't get EFAULT from MMU faults if we don't have
+ * an MMU, but we might get them from range checking
+ */
+ ret = op_ret;
return ret;
- }
-
- cond_resched();
- if (!(flags & FLAGS_SHARED))
- goto retry_private;
- goto retry;
- }
+ }
- plist_for_each_entry_safe(this, next, &hb1->chain, list) {
- if (futex_match (&this->key, &key1)) {
- if (this->pi_state || this->rt_waiter) {
- ret = -EINVAL;
- goto out_unlock;
+ if (op_ret == -EFAULT) {
+ ret = fault_in_user_writeable(uaddr2);
+ if (ret)
+ return ret;
}
- this->wake(&wake_q, this);
- if (++ret >= nr_wake)
- break;
+
+ cond_resched();
+ if (!(flags & FLAGS_SHARED))
+ goto retry_private;
+ goto retry;
}
- }
- if (op_ret > 0) {
- op_ret = 0;
- plist_for_each_entry_safe(this, next, &hb2->chain, list) {
- if (futex_match (&this->key, &key2)) {
+ plist_for_each_entry_safe(this, next, &hb1->chain, list) {
+ if (futex_match(&this->key, &key1)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
goto out_unlock;
}
this->wake(&wake_q, this);
- if (++op_ret >= nr_wake2)
+ if (++ret >= nr_wake)
break;
}
}
- ret += op_ret;
- }
+
+ if (op_ret > 0) {
+ op_ret = 0;
+ plist_for_each_entry_safe(this, next, &hb2->chain, list) {
+ if (futex_match(&this->key, &key2)) {
+ if (this->pi_state || this->rt_waiter) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+ this->wake(&wake_q, this);
+ if (++op_ret >= nr_wake2)
+ break;
+ }
+ }
+ ret += op_ret;
+ }
out_unlock:
- double_unlock_hb(hb1, hb2);
+ double_unlock_hb(hb1, hb2);
+ }
wake_up_q(&wake_q);
return ret;
}
@@ -402,7 +405,6 @@ int futex_unqueue_multiple(struct futex_vector *v, int count)
*/
int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
{
- struct futex_hash_bucket *hb;
bool retry = false;
int ret, i;
u32 uval;
@@ -441,21 +443,25 @@ retry:
struct futex_q *q = &vs[i].q;
u32 val = vs[i].w.val;
- hb = futex_hash(&q->key);
- futex_q_lock(q, hb);
- ret = futex_get_value_locked(&uval, uaddr);
+ if (1) {
+ struct futex_hash_bucket *hb;
- if (!ret && uval == val) {
- /*
- * The bucket lock can't be held while dealing with the
- * next futex. Queue each futex at this moment so hb can
- * be unlocked.
- */
- futex_queue(q, hb, current);
- continue;
- }
+ hb = futex_hash(&q->key);
+ futex_q_lock(q, hb);
+ ret = futex_get_value_locked(&uval, uaddr);
- futex_q_unlock(hb);
+ if (!ret && uval == val) {
+ /*
+ * The bucket lock can't be held while dealing with the
+ * next futex. Queue each futex at this moment so hb can
+ * be unlocked.
+ */
+ futex_queue(q, hb, current);
+ continue;
+ }
+
+ futex_q_unlock(hb);
+ }
__set_current_state(TASK_RUNNING);
/*
@@ -584,7 +590,6 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
struct futex_q *q, union futex_key *key2,
struct task_struct *task)
{
- struct futex_hash_bucket *hb;
u32 uval;
int ret;
@@ -612,43 +617,47 @@ retry:
return ret;
retry_private:
- hb = futex_hash(&q->key);
- futex_q_lock(q, hb);
+ if (1) {
+ struct futex_hash_bucket *hb;
+
+ hb = futex_hash(&q->key);
+ futex_q_lock(q, hb);
- ret = futex_get_value_locked(&uval, uaddr);
+ ret = futex_get_value_locked(&uval, uaddr);
- if (ret) {
- futex_q_unlock(hb);
+ if (ret) {
+ futex_q_unlock(hb);
- ret = get_user(uval, uaddr);
- if (ret)
- return ret;
+ ret = get_user(uval, uaddr);
+ if (ret)
+ return ret;
- if (!(flags & FLAGS_SHARED))
- goto retry_private;
+ if (!(flags & FLAGS_SHARED))
+ goto retry_private;
- goto retry;
- }
+ goto retry;
+ }
- if (uval != val) {
- futex_q_unlock(hb);
- return -EWOULDBLOCK;
- }
+ if (uval != val) {
+ futex_q_unlock(hb);
+ return -EWOULDBLOCK;
+ }
- if (key2 && futex_match(&q->key, key2)) {
- futex_q_unlock(hb);
- return -EINVAL;
- }
+ if (key2 && futex_match(&q->key, key2)) {
+ futex_q_unlock(hb);
+ return -EINVAL;
+ }
- /*
- * The task state is guaranteed to be set before another task can
- * wake it. set_current_state() is implemented using smp_store_mb() and
- * futex_queue() calls spin_unlock() upon completion, both serializing
- * access to the hash list and forcing another memory barrier.
- */
- if (task == current)
- set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
- futex_queue(q, hb, task);
+ /*
+ * The task state is guaranteed to be set before another task can
+ * wake it. set_current_state() is implemented using smp_store_mb() and
+ * futex_queue() calls spin_unlock() upon completion, both serializing
+ * access to the hash list and forcing another memory barrier.
+ */
+ if (task == current)
+ set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+ futex_queue(q, hb, task);
+ }
return ret;
}
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Pull futex_hash() out of futex_q_lock()
2025-04-16 16:29 ` [PATCH v12 04/21] futex: Pull futex_hash() out of futex_q_lock() Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Peter Zijlstra
0 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Sebastian Andrzej Siewior, x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 2fb292096d950a67a1941949a08a60ddd3193da3
Gitweb: https://git.kernel.org/tip/2fb292096d950a67a1941949a08a60ddd3193da3
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Apr 2025 18:29:04 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:05 +02:00
futex: Pull futex_hash() out of futex_q_lock()
Getting the hash bucket and queuing it are two distinct actions. In
light of wanting to add a put hash bucket function later, untangle
them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-5-bigeasy@linutronix.de
---
kernel/futex/core.c | 7 +------
kernel/futex/futex.h | 2 +-
kernel/futex/pi.c | 3 ++-
kernel/futex/waitwake.c | 6 ++++--
4 files changed, 8 insertions(+), 10 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index cca1585..7adc914 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -502,13 +502,9 @@ void __futex_unqueue(struct futex_q *q)
}
/* The key must be already stored in q->key. */
-struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
+void futex_q_lock(struct futex_q *q, struct futex_hash_bucket *hb)
__acquires(&hb->lock)
{
- struct futex_hash_bucket *hb;
-
- hb = futex_hash(&q->key);
-
/*
* Increment the counter before taking the lock so that
* a potential waker won't miss a to-be-slept task that is
@@ -522,7 +518,6 @@ struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
q->lock_ptr = &hb->lock;
spin_lock(&hb->lock);
- return hb;
}
void futex_q_unlock(struct futex_hash_bucket *hb)
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 16aafd0..a219903 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -354,7 +354,7 @@ static inline int futex_hb_waiters_pending(struct futex_hash_bucket *hb)
#endif
}
-extern struct futex_hash_bucket *futex_q_lock(struct futex_q *q);
+extern void futex_q_lock(struct futex_q *q, struct futex_hash_bucket *hb);
extern void futex_q_unlock(struct futex_hash_bucket *hb);
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 7a94184..3bf942e 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -939,7 +939,8 @@ retry:
goto out;
retry_private:
- hb = futex_q_lock(&q);
+ hb = futex_hash(&q.key);
+ futex_q_lock(&q, hb);
ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
&exiting, 0);
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 6cf1070..1108f37 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -441,7 +441,8 @@ retry:
struct futex_q *q = &vs[i].q;
u32 val = vs[i].w.val;
- hb = futex_q_lock(q);
+ hb = futex_hash(&q->key);
+ futex_q_lock(q, hb);
ret = futex_get_value_locked(&uval, uaddr);
if (!ret && uval == val) {
@@ -611,7 +612,8 @@ retry:
return ret;
retry_private:
- hb = futex_q_lock(q);
+ hb = futex_hash(&q->key);
+ futex_q_lock(q, hb);
ret = futex_get_value_locked(&uval, uaddr);
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] mm: Add vmalloc_huge_node()
2025-04-16 16:29 ` [PATCH v12 02/21] mm: Add vmalloc_huge_node() Sebastian Andrzej Siewior
@ 2025-05-08 10:33 ` tip-bot2 for Peter Zijlstra
0 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Sebastian Andrzej Siewior,
Christoph Hellwig, x86, linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 55284f70134f01fdc9cc4c4905551cc1f37abd34
Gitweb: https://git.kernel.org/tip/55284f70134f01fdc9cc4c4905551cc1f37abd34
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Apr 2025 18:29:02 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:05 +02:00
mm: Add vmalloc_huge_node()
To enable node specific hash-tables using huge pages if possible.
[bigeasy: use __vmalloc_node_range_noprof(), add nommu bits, inline
vmalloc_huge]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250416162921.513656-3-bigeasy@linutronix.de
---
include/linux/vmalloc.h | 9 +++++++--
mm/nommu.c | 18 +++++++++++++++++-
mm/vmalloc.c | 11 ++++++-----
3 files changed, 30 insertions(+), 8 deletions(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 31e9ffd..de95794 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -168,8 +168,13 @@ void *__vmalloc_node_noprof(unsigned long size, unsigned long align, gfp_t gfp_m
int node, const void *caller) __alloc_size(1);
#define __vmalloc_node(...) alloc_hooks(__vmalloc_node_noprof(__VA_ARGS__))
-void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
-#define vmalloc_huge(...) alloc_hooks(vmalloc_huge_noprof(__VA_ARGS__))
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node) __alloc_size(1);
+#define vmalloc_huge_node(...) alloc_hooks(vmalloc_huge_node_noprof(__VA_ARGS__))
+
+static inline void *vmalloc_huge(unsigned long size, gfp_t gfp_mask)
+{
+ return vmalloc_huge_node(size, gfp_mask, NUMA_NO_NODE);
+}
extern void *__vmalloc_array_noprof(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
#define __vmalloc_array(...) alloc_hooks(__vmalloc_array_noprof(__VA_ARGS__))
diff --git a/mm/nommu.c b/mm/nommu.c
index 617e7ba..70f92f9 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -200,7 +200,23 @@ void *vmalloc_noprof(unsigned long size)
}
EXPORT_SYMBOL(vmalloc_noprof);
-void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __weak __alias(__vmalloc_noprof);
+/*
+ * vmalloc_huge_node - allocate virtually contiguous memory, on a node
+ *
+ * @size: allocation size
+ * @gfp_mask: flags for the page level allocator
+ * @node: node to use for allocation or NUMA_NO_NODE
+ *
+ * Allocate enough pages to cover @size from the page level
+ * allocator and map them into contiguous kernel virtual space.
+ *
+ * Due to NOMMU implications the node argument and HUGE page attribute is
+ * ignored.
+ */
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
+{
+ return __vmalloc_noprof(size, gfp_mask);
+}
/*
* vzalloc - allocate virtually contiguous memory with zero fill
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3ed720a..8b9f6d3 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3943,9 +3943,10 @@ void *vmalloc_noprof(unsigned long size)
EXPORT_SYMBOL(vmalloc_noprof);
/**
- * vmalloc_huge - allocate virtually contiguous memory, allow huge pages
+ * vmalloc_huge_node - allocate virtually contiguous memory, allow huge pages
* @size: allocation size
* @gfp_mask: flags for the page level allocator
+ * @node: node to use for allocation or NUMA_NO_NODE
*
* Allocate enough pages to cover @size from the page level
* allocator and map them into contiguous kernel virtual space.
@@ -3954,13 +3955,13 @@ EXPORT_SYMBOL(vmalloc_noprof);
*
* Return: pointer to the allocated memory or %NULL on error
*/
-void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask)
+void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
{
return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
- gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
- NUMA_NO_NODE, __builtin_return_address(0));
+ gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+ node, __builtin_return_address(0));
}
-EXPORT_SYMBOL_GPL(vmalloc_huge_noprof);
+EXPORT_SYMBOL_GPL(vmalloc_huge_node_noprof);
/**
* vzalloc - allocate virtually contiguous memory with zero fill
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Move futex_queue() into futex_wait_setup()
2025-04-16 16:29 ` [PATCH v12 03/21] futex: Move futex_queue() into futex_wait_setup() Sebastian Andrzej Siewior
2025-05-05 21:43 ` André Almeida
@ 2025-05-08 10:33 ` tip-bot2 for Peter Zijlstra
1 sibling, 0 replies; 109+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-05-08 10:33 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra (Intel), Sebastian Andrzej Siewior, x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 93f1b6d79a73b520b6875cf3babf4a09acc4eef0
Gitweb: https://git.kernel.org/tip/93f1b6d79a73b520b6875cf3babf4a09acc4eef0
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 16 Apr 2025 18:29:03 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:05 +02:00
futex: Move futex_queue() into futex_wait_setup()
futex_wait_setup() has a weird calling convention in order to return
hb to use as an argument to futex_queue().
Mostly such that requeue can have an extra test in between.
Reorder code a little to get rid of this and keep the hb usage inside
futex_wait_setup().
[bigeasy: fixes]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-4-bigeasy@linutronix.de
---
io_uring/futex.c | 4 +---
kernel/futex/futex.h | 6 ++---
kernel/futex/requeue.c | 28 +++++++++---------------
kernel/futex/waitwake.c | 47 ++++++++++++++++++++++------------------
4 files changed, 42 insertions(+), 43 deletions(-)
diff --git a/io_uring/futex.c b/io_uring/futex.c
index 0ea4820..e89c089 100644
--- a/io_uring/futex.c
+++ b/io_uring/futex.c
@@ -273,7 +273,6 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex);
struct io_ring_ctx *ctx = req->ctx;
struct io_futex_data *ifd = NULL;
- struct futex_hash_bucket *hb;
int ret;
if (!iof->futex_mask) {
@@ -295,12 +294,11 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
ifd->req = req;
ret = futex_wait_setup(iof->uaddr, iof->futex_val, iof->futex_flags,
- &ifd->q, &hb);
+ &ifd->q, NULL, NULL);
if (!ret) {
hlist_add_head(&req->hash_node, &ctx->futex_list);
io_ring_submit_unlock(ctx, issue_flags);
- futex_queue(&ifd->q, hb, NULL);
return IOU_ISSUE_SKIP_COMPLETE;
}
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 6b2f4c7..16aafd0 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -219,9 +219,9 @@ static inline int futex_match(union futex_key *key1, union futex_key *key2)
}
extern int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
- struct futex_q *q, struct futex_hash_bucket **hb);
-extern void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
- struct hrtimer_sleeper *timeout);
+ struct futex_q *q, union futex_key *key2,
+ struct task_struct *task);
+extern void futex_do_wait(struct futex_q *q, struct hrtimer_sleeper *timeout);
extern bool __futex_wake_mark(struct futex_q *q);
extern void futex_wake_mark(struct wake_q_head *wake_q, struct futex_q *q);
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index b47bb76..0e55975 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -769,7 +769,6 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
{
struct hrtimer_sleeper timeout, *to;
struct rt_mutex_waiter rt_waiter;
- struct futex_hash_bucket *hb;
union futex_key key2 = FUTEX_KEY_INIT;
struct futex_q q = futex_q_init;
struct rt_mutex_base *pi_mutex;
@@ -805,29 +804,24 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
* Prepare to wait on uaddr. On success, it holds hb->lock and q
* is initialized.
*/
- ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
+ ret = futex_wait_setup(uaddr, val, flags, &q, &key2, current);
if (ret)
goto out;
- /*
- * The check above which compares uaddrs is not sufficient for
- * shared futexes. We need to compare the keys:
- */
- if (futex_match(&q.key, &key2)) {
- futex_q_unlock(hb);
- ret = -EINVAL;
- goto out;
- }
-
/* Queue the futex_q, drop the hb lock, wait for wakeup. */
- futex_wait_queue(hb, &q, to);
+ futex_do_wait(&q, to);
switch (futex_requeue_pi_wakeup_sync(&q)) {
case Q_REQUEUE_PI_IGNORE:
- /* The waiter is still on uaddr1 */
- spin_lock(&hb->lock);
- ret = handle_early_requeue_pi_wakeup(hb, &q, to);
- spin_unlock(&hb->lock);
+ {
+ struct futex_hash_bucket *hb;
+
+ hb = futex_hash(&q.key);
+ /* The waiter is still on uaddr1 */
+ spin_lock(&hb->lock);
+ ret = handle_early_requeue_pi_wakeup(hb, &q, to);
+ spin_unlock(&hb->lock);
+ }
break;
case Q_REQUEUE_PI_LOCKED:
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 25877d4..6cf1070 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -339,18 +339,8 @@ static long futex_wait_restart(struct restart_block *restart);
* @q: the futex_q to queue up on
* @timeout: the prepared hrtimer_sleeper, or null for no timeout
*/
-void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
- struct hrtimer_sleeper *timeout)
+void futex_do_wait(struct futex_q *q, struct hrtimer_sleeper *timeout)
{
- /*
- * The task state is guaranteed to be set before another task can
- * wake it. set_current_state() is implemented using smp_store_mb() and
- * futex_queue() calls spin_unlock() upon completion, both serializing
- * access to the hash list and forcing another memory barrier.
- */
- set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
- futex_queue(q, hb, current);
-
/* Arm the timer */
if (timeout)
hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
@@ -578,7 +568,8 @@ int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
* @val: the expected value
* @flags: futex flags (FLAGS_SHARED, etc.)
* @q: the associated futex_q
- * @hb: storage for hash_bucket pointer to be returned to caller
+ * @key2: the second futex_key if used for requeue PI
+ * task: Task queueing this futex
*
* Setup the futex_q and locate the hash_bucket. Get the futex value and
* compare it with the expected value. Handle atomic faults internally.
@@ -589,8 +580,10 @@ int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
* - <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
*/
int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
- struct futex_q *q, struct futex_hash_bucket **hb)
+ struct futex_q *q, union futex_key *key2,
+ struct task_struct *task)
{
+ struct futex_hash_bucket *hb;
u32 uval;
int ret;
@@ -618,12 +611,12 @@ retry:
return ret;
retry_private:
- *hb = futex_q_lock(q);
+ hb = futex_q_lock(q);
ret = futex_get_value_locked(&uval, uaddr);
if (ret) {
- futex_q_unlock(*hb);
+ futex_q_unlock(hb);
ret = get_user(uval, uaddr);
if (ret)
@@ -636,10 +629,25 @@ retry_private:
}
if (uval != val) {
- futex_q_unlock(*hb);
- ret = -EWOULDBLOCK;
+ futex_q_unlock(hb);
+ return -EWOULDBLOCK;
}
+ if (key2 && futex_match(&q->key, key2)) {
+ futex_q_unlock(hb);
+ return -EINVAL;
+ }
+
+ /*
+ * The task state is guaranteed to be set before another task can
+ * wake it. set_current_state() is implemented using smp_store_mb() and
+ * futex_queue() calls spin_unlock() upon completion, both serializing
+ * access to the hash list and forcing another memory barrier.
+ */
+ if (task == current)
+ set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+ futex_queue(q, hb, task);
+
return ret;
}
@@ -647,7 +655,6 @@ int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
struct hrtimer_sleeper *to, u32 bitset)
{
struct futex_q q = futex_q_init;
- struct futex_hash_bucket *hb;
int ret;
if (!bitset)
@@ -660,12 +667,12 @@ retry:
* Prepare to wait on uaddr. On success, it holds hb->lock and q
* is initialized.
*/
- ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
+ ret = futex_wait_setup(uaddr, val, flags, &q, NULL, current);
if (ret)
return ret;
/* futex_queue and wait for wakeup, timeout, or a signal. */
- futex_wait_queue(hb, &q, to);
+ futex_do_wait(&q, to);
/* If we were woken (and unqueued), we succeeded, whatever. */
if (!futex_unqueue(&q))
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] rcuref: Provide rcuref_is_dead()
2025-04-16 16:29 ` [PATCH v12 01/21] rcuref: Provide rcuref_is_dead() Sebastian Andrzej Siewior
2025-05-05 21:09 ` André Almeida
@ 2025-05-08 10:34 ` tip-bot2 for Sebastian Andrzej Siewior
1 sibling, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-05-08 10:34 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 3efa66ce6ee1b55ab687b316e48e1e9ddc1f780a
Gitweb: https://git.kernel.org/tip/3efa66ce6ee1b55ab687b316e48e1e9ddc1f780a
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 16 Apr 2025 18:29:01 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 03 May 2025 12:02:04 +02:00
rcuref: Provide rcuref_is_dead()
rcuref_read() returns the number of references that are currently held.
If 0 is returned then it is not safe to assume that the object ca be
scheduled for deconstruction because it is marked DEAD. This happens if
the return value of rcuref_put() is ignored and assumptions are made.
If 0 is returned then the counter transitioned from 0 to RCUREF_NOREF.
If rcuref_put() did not return to the caller then the counter did not
yet transition from RCUREF_NOREF to RCUREF_DEAD. This means that there
is still a chance that the counter will transition from RCUREF_NOREF to
0 meaning it is still valid and must not be deconstructed. In this brief
window rcuref_read() will return 0.
Provide rcuref_is_dead() to determine if the counter is marked as
RCUREF_DEAD.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-2-bigeasy@linutronix.de
---
include/linux/rcuref.h | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/include/linux/rcuref.h b/include/linux/rcuref.h
index 6322d8c..2fb2af6 100644
--- a/include/linux/rcuref.h
+++ b/include/linux/rcuref.h
@@ -30,7 +30,11 @@ static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
* rcuref_read - Read the number of held reference counts of a rcuref
* @ref: Pointer to the reference count
*
- * Return: The number of held references (0 ... N)
+ * Return: The number of held references (0 ... N). The value 0 does not
+ * indicate that it is safe to schedule the object, protected by this reference
+ * counter, for deconstruction.
+ * If you want to know if the reference counter has been marked DEAD (as
+ * signaled by rcuref_put()) please use rcuread_is_dead().
*/
static inline unsigned int rcuref_read(rcuref_t *ref)
{
@@ -40,6 +44,22 @@ static inline unsigned int rcuref_read(rcuref_t *ref)
return c >= RCUREF_RELEASED ? 0 : c + 1;
}
+/**
+ * rcuref_is_dead - Check if the rcuref has been already marked dead
+ * @ref: Pointer to the reference count
+ *
+ * Return: True if the object has been marked DEAD. This signals that a previous
+ * invocation of rcuref_put() returned true on this reference counter meaning
+ * the protected object can safely be scheduled for deconstruction.
+ * Otherwise, returns false.
+ */
+static inline bool rcuref_is_dead(rcuref_t *ref)
+{
+ unsigned int c = atomic_read(&ref->refcnt);
+
+ return (c >= RCUREF_RELEASED) && (c < RCUREF_NOREF);
+}
+
extern __must_check bool rcuref_get_slowpath(rcuref_t *ref);
/**
^ permalink raw reply related [flat|nested] 109+ messages in thread
* Re: [PATCH v12 10/21] futex: Introduce futex_q_lockptr_lock()
2025-04-16 16:29 ` [PATCH v12 10/21] futex: Introduce futex_q_lockptr_lock() Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
@ 2025-05-08 19:06 ` André Almeida
2025-05-16 12:18 ` Sebastian Andrzej Siewior
1 sibling, 1 reply; 109+ messages in thread
From: André Almeida @ 2025-05-08 19:06 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Darren Hart, Davidlohr Bueso, Ingo Molnar, Juri Lelli,
Peter Zijlstra, Thomas Gleixner, Valentin Schneider, Waiman Long,
linux-kernel
Em 16/04/2025 13:29, Sebastian Andrzej Siewior escreveu:
> futex_lock_pi() and __fixup_pi_state_owner() acquire the
> futex_q::lock_ptr without holding a reference assuming the previously
> obtained hash bucket and the assigned lock_ptr are still valid. This
> isn't the case once the private hash can be resized and becomes invalid
> after the reference drop.
>
> Introduce futex_q_lockptr_lock() to lock the hash bucket recorded in
> futex_q::lock_ptr. The lock pointer is read in a RCU section to ensure
> that it does not go away if the hash bucket has been replaced and the
> old pointer has been observed. After locking the pointer needs to be
> compared to check if it changed. If so then the hash bucket has been
> replaced and the user has been moved to the new one and lock_ptr has
> been updated. The lock operation needs to be redone in this case.
>
> The locked hash bucket is not returned.
>
> A special case is an early return in futex_lock_pi() (due to signal or
> timeout) and a successful futex_wait_requeue_pi(). In both cases a valid
> futex_q::lock_ptr is expected (and its matching hash bucket) but since
> the waiter has been removed from the hash this can no longer be
> guaranteed. Therefore before the waiter is removed and a reference is
> acquired which is later dropped by the waiter to avoid a resize.
>
> Add futex_q_lockptr_lock() and use it.
> Acquire an additional reference in requeue_pi_wake_futex() and
> futex_unlock_pi() while the futex_q is removed, denote this extra
> reference in futex_q::drop_hb_ref and let the waiter drop the reference
> in this case.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> kernel/futex/core.c | 25 +++++++++++++++++++++++++
> kernel/futex/futex.h | 3 ++-
> kernel/futex/pi.c | 15 +++++++++++++--
> kernel/futex/requeue.c | 16 +++++++++++++---
> 4 files changed, 53 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> index 5e70cb8eb2507..1443a98dfa7fa 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -134,6 +134,13 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
> return &futex_queues[hash & futex_hashmask];
> }
>
> +/**
> + * futex_hash_get - Get an additional reference for the local hash.
> + * @hb: ptr to the private local hash.
> + *
> + * Obtain an additional reference for the already obtained hash bucket. The
> + * caller must already own an reference.
> + */
This comment should come with patch 6 (that creates the function) or
patch 14 (that implements the function).
> void futex_hash_get(struct futex_hash_bucket *hb) { }
> void futex_hash_put(struct futex_hash_bucket *hb) { }
>
> @@ -615,6 +622,24 @@ int futex_unqueue(struct futex_q *q)
> return ret;
> }
>
> +void futex_q_lockptr_lock(struct futex_q *q)
> +{
> + spinlock_t *lock_ptr;
> +
> + /*
> + * See futex_unqueue() why lock_ptr can change.
> + */
> + guard(rcu)();
> +retry:
> + lock_ptr = READ_ONCE(q->lock_ptr);
> + spin_lock(lock_ptr);
> +
> + if (unlikely(lock_ptr != q->lock_ptr)) {
> + spin_unlock(lock_ptr);
> + goto retry;
> + }
> +}
> +
> /*
> * PI futexes can not be requeued and must remove themselves from the hash
> * bucket. The hash bucket lock (i.e. lock_ptr) is held.
> diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
> index bc76e366f9a77..26e69333cb745 100644
> --- a/kernel/futex/futex.h
> +++ b/kernel/futex/futex.h
> @@ -183,6 +183,7 @@ struct futex_q {
> union futex_key *requeue_pi_key;
> u32 bitset;
> atomic_t requeue_state;
> + bool drop_hb_ref;
This new member needs a comment:
* @drop_hb_ref: True if an extra reference was acquired by a pi
operation, and needs an extra put()
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 14/21] futex: Allow to resize the private local hash
2025-04-16 16:29 ` [PATCH v12 14/21] futex: Allow to resize the private local hash Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
@ 2025-05-08 20:32 ` André Almeida
2025-05-16 10:49 ` Sebastian Andrzej Siewior
2025-05-10 8:45 ` [PATCH] futex: Fix futex_mm_init() build failure on older compilers, remove rcu_assign_pointer() Ingo Molnar
2025-06-01 7:39 ` [PATCH v12 14/21] futex: Allow to resize the private local hash Lai, Yi
3 siblings, 1 reply; 109+ messages in thread
From: André Almeida @ 2025-05-08 20:32 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: Darren Hart, Davidlohr Bueso, Ingo Molnar, Juri Lelli,
Peter Zijlstra, Thomas Gleixner, Valentin Schneider, Waiman Long
Em 16/04/2025 13:29, Sebastian Andrzej Siewior escreveu:
> The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/
> replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS
> operation can now be invoked at runtime and resize an already existing
> internal private futex_hash_bucket to another size.
>
> The reallocation is based on an idea by Thomas Gleixner: The initial
> allocation of struct futex_private_hash sets the reference count
> to one. Every user acquires a reference on the local hash before using
> it and drops it after it enqueued itself on the hash bucket. There is no
> reference held while the task is scheduled out while waiting for the
> wake up.
> The resize process allocates a new struct futex_private_hash and drops
> the initial reference. Synchronized with mm_struct::futex_hash_lock it
> is checked if the reference counter for the currently used
> mm_struct::futex_phash is marked as DEAD. If so, then all users enqueued
> on the current private hash are requeued on the new private hash and the
> new private hash is set to mm_struct::futex_phash. Otherwise the newly
> allocated private hash is saved as mm_struct::futex_phash_new and the
> rehashing and reassigning is delayed to the futex_hash() caller once the
> reference counter is marked DEAD.
> The replacement is not performed at rcuref_put() time because certain
> callers, such as futex_wait_queue(), drop their reference after changing
> the task state. This change will be destroyed once the futex_hash_lock
> is acquired.
>
> The user can change the number slots with PR_FUTEX_HASH_SET_SLOTS
> multiple times. An increase and decrease is allowed and request blocks
> until the assignment is done.
>
> The private hash allocated at thread creation is changed from 16 to
> 16 <= 4 * number_of_threads <= global_hash_size
> where number_of_threads can not exceed the number of online CPUs. Should
> the user PR_FUTEX_HASH_SET_SLOTS then the auto scaling is disabled.
>
> [peterz: reorganize the code to avoid state tracking and simplify new
> object handling, block the user until changes are in effect, allow
> increase and decrease of the hash].
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> include/linux/futex.h | 3 +-
> include/linux/mm_types.h | 4 +-
> kernel/futex/core.c | 290 ++++++++++++++++++++++++++++++++++++---
> kernel/futex/requeue.c | 5 +
> 4 files changed, 281 insertions(+), 21 deletions(-)
>
[...]
> static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> @@ -1273,16 +1442,23 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> if (hash_slots && (hash_slots == 1 || !is_power_of_2(hash_slots)))
> return -EINVAL;
>
> - if (mm->futex_phash)
> - return -EALREADY;
> -
> - if (!thread_group_empty(current))
> - return -EINVAL;
> + /*
> + * Once we've disabled the global hash there is no way back.
> + */
> + scoped_guard(rcu) {
> + fph = rcu_dereference(mm->futex_phash);
> + if (fph && !fph->hash_mask) {
> + if (custom)
> + return -EBUSY;
> + return 0;
> + }
> + }
>
> fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
> if (!fph)
> return -ENOMEM;
>
> + rcuref_init(&fph->users, 1);
> fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
> fph->custom = custom;
> fph->mm = mm;
> @@ -1290,26 +1466,102 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> for (i = 0; i < hash_slots; i++)
> futex_hash_bucket_init(&fph->queues[i], fph);
>
> - mm->futex_phash = fph;
If (hash_slots == 0), do we still need to do all of this work bellow? I
thought that using the global hash would allow to skip this.
> + if (custom) {
> + /*
> + * Only let prctl() wait / retry; don't unduly delay clone().
> + */
> +again:
> + wait_var_event(mm, futex_pivot_pending(mm));
> + }
> +
> + scoped_guard(mutex, &mm->futex_hash_lock) {
> + struct futex_private_hash *free __free(kvfree) = NULL;
> + struct futex_private_hash *cur, *new;
> +
> + cur = rcu_dereference_protected(mm->futex_phash,
> + lockdep_is_held(&mm->futex_hash_lock));
> + new = mm->futex_phash_new;
> + mm->futex_phash_new = NULL;
> +
> + if (fph) {
> + if (cur && !new) {
> + /*
> + * If we have an existing hash, but do not yet have
> + * allocated a replacement hash, drop the initial
> + * reference on the existing hash.
> + */
> + futex_private_hash_put(cur);
> + }
> +
> + if (new) {
> + /*
> + * Two updates raced; throw out the lesser one.
> + */
> + if (futex_hash_less(new, fph)) {
> + free = new;
> + new = fph;
> + } else {
> + free = fph;
> + }
> + } else {
> + new = fph;
> + }
> + fph = NULL;
> + }
> +
> + if (new) {
> + /*
> + * Will set mm->futex_phash_new on failure;
> + * futex_private_hash_get() will try again.
> + */
> + if (!__futex_pivot_hash(mm, new) && custom)
> + goto again;
Is it safe to use a goto inside a scoped_guard(){}?
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
2025-05-06 7:36 ` Peter Zijlstra
@ 2025-05-09 11:41 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-09 11:41 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-05-06 09:36:11 [+0200], Peter Zijlstra wrote:
> Well, if you do stupid things, you get to keep the pieces or something
> along those lines. Same as when userspace goes scribble the node value
> while another thread is waiting and all that.
>
> Even with the unconditional write back you're going to have a problem
> with concurrent wait on the same futex.
We could add a global lock for the write back case to ensure there is
only one at a time. However let me document the current behaviour of the
new pieces and tick it off ;)
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 20/21] selftests/futex: Add futex_priv_hash
2025-04-16 16:29 ` [PATCH v12 20/21] selftests/futex: Add futex_priv_hash Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
@ 2025-05-09 21:22 ` André Almeida
2025-05-16 7:38 ` Sebastian Andrzej Siewior
2025-05-27 11:28 ` Mark Brown
2 siblings, 1 reply; 109+ messages in thread
From: André Almeida @ 2025-05-09 21:22 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: Darren Hart, Davidlohr Bueso, Ingo Molnar, Juri Lelli,
Peter Zijlstra, Thomas Gleixner, Valentin Schneider, Waiman Long
Hi Sebastian,
Thank you for adding a selftest for the new uAPI. The recent futex
selftests accepted uses the kselftest helpers in a different way than
the way you have used. I've attached a diff to exemplify how I would
write this selftest. The advantage is to have a TAP output that can be
easier used with automated testing, and that would not stop when the
first test fails.
Em 16/04/2025 13:29, Sebastian Andrzej Siewior escreveu:
> Test the basic functionality of the private hash:
> - Upon start, with no threads there is no private hash.
> - The first thread initializes the private hash.
> - More than four threads will increase the size of the private hash if
> the system has more than 16 CPUs online.
> - Once the user sets the size of private hash, auto scaling is disabled.
> - The user is only allowed to use numbers to the power of two.
> - The user may request the global or make the hash immutable.
> - Once the global hash has been set or the hash has been made immutable,
> further changes are not allowed.
> - Futex operations should work the whole time. It must be possible to
> hold a lock, such a PI initialised mutex, during the resize operation.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
-- >8 --
---
.../futex/functional/futex_priv_hash.c | 31 ++++++++++++-------
1 file changed, 19 insertions(+), 12 deletions(-)
diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c
b/tools/testing/selftests/futex/functional/futex_priv_hash.c
index 4d37650baa19..33fa9ad11d69 100644
--- a/tools/testing/selftests/futex/functional/futex_priv_hash.c
+++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c
@@ -51,15 +51,17 @@ static void futex_hash_slots_set_verify(int slots)
ret = futex_hash_slots_set(slots, 0);
if (ret != 0) {
- error("Failed to set slots to %d\n", errno, slots);
- exit(1);
+ ksft_test_result_fail("Failed to set slots to %d: %s\n", slots,
strerror(errno));
+ return;
}
ret = futex_hash_slots_get();
if (ret != slots) {
- error("Set %d slots but PR_FUTEX_HASH_GET_SLOTS returns: %d\n",
- errno, slots, ret);
- exit(1);
+ ksft_test_result_fail("Set %d slots but PR_FUTEX_HASH_GET_SLOTS
returns: %d\n",
+ slots, ret);
+ return;
}
+
+ ksft_test_result_pass("futex_hash_slots_set() and get() succeeded (%d
slots)\n", slots);
}
static void futex_hash_slots_set_must_fail(int slots, int immutable)
@@ -67,12 +69,14 @@ static void futex_hash_slots_set_must_fail(int
slots, int immutable)
int ret;
ret = futex_hash_slots_set(slots, immutable);
- if (ret < 0)
+ if (ret < 0) {
+ ksft_test_result_pass("invalid futex_hash_slots_set(%d, %d)
succeeded.\n",
+ slots, immutable);
return;
+ }
- fail("futex_hash_slots_set(%d, %d) expected to fail but succeeded.\n",
+ ksft_test_result_fail("futex_hash_slots_set(%d, %d) expected to fail
but succeeded.\n",
slots, immutable);
- exit(1);
}
static void *thread_return_fn(void *arg)
@@ -156,6 +160,8 @@ int main(int argc, char *argv[])
}
}
+ ksft_print_header();
+ ksft_set_plan(13);
ret = pthread_mutexattr_init(&mutex_attr_pi);
ret |= pthread_mutexattr_setprotocol(&mutex_attr_pi,
PTHREAD_PRIO_INHERIT);
@@ -235,13 +241,13 @@ int main(int argc, char *argv[])
ret = futex_hash_slots_set(15, 0);
if (ret >= 0) {
- fail("Expected to fail with 15 slots but succeeded: %d.\n", ret);
+ ksft_test_result_fail("Expected to fail with 15 slots but succeeded:
%d.\n", ret);
return 1;
}
futex_hash_slots_set_verify(2);
join_max_threads();
if (counter != MAX_THREADS) {
- fail("Expected thread counter at %d but is %d\n",
+ ksft_test_result_fail("Expected thread counter at %d but is %d\n",
MAX_THREADS, counter);
return 1;
}
@@ -254,8 +260,7 @@ int main(int argc, char *argv[])
ret = futex_hash_slots_get();
if (ret != 2) {
- printf("Expected 2 slots, no auto-resize, got %d\n", ret);
- return 1;
+ ksft_test_result_fail("Expected 2 slots, no auto-resize, got %d\n", ret);
}
futex_hash_slots_set_must_fail(1 << 29, 0);
@@ -311,5 +316,7 @@ int main(int argc, char *argv[])
fail("Expected immutable private hash, got %d\n", ret);
return 1;
}
+
+ ksft_print_cnts();
return 0;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [PATCH] futex: Fix futex_mm_init() build failure on older compilers, remove rcu_assign_pointer()
2025-04-16 16:29 ` [PATCH v12 14/21] futex: Allow to resize the private local hash Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-05-08 20:32 ` [PATCH v12 14/21] " André Almeida
@ 2025-05-10 8:45 ` Ingo Molnar
2025-05-11 8:11 ` [tip: locking/futex] futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init() tip-bot2 for Ingo Molnar
2025-06-01 7:39 ` [PATCH v12 14/21] futex: Allow to resize the private local hash Lai, Yi
3 siblings, 1 reply; 109+ messages in thread
From: Ingo Molnar @ 2025-05-10 8:45 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long
* Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> diff --git a/include/linux/futex.h b/include/linux/futex.h
> index 1d3f7555825ec..40bc778b2bb45 100644
> --- a/include/linux/futex.h
> +++ b/include/linux/futex.h
> @@ -85,7 +85,8 @@ void futex_hash_free(struct mm_struct *mm);
>
> static inline void futex_mm_init(struct mm_struct *mm)
> {
> - mm->futex_phash = NULL;
> + rcu_assign_pointer(mm->futex_phash, NULL);
> + mutex_init(&mm->futex_hash_lock);
> }
This breaks the build on older compilers - I tried gcc-9, x86-64
defconfig:
CC io_uring/futex.o
In file included from ./arch/x86/include/generated/asm/rwonce.h:1,
from ./include/linux/compiler.h:390,
from ./include/linux/array_size.h:5,
from ./include/linux/kernel.h:16,
from io_uring/futex.c:2:
./include/linux/futex.h: In function 'futex_mm_init':
./include/linux/rcupdate.h:555:36: error: dereferencing pointer to incomplete type 'struct futex_private_hash'
555 | #define RCU_INITIALIZER(v) (typeof(*(v)) __force __rcu *)(v)
| ^~~~
./include/asm-generic/rwonce.h:55:33: note: in definition of macro '__WRITE_ONCE'
55 | *(volatile typeof(x) *)&(x) = (val); \
| ^~~
./arch/x86/include/asm/barrier.h:63:2: note: in expansion of macro 'WRITE_ONCE'
63 | WRITE_ONCE(*p, v); \
| ^~~~~~~~~~
./include/asm-generic/barrier.h:172:55: note: in expansion of macro '__smp_store_release'
172 | #define smp_store_release(p, v) do { kcsan_release(); __smp_store_release(p, v); } while (0)
| ^~~~~~~~~~~~~~~~~~~
./include/linux/rcupdate.h:596:3: note: in expansion of macro 'smp_store_release'
596 | smp_store_release(&p, RCU_INITIALIZER((typeof(p))_r_a_p__v)); \
| ^~~~~~~~~~~~~~~~~
./include/linux/rcupdate.h:596:25: note: in expansion of macro 'RCU_INITIALIZER'
596 | smp_store_release(&p, RCU_INITIALIZER((typeof(p))_r_a_p__v)); \
| ^~~~~~~~~~~~~~~
./include/linux/futex.h:91:2: note: in expansion of macro 'rcu_assign_pointer'
91 | rcu_assign_pointer(mm->futex_phash, NULL);
| ^~~~~~~~~~~~~~~~~~
make[3]: *** [scripts/Makefile.build:203: io_uring/futex.o] Error 1
make[2]: *** [scripts/Makefile.build:461: io_uring] Error 2
make[1]: *** [/home/mingo/tip/Makefile:2004: .] Error 2
make: *** [Makefile:248: __sub-make] Error 2
The problem appears to be that this variant of rcu_assign_pointer()
wants to know the full type of 'struct futex_private_hash', which type
is local to futex.c:
kernel/futex/core.c:struct futex_private_hash {
So either we uninline futex_mm_init() and move it into futex/core.c, or
we share the structure definition with kernel/fork.c. Both have
disadvantages.
A third solution would be to just initialize mm->futex_phash with NULL
like the patch below, it's not like this new MM's ->futex_phash can be
observed externally until the task is inserted into the task list -
which guarantees full store ordering.
This relaxation of this initialization might also give a tiny speedup
on certain platforms.
But an Ack from PeterZ on that assumption would be nice.
Thanks,
Ingo
=====================================>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
include/linux/futex.h | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index eccc99751bd9..168ffd5996b4 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -88,7 +88,14 @@ void futex_hash_free(struct mm_struct *mm);
static inline void futex_mm_init(struct mm_struct *mm)
{
- rcu_assign_pointer(mm->futex_phash, NULL);
+ /*
+ * No need for rcu_assign_pointer() here, as we can rely on
+ * tasklist_lock write-ordering in copy_process(), before
+ * the task's MM becomes visible and the ->futex_phash
+ * becomes externally observable:
+ */
+ mm->futex_phash = NULL;
+
mutex_init(&mm->futex_hash_lock);
}
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/futex] futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init()
2025-05-10 8:45 ` [PATCH] futex: Fix futex_mm_init() build failure on older compilers, remove rcu_assign_pointer() Ingo Molnar
@ 2025-05-11 8:11 ` tip-bot2 for Ingo Molnar
0 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Ingo Molnar @ 2025-05-11 8:11 UTC (permalink / raw)
To: linux-tip-commits
Cc: Ingo Molnar, andrealmeid, Darren Hart, Davidlohr Bueso,
Juri Lelli, Peter Zijlstra, Sebastian Andrzej Siewior,
Valentin Schneider, Waiman Long, x86, linux-kernel
The following commit has been merged into the locking/futex branch of tip:
Commit-ID: 094ac8cff7858bee5fa4554f6ea66c964f8e160e
Gitweb: https://git.kernel.org/tip/094ac8cff7858bee5fa4554f6ea66c964f8e160e
Author: Ingo Molnar <mingo@kernel.org>
AuthorDate: Sat, 10 May 2025 10:45:28 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 11 May 2025 10:02:12 +02:00
futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init()
The following commit added an rcu_assign_pointer() assignment to
futex_mm_init() in <linux/futex.h>:
bd54df5ea7ca ("futex: Allow to resize the private local hash")
Which breaks the build on older compilers (gcc-9, x86-64 defconfig):
CC io_uring/futex.o
In file included from ./arch/x86/include/generated/asm/rwonce.h:1,
from ./include/linux/compiler.h:390,
from ./include/linux/array_size.h:5,
from ./include/linux/kernel.h:16,
from io_uring/futex.c:2:
./include/linux/futex.h: In function 'futex_mm_init':
./include/linux/rcupdate.h:555:36: error: dereferencing pointer to incomplete type 'struct futex_private_hash'
The problem is that this variant of rcu_assign_pointer() wants to
know the full type of 'struct futex_private_hash', which type
is local to futex.c:
kernel/futex/core.c:struct futex_private_hash {
There are a couple of mechanical solutions for this bug:
- we can uninline futex_mm_init() and move it into futex/core.c
- or we can share the structure definition with kernel/fork.c.
But both of these solutions have disadvantages: the first one adds
runtime overhead, while the second one dis-encapsulates private
futex types.
A third solution, implemented by this patch, is to just initialize
mm->futex_phash with NULL like the patch below, it's not like this
new MM's ->futex_phash can be observed externally until the task
is inserted into the task list, which guarantees full store ordering.
The relaxation of this initialization might also give a tiny speedup
on certain platforms.
Fixes: bd54df5ea7ca ("futex: Allow to resize the private local hash")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: André Almeida <andrealmeid@igalia.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/aB8SI00EHBri23lB@gmail.com
---
include/linux/futex.h | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index eccc997..168ffd5 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -88,7 +88,14 @@ void futex_hash_free(struct mm_struct *mm);
static inline void futex_mm_init(struct mm_struct *mm)
{
- rcu_assign_pointer(mm->futex_phash, NULL);
+ /*
+ * No need for rcu_assign_pointer() here, as we can rely on
+ * tasklist_lock write-ordering in copy_process(), before
+ * the task's MM becomes visible and the ->futex_phash
+ * becomes externally observable:
+ */
+ mm->futex_phash = NULL;
+
mutex_init(&mm->futex_hash_lock);
}
^ permalink raw reply related [flat|nested] 109+ messages in thread
* Re: [PATCH v12 20/21] selftests/futex: Add futex_priv_hash
2025-05-09 21:22 ` [PATCH v12 20/21] " André Almeida
@ 2025-05-16 7:38 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-16 7:38 UTC (permalink / raw)
To: André Almeida
Cc: linux-kernel, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-05-09 18:22:18 [-0300], André Almeida wrote:
> Hi Sebastian,
Hi,
> Thank you for adding a selftest for the new uAPI. The recent futex selftests
> accepted uses the kselftest helpers in a different way than the way you have
> used. I've attached a diff to exemplify how I would write this selftest. The
> advantage is to have a TAP output that can be easier used with automated
> testing, and that would not stop when the first test fails.
It copied somewhere from existing ones. I think. Let me try to adapt
here…
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 14/21] futex: Allow to resize the private local hash
2025-05-08 20:32 ` [PATCH v12 14/21] " André Almeida
@ 2025-05-16 10:49 ` Sebastian Andrzej Siewior
2025-05-16 13:00 ` André Almeida
0 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-16 10:49 UTC (permalink / raw)
To: André Almeida
Cc: linux-kernel, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-05-08 17:32:24 [-0300], André Almeida wrote:
> > @@ -1290,26 +1466,102 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> > for (i = 0; i < hash_slots; i++)
> > futex_hash_bucket_init(&fph->queues[i], fph);
> > - mm->futex_phash = fph;
>
> If (hash_slots == 0), do we still need to do all of this work bellow? I
> thought that using the global hash would allow to skip this.
Not sure what you mean by below. We need to create a smaller struct
futex_private_hash and initialize it. We also need to move all current
futex waiters, which might be on the private hash that is going away,
over to the global hash. So yes, all this is needed.
> > + if (custom) {
> > + /*
> > + * Only let prctl() wait / retry; don't unduly delay clone().
> > + */
> > +again:
> > + wait_var_event(mm, futex_pivot_pending(mm));
> > + }
> > +
> > + scoped_guard(mutex, &mm->futex_hash_lock) {
> > + struct futex_private_hash *free __free(kvfree) = NULL;
> > + struct futex_private_hash *cur, *new;
> > +
> > + cur = rcu_dereference_protected(mm->futex_phash,
> > + lockdep_is_held(&mm->futex_hash_lock));
> > + new = mm->futex_phash_new;
> > + mm->futex_phash_new = NULL;
> > +
> > + if (fph) {
> > + if (cur && !new) {
> > + /*
> > + * If we have an existing hash, but do not yet have
> > + * allocated a replacement hash, drop the initial
> > + * reference on the existing hash.
> > + */
> > + futex_private_hash_put(cur);
> > + }
> > +
> > + if (new) {
> > + /*
> > + * Two updates raced; throw out the lesser one.
> > + */
> > + if (futex_hash_less(new, fph)) {
> > + free = new;
> > + new = fph;
> > + } else {
> > + free = fph;
> > + }
> > + } else {
> > + new = fph;
> > + }
> > + fph = NULL;
> > + }
> > +
> > + if (new) {
> > + /*
> > + * Will set mm->futex_phash_new on failure;
> > + * futex_private_hash_get() will try again.
> > + */
> > + if (!__futex_pivot_hash(mm, new) && custom)
> > + goto again;
>
> Is it safe to use a goto inside a scoped_guard(){}?
We jump outside of the scoped_guard() and while testing I've been
looking at the assembly and gcc did the right thing. So I would say why
not. The alternative would be to do manual lock/unlock and think about
the unlock just before the goto statement so this looks "easier".
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 10/21] futex: Introduce futex_q_lockptr_lock()
2025-05-08 19:06 ` [PATCH v12 10/21] " André Almeida
@ 2025-05-16 12:18 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-16 12:18 UTC (permalink / raw)
To: André Almeida
Cc: Darren Hart, Davidlohr Bueso, Ingo Molnar, Juri Lelli,
Peter Zijlstra, Thomas Gleixner, Valentin Schneider, Waiman Long,
linux-kernel
On 2025-05-08 16:06:28 [-0300], André Almeida wrote:
> > +/**
> > + * futex_hash_get - Get an additional reference for the local hash.
> > + * @hb: ptr to the private local hash.
> > + *
> > + * Obtain an additional reference for the already obtained hash bucket. The
> > + * caller must already own an reference.
> > + */
>
> This comment should come with patch 6 (that creates the function) or patch
> 14 (that implements the function).
It is too late for that.
> > --- a/kernel/futex/futex.h
> > +++ b/kernel/futex/futex.h
> > @@ -183,6 +183,7 @@ struct futex_q {
> > union futex_key *requeue_pi_key;
> > u32 bitset;
> > atomic_t requeue_state;
> > + bool drop_hb_ref;
>
> This new member needs a comment:
>
> * @drop_hb_ref: True if an extra reference was acquired by a pi operation,
> and needs an extra put()
This is done as of today.
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 05/21] futex: Create hb scopes
2025-05-06 23:45 ` André Almeida
@ 2025-05-16 12:20 ` Sebastian Andrzej Siewior
2025-05-16 13:23 ` Peter Zijlstra
1 sibling, 0 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-16 12:20 UTC (permalink / raw)
To: André Almeida
Cc: linux-kernel, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-05-06 20:45:06 [-0300], André Almeida wrote:
> > --- a/kernel/futex/core.c
> > +++ b/kernel/futex/core.c
> > @@ -957,50 +956,54 @@ static void exit_pi_state_list(struct task_struct *curr)
> > next = head->next;
> > pi_state = list_entry(next, struct futex_pi_state, list);
> > key = pi_state->key;
> > - hb = futex_hash(&key);
> > + if (1) {
>
> Couldn't those explict scopes be achive without the if (1), just {}?
I don't see why not. I guess it looks nicer that way.
> > + struct futex_hash_bucket *hb;
>
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 03/21] futex: Move futex_queue() into futex_wait_setup()
2025-05-05 21:43 ` André Almeida
@ 2025-05-16 12:53 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-16 12:53 UTC (permalink / raw)
To: André Almeida
Cc: Darren Hart, linux-kernel, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-05-05 18:43:16 [-0300], André Almeida wrote:
> > --- a/kernel/futex/waitwake.c
> > +++ b/kernel/futex/waitwake.c
> > @@ -339,18 +339,8 @@ static long futex_wait_restart(struct restart_block *restart);
> > * @q: the futex_q to queue up on
> > * @timeout: the prepared hrtimer_sleeper, or null for no timeout
> > */
> > -void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
> > - struct hrtimer_sleeper *timeout)
> > +void futex_do_wait(struct futex_q *q, struct hrtimer_sleeper *timeout)
>
> Update the name in the kernel doc comment as well. Also drop from the
> comment the part that says "futex_queue() and ..."
This has been done.
…
> > @@ -636,10 +629,25 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
> > }
> > if (uval != val) {
> > - futex_q_unlock(*hb);
> > - ret = -EWOULDBLOCK;
> > + futex_q_unlock(hb);
> > + return -EWOULDBLOCK;
> > }
> > + if (key2 && futex_match(&q->key, key2)) {
> > + futex_q_unlock(hb);
> > + return -EINVAL;
>
> Please add this new ret value in the kernel doc too.
I'm going to add this:
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -585,7 +585,8 @@ int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
*
* Return:
* - 0 - uaddr contains val and hb has been locked;
- * - <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
+ * - <0 - On error and the hb is unlocked. A possible reason: the uaddr can not
+ * be read, does not contain the expected value or is not properly aligned.
*/
int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
struct futex_q *q, union futex_key *key2,
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 14/21] futex: Allow to resize the private local hash
2025-05-16 10:49 ` Sebastian Andrzej Siewior
@ 2025-05-16 13:00 ` André Almeida
0 siblings, 0 replies; 109+ messages in thread
From: André Almeida @ 2025-05-16 13:00 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long
Em 16/05/2025 07:49, Sebastian Andrzej Siewior escreveu:
> On 2025-05-08 17:32:24 [-0300], André Almeida wrote:
>>> + if (!__futex_pivot_hash(mm, new) && custom)
>>> + goto again;
>>
>> Is it safe to use a goto inside a scoped_guard(){}?
>
> We jump outside of the scoped_guard() and while testing I've been
> looking at the assembly and gcc did the right thing. So I would say why
> not. The alternative would be to do manual lock/unlock and think about
> the unlock just before the goto statement so this looks "easier".
>
Ok, thanks for conforming it! I wasn't sure about the goto but now it's
clear to me.
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 05/21] futex: Create hb scopes
2025-05-06 23:45 ` André Almeida
2025-05-16 12:20 ` Sebastian Andrzej Siewior
@ 2025-05-16 13:23 ` Peter Zijlstra
1 sibling, 0 replies; 109+ messages in thread
From: Peter Zijlstra @ 2025-05-16 13:23 UTC (permalink / raw)
To: André Almeida
Cc: Sebastian Andrzej Siewior, linux-kernel, Darren Hart,
Davidlohr Bueso, Ingo Molnar, Juri Lelli, Thomas Gleixner,
Valentin Schneider, Waiman Long
On Tue, May 06, 2025 at 08:45:06PM -0300, André Almeida wrote:
> Em 16/04/2025 13:29, Sebastian Andrzej Siewior escreveu:
> > From: Peter Zijlstra <peterz@infradead.org>
> >
> > Create explicit scopes for hb variables; almost pure re-indent.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > ---
> > kernel/futex/core.c | 81 ++++----
> > kernel/futex/pi.c | 282 +++++++++++++-------------
> > kernel/futex/requeue.c | 433 ++++++++++++++++++++--------------------
> > kernel/futex/waitwake.c | 193 +++++++++---------
> > 4 files changed, 504 insertions(+), 485 deletions(-)
> >
> > diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> > index 7adc914878933..e4cb5ce9785b1 100644
> > --- a/kernel/futex/core.c
> > +++ b/kernel/futex/core.c
> > @@ -944,7 +944,6 @@ static void exit_pi_state_list(struct task_struct *curr)
> > {
> > struct list_head *next, *head = &curr->pi_state_list;
> > struct futex_pi_state *pi_state;
> > - struct futex_hash_bucket *hb;
> > union futex_key key = FUTEX_KEY_INIT;
> > /*
> > @@ -957,50 +956,54 @@ static void exit_pi_state_list(struct task_struct *curr)
> > next = head->next;
> > pi_state = list_entry(next, struct futex_pi_state, list);
> > key = pi_state->key;
> > - hb = futex_hash(&key);
> > + if (1) {
>
> Couldn't those explict scopes be achive without the if (1), just {}?
Yes, this is possible. I have some experience with people getting
confused with that style though. But yeah, whatever :-)
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 20/21] selftests/futex: Add futex_priv_hash
2025-04-16 16:29 ` [PATCH v12 20/21] selftests/futex: Add futex_priv_hash Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-05-09 21:22 ` [PATCH v12 20/21] " André Almeida
@ 2025-05-27 11:28 ` Mark Brown
2025-05-27 12:23 ` Sebastian Andrzej Siewior
2 siblings, 1 reply; 109+ messages in thread
From: Mark Brown @ 2025-05-27 11:28 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long
[-- Attachment #1: Type: text/plain, Size: 1001 bytes --]
On Wed, Apr 16, 2025 at 06:29:20PM +0200, Sebastian Andrzej Siewior wrote:
> Test the basic functionality of the private hash:
> - Upon start, with no threads there is no private hash.
> - The first thread initializes the private hash.
> - More than four threads will increase the size of the private hash if
> the system has more than 16 CPUs online.
This newly added test is not running successfully on arm64, it looks
like it's just a straightforward integration issue:
# Usage: futex_priv_hash
# -c Use color
# -g Test global hash instead intead local immutable
# -h Display this help message
# -v L Verbosity level: 0=QUIET 1=CRITICAL 2=INFO
# Usage: futex_priv_hash
# -c Use color
# -g Test global hash instead intead local immutable
# -h Display this help message
# -v L Verbosity level: 0=QUIET 1=CRITICAL 2=INFO
not ok 1 selftests: futex: run.sh # exit=1
Full log:
https://lava.sirena.org.uk/scheduler/job/1414260#L9910
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 20/21] selftests/futex: Add futex_priv_hash
2025-05-27 11:28 ` Mark Brown
@ 2025-05-27 12:23 ` Sebastian Andrzej Siewior
2025-05-27 12:35 ` Mark Brown
0 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-27 12:23 UTC (permalink / raw)
To: Mark Brown
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long
On 2025-05-27 12:28:00 [+0100], Mark Brown wrote:
> On Wed, Apr 16, 2025 at 06:29:20PM +0200, Sebastian Andrzej Siewior wrote:
> > Test the basic functionality of the private hash:
> > - Upon start, with no threads there is no private hash.
> > - The first thread initializes the private hash.
> > - More than four threads will increase the size of the private hash if
> > the system has more than 16 CPUs online.
>
> This newly added test is not running successfully on arm64, it looks
> like it's just a straightforward integration issue:
>
> # Usage: futex_priv_hash
> # -c Use color
> # -g Test global hash instead intead local immutable
> # -h Display this help message
> # -v L Verbosity level: 0=QUIET 1=CRITICAL 2=INFO
> # Usage: futex_priv_hash
> # -c Use color
> # -g Test global hash instead intead local immutable
> # -h Display this help message
> # -v L Verbosity level: 0=QUIET 1=CRITICAL 2=INFO
> not ok 1 selftests: futex: run.sh # exit=1
That is odd. If I run ./run.sh then it passes. I tried it with forcing
COLOR=-c and without it. This is the only option that is passed. That is on
x86 however but I doubt arm64 is doing anything special here.
A bit puzzled here.
> Full log:
>
> https://lava.sirena.org.uk/scheduler/job/1414260#L9910
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 20/21] selftests/futex: Add futex_priv_hash
2025-05-27 12:23 ` Sebastian Andrzej Siewior
@ 2025-05-27 12:35 ` Mark Brown
2025-05-27 12:43 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 109+ messages in thread
From: Mark Brown @ 2025-05-27 12:35 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long
[-- Attachment #1: Type: text/plain, Size: 980 bytes --]
On Tue, May 27, 2025 at 02:23:32PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-05-27 12:28:00 [+0100], Mark Brown wrote:
> > This newly added test is not running successfully on arm64, it looks
> > like it's just a straightforward integration issue:
> > # Usage: futex_priv_hash
> > # -c Use color
> > # -g Test global hash instead intead local immutable
> > # -h Display this help message
> > # -v L Verbosity level: 0=QUIET 1=CRITICAL 2=INFO
> That is odd. If I run ./run.sh then it passes. I tried it with forcing
> COLOR=-c and without it. This is the only option that is passed. That is on
> x86 however but I doubt arm64 is doing anything special here.
> A bit puzzled here.
Yeah, I was a bit confused as well. This is running with an installed
copy of the selftests and IIRC the build is out of tree so it's possible
something is different with that path compared to what you're doing?
It's a common source of problems?
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 20/21] selftests/futex: Add futex_priv_hash
2025-05-27 12:35 ` Mark Brown
@ 2025-05-27 12:43 ` Sebastian Andrzej Siewior
2025-05-27 12:59 ` Mark Brown
0 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-27 12:43 UTC (permalink / raw)
To: Mark Brown
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long
On 2025-05-27 13:35:50 [+0100], Mark Brown wrote:
> On Tue, May 27, 2025 at 02:23:32PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2025-05-27 12:28:00 [+0100], Mark Brown wrote:
>
> > > This newly added test is not running successfully on arm64, it looks
> > > like it's just a straightforward integration issue:
>
> > > # Usage: futex_priv_hash
> > > # -c Use color
> > > # -g Test global hash instead intead local immutable
> > > # -h Display this help message
> > > # -v L Verbosity level: 0=QUIET 1=CRITICAL 2=INFO
>
> > That is odd. If I run ./run.sh then it passes. I tried it with forcing
> > COLOR=-c and without it. This is the only option that is passed. That is on
> > x86 however but I doubt arm64 is doing anything special here.
>
> > A bit puzzled here.
>
> Yeah, I was a bit confused as well. This is running with an installed
> copy of the selftests and IIRC the build is out of tree so it's possible
> something is different with that path compared to what you're doing?
> It's a common source of problems?
It shouldn't be. The test is self-contained and as long as the run.sh
script invokes all is good. I don't see what futex_priv_hash might be
is doing different compared to the previous test.
I just noticed that after the two futex_priv_hash invocations there
should be one futex_numa_mpol invocation. That one is missing in the
output. Any idea why?
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 20/21] selftests/futex: Add futex_priv_hash
2025-05-27 12:43 ` Sebastian Andrzej Siewior
@ 2025-05-27 12:59 ` Mark Brown
2025-05-27 13:25 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 109+ messages in thread
From: Mark Brown @ 2025-05-27 12:59 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long
[-- Attachment #1: Type: text/plain, Size: 1279 bytes --]
On Tue, May 27, 2025 at 02:43:27PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-05-27 13:35:50 [+0100], Mark Brown wrote:
> > Yeah, I was a bit confused as well. This is running with an installed
> > copy of the selftests and IIRC the build is out of tree so it's possible
> > something is different with that path compared to what you're doing?
> > It's a common source of problems?
> It shouldn't be. The test is self-contained and as long as the run.sh
> script invokes all is good. I don't see what futex_priv_hash might be
> is doing different compared to the previous test.
> I just noticed that after the two futex_priv_hash invocations there
> should be one futex_numa_mpol invocation. That one is missing in the
> output. Any idea why?
I'm not seeing that test being built or in the binary:
https://builds.sirena.org.uk/cda95faef7bcf26ba3f54c3cddce66d50116d146/arm64/defconfig/build.log
https://builds.sirena.org.uk/cda95faef7bcf26ba3f54c3cddce66d50116d146/arm64/defconfig/kselftest.tar.xz
(note that this is the specific commit that I'm replying to the patch
for, not -next.) It looks like it's something's getting mistbuilt or
there's some logic bug with the argument parsing, if I run the binary
with -h it exits with return code 0 rather than 1.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 20/21] selftests/futex: Add futex_priv_hash
2025-05-27 12:59 ` Mark Brown
@ 2025-05-27 13:25 ` Sebastian Andrzej Siewior
2025-05-27 13:40 ` Mark Brown
0 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-27 13:25 UTC (permalink / raw)
To: Mark Brown
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long
On 2025-05-27 13:59:38 [+0100], Mark Brown wrote:
> I'm not seeing that test being built or in the binary:
>
> https://builds.sirena.org.uk/cda95faef7bcf26ba3f54c3cddce66d50116d146/arm64/defconfig/build.log
> https://builds.sirena.org.uk/cda95faef7bcf26ba3f54c3cddce66d50116d146/arm64/defconfig/kselftest.tar.xz
>
> (note that this is the specific commit that I'm replying to the patch
Ach, okay. I assumed you had the master branch as of today. The whole
KTAP/ machine readable output was added later.
> for, not -next.) It looks like it's something's getting mistbuilt or
> there's some logic bug with the argument parsing, if I run the binary
> with -h it exits with return code 0 rather than 1.
I copied the logic from the other tests in that folder. If you set -h (a
valid argument) then it exits with 0. If you an invalid argument it
exits with 1.
But now that I start the binary myself, it ends the same way. This
cures it:
diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c b/tools/testing/selftests/futex/functional/futex_priv_hash.c
index 2dca18fefedcd..24a92dc94eb86 100644
--- a/tools/testing/selftests/futex/functional/futex_priv_hash.c
+++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c
@@ -130,7 +130,7 @@ int main(int argc, char *argv[])
pthread_mutexattr_t mutex_attr_pi;
int use_global_hash = 0;
int ret;
- char c;
+ int c;
while ((c = getopt(argc, argv, "cghv:")) != -1) {
switch (c) {
Sebastian
^ permalink raw reply related [flat|nested] 109+ messages in thread
* Re: [PATCH v12 20/21] selftests/futex: Add futex_priv_hash
2025-05-27 13:25 ` Sebastian Andrzej Siewior
@ 2025-05-27 13:40 ` Mark Brown
2025-05-27 13:45 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 109+ messages in thread
From: Mark Brown @ 2025-05-27 13:40 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long
[-- Attachment #1: Type: text/plain, Size: 1244 bytes --]
On Tue, May 27, 2025 at 03:25:33PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-05-27 13:59:38 [+0100], Mark Brown wrote:
> > https://builds.sirena.org.uk/cda95faef7bcf26ba3f54c3cddce66d50116d146/arm64/defconfig/build.log
> > https://builds.sirena.org.uk/cda95faef7bcf26ba3f54c3cddce66d50116d146/arm64/defconfig/kselftest.tar.xz
> > (note that this is the specific commit that I'm replying to the patch
> Ach, okay. I assumed you had the master branch as of today. The whole
> KTAP/ machine readable output was added later.
> > for, not -next.) It looks like it's something's getting mistbuilt or
> > there's some logic bug with the argument parsing, if I run the binary
> > with -h it exits with return code 0 rather than 1.
> I copied the logic from the other tests in that folder. If you set -h (a
> valid argument) then it exits with 0. If you an invalid argument it
> exits with 1.
Yeah, so it was actually parsing arguments.
> But now that I start the binary myself, it ends the same way. This
> cures it:
> int ret;
> - char c;
> + int c;
>
> while ((c = getopt(argc, argv, "cghv:")) != -1) {
Ah, yes - that'd do it. Looking at the other tests there they do have c
as int.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 20/21] selftests/futex: Add futex_priv_hash
2025-05-27 13:40 ` Mark Brown
@ 2025-05-27 13:45 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-27 13:45 UTC (permalink / raw)
To: Mark Brown
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long
On 2025-05-27 14:40:22 [+0100], Mark Brown wrote:
> > int ret;
> > - char c;
> > + int c;
> >
> > while ((c = getopt(argc, argv, "cghv:")) != -1) {
>
> Ah, yes - that'd do it. Looking at the other tests there they do have c
> as int.
And in the mpol. I'm going to send a patch later…
Thank you.
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 14/21] futex: Allow to resize the private local hash
2025-04-16 16:29 ` [PATCH v12 14/21] futex: Allow to resize the private local hash Sebastian Andrzej Siewior
` (2 preceding siblings ...)
2025-05-10 8:45 ` [PATCH] futex: Fix futex_mm_init() build failure on older compilers, remove rcu_assign_pointer() Ingo Molnar
@ 2025-06-01 7:39 ` Lai, Yi
2025-06-02 11:00 ` Sebastian Andrzej Siewior
3 siblings, 1 reply; 109+ messages in thread
From: Lai, Yi @ 2025-06-01 7:39 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long, yi1.lai
On Wed, Apr 16, 2025 at 06:29:14PM +0200, Sebastian Andrzej Siewior wrote:
> The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/
> replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS
> operation can now be invoked at runtime and resize an already existing
> internal private futex_hash_bucket to another size.
>
> The reallocation is based on an idea by Thomas Gleixner: The initial
> allocation of struct futex_private_hash sets the reference count
> to one. Every user acquires a reference on the local hash before using
> it and drops it after it enqueued itself on the hash bucket. There is no
> reference held while the task is scheduled out while waiting for the
> wake up.
> The resize process allocates a new struct futex_private_hash and drops
> the initial reference. Synchronized with mm_struct::futex_hash_lock it
> is checked if the reference counter for the currently used
> mm_struct::futex_phash is marked as DEAD. If so, then all users enqueued
> on the current private hash are requeued on the new private hash and the
> new private hash is set to mm_struct::futex_phash. Otherwise the newly
> allocated private hash is saved as mm_struct::futex_phash_new and the
> rehashing and reassigning is delayed to the futex_hash() caller once the
> reference counter is marked DEAD.
> The replacement is not performed at rcuref_put() time because certain
> callers, such as futex_wait_queue(), drop their reference after changing
> the task state. This change will be destroyed once the futex_hash_lock
> is acquired.
>
> The user can change the number slots with PR_FUTEX_HASH_SET_SLOTS
> multiple times. An increase and decrease is allowed and request blocks
> until the assignment is done.
>
> The private hash allocated at thread creation is changed from 16 to
> 16 <= 4 * number_of_threads <= global_hash_size
> where number_of_threads can not exceed the number of online CPUs. Should
> the user PR_FUTEX_HASH_SET_SLOTS then the auto scaling is disabled.
>
> [peterz: reorganize the code to avoid state tracking and simplify new
> object handling, block the user until changes are in effect, allow
> increase and decrease of the hash].
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> include/linux/futex.h | 3 +-
> include/linux/mm_types.h | 4 +-
> kernel/futex/core.c | 290 ++++++++++++++++++++++++++++++++++++---
> kernel/futex/requeue.c | 5 +
> 4 files changed, 281 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/futex.h b/include/linux/futex.h
> index 1d3f7555825ec..40bc778b2bb45 100644
> --- a/include/linux/futex.h
> +++ b/include/linux/futex.h
> @@ -85,7 +85,8 @@ void futex_hash_free(struct mm_struct *mm);
>
> static inline void futex_mm_init(struct mm_struct *mm)
> {
> - mm->futex_phash = NULL;
> + rcu_assign_pointer(mm->futex_phash, NULL);
> + mutex_init(&mm->futex_hash_lock);
> }
>
> #else /* !CONFIG_FUTEX_PRIVATE_HASH */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index a4b5661e41770..32ba5126e2214 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1033,7 +1033,9 @@ struct mm_struct {
> seqcount_t mm_lock_seq;
> #endif
> #ifdef CONFIG_FUTEX_PRIVATE_HASH
> - struct futex_private_hash *futex_phash;
> + struct mutex futex_hash_lock;
> + struct futex_private_hash __rcu *futex_phash;
> + struct futex_private_hash *futex_phash_new;
> #endif
>
> unsigned long hiwater_rss; /* High-watermark of RSS usage */
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> index 53b3a00a92539..9e7dad52abea8 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -40,6 +40,7 @@
> #include <linux/fault-inject.h>
> #include <linux/slab.h>
> #include <linux/prctl.h>
> +#include <linux/rcuref.h>
>
> #include "futex.h"
> #include "../locking/rtmutex_common.h"
> @@ -57,7 +58,9 @@ static struct {
> #define futex_hashmask (__futex_data.hashmask)
>
> struct futex_private_hash {
> + rcuref_t users;
> unsigned int hash_mask;
> + struct rcu_head rcu;
> void *mm;
> bool custom;
> struct futex_hash_bucket queues[];
> @@ -129,11 +132,14 @@ static inline bool futex_key_is_private(union futex_key *key)
>
> bool futex_private_hash_get(struct futex_private_hash *fph)
> {
> - return false;
> + return rcuref_get(&fph->users);
> }
>
> void futex_private_hash_put(struct futex_private_hash *fph)
> {
> + /* Ignore return value, last put is verified via rcuref_is_dead() */
> + if (rcuref_put(&fph->users))
> + wake_up_var(fph->mm);
> }
>
> /**
> @@ -143,8 +149,23 @@ void futex_private_hash_put(struct futex_private_hash *fph)
> * Obtain an additional reference for the already obtained hash bucket. The
> * caller must already own an reference.
> */
> -void futex_hash_get(struct futex_hash_bucket *hb) { }
> -void futex_hash_put(struct futex_hash_bucket *hb) { }
> +void futex_hash_get(struct futex_hash_bucket *hb)
> +{
> + struct futex_private_hash *fph = hb->priv;
> +
> + if (!fph)
> + return;
> + WARN_ON_ONCE(!futex_private_hash_get(fph));
> +}
> +
> +void futex_hash_put(struct futex_hash_bucket *hb)
> +{
> + struct futex_private_hash *fph = hb->priv;
> +
> + if (!fph)
> + return;
> + futex_private_hash_put(fph);
> +}
>
> static struct futex_hash_bucket *
> __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
> @@ -155,7 +176,7 @@ __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
> return NULL;
>
> if (!fph)
> - fph = key->private.mm->futex_phash;
> + fph = rcu_dereference(key->private.mm->futex_phash);
> if (!fph || !fph->hash_mask)
> return NULL;
>
> @@ -165,21 +186,119 @@ __futex_hash_private(union futex_key *key, struct futex_private_hash *fph)
> return &fph->queues[hash & fph->hash_mask];
> }
>
> +static void futex_rehash_private(struct futex_private_hash *old,
> + struct futex_private_hash *new)
> +{
> + struct futex_hash_bucket *hb_old, *hb_new;
> + unsigned int slots = old->hash_mask + 1;
> + unsigned int i;
> +
> + for (i = 0; i < slots; i++) {
> + struct futex_q *this, *tmp;
> +
> + hb_old = &old->queues[i];
> +
> + spin_lock(&hb_old->lock);
> + plist_for_each_entry_safe(this, tmp, &hb_old->chain, list) {
> +
> + plist_del(&this->list, &hb_old->chain);
> + futex_hb_waiters_dec(hb_old);
> +
> + WARN_ON_ONCE(this->lock_ptr != &hb_old->lock);
> +
> + hb_new = __futex_hash(&this->key, new);
> + futex_hb_waiters_inc(hb_new);
> + /*
> + * The new pointer isn't published yet but an already
> + * moved user can be unqueued due to timeout or signal.
> + */
> + spin_lock_nested(&hb_new->lock, SINGLE_DEPTH_NESTING);
> + plist_add(&this->list, &hb_new->chain);
> + this->lock_ptr = &hb_new->lock;
> + spin_unlock(&hb_new->lock);
> + }
> + spin_unlock(&hb_old->lock);
> + }
> +}
> +
> +static bool __futex_pivot_hash(struct mm_struct *mm,
> + struct futex_private_hash *new)
> +{
> + struct futex_private_hash *fph;
> +
> + WARN_ON_ONCE(mm->futex_phash_new);
> +
> + fph = rcu_dereference_protected(mm->futex_phash,
> + lockdep_is_held(&mm->futex_hash_lock));
> + if (fph) {
> + if (!rcuref_is_dead(&fph->users)) {
> + mm->futex_phash_new = new;
> + return false;
> + }
> +
> + futex_rehash_private(fph, new);
> + }
> + rcu_assign_pointer(mm->futex_phash, new);
> + kvfree_rcu(fph, rcu);
> + return true;
> +}
> +
Hi Sebastian Andrzej Siewior,
Greetings!
I used Syzkaller and found that there is KASAN: null-ptr-deref Read in __futex_pivot_hash in linux-next next-20250527.
After bisection and the first bad commit is:
"
bd54df5ea7ca futex: Allow to resize the private local hash
"
All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/250531_004606___futex_pivot_hash
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/250531_004606___futex_pivot_hash/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/250531_004606___futex_pivot_hash/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/250531_004606___futex_pivot_hash/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/250531_004606___futex_pivot_hash/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/250531_004606___futex_pivot_hash/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250531_004606___futex_pivot_hash/bzImage_fefff2755f2aa4125dce2a1edfe7e545c7c621f2
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/250531_004606___futex_pivot_hash/bzImage_fefff2755f2aa4125dce2a1edfe7e545c7c621f2
"
[ 266.064649] Adding 124996k swap on ./swap-file. Priority:0 extents:1 across:124996k
[ 266.075472] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#11] SMP I
[ 266.075983] KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
[ 266.076337] CPU: 0 UID: 0 PID: 1168 Comm: repro Tainted: G B D 6.15.0-next-20250527-fefff2755f2a #1
[ 266.076882] Tainted: [B]=BAD_PAGE, [D]=DIE
[ 266.077073] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.o4
[ 266.077594] RIP: 0010:plist_del+0xf3/0x2d0
[ 266.077803] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 a6 01 00 00 49 8d 7f 08 4c 8b 73 10 48 b8 00 00 00 00 00 0
[ 266.078640] RSP: 0018:ffff8880159dfc40 EFLAGS: 00010202
[ 266.078886] RAX: dffffc0000000000 RBX: ffff88800f2397e8 RCX: ffffffff85ca6b25
[ 266.079327] RDX: 0000000000000001 RSI: 0000000000000008 RDI: 0000000000000008
[ 266.079658] RBP: ffff8880159dfc70 R08: 0000000000000001 R09: ffffed1002b3bf7d
[ 266.079989] R10: 0000000000000003 R11: 000000000000000c R12: ffff88800f239800
[ 266.080311] R13: ffff88800f2397f0 R14: 0000000000000000 R15: 0000000000000000
[ 266.080635] FS: 00007f8c127ff640(0000) GS:ffff8880e355f000(0000) knlGS:0000000000000000
[ 266.080998] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 266.081260] CR2: 00007f8c127fee38 CR3: 00000000149da003 CR4: 0000000000770ef0
[ 266.081594] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 266.081919] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 266.082248] PKRU: 55555554
[ 266.082377] Call Trace:
[ 266.082496] <TASK>
[ 266.082605] __futex_pivot_hash+0x2b0/0x520
[ 266.082815] futex_hash_allocate+0xb26/0x10b0
[ 266.083028] ? __pfx_futex_hash_allocate+0x10/0x10
[ 266.083261] ? __sanitizer_cov_trace_switch+0x58/0xa0
[ 266.083508] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 266.083756] ? static_key_count+0x69/0x80
[ 266.083948] futex_hash_prctl+0x20c/0x650
[ 266.084146] __do_sys_prctl+0x1a0d/0x2170
[ 266.084347] ? __pfx___do_sys_prctl+0x10/0x10
[ 266.084563] __x64_sys_prctl+0xc6/0x150
[ 266.084742] ? syscall_trace_enter+0x14d/0x280
[ 266.084956] x64_sys_call+0x1a25/0x2150
[ 266.085144] do_syscall_64+0x6d/0x2e0
[ 266.085324] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 266.085558] RIP: 0033:0x7f8c1283ee5d
[ 266.085731] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 8
[ 266.086550] RSP: 002b:00007f8c127fed48 EFLAGS: 00000246 ORIG_RAX: 000000000000009d
[ 266.086895] RAX: ffffffffffffffda RBX: 00007f8c127ff640 RCX: 00007f8c1283ee5d
[ 266.087219] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000004e
[ 266.087546] RBP: 00007f8c127fed60 R08: 0000000000000000 R09: 0000000000000000
[ 266.087869] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8c127ff640
[ 266.088191] R13: 0000000000000013 R14: 00007f8c1289f560 R15: 0000000000000000
[ 266.088521] </TASK>
[ 266.088631] Modules linked in:
[ 266.088810] ---[ end trace 0000000000000000 ]---
[ 266.089030] RIP: 0010:__futex_pivot_hash+0x271/0x520
[ 266.089265] Code: e8 84 a5 58 04 48 8b 45 d0 48 c1 e8 03 42 80 3c 28 00 0f 85 5e 02 00 00 48 8b 45 d0 4c 8b 30 4c 0
[ 266.090087] RSP: 0018:ffff88801b43fc80 EFLAGS: 00010206
[ 266.090332] RAX: 0007c018e000003c RBX: 003e00c7000001c9 RCX: ffffffff81799536
[ 266.090660] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8880227e8888
[ 266.090983] RBP: ffff88801b43fcf8 R08: 0000000000000001 R09: ffffed1003687f7d
[ 266.091309] R10: 0000000000000003 R11: 6e696c6261736944 R12: ffff888014430d68
[ 266.091634] R13: dffffc0000000000 R14: 003e00c7000001e1 R15: ffff888014430a80
[ 266.091950] FS: 00007f8c127ff640(0000) GS:ffff8880e355f000(0000) knlGS:0000000000000000
[ 266.092319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 266.092582] CR2: 00007f8c127fee38 CR3: 00000000149da003 CR4: 0000000000770ef0
[ 266.092915] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 266.093243] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 266.093608] PKRU: 55555554
[ 266.093738] note: repro[1168] exited with preempt_count 1
"
I also tried lastest linux-tag next-20250530. This issue can be reproduced. Here is the log:
"
[ 50.554828] Adding 124996k swap on ./swap-file. Priority:0 extents:1 across:124996k
[ 50.563846] Oops: general protection fault, probably for non-canonical address 0xe028fc18c0000065: 0000 [#4] SMP KI
[ 50.564384] KASAN: maybe wild-memory-access in range [0x014800c600000328-0x014800c60000032f]
[ 50.564774] CPU: 1 UID: 0 PID: 813 Comm: repro Tainted: G B D 6.15.0-next-20250530-kvm #3 PREEMPT(v
[ 50.565314] Tainted: [B]=BAD_PAGE, [D]=DIE
[ 50.565514] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.o4
[ 50.566028] RIP: 0010:__futex_pivot_hash+0x204/0x530
[ 50.566278] Code: e8 f1 e6 5b 04 48 8b 45 d0 48 c1 e8 03 42 80 3c 28 00 0f 85 d1 02 00 00 48 8b 45 d0 4c 8b 30 4c 0
[ 50.567119] RSP: 0018:ffff88801241fc80 EFLAGS: 00010206
[ 50.567372] RAX: 00290018c0000065 RBX: 014800c600000310 RCX: ffffffff8179ecdc
[ 50.567706] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88801d5d1708
[ 50.568036] RBP: ffff88801241fcf8 R08: 0000000000000001 R09: ffffed1002483f7d
[ 50.568364] R10: 0000000000000003 R11: 00000000bd9dfb48 R12: ffff88801429bf00
[ 50.568699] R13: dffffc0000000000 R14: 014800c600000328 R15: 0000000000000001
[ 50.569035] FS: 00007f183fe43640(0000) GS:ffff8880e3652000(0000) knlGS:0000000000000000
[ 50.569415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 50.569691] CR2: 00007f183fe42e38 CR3: 000000001115c005 CR4: 0000000000770ef0
[ 50.570026] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 50.570349] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 50.570684] PKRU: 55555554
[ 50.570820] Call Trace:
[ 50.570946] <TASK>
[ 50.571060] futex_hash_allocate+0xb3a/0x1060
[ 50.571279] ? sigprocmask+0x24e/0x370
[ 50.571470] ? __pfx_futex_hash_allocate+0x10/0x10
[ 50.571703] ? rcu_is_watching+0x19/0xc0
[ 50.571899] ? __sanitizer_cov_trace_switch+0x58/0xa0
[ 50.572152] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 50.572416] ? static_key_count+0x63/0x80
[ 50.572608] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 50.572870] futex_hash_prctl+0x1fe/0x650
[ 50.573069] __do_sys_prctl+0x4a3/0x2110
[ 50.573270] ? __pfx___do_sys_prctl+0x10/0x10
[ 50.573486] ? __audit_syscall_entry+0x39f/0x500
[ 50.573714] __x64_sys_prctl+0xc6/0x150
[ 50.573905] ? syscall_trace_enter+0x14d/0x280
[ 50.574120] x64_sys_call+0x1a2f/0x1fa0
[ 50.574314] do_syscall_64+0x6d/0x2e0
[ 50.574497] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 50.574736] RIP: 0033:0x7f183fc3ee5d
[ 50.574911] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 8
[ 50.575748] RSP: 002b:00007f183fe42d48 EFLAGS: 00000246 ORIG_RAX: 000000000000009d
[ 50.576105] RAX: ffffffffffffffda RBX: 00007f183fe43640 RCX: 00007f183fc3ee5d
[ 50.576434] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000004e
[ 50.576768] RBP: 00007f183fe42d60 R08: 0000000000000000 R09: 0000000000000000
[ 50.577105] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f183fe43640
[ 50.577444] R13: 000000000000000c R14: 00007f183fc9f560 R15: 0000000000000000
[ 50.577781] </TASK>
[ 50.577887] Modules linked in:
[ 50.578095] ---[ end trace 0000000000000000 ]---
[ 50.578316] RIP: 0010:__futex_pivot_hash+0x204/0x530
[ 50.578559] Code: e8 f1 e6 5b 04 48 8b 45 d0 48 c1 e8 03 42 80 3c 28 00 0f 85 d1 02 00 00 48 8b 45 d0 4c 8b 30 4c 0
[ 50.579394] RSP: 0018:ffff888012557c80 EFLAGS: 00010206
[ 50.579643] RAX: 00798018e0000056 RBX: 03cc00c700000299 RCX: ffffffff8179ecdc
[ 50.579975] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8880117f8488
[ 50.580303] RBP: ffff888012557cf8 R08: 0000000000000001 R09: ffffed10024aaf7d
[ 50.580669] R10: 0000000000000003 R11: 6e696c6261736944 R12: ffff888012cf0000
[ 50.581597] R13: dffffc0000000000 R14: 03cc00c7000002b1 R15: 0000000000000001
[ 50.581937] FS: 00007f183fe43640(0000) GS:ffff8880e3652000(0000) knlGS:0000000000000000
[ 50.582309] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 50.582583] CR2: 00007f183fe42e38 CR3: 000000001115c005 CR4: 0000000000770ef0
[ 50.582977] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 50.583294] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 50.583622] PKRU: 55555554
[ 50.583758] note: repro[813] exited with preempt_count 1
"
Hope this cound be insightful to you.
Regards,
Yi Lai
---
If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.
How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0
// start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
// You could change the bzImage_xxx as you want
// Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost
After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/
Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage //x should equal or less than cpu num your pc has
Fill the bzImage file into above start3.sh to load the target kernel in vm.
Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install
> +static void futex_pivot_hash(struct mm_struct *mm)
> +{
> + scoped_guard(mutex, &mm->futex_hash_lock) {
> + struct futex_private_hash *fph;
> +
> + fph = mm->futex_phash_new;
> + if (fph) {
> + mm->futex_phash_new = NULL;
> + __futex_pivot_hash(mm, fph);
> + }
> + }
> +}
> +
> struct futex_private_hash *futex_private_hash(void)
> {
> struct mm_struct *mm = current->mm;
> - struct futex_private_hash *fph;
> + /*
> + * Ideally we don't loop. If there is a replacement in progress
> + * then a new private hash is already prepared and a reference can't be
> + * obtained once the last user dropped it's.
> + * In that case we block on mm_struct::futex_hash_lock and either have
> + * to perform the replacement or wait while someone else is doing the
> + * job. Eitherway, on the second iteration we acquire a reference on the
> + * new private hash or loop again because a new replacement has been
> + * requested.
> + */
> +again:
> + scoped_guard(rcu) {
> + struct futex_private_hash *fph;
>
> - fph = mm->futex_phash;
> - return fph;
> + fph = rcu_dereference(mm->futex_phash);
> + if (!fph)
> + return NULL;
> +
> + if (rcuref_get(&fph->users))
> + return fph;
> + }
> + futex_pivot_hash(mm);
> + goto again;
> }
>
> struct futex_hash_bucket *futex_hash(union futex_key *key)
> {
> + struct futex_private_hash *fph;
> struct futex_hash_bucket *hb;
>
> - hb = __futex_hash(key, NULL);
> - return hb;
> +again:
> + scoped_guard(rcu) {
> + hb = __futex_hash(key, NULL);
> + fph = hb->priv;
> +
> + if (!fph || futex_private_hash_get(fph))
> + return hb;
> + }
> + futex_pivot_hash(key->private.mm);
> + goto again;
> }
>
> #else /* !CONFIG_FUTEX_PRIVATE_HASH */
> @@ -664,6 +783,8 @@ int futex_unqueue(struct futex_q *q)
> spinlock_t *lock_ptr;
> int ret = 0;
>
> + /* RCU so lock_ptr is not going away during locking. */
> + guard(rcu)();
> /* In the common case we don't take the spinlock, which is nice. */
> retry:
> /*
> @@ -1065,6 +1186,10 @@ static void exit_pi_state_list(struct task_struct *curr)
> struct futex_pi_state *pi_state;
> union futex_key key = FUTEX_KEY_INIT;
>
> + /*
> + * The mutex mm_struct::futex_hash_lock might be acquired.
> + */
> + might_sleep();
> /*
> * Ensure the hash remains stable (no resize) during the while loop
> * below. The hb pointer is acquired under the pi_lock so we can't block
> @@ -1261,7 +1386,51 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
> #ifdef CONFIG_FUTEX_PRIVATE_HASH
> void futex_hash_free(struct mm_struct *mm)
> {
> - kvfree(mm->futex_phash);
> + struct futex_private_hash *fph;
> +
> + kvfree(mm->futex_phash_new);
> + fph = rcu_dereference_raw(mm->futex_phash);
> + if (fph) {
> + WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
> + kvfree(fph);
> + }
> +}
> +
> +static bool futex_pivot_pending(struct mm_struct *mm)
> +{
> + struct futex_private_hash *fph;
> +
> + guard(rcu)();
> +
> + if (!mm->futex_phash_new)
> + return true;
> +
> + fph = rcu_dereference(mm->futex_phash);
> + return rcuref_is_dead(&fph->users);
> +}
> +
> +static bool futex_hash_less(struct futex_private_hash *a,
> + struct futex_private_hash *b)
> +{
> + /* user provided always wins */
> + if (!a->custom && b->custom)
> + return true;
> + if (a->custom && !b->custom)
> + return false;
> +
> + /* zero-sized hash wins */
> + if (!b->hash_mask)
> + return true;
> + if (!a->hash_mask)
> + return false;
> +
> + /* keep the biggest */
> + if (a->hash_mask < b->hash_mask)
> + return true;
> + if (a->hash_mask > b->hash_mask)
> + return false;
> +
> + return false; /* equal */
> }
>
> static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> @@ -1273,16 +1442,23 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> if (hash_slots && (hash_slots == 1 || !is_power_of_2(hash_slots)))
> return -EINVAL;
>
> - if (mm->futex_phash)
> - return -EALREADY;
> -
> - if (!thread_group_empty(current))
> - return -EINVAL;
> + /*
> + * Once we've disabled the global hash there is no way back.
> + */
> + scoped_guard(rcu) {
> + fph = rcu_dereference(mm->futex_phash);
> + if (fph && !fph->hash_mask) {
> + if (custom)
> + return -EBUSY;
> + return 0;
> + }
> + }
>
> fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
> if (!fph)
> return -ENOMEM;
>
> + rcuref_init(&fph->users, 1);
> fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
> fph->custom = custom;
> fph->mm = mm;
> @@ -1290,26 +1466,102 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
> for (i = 0; i < hash_slots; i++)
> futex_hash_bucket_init(&fph->queues[i], fph);
>
> - mm->futex_phash = fph;
> + if (custom) {
> + /*
> + * Only let prctl() wait / retry; don't unduly delay clone().
> + */
> +again:
> + wait_var_event(mm, futex_pivot_pending(mm));
> + }
> +
> + scoped_guard(mutex, &mm->futex_hash_lock) {
> + struct futex_private_hash *free __free(kvfree) = NULL;
> + struct futex_private_hash *cur, *new;
> +
> + cur = rcu_dereference_protected(mm->futex_phash,
> + lockdep_is_held(&mm->futex_hash_lock));
> + new = mm->futex_phash_new;
> + mm->futex_phash_new = NULL;
> +
> + if (fph) {
> + if (cur && !new) {
> + /*
> + * If we have an existing hash, but do not yet have
> + * allocated a replacement hash, drop the initial
> + * reference on the existing hash.
> + */
> + futex_private_hash_put(cur);
> + }
> +
> + if (new) {
> + /*
> + * Two updates raced; throw out the lesser one.
> + */
> + if (futex_hash_less(new, fph)) {
> + free = new;
> + new = fph;
> + } else {
> + free = fph;
> + }
> + } else {
> + new = fph;
> + }
> + fph = NULL;
> + }
> +
> + if (new) {
> + /*
> + * Will set mm->futex_phash_new on failure;
> + * futex_private_hash_get() will try again.
> + */
> + if (!__futex_pivot_hash(mm, new) && custom)
> + goto again;
> + }
> + }
> return 0;
> }
>
> int futex_hash_allocate_default(void)
> {
> + unsigned int threads, buckets, current_buckets = 0;
> + struct futex_private_hash *fph;
> +
> if (!current->mm)
> return 0;
>
> - if (current->mm->futex_phash)
> + scoped_guard(rcu) {
> + threads = min_t(unsigned int,
> + get_nr_threads(current),
> + num_online_cpus());
> +
> + fph = rcu_dereference(current->mm->futex_phash);
> + if (fph) {
> + if (fph->custom)
> + return 0;
> +
> + current_buckets = fph->hash_mask + 1;
> + }
> + }
> +
> + /*
> + * The default allocation will remain within
> + * 16 <= threads * 4 <= global hash size
> + */
> + buckets = roundup_pow_of_two(4 * threads);
> + buckets = clamp(buckets, 16, futex_hashmask + 1);
> +
> + if (current_buckets >= buckets)
> return 0;
>
> - return futex_hash_allocate(16, false);
> + return futex_hash_allocate(buckets, false);
> }
>
> static int futex_hash_get_slots(void)
> {
> struct futex_private_hash *fph;
>
> - fph = current->mm->futex_phash;
> + guard(rcu)();
> + fph = rcu_dereference(current->mm->futex_phash);
> if (fph && fph->hash_mask)
> return fph->hash_mask + 1;
> return 0;
> diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
> index b0e64fd454d96..c716a66f86929 100644
> --- a/kernel/futex/requeue.c
> +++ b/kernel/futex/requeue.c
> @@ -87,6 +87,11 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
> futex_hb_waiters_inc(hb2);
> plist_add(&q->list, &hb2->chain);
> q->lock_ptr = &hb2->lock;
> + /*
> + * hb1 and hb2 belong to the same futex_hash_bucket_private
> + * because if we managed get a reference on hb1 then it can't be
> + * replaced. Therefore we avoid put(hb1)+get(hb2) here.
> + */
> }
> q->key = *key2;
> }
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 14/21] futex: Allow to resize the private local hash
2025-06-01 7:39 ` [PATCH v12 14/21] futex: Allow to resize the private local hash Lai, Yi
@ 2025-06-02 11:00 ` Sebastian Andrzej Siewior
2025-06-02 14:36 ` Lai, Yi
` (2 more replies)
0 siblings, 3 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-02 11:00 UTC (permalink / raw)
To: Lai, Yi
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long, yi1.lai
On 2025-06-01 15:39:47 [+0800], Lai, Yi wrote:
> Hi Sebastian Andrzej Siewior,
Hi Yi,
> Greetings!
>
> I used Syzkaller and found that there is KASAN: null-ptr-deref Read in __futex_pivot_hash in linux-next next-20250527.
>
> After bisection and the first bad commit is:
> "
> bd54df5ea7ca futex: Allow to resize the private local hash
> "
Thank you for the report. Next time please trim your report. There is no
need to put your report in the middle of the patch.
The following fixes it:
----------->8--------------
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Mon, 2 Jun 2025 12:11:13 +0200
Subject: [PATCH] futex: Verify under the lock if global hash is in use
Once the global hash is requested there is no way back to switch back to
the per-task private hash. This is checked at the begin of the function.
It is possible that two threads simultaneously request the global hash
and both pass the initial check and block later on the
mm::futex_hash_lock. In this case the first thread performs the switch
to the global hash. The second thread will also attempt to switch to the
global hash and while doing so, accessing the nonexisting slot 1 of the
struct futex_private_hash.
This has been reported by Yi Lai.
Verify under mm_struct::futex_phash that the global hash is not in use.
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/aDwDw9Aygqo6oAx+@ly-workstation/
Fixes: bd54df5ea7cad ("futex: Allow to resize the private local hash")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 1cd3a646c91fd..abbd97c2fcba8 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1629,6 +1629,16 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
mm->futex_phash_new = NULL;
if (fph) {
+ if (cur && !cur->hash_mask) {
+ /*
+ * If two threads simultaneously request the global
+ * hash then the first one performs the switch,
+ * the second one returns here.
+ */
+ free = fph;
+ mm->futex_phash_new = new;
+ return -EBUSY;
+ }
if (cur && !new) {
/*
* If we have an existing hash, but do not yet have
--
2.49.0
Sebastian
^ permalink raw reply related [flat|nested] 109+ messages in thread
* Re: [PATCH v12 14/21] futex: Allow to resize the private local hash
2025-06-02 11:00 ` Sebastian Andrzej Siewior
@ 2025-06-02 14:36 ` Lai, Yi
2025-06-02 14:44 ` Sebastian Andrzej Siewior
2025-06-11 9:20 ` [tip: locking/urgent] " tip-bot2 for Sebastian Andrzej Siewior
2025-06-11 14:39 ` tip-bot2 for Sebastian Andrzej Siewior
2 siblings, 1 reply; 109+ messages in thread
From: Lai, Yi @ 2025-06-02 14:36 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long, yi1.lai
On Mon, Jun 02, 2025 at 01:00:27PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-01 15:39:47 [+0800], Lai, Yi wrote:
> > Hi Sebastian Andrzej Siewior,
> Hi Yi,
> > Greetings!
> >
> > I used Syzkaller and found that there is KASAN: null-ptr-deref Read in __futex_pivot_hash in linux-next next-20250527.
> >
> > After bisection and the first bad commit is:
> > "
> > bd54df5ea7ca futex: Allow to resize the private local hash
> > "
>
> Thank you for the report. Next time please trim your report. There is no
> need to put your report in the middle of the patch.
>
> The following fixes it:
>
Will trim my report next time.
After applying following patch on top of lastest linux-next, issue
cannot be reproduced. Thanks.
Regards,
Yi Lai
> ----------->8--------------
>
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Date: Mon, 2 Jun 2025 12:11:13 +0200
> Subject: [PATCH] futex: Verify under the lock if global hash is in use
>
> Once the global hash is requested there is no way back to switch back to
> the per-task private hash. This is checked at the begin of the function.
>
> It is possible that two threads simultaneously request the global hash
> and both pass the initial check and block later on the
> mm::futex_hash_lock. In this case the first thread performs the switch
> to the global hash. The second thread will also attempt to switch to the
> global hash and while doing so, accessing the nonexisting slot 1 of the
> struct futex_private_hash.
> This has been reported by Yi Lai.
>
> Verify under mm_struct::futex_phash that the global hash is not in use.
>
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Closes: https://lore.kernel.org/all/aDwDw9Aygqo6oAx+@ly-workstation/
> Fixes: bd54df5ea7cad ("futex: Allow to resize the private local hash")
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> kernel/futex/core.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> index 1cd3a646c91fd..abbd97c2fcba8 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -1629,6 +1629,16 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
> mm->futex_phash_new = NULL;
>
> if (fph) {
> + if (cur && !cur->hash_mask) {
> + /*
> + * If two threads simultaneously request the global
> + * hash then the first one performs the switch,
> + * the second one returns here.
> + */
> + free = fph;
> + mm->futex_phash_new = new;
> + return -EBUSY;
> + }
> if (cur && !new) {
> /*
> * If we have an existing hash, but do not yet have
> --
> 2.49.0
>
>
> Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 14/21] futex: Allow to resize the private local hash
2025-06-02 14:36 ` Lai, Yi
@ 2025-06-02 14:44 ` Sebastian Andrzej Siewior
2025-06-02 15:00 ` Lai, Yi
0 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-02 14:44 UTC (permalink / raw)
To: Lai, Yi
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long, yi1.lai
On 2025-06-02 22:36:45 [+0800], Lai, Yi wrote:
> Will trim my report next time.
Thank you.
> After applying following patch on top of lastest linux-next, issue
> cannot be reproduced. Thanks.
Does this statement above count as
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
?
> Regards,
> Yi Lai
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [PATCH v12 14/21] futex: Allow to resize the private local hash
2025-06-02 14:44 ` Sebastian Andrzej Siewior
@ 2025-06-02 15:00 ` Lai, Yi
0 siblings, 0 replies; 109+ messages in thread
From: Lai, Yi @ 2025-06-02 15:00 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Thomas Gleixner,
Valentin Schneider, Waiman Long, yi1.lai
On Mon, Jun 02, 2025 at 04:44:22PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-02 22:36:45 [+0800], Lai, Yi wrote:
> > Will trim my report next time.
> Thank you.
>
> > After applying following patch on top of lastest linux-next, issue
> > cannot be reproduced. Thanks.
>
> Does this statement above count as
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
>
> ?
>
Yes. Please kindly include it.
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> > Regards,
> > Yi Lai
>
> Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-02 11:00 ` Sebastian Andrzej Siewior
2025-06-02 14:36 ` Lai, Yi
@ 2025-06-11 9:20 ` tip-bot2 for Sebastian Andrzej Siewior
2025-06-11 14:39 ` tip-bot2 for Sebastian Andrzej Siewior
2 siblings, 0 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-06-11 9:20 UTC (permalink / raw)
To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), x86, linux-kernel
The following commit has been merged into the locking/urgent branch of tip:
Commit-ID: cdd0f803c1f9b69785f5ff865864cfea11081c91
Gitweb: https://git.kernel.org/tip/cdd0f803c1f9b69785f5ff865864cfea11081c91
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Mon, 02 Jun 2025 13:00:27 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 05 Jun 2025 14:37:59 +02:00
futex: Allow to resize the private local hash
On 2025-06-01 15:39:47 [+0800], Lai, Yi wrote:
> Hi Sebastian Andrzej Siewior,
Hi Yi,
> Greetings!
>
> I used Syzkaller and found that there is KASAN: null-ptr-deref Read in __futex_pivot_hash in linux-next next-20250527.
>
> After bisection and the first bad commit is:
> "
> bd54df5ea7ca futex: Allow to resize the private local hash
> "
Thank you for the report. Next time please trim your report. There is no
need to put your report in the middle of the patch.
The following fixes it:
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250602110027.wfqbHgzb@linutronix.de
---
kernel/futex/core.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index b652d2f..33b3643 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1629,6 +1629,16 @@ again:
mm->futex_phash_new = NULL;
if (fph) {
+ if (cur && !cur->hash_mask) {
+ /*
+ * If two threads simultaneously request the global
+ * hash then the first one performs the switch,
+ * the second one returns here.
+ */
+ free = fph;
+ mm->futex_phash_new = new;
+ return -EBUSY;
+ }
if (cur && !new) {
/*
* If we have an existing hash, but do not yet have
^ permalink raw reply related [flat|nested] 109+ messages in thread
* [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-02 11:00 ` Sebastian Andrzej Siewior
2025-06-02 14:36 ` Lai, Yi
2025-06-11 9:20 ` [tip: locking/urgent] " tip-bot2 for Sebastian Andrzej Siewior
@ 2025-06-11 14:39 ` tip-bot2 for Sebastian Andrzej Siewior
2025-06-11 14:43 ` Sebastian Andrzej Siewior
2025-06-16 17:14 ` Calvin Owens
2 siblings, 2 replies; 109+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-06-11 14:39 UTC (permalink / raw)
To: linux-tip-commits
Cc: Lai, Yi, Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86,
linux-kernel
The following commit has been merged into the locking/urgent branch of tip:
Commit-ID: 703b5f31aee5bda47868c09a3522a78823c1bb77
Gitweb: https://git.kernel.org/tip/703b5f31aee5bda47868c09a3522a78823c1bb77
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Mon, 02 Jun 2025 13:00:27 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 11 Jun 2025 16:26:44 +02:00
futex: Allow to resize the private local hash
Once the global hash is requested there is no way back to switch back to
the per-task private hash. This is checked at the begin of the function.
It is possible that two threads simultaneously request the global hash
and both pass the initial check and block later on the
mm::futex_hash_lock. In this case the first thread performs the switch
to the global hash. The second thread will also attempt to switch to the
global hash and while doing so, accessing the nonexisting slot 1 of the
struct futex_private_hash.
This has been reported by Yi Lai.
Verify under mm_struct::futex_phash that the global hash is not in use.
Fixes: bd54df5ea7cad ("futex: Allow to resize the private local hash")
Closes: https://lore.kernel.org/all/aDwDw9Aygqo6oAx+@ly-workstation/
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250602110027.wfqbHgzb@linutronix.de
---
kernel/futex/core.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index b652d2f..33b3643 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1629,6 +1629,16 @@ again:
mm->futex_phash_new = NULL;
if (fph) {
+ if (cur && !cur->hash_mask) {
+ /*
+ * If two threads simultaneously request the global
+ * hash then the first one performs the switch,
+ * the second one returns here.
+ */
+ free = fph;
+ mm->futex_phash_new = new;
+ return -EBUSY;
+ }
if (cur && !new) {
/*
* If we have an existing hash, but do not yet have
^ permalink raw reply related [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-11 14:39 ` tip-bot2 for Sebastian Andrzej Siewior
@ 2025-06-11 14:43 ` Sebastian Andrzej Siewior
2025-06-11 15:11 ` Peter Zijlstra
2025-06-16 17:14 ` Calvin Owens
1 sibling, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-11 14:43 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-tip-commits, Lai, Yi, Peter Zijlstra (Intel), x86
On 2025-06-11 14:39:16 [-0000], tip-bot2 for Sebastian Andrzej Siewior wrote:
> The following commit has been merged into the locking/urgent branch of tip:
>
> Commit-ID: 703b5f31aee5bda47868c09a3522a78823c1bb77
> Gitweb: https://git.kernel.org/tip/703b5f31aee5bda47868c09a3522a78823c1bb77
> Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> AuthorDate: Mon, 02 Jun 2025 13:00:27 +02:00
> Committer: Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Wed, 11 Jun 2025 16:26:44 +02:00
>
> futex: Allow to resize the private local hash
>
> Once the global hash is requested there is no way back to switch back to
> the per-task private hash. This is checked at the begin of the function.
>
> It is possible that two threads simultaneously request the global hash
> and both pass the initial check and block later on the
> mm::futex_hash_lock. In this case the first thread performs the switch
> to the global hash. The second thread will also attempt to switch to the
> global hash and while doing so, accessing the nonexisting slot 1 of the
> struct futex_private_hash.
> This has been reported by Yi Lai.
>
> Verify under mm_struct::futex_phash that the global hash is not in use.
Could you please replace it with
https://lore.kernel.org/all/20250610104400.1077266-5-bigeasy@linutronix.de/
It also looks like the subject from commit bd54df5ea7cad ("futex: Allow
to resize the private local hash")
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-11 14:43 ` Sebastian Andrzej Siewior
@ 2025-06-11 15:11 ` Peter Zijlstra
2025-06-11 15:20 ` Peter Zijlstra
0 siblings, 1 reply; 109+ messages in thread
From: Peter Zijlstra @ 2025-06-11 15:11 UTC (permalink / raw)
To: Sebastian Andrzej Siewior; +Cc: linux-kernel, linux-tip-commits, Lai, Yi, x86
On Wed, Jun 11, 2025 at 04:43:02PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-11 14:39:16 [-0000], tip-bot2 for Sebastian Andrzej Siewior wrote:
> > The following commit has been merged into the locking/urgent branch of tip:
> >
> > Commit-ID: 703b5f31aee5bda47868c09a3522a78823c1bb77
> > Gitweb: https://git.kernel.org/tip/703b5f31aee5bda47868c09a3522a78823c1bb77
> > Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > AuthorDate: Mon, 02 Jun 2025 13:00:27 +02:00
> > Committer: Peter Zijlstra <peterz@infradead.org>
> > CommitterDate: Wed, 11 Jun 2025 16:26:44 +02:00
> >
> > futex: Allow to resize the private local hash
> >
> > Once the global hash is requested there is no way back to switch back to
> > the per-task private hash. This is checked at the begin of the function.
> >
> > It is possible that two threads simultaneously request the global hash
> > and both pass the initial check and block later on the
> > mm::futex_hash_lock. In this case the first thread performs the switch
> > to the global hash. The second thread will also attempt to switch to the
> > global hash and while doing so, accessing the nonexisting slot 1 of the
> > struct futex_private_hash.
> > This has been reported by Yi Lai.
> >
> > Verify under mm_struct::futex_phash that the global hash is not in use.
>
> Could you please replace it with
> https://lore.kernel.org/all/20250610104400.1077266-5-bigeasy@linutronix.de/
>
> It also looks like the subject from commit bd54df5ea7cad ("futex: Allow
> to resize the private local hash")
Now done so, unless I messed up again :/
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-11 15:11 ` Peter Zijlstra
@ 2025-06-11 15:20 ` Peter Zijlstra
2025-06-11 15:35 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 109+ messages in thread
From: Peter Zijlstra @ 2025-06-11 15:20 UTC (permalink / raw)
To: Sebastian Andrzej Siewior; +Cc: linux-kernel, linux-tip-commits, Lai, Yi, x86
On Wed, Jun 11, 2025 at 05:11:13PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 11, 2025 at 04:43:02PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2025-06-11 14:39:16 [-0000], tip-bot2 for Sebastian Andrzej Siewior wrote:
> > > The following commit has been merged into the locking/urgent branch of tip:
> > >
> > > Commit-ID: 703b5f31aee5bda47868c09a3522a78823c1bb77
> > > Gitweb: https://git.kernel.org/tip/703b5f31aee5bda47868c09a3522a78823c1bb77
> > > Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > AuthorDate: Mon, 02 Jun 2025 13:00:27 +02:00
> > > Committer: Peter Zijlstra <peterz@infradead.org>
> > > CommitterDate: Wed, 11 Jun 2025 16:26:44 +02:00
> > >
> > > futex: Allow to resize the private local hash
> > >
> > > Once the global hash is requested there is no way back to switch back to
> > > the per-task private hash. This is checked at the begin of the function.
> > >
> > > It is possible that two threads simultaneously request the global hash
> > > and both pass the initial check and block later on the
> > > mm::futex_hash_lock. In this case the first thread performs the switch
> > > to the global hash. The second thread will also attempt to switch to the
> > > global hash and while doing so, accessing the nonexisting slot 1 of the
> > > struct futex_private_hash.
> > > This has been reported by Yi Lai.
> > >
> > > Verify under mm_struct::futex_phash that the global hash is not in use.
> >
> > Could you please replace it with
> > https://lore.kernel.org/all/20250610104400.1077266-5-bigeasy@linutronix.de/
> >
> > It also looks like the subject from commit bd54df5ea7cad ("futex: Allow
> > to resize the private local hash")
>
> Now done so, unless I messed up again :/
ARGH, let me try that again :-(
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-11 15:20 ` Peter Zijlstra
@ 2025-06-11 15:35 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-11 15:35 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, linux-tip-commits, Lai, Yi, x86
On 2025-06-11 17:20:19 [+0200], Peter Zijlstra wrote:
> ARGH, let me try that again :-(
That last commit 69a14d146f3b87819f3fb73ed5d1de3e1fa680c1 looks great.
Thank you.
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-11 14:39 ` tip-bot2 for Sebastian Andrzej Siewior
2025-06-11 14:43 ` Sebastian Andrzej Siewior
@ 2025-06-16 17:14 ` Calvin Owens
2025-06-17 7:16 ` Sebastian Andrzej Siewior
1 sibling, 1 reply; 109+ messages in thread
From: Calvin Owens @ 2025-06-16 17:14 UTC (permalink / raw)
To: linux-kernel
Cc: linux-tip-commits, Lai, Yi, Sebastian Andrzej Siewior,
Peter Zijlstra (Intel), x86
On Wednesday 06/11 at 14:39 -0000, tip-bot2 for Sebastian Andrzej Siewior wrote:
> <snip>
> It is possible that two threads simultaneously request the global hash
> and both pass the initial check and block later on the
> mm::futex_hash_lock. In this case the first thread performs the switch
> to the global hash. The second thread will also attempt to switch to the
> global hash and while doing so, accessing the nonexisting slot 1 of the
> struct futex_private_hash.
In case it's interesting to anyone, I'm hitting this one in real life,
one of my build machines got stuck overnight:
Jun 16 02:51:34 beethoven kernel: rcu: INFO: rcu_preempt self-detected stall on CPU
Jun 16 02:51:34 beethoven kernel: rcu: 16-....: (59997 ticks this GP) idle=eaf4/1/0x4000000000000000 softirq=14417247/14470115 fqs=21169
Jun 16 02:51:34 beethoven kernel: rcu: (t=60000 jiffies g=21453525 q=663214 ncpus=24)
Jun 16 02:51:34 beethoven kernel: CPU: 16 UID: 1000 PID: 2028199 Comm: cargo Not tainted 6.16.0-rc1-lto-00236-g8c6bc74c7f89 #1 PREEMPT
Jun 16 02:51:34 beethoven kernel: Hardware name: ASRock B850 Pro-A/B850 Pro-A, BIOS 3.11 11/12/2024
Jun 16 02:51:34 beethoven kernel: RIP: 0010:queued_spin_lock_slowpath+0x162/0x1d0
Jun 16 02:51:34 beethoven kernel: Code: 0f 1f 84 00 00 00 00 00 f3 90 83 7a 08 00 74 f8 48 8b 32 48 85 f6 74 09 0f 0d 0e eb 0d 31 f6 eb 09 31 f6 eb 05 0f 1f 00 f3 90 <8b> 07 66 85 c0 75 f7 39 c8 75 13 41 b8 01 00 00 00 89 c8 f0 44 0f
Jun 16 02:51:34 beethoven kernel: RSP: 0018:ffffc9002fb1fc38 EFLAGS: 00000206
Jun 16 02:51:34 beethoven kernel: RAX: 0000000000447f3a RBX: ffffc9003029fdf0 RCX: 0000000000440000
Jun 16 02:51:34 beethoven kernel: RDX: ffff88901fea5100 RSI: 0000000000000000 RDI: ffff888127e7d844
Jun 16 02:51:34 beethoven kernel: RBP: ffff8883a3c07248 R08: 0000000000000000 R09: 00000000b69b409a
Jun 16 02:51:34 beethoven kernel: R10: 000000001bd29fd9 R11: 0000000069b409ab R12: ffff888127e7d844
Jun 16 02:51:34 beethoven kernel: R13: ffff888127e7d840 R14: ffffc9003029fde0 R15: ffff8883a3c07248
Jun 16 02:51:34 beethoven kernel: FS: 00007f61c23d85c0(0000) GS:ffff88909b9f6000(0000) knlGS:0000000000000000
Jun 16 02:51:34 beethoven kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 16 02:51:34 beethoven kernel: CR2: 000056407760f3e0 CR3: 0000000905f29000 CR4: 0000000000750ef0
Jun 16 02:51:34 beethoven kernel: PKRU: 55555554
Jun 16 02:51:34 beethoven kernel: Call Trace:
Jun 16 02:51:34 beethoven kernel: <TASK>
Jun 16 02:51:34 beethoven kernel: __futex_pivot_hash+0x1f8/0x2e0
Jun 16 02:51:34 beethoven kernel: futex_hash+0x95/0xe0
Jun 16 02:51:34 beethoven kernel: futex_wait_setup+0x7e/0x230
Jun 16 02:51:34 beethoven kernel: __futex_wait+0x66/0x130
Jun 16 02:51:34 beethoven kernel: ? __futex_wake_mark+0xc0/0xc0
Jun 16 02:51:34 beethoven kernel: futex_wait+0xee/0x180
Jun 16 02:51:34 beethoven kernel: ? hrtimer_setup_sleeper_on_stack+0xe0/0xe0
Jun 16 02:51:34 beethoven kernel: do_futex+0x86/0x120
Jun 16 02:51:34 beethoven kernel: __se_sys_futex+0x16d/0x1e0
Jun 16 02:51:34 beethoven kernel: do_syscall_64+0x47/0x170
Jun 16 02:51:34 beethoven kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53
Jun 16 02:51:34 beethoven kernel: RIP: 0033:0x7f61c1d18779
Jun 16 02:51:34 beethoven kernel: Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4f 86 0d 00 f7 d8 64 89 01 48
Jun 16 02:51:34 beethoven kernel: RSP: 002b:00007ffcd3f6e3f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Jun 16 02:51:34 beethoven kernel: RAX: ffffffffffffffda RBX: 00007f61c1d18760 RCX: 00007f61c1d18779
Jun 16 02:51:34 beethoven kernel: RDX: 00000000000000a9 RSI: 0000000000000089 RDI: 0000564077580bb0
Jun 16 02:51:34 beethoven kernel: RBP: 00007ffcd3f6e450 R08: 0000000000000000 R09: 00007ffcffffffff
Jun 16 02:51:34 beethoven kernel: R10: 00007ffcd3f6e410 R11: 0000000000000246 R12: 000000001dcd6401
Jun 16 02:51:34 beethoven kernel: R13: 00007f61c1c33fd0 R14: 0000564077580bb0 R15: 00000000000000a9
Jun 16 02:51:34 beethoven kernel: </TASK>
<repeats forever until I wake up and kill the machine>
It seems like this is well understood already, but let me know if
there's any debug info I can send that might be useful.
Thanks,
Calvin
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-16 17:14 ` Calvin Owens
@ 2025-06-17 7:16 ` Sebastian Andrzej Siewior
2025-06-17 9:23 ` Calvin Owens
0 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-17 7:16 UTC (permalink / raw)
To: Calvin Owens
Cc: linux-kernel, linux-tip-commits, Lai, Yi, Peter Zijlstra (Intel),
x86
On 2025-06-16 10:14:24 [-0700], Calvin Owens wrote:
> On Wednesday 06/11 at 14:39 -0000, tip-bot2 for Sebastian Andrzej Siewior wrote:
> > <snip>
> > It is possible that two threads simultaneously request the global hash
> > and both pass the initial check and block later on the
> > mm::futex_hash_lock. In this case the first thread performs the switch
> > to the global hash. The second thread will also attempt to switch to the
> > global hash and while doing so, accessing the nonexisting slot 1 of the
> > struct futex_private_hash.
>
> In case it's interesting to anyone, I'm hitting this one in real life,
> one of my build machines got stuck overnight:
The scenario described in the description is not something that happens
on its own. The bot explicitly "asked" for it. This won't happen in a
"normal" scenario where you do not explicitly ask for specific hash via
the prctl() interface.
> Jun 16 02:51:34 beethoven kernel: rcu: INFO: rcu_preempt self-detected stall on CPU
> Jun 16 02:51:34 beethoven kernel: rcu: 16-....: (59997 ticks this GP) idle=eaf4/1/0x4000000000000000 softirq=14417247/14470115 fqs=21169
> Jun 16 02:51:34 beethoven kernel: rcu: (t=60000 jiffies g=21453525 q=663214 ncpus=24)
> Jun 16 02:51:34 beethoven kernel: CPU: 16 UID: 1000 PID: 2028199 Comm: cargo Not tainted 6.16.0-rc1-lto-00236-g8c6bc74c7f89 #1 PREEMPT
> Jun 16 02:51:34 beethoven kernel: Hardware name: ASRock B850 Pro-A/B850 Pro-A, BIOS 3.11 11/12/2024
> Jun 16 02:51:34 beethoven kernel: RIP: 0010:queued_spin_lock_slowpath+0x162/0x1d0
> Jun 16 02:51:34 beethoven kernel: Code: 0f 1f 84 00 00 00 00 00 f3 90 83 7a 08 00 74 f8 48 8b 32 48 85 f6 74 09 0f 0d 0e eb 0d 31 f6 eb 09 31 f6 eb 05 0f 1f 00 f3 90 <8b> 07 66 85 c0 75 f7 39 c8 75 13 41 b8 01 00 00 00 89 c8 f0 44 0f
…
> Jun 16 02:51:34 beethoven kernel: Call Trace:
> Jun 16 02:51:34 beethoven kernel: <TASK>
> Jun 16 02:51:34 beethoven kernel: __futex_pivot_hash+0x1f8/0x2e0
> Jun 16 02:51:34 beethoven kernel: futex_hash+0x95/0xe0
> Jun 16 02:51:34 beethoven kernel: futex_wait_setup+0x7e/0x230
> Jun 16 02:51:34 beethoven kernel: __futex_wait+0x66/0x130
> Jun 16 02:51:34 beethoven kernel: ? __futex_wake_mark+0xc0/0xc0
> Jun 16 02:51:34 beethoven kernel: futex_wait+0xee/0x180
> Jun 16 02:51:34 beethoven kernel: ? hrtimer_setup_sleeper_on_stack+0xe0/0xe0
> Jun 16 02:51:34 beethoven kernel: do_futex+0x86/0x120
> Jun 16 02:51:34 beethoven kernel: __se_sys_futex+0x16d/0x1e0
> Jun 16 02:51:34 beethoven kernel: do_syscall_64+0x47/0x170
> Jun 16 02:51:34 beethoven kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53
…
> <repeats forever until I wake up and kill the machine>
>
> It seems like this is well understood already, but let me know if
> there's any debug info I can send that might be useful.
This is with LTO enabled.
Based on the backtrace: there was a resize request (probably because a
thread was created) and the resize was delayed because the hash was in
use. The hash was released and now this thread moves all enqueued users
from the old the hash to the new. RIP says it is a spin lock that it is
stuck on. This is either the new or the old hash bucket lock.
If this lifelocks then someone else must have it locked and not
released.
Is this the only thread stuck or is there more?
I'm puzzled here. It looks as if there was an unlock missing.
> Thanks,
> Calvin
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-17 7:16 ` Sebastian Andrzej Siewior
@ 2025-06-17 9:23 ` Calvin Owens
2025-06-17 9:50 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 109+ messages in thread
From: Calvin Owens @ 2025-06-17 9:23 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, linux-tip-commits, Lai, Yi, Peter Zijlstra (Intel),
x86
On Tuesday 06/17 at 09:16 +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-16 10:14:24 [-0700], Calvin Owens wrote:
> > On Wednesday 06/11 at 14:39 -0000, tip-bot2 for Sebastian Andrzej Siewior wrote:
> > > <snip>
> > > It is possible that two threads simultaneously request the global hash
> > > and both pass the initial check and block later on the
> > > mm::futex_hash_lock. In this case the first thread performs the switch
> > > to the global hash. The second thread will also attempt to switch to the
> > > global hash and while doing so, accessing the nonexisting slot 1 of the
> > > struct futex_private_hash.
> >
> > In case it's interesting to anyone, I'm hitting this one in real life,
> > one of my build machines got stuck overnight:
>
> The scenario described in the description is not something that happens
> on its own. The bot explicitly "asked" for it. This won't happen in a
> "normal" scenario where you do not explicitly ask for specific hash via
> the prctl() interface.
Ugh, I'm sorry, I was in too much of a hurry this morning... cargo is
obviously not calling PR_FUTEX_HASH which is new in 6.16 :/
> > Jun 16 02:51:34 beethoven kernel: rcu: INFO: rcu_preempt self-detected stall on CPU
> > Jun 16 02:51:34 beethoven kernel: rcu: 16-....: (59997 ticks this GP) idle=eaf4/1/0x4000000000000000 softirq=14417247/14470115 fqs=21169
> > Jun 16 02:51:34 beethoven kernel: rcu: (t=60000 jiffies g=21453525 q=663214 ncpus=24)
> > Jun 16 02:51:34 beethoven kernel: CPU: 16 UID: 1000 PID: 2028199 Comm: cargo Not tainted 6.16.0-rc1-lto-00236-g8c6bc74c7f89 #1 PREEMPT
> > Jun 16 02:51:34 beethoven kernel: Hardware name: ASRock B850 Pro-A/B850 Pro-A, BIOS 3.11 11/12/2024
> > Jun 16 02:51:34 beethoven kernel: RIP: 0010:queued_spin_lock_slowpath+0x162/0x1d0
> > Jun 16 02:51:34 beethoven kernel: Code: 0f 1f 84 00 00 00 00 00 f3 90 83 7a 08 00 74 f8 48 8b 32 48 85 f6 74 09 0f 0d 0e eb 0d 31 f6 eb 09 31 f6 eb 05 0f 1f 00 f3 90 <8b> 07 66 85 c0 75 f7 39 c8 75 13 41 b8 01 00 00 00 89 c8 f0 44 0f
> …
> > Jun 16 02:51:34 beethoven kernel: Call Trace:
> > Jun 16 02:51:34 beethoven kernel: <TASK>
> > Jun 16 02:51:34 beethoven kernel: __futex_pivot_hash+0x1f8/0x2e0
> > Jun 16 02:51:34 beethoven kernel: futex_hash+0x95/0xe0
> > Jun 16 02:51:34 beethoven kernel: futex_wait_setup+0x7e/0x230
> > Jun 16 02:51:34 beethoven kernel: __futex_wait+0x66/0x130
> > Jun 16 02:51:34 beethoven kernel: ? __futex_wake_mark+0xc0/0xc0
> > Jun 16 02:51:34 beethoven kernel: futex_wait+0xee/0x180
> > Jun 16 02:51:34 beethoven kernel: ? hrtimer_setup_sleeper_on_stack+0xe0/0xe0
> > Jun 16 02:51:34 beethoven kernel: do_futex+0x86/0x120
> > Jun 16 02:51:34 beethoven kernel: __se_sys_futex+0x16d/0x1e0
> > Jun 16 02:51:34 beethoven kernel: do_syscall_64+0x47/0x170
> > Jun 16 02:51:34 beethoven kernel: entry_SYSCALL_64_after_hwframe+0x4b/0x53
> …
> > <repeats forever until I wake up and kill the machine>
> >
> > It seems like this is well understood already, but let me know if
> > there's any debug info I can send that might be useful.
>
> This is with LTO enabled.
Full lto with llvm-20.1.7.
> Based on the backtrace: there was a resize request (probably because a
> thread was created) and the resize was delayed because the hash was in
> use. The hash was released and now this thread moves all enqueued users
> from the old the hash to the new. RIP says it is a spin lock that it is
> stuck on. This is either the new or the old hash bucket lock.
> If this lifelocks then someone else must have it locked and not
> released.
> Is this the only thread stuck or is there more?
> I'm puzzled here. It looks as if there was an unlock missing.
Nothing showed up in the logs but the RCU stalls on CPU16, always in
queued_spin_lock_slowpath().
I'll run the build it was doing when it happened in a loop overnight and
see if I can trigger it again.
> > Thanks,
> > Calvin
>
> Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-17 9:23 ` Calvin Owens
@ 2025-06-17 9:50 ` Sebastian Andrzej Siewior
2025-06-17 16:11 ` Calvin Owens
0 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-17 9:50 UTC (permalink / raw)
To: Calvin Owens
Cc: linux-kernel, linux-tip-commits, Lai, Yi, Peter Zijlstra (Intel),
x86
On 2025-06-17 02:23:08 [-0700], Calvin Owens wrote:
> Ugh, I'm sorry, I was in too much of a hurry this morning... cargo is
> obviously not calling PR_FUTEX_HASH which is new in 6.16 :/
No worries.
> > This is with LTO enabled.
>
> Full lto with llvm-20.1.7.
>
…
> Nothing showed up in the logs but the RCU stalls on CPU16, always in
> queued_spin_lock_slowpath().
>
> I'll run the build it was doing when it happened in a loop overnight and
> see if I can trigger it again.
Please check if you can reproduce it and if so if it also happens
without lto.
I have no idea why one spinlock_t remains locked. It is either locked or
some stray memory.
Oh. Lockdep adds quite some overhead but it should complain that a
spinlock_t is still locked while returning to userland.
> > > Thanks,
> > > Calvin
> >
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-17 9:50 ` Sebastian Andrzej Siewior
@ 2025-06-17 16:11 ` Calvin Owens
2025-06-18 2:15 ` Calvin Owens
2025-06-18 16:03 ` Sebastian Andrzej Siewior
0 siblings, 2 replies; 109+ messages in thread
From: Calvin Owens @ 2025-06-17 16:11 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, linux-tip-commits, Lai, Yi, Peter Zijlstra (Intel),
x86
On Tuesday 06/17 at 11:50 +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-17 02:23:08 [-0700], Calvin Owens wrote:
> > Ugh, I'm sorry, I was in too much of a hurry this morning... cargo is
> > obviously not calling PR_FUTEX_HASH which is new in 6.16 :/
> No worries.
>
> > > This is with LTO enabled.
> >
> > Full lto with llvm-20.1.7.
> >
> …
> > Nothing showed up in the logs but the RCU stalls on CPU16, always in
> > queued_spin_lock_slowpath().
> >
> > I'll run the build it was doing when it happened in a loop overnight and
> > see if I can trigger it again.
Actually got an oops this time:
Oops: general protection fault, probably for non-canonical address 0xfdd92c90843cf111: 0000 [#1] SMP
CPU: 3 UID: 1000 PID: 323127 Comm: cargo Not tainted 6.16.0-rc2-lto-00024-g9afe652958c3 #1 PREEMPT
Hardware name: ASRock B850 Pro-A/B850 Pro-A, BIOS 3.11 11/12/2024
RIP: 0010:queued_spin_lock_slowpath+0x12a/0x1d0
Code: c8 c1 e8 10 66 87 47 02 66 85 c0 74 48 0f b7 c0 49 c7 c0 f8 ff ff ff 89 c6 c1 ee 02 83 e0 03 49 8b b4 f0 00 21 67 83 c1 e0 04 <48> 89 94 30 00 f1 4a 84 83 7a 08 00 75 10 0f 1f 84 00 00 00 00 00
RSP: 0018:ffffc9002c953d20 EFLAGS: 00010256
RAX: 0000000000000000 RBX: ffff88814e78be40 RCX: 0000000000100000
RDX: ffff88901fce5100 RSI: fdd92c90fff20011 RDI: ffff8881c2ae9384
RBP: 000000000000002b R08: fffffffffffffff8 R09: 00000000002ab900
R10: 000000000000b823 R11: 0000000000000c00 R12: ffff88814e78be40
R13: ffffc9002c953d48 R14: ffffc9002c953d48 R15: ffff8881c2ae9384
FS: 00007f086efb6600(0000) GS:ffff88909b836000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055ced9c42650 CR3: 000000034b88e000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
<TASK>
futex_unqueue+0x2e/0x110
__futex_wait+0xc5/0x130
? __futex_wake_mark+0xc0/0xc0
futex_wait+0xee/0x180
? hrtimer_setup_sleeper_on_stack+0xe0/0xe0
do_futex+0x86/0x120
__se_sys_futex+0x16d/0x1e0
? __x64_sys_write+0xba/0xc0
do_syscall_64+0x47/0x170
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f086e918779
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4f 86 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc5815f678 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: ffffffffffffffda RBX: 00007f086e918760 RCX: 00007f086e918779
RDX: 000000000000002b RSI: 0000000000000089 RDI: 00005636f9fb60d0
RBP: 00007ffc5815f6d0 R08: 0000000000000000 R09: 00007ffcffffffff
R10: 00007ffc5815f690 R11: 0000000000000246 R12: 000000001dcd6401
R13: 00007f086e833fd0 R14: 00005636f9fb60d0 R15: 000000000000002b
</TASK>
---[ end trace 0000000000000000 ]---
RIP: 0010:queued_spin_lock_slowpath+0x12a/0x1d0
Code: c8 c1 e8 10 66 87 47 02 66 85 c0 74 48 0f b7 c0 49 c7 c0 f8 ff ff ff 89 c6 c1 ee 02 83 e0 03 49 8b b4 f0 00 21 67 83 c1 e0 04 <48> 89 94 30 00 f1 4a 84 83 7a 08 00 75 10 0f 1f 84 00 00 00 00 00
RSP: 0018:ffffc9002c953d20 EFLAGS: 00010256
RAX: 0000000000000000 RBX: ffff88814e78be40 RCX: 0000000000100000
RDX: ffff88901fce5100 RSI: fdd92c90fff20011 RDI: ffff8881c2ae9384
RBP: 000000000000002b R08: fffffffffffffff8 R09: 00000000002ab900
R10: 000000000000b823 R11: 0000000000000c00 R12: ffff88814e78be40
R13: ffffc9002c953d48 R14: ffffc9002c953d48 R15: ffff8881c2ae9384
FS: 00007f086efb6600(0000) GS:ffff88909b836000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055ced9c42650 CR3: 000000034b88e000 CR4: 0000000000750ef0
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception ]---
This is a giant Yocto build, but the comm is always cargo, so hopefully
I can run those bits in isolation and hit it more quickly.
> Please check if you can reproduce it and if so if it also happens
> without lto.
> I have no idea why one spinlock_t remains locked. It is either locked or
> some stray memory.
> Oh. Lockdep adds quite some overhead but it should complain that a
> spinlock_t is still locked while returning to userland.
I'll report back when I've tried :)
I'll also try some of the mm debug configs.
Thanks,
Calvin
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-17 16:11 ` Calvin Owens
@ 2025-06-18 2:15 ` Calvin Owens
2025-06-18 16:47 ` Sebastian Andrzej Siewior
2025-06-18 16:03 ` Sebastian Andrzej Siewior
1 sibling, 1 reply; 109+ messages in thread
From: Calvin Owens @ 2025-06-18 2:15 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, linux-tip-commits, Lai, Yi, Peter Zijlstra (Intel),
x86
On Tuesday 06/17 at 09:11 -0700, Calvin Owens wrote:
> On Tuesday 06/17 at 11:50 +0200, Sebastian Andrzej Siewior wrote:
> > On 2025-06-17 02:23:08 [-0700], Calvin Owens wrote:
> > > Ugh, I'm sorry, I was in too much of a hurry this morning... cargo is
> > > obviously not calling PR_FUTEX_HASH which is new in 6.16 :/
> > No worries.
> >
> > > > This is with LTO enabled.
> > >
> > > Full lto with llvm-20.1.7.
> > >
> > …
> > > Nothing showed up in the logs but the RCU stalls on CPU16, always in
> > > queued_spin_lock_slowpath().
> > >
> > > I'll run the build it was doing when it happened in a loop overnight and
> > > see if I can trigger it again.
>
> Actually got an oops this time:
>
> <snip>
>
> This is a giant Yocto build, but the comm is always cargo, so hopefully
> I can run those bits in isolation and hit it more quickly.
>
> > Please check if you can reproduce it and if so if it also happens
> > without lto.
It takes longer with LTO disabled, but I'm still seeing some crashes.
First this WARN:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1866190 at mm/slub.c:4753 free_large_kmalloc+0xa5/0xc0
CPU: 2 UID: 1000 PID: 1866190 Comm: python3 Not tainted 6.16.0-rc2-nolto-00024-g9afe652958c3 #1 PREEMPT
Hardware name: ASRock B850 Pro-A/B850 Pro-A, BIOS 3.11 11/12/2024
RIP: 0010:free_large_kmalloc+0xa5/0xc0
Code: 02 00 00 74 01 fb 83 7b 30 ff 74 07 c7 43 30 ff ff ff ff f0 ff 4b 34 75 08 48 89 df e8 84 dd f9 ff 48 83 c4 08 5b 41 5e 5d c3 <0f> 0b 48 89 df 48 c7 c6 46 92 f5 82 48 83 c4 08 5b 41 5e 5d e9 42
RSP: 0018:ffffc90024d67ce8 EFLAGS: 00010206
RAX: 00000000ff000000 RBX: ffffea00051d5700 RCX: ffffea00042f2208
RDX: 0000000000053a55 RSI: ffff88814755c000 RDI: ffffea00051d5700
RBP: 0000000000000000 R08: fffffffffffdfce5 R09: ffffffff83d52928
R10: ffffea00047ae080 R11: 0000000000000003 R12: ffff8882cae5cd00
R13: ffff88819bb19c08 R14: ffff88819bb194c0 R15: ffff8883a24df900
FS: 0000000000000000(0000) GS:ffff88909bf54000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055842ea1e3f0 CR3: 0000000d82b9d000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
<TASK>
futex_hash_free+0x10/0x40
__mmput+0xb4/0xd0
exec_mmap+0x1e2/0x210
begin_new_exec+0x491/0x6c0
load_elf_binary+0x25d/0x1050
? load_misc_binary+0x19a/0x2d0
bprm_execve+0x1d5/0x370
do_execveat_common+0x29e/0x300
__x64_sys_execve+0x33/0x40
do_syscall_64+0x48/0xfb0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fd8ec8e7dd7
Code: Unable to access opcode bytes at 0x7fd8ec8e7dad.
RSP: 002b:00007fd8adff9e88 EFLAGS: 00000206 ORIG_RAX: 000000000000003b
RAX: ffffffffffffffda RBX: 00007fd8adffb6c0 RCX: 00007fd8ec8e7dd7
RDX: 000055842ed3ce60 RSI: 00007fd8eaea3870 RDI: 00007fd8eae87940
RBP: 00007fd8adff9e90 R08: 00000000ffffffff R09: 0000000000000000
R10: 0000000000000008 R11: 0000000000000206 R12: 00007fd8ed12da28
R13: 00007fd8eae87940 R14: 00007fd8eaea3870 R15: 0000000000000001
</TASK>
---[ end trace 0000000000000000 ]---
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x13d1507b pfn:0x14755c
flags: 0x2000000000000000(node=0|zone=1)
raw: 2000000000000000 ffffea00042f2208 ffff88901fd66b00 0000000000000000
raw: 0000000013d1507b 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: Not a kmalloc allocation
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x13d1507b pfn:0x14755c
flags: 0x2000000000000000(node=0|zone=1)
raw: 2000000000000000 ffffea00042f2208 ffff88901fd66b00 0000000000000000
raw: 0000000013d1507b 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: Not a kmalloc allocation
...and then it oopsed (same stack as my last mail) about twenty minutes
later when I hit Ctrl+C to stop the build:
BUG: unable to handle page fault for address: 00000008849281a9
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: Oops: 0002 [#1] SMP
CPU: 13 UID: 1000 PID: 1864338 Comm: python3 Tainted: G W 6.16.0-rc2-nolto-00024-g9afe652958c3 #1 PREEMPT
Tainted: [W]=WARN
Hardware name: ASRock B850 Pro-A/B850 Pro-A, BIOS 3.11 11/12/2024
RIP: 0010:queued_spin_lock_slowpath+0x112/0x1a0
Code: c8 c1 e8 10 66 87 47 02 66 85 c0 74 40 0f b7 c0 49 c7 c0 f8 ff ff ff 89 c6 c1 ee 02 83 e0 03 49 8b b4 f0 40 8b 06 83 c1 e0 04 <48> 89 94 30 00 12 d5 83 83 7a 08 00 75 08 f3 90 83 7a 08 00 74 f8
RSP: 0018:ffffc9002b35fd20 EFLAGS: 00010212
RAX: 0000000000000020 RBX: ffffc9002b35fd50 RCX: 0000000000380000
RDX: ffff88901fde5200 RSI: 0000000900bd6f89 RDI: ffff88814755d204
RBP: 0000000000000000 R08: fffffffffffffff8 R09: 00000000002ab900
R10: 0000000000000065 R11: 0000000000001000 R12: ffff88906c343e40
R13: ffffc9002b35fd50 R14: ffff88814755d204 R15: 00007fd8eb6feac0
FS: 00007fd8eb6ff6c0(0000) GS:ffff88909c094000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000008849281a9 CR3: 0000001fcf611000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
<TASK>
futex_unqueue+0x21/0x90
__futex_wait+0xb7/0x120
? __futex_wake_mark+0x40/0x40
futex_wait+0x5b/0xd0
do_futex+0x86/0x120
__se_sys_futex+0x10d/0x180
do_syscall_64+0x48/0xfb0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fd8ec8a49ee
Code: 08 0f 85 f5 4b ff ff 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 80 00 00 00 00 48 83 ec 08
RSP: 002b:00007fd8eb6fe9b8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: ffffffffffffffda RBX: 00007fd8eb6ff6c0 RCX: 00007fd8ec8a49ee
RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007fd8eb6feac0
RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fd8eb6fea00
R13: 0000000000001de0 R14: 00007fd8ececa240 R15: 00000000000000ef
</TASK>
CR2: 00000008849281a9
---[ end trace 0000000000000000 ]---
I enabled lockdep and I've got it running again.
I set up a little git repo with a copy of all the traces so far, and the
kconfigs I'm running:
https://github.com/jcalvinowens/lkml-debug-616
...and I pushed the actual vmlinux binaries here:
https://github.com/jcalvinowens/lkml-debug-616/releases/tag/20250617
There were some block warnings on another machine running the same
workload, but of course they aren't necessarily related.
> > I have no idea why one spinlock_t remains locked. It is either locked or
> > some stray memory.
> > Oh. Lockdep adds quite some overhead but it should complain that a
> > spinlock_t is still locked while returning to userland.
>
> I'll report back when I've tried :)
>
> I'll also try some of the mm debug configs.
>
> Thanks,
> Calvin
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-17 16:11 ` Calvin Owens
2025-06-18 2:15 ` Calvin Owens
@ 2025-06-18 16:03 ` Sebastian Andrzej Siewior
2025-06-18 16:49 ` Calvin Owens
1 sibling, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-18 16:03 UTC (permalink / raw)
To: Calvin Owens
Cc: linux-kernel, linux-tip-commits, Lai, Yi, Peter Zijlstra (Intel),
x86
On 2025-06-17 09:11:06 [-0700], Calvin Owens wrote:
> Actually got an oops this time:
>
> Oops: general protection fault, probably for non-canonical address 0xfdd92c90843cf111: 0000 [#1] SMP
> CPU: 3 UID: 1000 PID: 323127 Comm: cargo Not tainted 6.16.0-rc2-lto-00024-g9afe652958c3 #1 PREEMPT
> Hardware name: ASRock B850 Pro-A/B850 Pro-A, BIOS 3.11 11/12/2024
> RIP: 0010:queued_spin_lock_slowpath+0x12a/0x1d0
…
> Call Trace:
> <TASK>
> futex_unqueue+0x2e/0x110
> __futex_wait+0xc5/0x130
> futex_wait+0xee/0x180
> do_futex+0x86/0x120
> __se_sys_futex+0x16d/0x1e0
> do_syscall_64+0x47/0x170
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> RIP: 0033:0x7f086e918779
The lock_ptr is pointing to invalid memory. It explodes within
queued_spin_lock_slowpath() which looks like decode_tail() returned a
wrong pointer/ offset.
futex_queue() adds a local futex_q to the list and its lock_ptr points
to the hb lock. Then we do schedule() and after the wakeup the lock_ptr
is NULL after a successful wake. Otherwise it still points to the
futex_hash_bucket::lock.
Since futex_unqueue() attempts to acquire the lock, then there was no
wakeup but a timeout or a signal that ended the wait. The lock_ptr can
change during resize.
During the resize futex_rehash_private() moves the futex_q members from
the old queue to the new one. The lock is accessed within RCU and the
lock_ptr value is compared against the old value after locking. That
means it is accessed either before the rehash moved it the new hash
bucket or afterwards.
I don't see how this pointer can become invalid. RCU protects against
cleanup and the pointer compare ensures that it is the "current"
pointer.
I've been looking at clang's assembly of futex_unqueue() and it looks
correct. And futex_rehash_private() iterates over all slots.
> This is a giant Yocto build, but the comm is always cargo, so hopefully
> I can run those bits in isolation and hit it more quickly.
If it still explodes without LTO, would you mind trying gcc?
> Thanks,
> Calvin
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-18 2:15 ` Calvin Owens
@ 2025-06-18 16:47 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-18 16:47 UTC (permalink / raw)
To: Calvin Owens
Cc: linux-kernel, linux-tip-commits, Lai, Yi, Peter Zijlstra (Intel),
x86
On 2025-06-17 19:15:37 [-0700], Calvin Owens wrote:
> It takes longer with LTO disabled, but I'm still seeing some crashes.
>
> First this WARN:
>
> ------------[ cut here ]------------
> WARNING: CPU: 2 PID: 1866190 at mm/slub.c:4753 free_large_kmalloc+0xa5/0xc0
> CPU: 2 UID: 1000 PID: 1866190 Comm: python3 Not tainted 6.16.0-rc2-nolto-00024-g9afe652958c3 #1 PREEMPT
…
> RIP: 0010:free_large_kmalloc+0xa5/0xc0
…
> Call Trace:
> <TASK>
> futex_hash_free+0x10/0x40
This points me to kernel/futex/core.c:1535, which is futex_phash_new.
Thanks for the provided vmlinux.
This is odd. The assignment happens only under &mm->futex_hash_lock and
it a bad pointer. The kvmalloc() pointer is stored there and only
remains there if a rehash did not happen before the task ended.
> __mmput+0xb4/0xd0
> exec_mmap+0x1e2/0x210
> begin_new_exec+0x491/0x6c0
> load_elf_binary+0x25d/0x1050
…
> ...and then it oopsed (same stack as my last mail) about twenty minutes
> later when I hit Ctrl+C to stop the build:
>
…
> I enabled lockdep and I've got it running again.
>
> I set up a little git repo with a copy of all the traces so far, and the
> kconfigs I'm running:
>
> https://github.com/jcalvinowens/lkml-debug-616
>
> ...and I pushed the actual vmlinux binaries here:
>
> https://github.com/jcalvinowens/lkml-debug-616/releases/tag/20250617
>
> There were some block warnings on another machine running the same
> workload, but of course they aren't necessarily related.
I have no explanation so far.
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-18 16:03 ` Sebastian Andrzej Siewior
@ 2025-06-18 16:49 ` Calvin Owens
2025-06-18 17:09 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 109+ messages in thread
From: Calvin Owens @ 2025-06-18 16:49 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, linux-tip-commits, Lai, Yi, Peter Zijlstra (Intel),
x86
On Wednesday 06/18 at 18:03 +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-17 09:11:06 [-0700], Calvin Owens wrote:
> > Actually got an oops this time:
> >
> > Oops: general protection fault, probably for non-canonical address 0xfdd92c90843cf111: 0000 [#1] SMP
> > CPU: 3 UID: 1000 PID: 323127 Comm: cargo Not tainted 6.16.0-rc2-lto-00024-g9afe652958c3 #1 PREEMPT
> > Hardware name: ASRock B850 Pro-A/B850 Pro-A, BIOS 3.11 11/12/2024
> > RIP: 0010:queued_spin_lock_slowpath+0x12a/0x1d0
> …
> > Call Trace:
> > <TASK>
> > futex_unqueue+0x2e/0x110
> > __futex_wait+0xc5/0x130
> > futex_wait+0xee/0x180
> > do_futex+0x86/0x120
> > __se_sys_futex+0x16d/0x1e0
> > do_syscall_64+0x47/0x170
> > entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > RIP: 0033:0x7f086e918779
>
> The lock_ptr is pointing to invalid memory. It explodes within
> queued_spin_lock_slowpath() which looks like decode_tail() returned a
> wrong pointer/ offset.
>
> futex_queue() adds a local futex_q to the list and its lock_ptr points
> to the hb lock. Then we do schedule() and after the wakeup the lock_ptr
> is NULL after a successful wake. Otherwise it still points to the
> futex_hash_bucket::lock.
>
> Since futex_unqueue() attempts to acquire the lock, then there was no
> wakeup but a timeout or a signal that ended the wait. The lock_ptr can
> change during resize.
> During the resize futex_rehash_private() moves the futex_q members from
> the old queue to the new one. The lock is accessed within RCU and the
> lock_ptr value is compared against the old value after locking. That
> means it is accessed either before the rehash moved it the new hash
> bucket or afterwards.
> I don't see how this pointer can become invalid. RCU protects against
> cleanup and the pointer compare ensures that it is the "current"
> pointer.
> I've been looking at clang's assembly of futex_unqueue() and it looks
> correct. And futex_rehash_private() iterates over all slots.
Didn't get much out of lockdep unfortunately.
It notices the corruption in the spinlock:
BUG: spinlock bad magic on CPU#2, cargo/4129172
lock: 0xffff8881410ecdc8, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
CPU: 2 UID: 1000 PID: 4129172 Comm: cargo Not tainted 6.16.0-rc2-nolto-lockdep-00047-g52da431bf03b #1 PREEMPT
Hardware name: ASRock B850 Pro-A/B850 Pro-A, BIOS 3.11 11/12/2024
Call Trace:
<TASK>
dump_stack_lvl+0x5a/0x80
do_raw_spin_lock+0x6a/0xd0
futex_wait_setup+0x8e/0x200
__futex_wait+0x63/0x120
? __futex_wake_mark+0x40/0x40
futex_wait+0x5b/0xd0
? hrtimer_dummy_timeout+0x10/0x10
do_futex+0x86/0x120
__se_sys_futex+0x10d/0x180
? entry_SYSCALL_64_after_hwframe+0x4b/0x53
do_syscall_64+0x6a/0x1070
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7ff7e7ffb779
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4f 86 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007fff29bee078 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: ffffffffffffffda RBX: 00007ff7e7ffb760 RCX: 00007ff7e7ffb779
RDX: 00000000000000b6 RSI: 0000000000000089 RDI: 000055a5e2b9c1a0
RBP: 00007fff29bee0d0 R08: 0000000000000000 R09: 00007fffffffffff
R10: 00007fff29bee090 R11: 0000000000000246 R12: 000000001dcd6401
R13: 00007ff7e7f16fd0 R14: 000055a5e2b9c1a0 R15: 00000000000000b6
</TASK>
That was followed by this WARN:
------------[ cut here ]------------
rcuref - imbalanced put()
WARNING: CPU: 2 PID: 4129172 at lib/rcuref.c:266 rcuref_put_slowpath+0x55/0x70
CPU: 2 UID: 1000 PID: 4129172 Comm: cargo Not tainted 6.16.0-rc2-nolto-lockdep-00047-g52da431bf03b #1 PREEMPT
Hardware name: ASRock B850 Pro-A/B850 Pro-A, BIOS 3.11 11/12/2024
RIP: 0010:rcuref_put_slowpath+0x55/0x70
Code: 00 00 00 c0 73 2a 85 f6 79 06 c7 07 00 00 00 a0 31 c0 c3 53 48 89 fb 48 c7 c7 da 7f 32 83 c6 05 7f 9c 35 02 01 e8 1b 83 9f ff <0f> 0b 48 89 df 5b 31 c0 c7 07 00 00 00 e0 c3 cc cc cc cc cc cc cc
RSP: 0018:ffffc90026e7fca8 EFLAGS: 00010282
RAX: 0000000000000019 RBX: ffff8881410ec000 RCX: 0000000000000027
RDX: 00000000ffff7fff RSI: 0000000000000002 RDI: ffff88901fc9c008
RBP: 0000000000000000 R08: 0000000000007fff R09: ffffffff83676870
R10: 0000000000017ffd R11: 00000000ffff7fff R12: 00000000000000b7
R13: 000055a5e2b9c1a0 R14: ffff8881410ecdc0 R15: 0000000000000001
FS: 00007ff7e875c600(0000) GS:ffff88909b96a000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd4b8001028 CR3: 0000000fd7d31000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
<TASK>
futex_private_hash_put+0xa7/0xc0
futex_wait_setup+0x1c0/0x200
__futex_wait+0x63/0x120
? __futex_wake_mark+0x40/0x40
futex_wait+0x5b/0xd0
? hrtimer_dummy_timeout+0x10/0x10
do_futex+0x86/0x120
__se_sys_futex+0x10d/0x180
? entry_SYSCALL_64_after_hwframe+0x4b/0x53
do_syscall_64+0x6a/0x1070
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7ff7e7ffb779
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4f 86 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007fff29bee078 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: ffffffffffffffda RBX: 00007ff7e7ffb760 RCX: 00007ff7e7ffb779
RDX: 00000000000000b6 RSI: 0000000000000089 RDI: 000055a5e2b9c1a0
RBP: 00007fff29bee0d0 R08: 0000000000000000 R09: 00007fffffffffff
R10: 00007fff29bee090 R11: 0000000000000246 R12: 000000001dcd6401
R13: 00007ff7e7f16fd0 R14: 000055a5e2b9c1a0 R15: 00000000000000b6
</TASK>
irq event stamp: 59385407
hardirqs last enabled at (59385407): [<ffffffff8274264c>] _raw_spin_unlock_irqrestore+0x2c/0x50
hardirqs last disabled at (59385406): [<ffffffff8274250d>] _raw_spin_lock_irqsave+0x1d/0x60
softirqs last enabled at (59341786): [<ffffffff8133cc1e>] __irq_exit_rcu+0x4e/0xd0
softirqs last disabled at (59341781): [<ffffffff8133cc1e>] __irq_exit_rcu+0x4e/0xd0
---[ end trace 0000000000000000 ]---
The oops after that is from a different task this time, but it just
looks like slab corruption:
BUG: unable to handle page fault for address: 0000000000001300
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP
CPU: 4 UID: 1000 PID: 4170542 Comm: zstd Tainted: G W 6.16.0-rc2-nolto-lockdep-00047-g52da431bf03b #1 PREEMPT
Tainted: [W]=WARN
Hardware name: ASRock B850 Pro-A/B850 Pro-A, BIOS 3.11 11/12/2024
RIP: 0010:__kvmalloc_node_noprof+0x1a2/0x4a0
Code: 0f 84 a3 01 00 00 41 83 f8 ff 74 10 48 8b 03 48 c1 e8 3f 41 39 c0 0f 85 8d 01 00 00 41 8b 46 28 49 8b 36 48 8d 4d 20 48 89 ea <4a> 8b 1c 20 4c 89 e0 65 48 0f c7 0e 74 4e eb 9f 41 83 f8 ff 75 b4
RSP: 0018:ffffc90036a87c00 EFLAGS: 00010246
RAX: 0000000000001000 RBX: ffffea0005043a00 RCX: 0000000000054764
RDX: 0000000000054744 RSI: ffffffff84347c80 RDI: 0000000000000080
RBP: 0000000000054744 R08: 00000000ffffffff R09: 0000000000000000
R10: ffffffff8140972d R11: 0000000000000000 R12: 0000000000000300
R13: 00000000004029c0 R14: ffff888100044800 R15: 0000000000001040
FS: 00007fca63240740(0000) GS:ffff88909b9ea000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000001300 CR3: 00000004fcac3000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
<TASK>
? futex_hash_allocate+0x17f/0x400
futex_hash_allocate+0x17f/0x400
? futex_hash_allocate+0x4d/0x400
? futex_hash_allocate_default+0x2b/0x1e0
? futex_hash_allocate_default+0x2b/0x1e0
? copy_process+0x35e/0x12a0
? futex_hash_allocate_default+0x2b/0x1e0
? copy_process+0x35e/0x12a0
copy_process+0xcf3/0x12a0
? entry_SYSCALL_64_after_hwframe+0x4b/0x53
kernel_clone+0x7f/0x310
? copy_clone_args_from_user+0x93/0x1e0
? entry_SYSCALL_64_after_hwframe+0x4b/0x53
__se_sys_clone3+0xbb/0xc0
? _copy_to_user+0x1f/0x60
? __se_sys_rt_sigprocmask+0xf2/0x120
? trace_hardirqs_off+0x40/0xb0
do_syscall_64+0x6a/0x1070
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fca6335f7a9
Code: 90 b8 01 00 00 00 b9 01 00 00 00 eb ec 0f 1f 40 00 b8 ea ff ff ff 48 85 ff 74 28 48 85 d2 74 23 49 89 c8 b8 b3 01 00 00 0f 05 <48> 85 c0 7c 14 74 01 c3 31 ed 4c 89 c7 ff d2 48 89 c7 b8 3c 00 00
RSP: 002b:00007ffcfe17fe78 EFLAGS: 00000202 ORIG_RAX: 00000000000001b3
RAX: ffffffffffffffda RBX: 00007fca632e18e0 RCX: 00007fca6335f7a9
RDX: 00007fca632e18e0 RSI: 0000000000000058 RDI: 00007ffcfe17fed0
RBP: 00007fca60f666c0 R08: 00007fca60f666c0 R09: 00007ffcfe17ffc7
R10: 0000000000000008 R11: 0000000000000202 R12: ffffffffffffff88
R13: 0000000000000002 R14: 00007ffcfe17fed0 R15: 00007fca60766000
</TASK>
CR2: 0000000000001300
---[ end trace 0000000000000000 ]---
RIP: 0010:__kvmalloc_node_noprof+0x1a2/0x4a0
Code: 0f 84 a3 01 00 00 41 83 f8 ff 74 10 48 8b 03 48 c1 e8 3f 41 39 c0 0f 85 8d 01 00 00 41 8b 46 28 49 8b 36 48 8d 4d 20 48 89 ea <4a> 8b 1c 20 4c 89 e0 65 48 0f c7 0e 74 4e eb 9f 41 83 f8 ff 75 b4
RSP: 0018:ffffc90036a87c00 EFLAGS: 00010246
RAX: 0000000000001000 RBX: ffffea0005043a00 RCX: 0000000000054764
RDX: 0000000000054744 RSI: ffffffff84347c80 RDI: 0000000000000080
RBP: 0000000000054744 R08: 00000000ffffffff R09: 0000000000000000
R10: ffffffff8140972d R11: 0000000000000000 R12: 0000000000000300
R13: 00000000004029c0 R14: ffff888100044800 R15: 0000000000001040
FS: 00007fca63240740(0000) GS:ffff88909b9ea000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000001300 CR3: 00000004fcac3000 CR4: 0000000000750ef0
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception ]---
No lock/rcu splats at all.
> > This is a giant Yocto build, but the comm is always cargo, so hopefully
> > I can run those bits in isolation and hit it more quickly.
>
> If it still explodes without LTO, would you mind trying gcc?
Will do.
Haven't had much luck isolating what triggers it, but if I run two copies
of these large build jobs in a loop, it reliably triggers in 6-8 hours.
Just to be clear, I can only trigger this on the one machine. I ran it
through memtest86+ yesterday and it passed, FWIW, but I'm a little
suspicious of the hardware right now too. I double checked that
everything in the BIOS related to power/perf is at factory settings.
Note that READ_ONLY_THP_FOR_FS and NO_PAGE_MAPCOUNT are both off.
> > Thanks,
> > Calvin
>
> Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-18 16:49 ` Calvin Owens
@ 2025-06-18 17:09 ` Sebastian Andrzej Siewior
2025-06-18 20:56 ` Calvin Owens
0 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-18 17:09 UTC (permalink / raw)
To: Calvin Owens
Cc: linux-kernel, linux-tip-commits, Lai, Yi, Peter Zijlstra (Intel),
x86
On 2025-06-18 09:49:18 [-0700], Calvin Owens wrote:
> Didn't get much out of lockdep unfortunately.
>
> It notices the corruption in the spinlock:
>
> BUG: spinlock bad magic on CPU#2, cargo/4129172
> lock: 0xffff8881410ecdc8, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
Yes. Which is what I assumed while I suggested this. But it complains
about bad magic. It says the magic is 0xdead4ead but this is
SPINLOCK_MAGIC. I was expecting any value but this one.
> That was followed by this WARN:
>
> ------------[ cut here ]------------
> rcuref - imbalanced put()
> WARNING: CPU: 2 PID: 4129172 at lib/rcuref.c:266 rcuref_put_slowpath+0x55/0x70
This is "reasonable". If the lock is broken, the remaining memory is
probably garbage anyway. It complains there that the reference put due
to invalid counter.
…
> The oops after that is from a different task this time, but it just
> looks like slab corruption:
>
…
The previous complained an invalid free from within the exec.
> No lock/rcu splats at all.
It exploded before that could happen.
> > If it still explodes without LTO, would you mind trying gcc?
>
> Will do.
Thank you.
> Haven't had much luck isolating what triggers it, but if I run two copies
> of these large build jobs in a loop, it reliably triggers in 6-8 hours.
>
> Just to be clear, I can only trigger this on the one machine. I ran it
> through memtest86+ yesterday and it passed, FWIW, but I'm a little
> suspicious of the hardware right now too. I double checked that
> everything in the BIOS related to power/perf is at factory settings.
But then it is kind of odd that it happens only with the futex code.
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-18 17:09 ` Sebastian Andrzej Siewior
@ 2025-06-18 20:56 ` Calvin Owens
2025-06-18 22:47 ` Calvin Owens
0 siblings, 1 reply; 109+ messages in thread
From: Calvin Owens @ 2025-06-18 20:56 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, Lai, Yi, Peter Zijlstra (Intel), x86
( Dropping linux-tip-commits from Cc )
On Wednesday 06/18 at 19:09 +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-18 09:49:18 [-0700], Calvin Owens wrote:
> > Didn't get much out of lockdep unfortunately.
> >
> > It notices the corruption in the spinlock:
> >
> > BUG: spinlock bad magic on CPU#2, cargo/4129172
> > lock: 0xffff8881410ecdc8, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
>
> Yes. Which is what I assumed while I suggested this. But it complains
> about bad magic. It says the magic is 0xdead4ead but this is
> SPINLOCK_MAGIC. I was expecting any value but this one.
>
> > That was followed by this WARN:
> >
> > ------------[ cut here ]------------
> > rcuref - imbalanced put()
> > WARNING: CPU: 2 PID: 4129172 at lib/rcuref.c:266 rcuref_put_slowpath+0x55/0x70
>
> This is "reasonable". If the lock is broken, the remaining memory is
> probably garbage anyway. It complains there that the reference put due
> to invalid counter.
>
> …
> > The oops after that is from a different task this time, but it just
> > looks like slab corruption:
> >
> …
>
> The previous complained an invalid free from within the exec.
>
> > No lock/rcu splats at all.
> It exploded before that could happen.
>
> > > If it still explodes without LTO, would you mind trying gcc?
> >
> > Will do.
>
> Thank you.
>
> > Haven't had much luck isolating what triggers it, but if I run two copies
> > of these large build jobs in a loop, it reliably triggers in 6-8 hours.
> >
> > Just to be clear, I can only trigger this on the one machine. I ran it
> > through memtest86+ yesterday and it passed, FWIW, but I'm a little
> > suspicious of the hardware right now too. I double checked that
> > everything in the BIOS related to power/perf is at factory settings.
>
> But then it is kind of odd that it happens only with the futex code.
I think the missing ingredient was PREEMPT: the 2nd machine has been
trying for over a day, but I rebuilt its kernel with PREEMPT_FULL this
morning (still llvm), and it just hit a similar oops.
Oops: general protection fault, probably for non-canonical address 0x74656d2f74696750: 0000 [#1] SMP
CPU: 10 UID: 1000 PID: 542469 Comm: cargo Not tainted 6.16.0-rc2-00045-g4663747812d1 #1 PREEMPT
Hardware name: Gigabyte Technology Co., Ltd. A620I AX/A620I AX, BIOS F3 07/10/2023
RIP: 0010:futex_hash+0x23/0x90
Code: 1f 84 00 00 00 00 00 41 57 41 56 53 48 89 fb e8 b3 04 fe ff 48 89 df 31 f6 e8 79 00 00 00 48 8b 78 18 49 89 c6 48 85 ff 74 55 <80> 7f 21 00 75 4f f0 83 07 01 79 49 e8 fc 17 37 00 84 c0 75 40 e8
RSP: 0018:ffffc9002e46fcd8 EFLAGS: 00010202
RAX: ffff888a68e25c40 RBX: ffffc9002e46fda0 RCX: 0000000036616534
RDX: 00000000ffffffff RSI: 0000000910180c00 RDI: 74656d2f7469672f
RBP: 00000000000000b0 R08: 000000000318dd0d R09: 000000002e117cb0
R10: 00000000318dd0d0 R11: 000000000000001b R12: 0000000000000000
R13: 000055e79b431170 R14: ffff888a68e25c40 R15: ffff8881ea0ae900
FS: 00007f1b6037b580(0000) GS:ffff8898a528b000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555830170098 CR3: 0000000d73e93000 CR4: 0000000000350ef0
Call Trace:
<TASK>
futex_wait_setup+0x7e/0x1d0
__futex_wait+0x63/0x120
? __futex_wake_mark+0x40/0x40
futex_wait+0x5b/0xd0
? hrtimer_dummy_timeout+0x10/0x10
do_futex+0x86/0x120
__x64_sys_futex+0x10a/0x180
do_syscall_64+0x48/0x4f0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
I also enabled DEBUG_PREEMPT, but that didn't print any additional info.
I'm testing a GCC kernel on both machines now.
Thanks,
Calvin
> Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-18 20:56 ` Calvin Owens
@ 2025-06-18 22:47 ` Calvin Owens
2025-06-19 21:07 ` Calvin Owens
0 siblings, 1 reply; 109+ messages in thread
From: Calvin Owens @ 2025-06-18 22:47 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, Lai, Yi, Peter Zijlstra (Intel), x86
On Wednesday 06/18 at 13:56 -0700, Calvin Owens wrote:
> ( Dropping linux-tip-commits from Cc )
>
> On Wednesday 06/18 at 19:09 +0200, Sebastian Andrzej Siewior wrote:
> > On 2025-06-18 09:49:18 [-0700], Calvin Owens wrote:
> > > Didn't get much out of lockdep unfortunately.
> > >
> > > It notices the corruption in the spinlock:
> > >
> > > BUG: spinlock bad magic on CPU#2, cargo/4129172
> > > lock: 0xffff8881410ecdc8, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
> >
> > Yes. Which is what I assumed while I suggested this. But it complains
> > about bad magic. It says the magic is 0xdead4ead but this is
> > SPINLOCK_MAGIC. I was expecting any value but this one.
> >
> > > That was followed by this WARN:
> > >
> > > ------------[ cut here ]------------
> > > rcuref - imbalanced put()
> > > WARNING: CPU: 2 PID: 4129172 at lib/rcuref.c:266 rcuref_put_slowpath+0x55/0x70
> >
> > This is "reasonable". If the lock is broken, the remaining memory is
> > probably garbage anyway. It complains there that the reference put due
> > to invalid counter.
> >
> > …
> > > The oops after that is from a different task this time, but it just
> > > looks like slab corruption:
> > >
> > …
> >
> > The previous complained an invalid free from within the exec.
> >
> > > No lock/rcu splats at all.
> > It exploded before that could happen.
> >
> > > > If it still explodes without LTO, would you mind trying gcc?
> > >
> > > Will do.
> >
> > Thank you.
> >
> > > Haven't had much luck isolating what triggers it, but if I run two copies
> > > of these large build jobs in a loop, it reliably triggers in 6-8 hours.
> > >
> > > Just to be clear, I can only trigger this on the one machine. I ran it
> > > through memtest86+ yesterday and it passed, FWIW, but I'm a little
> > > suspicious of the hardware right now too. I double checked that
> > > everything in the BIOS related to power/perf is at factory settings.
> >
> > But then it is kind of odd that it happens only with the futex code.
>
> I think the missing ingredient was PREEMPT: the 2nd machine has been
> trying for over a day, but I rebuilt its kernel with PREEMPT_FULL this
> morning (still llvm), and it just hit a similar oops.
>
> Oops: general protection fault, probably for non-canonical address 0x74656d2f74696750: 0000 [#1] SMP
> CPU: 10 UID: 1000 PID: 542469 Comm: cargo Not tainted 6.16.0-rc2-00045-g4663747812d1 #1 PREEMPT
> Hardware name: Gigabyte Technology Co., Ltd. A620I AX/A620I AX, BIOS F3 07/10/2023
> RIP: 0010:futex_hash+0x23/0x90
> Code: 1f 84 00 00 00 00 00 41 57 41 56 53 48 89 fb e8 b3 04 fe ff 48 89 df 31 f6 e8 79 00 00 00 48 8b 78 18 49 89 c6 48 85 ff 74 55 <80> 7f 21 00 75 4f f0 83 07 01 79 49 e8 fc 17 37 00 84 c0 75 40 e8
> RSP: 0018:ffffc9002e46fcd8 EFLAGS: 00010202
> RAX: ffff888a68e25c40 RBX: ffffc9002e46fda0 RCX: 0000000036616534
> RDX: 00000000ffffffff RSI: 0000000910180c00 RDI: 74656d2f7469672f
> RBP: 00000000000000b0 R08: 000000000318dd0d R09: 000000002e117cb0
> R10: 00000000318dd0d0 R11: 000000000000001b R12: 0000000000000000
> R13: 000055e79b431170 R14: ffff888a68e25c40 R15: ffff8881ea0ae900
> FS: 00007f1b6037b580(0000) GS:ffff8898a528b000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000555830170098 CR3: 0000000d73e93000 CR4: 0000000000350ef0
> Call Trace:
> <TASK>
> futex_wait_setup+0x7e/0x1d0
> __futex_wait+0x63/0x120
> ? __futex_wake_mark+0x40/0x40
> futex_wait+0x5b/0xd0
> ? hrtimer_dummy_timeout+0x10/0x10
> do_futex+0x86/0x120
> __x64_sys_futex+0x10a/0x180
> do_syscall_64+0x48/0x4f0
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
>
> I also enabled DEBUG_PREEMPT, but that didn't print any additional info.
>
> I'm testing a GCC kernel on both machines now.
Machine #2 oopsed with the GCC kernel after just over an hour:
BUG: unable to handle page fault for address: ffff88a91eac4458
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 4401067 P4D 4401067 PUD 0
Oops: Oops: 0000 [#1] SMP
CPU: 4 UID: 1000 PID: 881756 Comm: cargo Not tainted 6.16.0-rc2-gcc-00045-g4663747812d1 #1 PREEMPT
Hardware name: Gigabyte Technology Co., Ltd. A620I AX/A620I AX, BIOS F3 07/10/2023
RIP: 0010:futex_hash+0x16/0x90
Code: 4d 85 e4 74 99 4c 89 e7 e8 07 51 80 00 eb 8f 0f 1f 44 00 00 41 54 55 48 89 fd 53 e8 14 f2 fd ff 48 89 ef 31 f6 e8 da f6 ff ff <48> 8b 78 18 48 89 c3 48 85 ff 74 0c 80 7f 21 00 75 06 f0 83 07 01
RSP: 0018:ffffc9002973fcf8 EFLAGS: 00010282
RAX: ffff88a91eac4440 RBX: ffff888d5a170000 RCX: 00000000add26115
RDX: 0000001c49080440 RSI: 00000000236034e8 RDI: 00000000f1a67530
RBP: ffffc9002973fdb8 R08: 00000000eb13f1af R09: ffffffff829c0fc0
R10: 0000000000000246 R11: 0000000000000000 R12: ffff888d5a1700f0
R13: ffffc9002973fdb8 R14: ffffc9002973fd70 R15: 0000000000000002
FS: 00007f64614ba9c0(0000) GS:ffff888cccceb000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88a91eac4458 CR3: 000000015e508000 CR4: 0000000000350ef0
Call Trace:
<TASK>
futex_wait_setup+0x51/0x1b0
__futex_wait+0xc0/0x120
? __futex_wake_mark+0x50/0x50
futex_wait+0x55/0xe0
? hrtimer_setup_sleeper_on_stack+0x30/0x30
do_futex+0x91/0x120
__x64_sys_futex+0xfc/0x1d0
do_syscall_64+0x44/0x1130
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f64615bd74d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ab c6 0b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffea50a6cc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: ffffffffffffffda RBX: 00007f64615bd730 RCX: 00007f64615bd74d
RDX: 0000000000000080 RSI: 0000000000000089 RDI: 000055bb7e399d90
RBP: 00007ffea50a6d20 R08: 0000000000000000 R09: 00007ffeffffffff
R10: 00007ffea50a6ce0 R11: 0000000000000246 R12: 000000001dcd6401
R13: 00007f64614e3710 R14: 000055bb7e399d90 R15: 0000000000000080
</TASK>
CR2: ffff88a91eac4458
---[ end trace 0000000000000000 ]---
Two CPUs oopsed at once with that same stack, the config and vmlinux are
uploaded in the git (https://github.com/jcalvinowens/lkml-debug-616).
I tried reproducing with DEBUG_PAGEALLOC, but the bug doesn't happen
with it turned on.
> Thanks,
> Calvin
>
> > Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-18 22:47 ` Calvin Owens
@ 2025-06-19 21:07 ` Calvin Owens
2025-06-20 10:31 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 109+ messages in thread
From: Calvin Owens @ 2025-06-19 21:07 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, Lai, Yi, Peter Zijlstra (Intel), x86
On Wednesday 06/18 at 15:47 -0700, Calvin Owens wrote:
> On Wednesday 06/18 at 13:56 -0700, Calvin Owens wrote:
> > ( Dropping linux-tip-commits from Cc )
> >
> > On Wednesday 06/18 at 19:09 +0200, Sebastian Andrzej Siewior wrote:
> > > On 2025-06-18 09:49:18 [-0700], Calvin Owens wrote:
> > > > Didn't get much out of lockdep unfortunately.
> > > >
> > > > It notices the corruption in the spinlock:
> > > >
> > > > BUG: spinlock bad magic on CPU#2, cargo/4129172
> > > > lock: 0xffff8881410ecdc8, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
> > >
> > > Yes. Which is what I assumed while I suggested this. But it complains
> > > about bad magic. It says the magic is 0xdead4ead but this is
> > > SPINLOCK_MAGIC. I was expecting any value but this one.
> > >
> > > > That was followed by this WARN:
> > > >
> > > > ------------[ cut here ]------------
> > > > rcuref - imbalanced put()
> > > > WARNING: CPU: 2 PID: 4129172 at lib/rcuref.c:266 rcuref_put_slowpath+0x55/0x70
> > >
> > > This is "reasonable". If the lock is broken, the remaining memory is
> > > probably garbage anyway. It complains there that the reference put due
> > > to invalid counter.
> > >
> > > …
> > > > The oops after that is from a different task this time, but it just
> > > > looks like slab corruption:
> > > >
> > > …
> > >
> > > The previous complained an invalid free from within the exec.
> > >
> > > > No lock/rcu splats at all.
> > > It exploded before that could happen.
> > >
> > > > > If it still explodes without LTO, would you mind trying gcc?
> > > >
> > > > Will do.
> > >
> > > Thank you.
> > >
> > > > Haven't had much luck isolating what triggers it, but if I run two copies
> > > > of these large build jobs in a loop, it reliably triggers in 6-8 hours.
> > > >
> > > > Just to be clear, I can only trigger this on the one machine. I ran it
> > > > through memtest86+ yesterday and it passed, FWIW, but I'm a little
> > > > suspicious of the hardware right now too. I double checked that
> > > > everything in the BIOS related to power/perf is at factory settings.
> > >
> > > But then it is kind of odd that it happens only with the futex code.
> >
> > I think the missing ingredient was PREEMPT: the 2nd machine has been
> > trying for over a day, but I rebuilt its kernel with PREEMPT_FULL this
> > morning (still llvm), and it just hit a similar oops.
> >
> > Oops: general protection fault, probably for non-canonical address 0x74656d2f74696750: 0000 [#1] SMP
> > CPU: 10 UID: 1000 PID: 542469 Comm: cargo Not tainted 6.16.0-rc2-00045-g4663747812d1 #1 PREEMPT
> > Hardware name: Gigabyte Technology Co., Ltd. A620I AX/A620I AX, BIOS F3 07/10/2023
> > RIP: 0010:futex_hash+0x23/0x90
> > Code: 1f 84 00 00 00 00 00 41 57 41 56 53 48 89 fb e8 b3 04 fe ff 48 89 df 31 f6 e8 79 00 00 00 48 8b 78 18 49 89 c6 48 85 ff 74 55 <80> 7f 21 00 75 4f f0 83 07 01 79 49 e8 fc 17 37 00 84 c0 75 40 e8
> > RSP: 0018:ffffc9002e46fcd8 EFLAGS: 00010202
> > RAX: ffff888a68e25c40 RBX: ffffc9002e46fda0 RCX: 0000000036616534
> > RDX: 00000000ffffffff RSI: 0000000910180c00 RDI: 74656d2f7469672f
> > RBP: 00000000000000b0 R08: 000000000318dd0d R09: 000000002e117cb0
> > R10: 00000000318dd0d0 R11: 000000000000001b R12: 0000000000000000
> > R13: 000055e79b431170 R14: ffff888a68e25c40 R15: ffff8881ea0ae900
> > FS: 00007f1b6037b580(0000) GS:ffff8898a528b000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000555830170098 CR3: 0000000d73e93000 CR4: 0000000000350ef0
> > Call Trace:
> > <TASK>
> > futex_wait_setup+0x7e/0x1d0
> > __futex_wait+0x63/0x120
> > ? __futex_wake_mark+0x40/0x40
> > futex_wait+0x5b/0xd0
> > ? hrtimer_dummy_timeout+0x10/0x10
> > do_futex+0x86/0x120
> > __x64_sys_futex+0x10a/0x180
> > do_syscall_64+0x48/0x4f0
> > entry_SYSCALL_64_after_hwframe+0x4b/0x53
> >
> > I also enabled DEBUG_PREEMPT, but that didn't print any additional info.
> >
> > I'm testing a GCC kernel on both machines now.
>
> Machine #2 oopsed with the GCC kernel after just over an hour:
>
> BUG: unable to handle page fault for address: ffff88a91eac4458
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 4401067 P4D 4401067 PUD 0
> Oops: Oops: 0000 [#1] SMP
> CPU: 4 UID: 1000 PID: 881756 Comm: cargo Not tainted 6.16.0-rc2-gcc-00045-g4663747812d1 #1 PREEMPT
> Hardware name: Gigabyte Technology Co., Ltd. A620I AX/A620I AX, BIOS F3 07/10/2023
> RIP: 0010:futex_hash+0x16/0x90
> Code: 4d 85 e4 74 99 4c 89 e7 e8 07 51 80 00 eb 8f 0f 1f 44 00 00 41 54 55 48 89 fd 53 e8 14 f2 fd ff 48 89 ef 31 f6 e8 da f6 ff ff <48> 8b 78 18 48 89 c3 48 85 ff 74 0c 80 7f 21 00 75 06 f0 83 07 01
> RSP: 0018:ffffc9002973fcf8 EFLAGS: 00010282
> RAX: ffff88a91eac4440 RBX: ffff888d5a170000 RCX: 00000000add26115
> RDX: 0000001c49080440 RSI: 00000000236034e8 RDI: 00000000f1a67530
> RBP: ffffc9002973fdb8 R08: 00000000eb13f1af R09: ffffffff829c0fc0
> R10: 0000000000000246 R11: 0000000000000000 R12: ffff888d5a1700f0
> R13: ffffc9002973fdb8 R14: ffffc9002973fd70 R15: 0000000000000002
> FS: 00007f64614ba9c0(0000) GS:ffff888cccceb000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff88a91eac4458 CR3: 000000015e508000 CR4: 0000000000350ef0
> Call Trace:
> <TASK>
> futex_wait_setup+0x51/0x1b0
> __futex_wait+0xc0/0x120
> ? __futex_wake_mark+0x50/0x50
> futex_wait+0x55/0xe0
> ? hrtimer_setup_sleeper_on_stack+0x30/0x30
> do_futex+0x91/0x120
> __x64_sys_futex+0xfc/0x1d0
> do_syscall_64+0x44/0x1130
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> RIP: 0033:0x7f64615bd74d
> Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ab c6 0b 00 f7 d8 64 89 01 48
> RSP: 002b:00007ffea50a6cc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> RAX: ffffffffffffffda RBX: 00007f64615bd730 RCX: 00007f64615bd74d
> RDX: 0000000000000080 RSI: 0000000000000089 RDI: 000055bb7e399d90
> RBP: 00007ffea50a6d20 R08: 0000000000000000 R09: 00007ffeffffffff
> R10: 00007ffea50a6ce0 R11: 0000000000000246 R12: 000000001dcd6401
> R13: 00007f64614e3710 R14: 000055bb7e399d90 R15: 0000000000000080
> </TASK>
> CR2: ffff88a91eac4458
> ---[ end trace 0000000000000000 ]---
>
> Two CPUs oopsed at once with that same stack, the config and vmlinux are
> uploaded in the git (https://github.com/jcalvinowens/lkml-debug-616).
>
> I tried reproducing with DEBUG_PAGEALLOC, but the bug doesn't happen
> with it turned on.
I've been rotating through debug options one at a time, I've reproduced
the oops with the following which yielded no additional console output:
* DEBUG_VM
* PAGE_POISONING (and page_poison=1)
* DEBUG_ATOMIC_SLEEP
* DEBUG_PREEMPT
(No poison patterns showed up at all in the oops traces either.)
I am not able to reproduce the oops at all with these options:
* DEBUG_PAGEALLOC_ENABLE_DEFAULT
* SLUB_DEBUG_ON
I'm also experimenting with stress-ng as a reproducer, no luck so far.
A third machine with an older Skylake CPU died overnight, but nothing
was logged over netconsole. Luckily it actually has a serial header on
the motherboard, so that's wired up and it's running again, maybe it
dies in a different way that might be a better clue...
> > Thanks,
> > Calvin
> >
> > > Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-19 21:07 ` Calvin Owens
@ 2025-06-20 10:31 ` Sebastian Andrzej Siewior
2025-06-20 18:56 ` Calvin Owens
0 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-20 10:31 UTC (permalink / raw)
To: Calvin Owens; +Cc: linux-kernel, Lai, Yi, Peter Zijlstra (Intel), x86
On 2025-06-19 14:07:30 [-0700], Calvin Owens wrote:
> > Machine #2 oopsed with the GCC kernel after just over an hour:
> >
> > BUG: unable to handle page fault for address: ffff88a91eac4458
> > RIP: 0010:futex_hash+0x16/0x90
…
> > Call Trace:
> > <TASK>
> > futex_wait_setup+0x51/0x1b0
…
The futex_hash_bucket pointer has an invalid ->priv pointer.
This could be use-after-free or double-free. I've been looking through
your config and you don't have CONFIG_SLAB_FREELIST_* set. I don't
remember which one but one of the two has a "primitiv" double free
detection.
…
> I am not able to reproduce the oops at all with these options:
>
> * DEBUG_PAGEALLOC_ENABLE_DEFAULT
> * SLUB_DEBUG_ON
SLUB_DEBUG_ON is something that would "reliably" notice double free.
If you drop SLUB_DEBUG_ON (but keep SLUB_DEBUG) then you can boot with
slab_debug=f keeping only the consistency checks. The "poison" checks
would be excluded for instance. That allocation is kvzalloc() but it
should be small on your machine to avoid vmalloc() and use only
kmalloc().
> I'm also experimenting with stress-ng as a reproducer, no luck so far.
Not sure what you are using there. I think cargo does:
- lock/ unlock in a threads
- create new thread which triggers auto-resize
- auto-resize gets delayed due to lock/ unlock in other threads (the
reference is held)
And now something happens leading to what we see.
_Maybe_ the cargo application terminates/ execs before the new struct is
assigned in an unexpected way.
The regular hash bucket has reference counting so it should raise
warnings if it goes wrong. I haven't seen those.
> A third machine with an older Skylake CPU died overnight, but nothing
> was logged over netconsole. Luckily it actually has a serial header on
> the motherboard, so that's wired up and it's running again, maybe it
> dies in a different way that might be a better clue...
So far I *think* that cargo does something that I don't expect and this
leads to a memory double-free. The SLUB_DEBUG_ON hopefully delays the
process long enough that the double free does not trigger.
I think I'm going to look for a random rust packet that is using cargo
for building (unless you have a recommendation) and look what it is
doing. It was always cargo after all. Maybe this brings some light.
> > > Thanks,
> > > Calvin
Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-20 10:31 ` Sebastian Andrzej Siewior
@ 2025-06-20 18:56 ` Calvin Owens
2025-06-21 1:02 ` Calvin Owens
0 siblings, 1 reply; 109+ messages in thread
From: Calvin Owens @ 2025-06-20 18:56 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, Lai, Yi, Peter Zijlstra (Intel), x86
On Friday 06/20 at 12:31 +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-19 14:07:30 [-0700], Calvin Owens wrote:
> > > Machine #2 oopsed with the GCC kernel after just over an hour:
> > >
> > > BUG: unable to handle page fault for address: ffff88a91eac4458
> > > RIP: 0010:futex_hash+0x16/0x90
> …
> > > Call Trace:
> > > <TASK>
> > > futex_wait_setup+0x51/0x1b0
> …
>
> The futex_hash_bucket pointer has an invalid ->priv pointer.
> This could be use-after-free or double-free. I've been looking through
> your config and you don't have CONFIG_SLAB_FREELIST_* set. I don't
> remember which one but one of the two has a "primitiv" double free
> detection.
>
> …
> > I am not able to reproduce the oops at all with these options:
> >
> > * DEBUG_PAGEALLOC_ENABLE_DEFAULT
> > * SLUB_DEBUG_ON
>
> SLUB_DEBUG_ON is something that would "reliably" notice double free.
> If you drop SLUB_DEBUG_ON (but keep SLUB_DEBUG) then you can boot with
> slab_debug=f keeping only the consistency checks. The "poison" checks
> would be excluded for instance. That allocation is kvzalloc() but it
> should be small on your machine to avoid vmalloc() and use only
> kmalloc().
I'll try slab_debug=f next.
> > I'm also experimenting with stress-ng as a reproducer, no luck so far.
>
> Not sure what you are using there. I think cargo does:
> - lock/ unlock in a threads
> - create new thread which triggers auto-resize
> - auto-resize gets delayed due to lock/ unlock in other threads (the
> reference is held)
I've tried various combinations of --io, --fork, --exec, --futex, --cpu,
--vm, and --forkheavy. It's not mixing the operations in threads as I
understand it, so I guess it won't ever do anything like what you're
describing no matter what stressors I run?
I did get this message once, something I haven't seen before:
[33024.247423] [ T281] sched: DL replenish lagged too much
...but maybe that's my fault for overloading it so much.
> And now something happens leading to what we see.
> _Maybe_ the cargo application terminates/ execs before the new struct is
> assigned in an unexpected way.
> The regular hash bucket has reference counting so it should raise
> warnings if it goes wrong. I haven't seen those.
>
> > A third machine with an older Skylake CPU died overnight, but nothing
> > was logged over netconsole. Luckily it actually has a serial header on
> > the motherboard, so that's wired up and it's running again, maybe it
> > dies in a different way that might be a better clue...
>
> So far I *think* that cargo does something that I don't expect and this
> leads to a memory double-free. The SLUB_DEBUG_ON hopefully delays the
> process long enough that the double free does not trigger.
>
> I think I'm going to look for a random rust packet that is using cargo
> for building (unless you have a recommendation) and look what it is
> doing. It was always cargo after all. Maybe this brings some light.
The list of things in my big build that use cargo is pretty short:
=== Dependendency Snapshot ===
Dep =mc:house:cargo-native.do_install
Package=mc:house:cargo-native.do_populate_sysroot
RDep =mc:house:cargo-c-native.do_prepare_recipe_sysroot
mc:house:cargo-native.do_create_spdx
mc:house:cbindgen-native.do_prepare_recipe_sysroot
mc:house:librsvg-native.do_prepare_recipe_sysroot
mc:house:librsvg.do_prepare_recipe_sysroot
mc:house:libstd-rs.do_prepare_recipe_sysroot
mc:house:python3-maturin-native.do_prepare_recipe_sysroot
mc:house:python3-maturin-native.do_populate_sysroot
mc:house:python3-rpds-py.do_prepare_recipe_sysroot
mc:house:python3-setuptools-rust-native.do_prepare_recipe_sysroot
I've tried building each of those targets alone (and all of them
together) in a loop, but that hasn't triggered anything. I guess that
other concurrent builds are necessary to trigger whatever this is.
I tried using stress-ng --vm and --cpu together to "load up" the machine
while running the isolated targets, but that hasn't worked either.
If you want to run *exactly* what I am, clone this unholy mess:
https://github.com/jcalvinowens/meta-house
...setup for yocto and install kas as described here:
https://docs.yoctoproject.org/ref-manual/system-requirements.html#ubuntu-and-debian
https://github.com/jcalvinowens/meta-house/blob/6f6a9c643169fc37ba809f7230261d0e5255b6d7/README.md#kas
...and run (for the 32-thread machine):
BB_NUMBER_THREADS="48" PARALLEL_MAKE="-j 36" kas build kas/walnascar.yaml -- -k
Fair warning, it needs a *lot* of RAM at the high concurrency, I have
96GB with 128GB of swap to spill into. It needs ~500GB of disk space if
it runs to completion and downloads ~15GB of tarballs when it starts.
Annoyingly it won't work if the system compiler is gcc-15 right now (the
verison of glib it has won't build, haven't had a chance to fix it yet).
> > > > Thanks,
> > > > Calvin
>
> Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-20 18:56 ` Calvin Owens
@ 2025-06-21 1:02 ` Calvin Owens
2025-06-21 7:24 ` Calvin Owens
0 siblings, 1 reply; 109+ messages in thread
From: Calvin Owens @ 2025-06-21 1:02 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, Lai, Yi, Peter Zijlstra (Intel), x86
On Friday 06/20 at 11:56 -0700, Calvin Owens wrote:
> On Friday 06/20 at 12:31 +0200, Sebastian Andrzej Siewior wrote:
> > On 2025-06-19 14:07:30 [-0700], Calvin Owens wrote:
> > > > Machine #2 oopsed with the GCC kernel after just over an hour:
> > > >
> > > > BUG: unable to handle page fault for address: ffff88a91eac4458
> > > > RIP: 0010:futex_hash+0x16/0x90
> > …
> > > > Call Trace:
> > > > <TASK>
> > > > futex_wait_setup+0x51/0x1b0
> > …
> >
> > The futex_hash_bucket pointer has an invalid ->priv pointer.
> > This could be use-after-free or double-free. I've been looking through
> > your config and you don't have CONFIG_SLAB_FREELIST_* set. I don't
> > remember which one but one of the two has a "primitiv" double free
> > detection.
> >
> > …
> > > I am not able to reproduce the oops at all with these options:
> > >
> > > * DEBUG_PAGEALLOC_ENABLE_DEFAULT
> > > * SLUB_DEBUG_ON
> >
> > SLUB_DEBUG_ON is something that would "reliably" notice double free.
> > If you drop SLUB_DEBUG_ON (but keep SLUB_DEBUG) then you can boot with
> > slab_debug=f keeping only the consistency checks. The "poison" checks
> > would be excluded for instance. That allocation is kvzalloc() but it
> > should be small on your machine to avoid vmalloc() and use only
> > kmalloc().
>
> I'll try slab_debug=f next.
I just hit the oops with SLUB_DEBUG and slab_debug=f, but nothing new
was logged.
> > > I'm also experimenting with stress-ng as a reproducer, no luck so far.
> >
> > Not sure what you are using there. I think cargo does:
> > - lock/ unlock in a threads
> > - create new thread which triggers auto-resize
> > - auto-resize gets delayed due to lock/ unlock in other threads (the
> > reference is held)
>
> I've tried various combinations of --io, --fork, --exec, --futex, --cpu,
> --vm, and --forkheavy. It's not mixing the operations in threads as I
> understand it, so I guess it won't ever do anything like what you're
> describing no matter what stressors I run?
>
> I did get this message once, something I haven't seen before:
>
> [33024.247423] [ T281] sched: DL replenish lagged too much
>
> ...but maybe that's my fault for overloading it so much.
>
> > And now something happens leading to what we see.
> > _Maybe_ the cargo application terminates/ execs before the new struct is
> > assigned in an unexpected way.
> > The regular hash bucket has reference counting so it should raise
> > warnings if it goes wrong. I haven't seen those.
> >
> > > A third machine with an older Skylake CPU died overnight, but nothing
> > > was logged over netconsole. Luckily it actually has a serial header on
> > > the motherboard, so that's wired up and it's running again, maybe it
> > > dies in a different way that might be a better clue...
> >
> > So far I *think* that cargo does something that I don't expect and this
> > leads to a memory double-free. The SLUB_DEBUG_ON hopefully delays the
> > process long enough that the double free does not trigger.
> >
> > I think I'm going to look for a random rust packet that is using cargo
> > for building (unless you have a recommendation) and look what it is
> > doing. It was always cargo after all. Maybe this brings some light.
>
> The list of things in my big build that use cargo is pretty short:
>
> === Dependendency Snapshot ===
> Dep =mc:house:cargo-native.do_install
> Package=mc:house:cargo-native.do_populate_sysroot
> RDep =mc:house:cargo-c-native.do_prepare_recipe_sysroot
> mc:house:cargo-native.do_create_spdx
> mc:house:cbindgen-native.do_prepare_recipe_sysroot
> mc:house:librsvg-native.do_prepare_recipe_sysroot
> mc:house:librsvg.do_prepare_recipe_sysroot
> mc:house:libstd-rs.do_prepare_recipe_sysroot
> mc:house:python3-maturin-native.do_prepare_recipe_sysroot
> mc:house:python3-maturin-native.do_populate_sysroot
> mc:house:python3-rpds-py.do_prepare_recipe_sysroot
> mc:house:python3-setuptools-rust-native.do_prepare_recipe_sysroot
>
> I've tried building each of those targets alone (and all of them
> together) in a loop, but that hasn't triggered anything. I guess that
> other concurrent builds are necessary to trigger whatever this is.
>
> I tried using stress-ng --vm and --cpu together to "load up" the machine
> while running the isolated targets, but that hasn't worked either.
>
> If you want to run *exactly* what I am, clone this unholy mess:
>
> https://github.com/jcalvinowens/meta-house
>
> ...setup for yocto and install kas as described here:
>
> https://docs.yoctoproject.org/ref-manual/system-requirements.html#ubuntu-and-debian
> https://github.com/jcalvinowens/meta-house/blob/6f6a9c643169fc37ba809f7230261d0e5255b6d7/README.md#kas
>
> ...and run (for the 32-thread machine):
>
> BB_NUMBER_THREADS="48" PARALLEL_MAKE="-j 36" kas build kas/walnascar.yaml -- -k
>
> Fair warning, it needs a *lot* of RAM at the high concurrency, I have
> 96GB with 128GB of swap to spill into. It needs ~500GB of disk space if
> it runs to completion and downloads ~15GB of tarballs when it starts.
>
> Annoyingly it won't work if the system compiler is gcc-15 right now (the
> verison of glib it has won't build, haven't had a chance to fix it yet).
>
> > > > > Thanks,
> > > > > Calvin
> >
> > Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-21 1:02 ` Calvin Owens
@ 2025-06-21 7:24 ` Calvin Owens
2025-06-21 21:01 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 109+ messages in thread
From: Calvin Owens @ 2025-06-21 7:24 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, Lai, Yi, Peter Zijlstra (Intel), x86
On Friday 06/20 at 18:02 -0700, Calvin Owens wrote:
> On Friday 06/20 at 11:56 -0700, Calvin Owens wrote:
> > On Friday 06/20 at 12:31 +0200, Sebastian Andrzej Siewior wrote:
> > > On 2025-06-19 14:07:30 [-0700], Calvin Owens wrote:
> > > > > Machine #2 oopsed with the GCC kernel after just over an hour:
> > > > >
> > > > > BUG: unable to handle page fault for address: ffff88a91eac4458
> > > > > RIP: 0010:futex_hash+0x16/0x90
> > > …
> > > > > Call Trace:
> > > > > <TASK>
> > > > > futex_wait_setup+0x51/0x1b0
> > > …
> > >
> > > The futex_hash_bucket pointer has an invalid ->priv pointer.
> > > This could be use-after-free or double-free. I've been looking through
> > > your config and you don't have CONFIG_SLAB_FREELIST_* set. I don't
> > > remember which one but one of the two has a "primitiv" double free
> > > detection.
> > >
> > > …
> > > > I am not able to reproduce the oops at all with these options:
> > > >
> > > > * DEBUG_PAGEALLOC_ENABLE_DEFAULT
> > > > * SLUB_DEBUG_ON
> > >
> > > SLUB_DEBUG_ON is something that would "reliably" notice double free.
> > > If you drop SLUB_DEBUG_ON (but keep SLUB_DEBUG) then you can boot with
> > > slab_debug=f keeping only the consistency checks. The "poison" checks
> > > would be excluded for instance. That allocation is kvzalloc() but it
> > > should be small on your machine to avoid vmalloc() and use only
> > > kmalloc().
> >
> > I'll try slab_debug=f next.
>
> I just hit the oops with SLUB_DEBUG and slab_debug=f, but nothing new
> was logged.
I went back to the original GCC config, and set up yocto to log what it
was doing over /dev/kmsg so maybe we can isolate the trigger.
I got a novel oops this time:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP
CPU: 6 UID: 0 PID: 12 Comm: kworker/u128:0 Not tainted 6.16.0-rc2-gcc-00269-g11313e2f7812 #1 PREEMPT
Hardware name: Gigabyte Technology Co., Ltd. A620I AX/A620I AX, BIOS F3 07/10/2023
Workqueue: netns cleanup_net
RIP: 0010:default_device_exit_batch+0xd0/0x2f0
Code: 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 0f 1f 44 00 00 <49> 8b 94 24 40 01 00 00 4c 89 e5 49 8d 84 24 40 01 00 00 48 39 04
RSP: 0018:ffffc900001c7d58 EFLAGS: 00010202
RAX: ffff888f1bacc140 RBX: ffffc900001c7e18 RCX: 0000000000000002
RDX: ffff888165232930 RSI: 0000000000000000 RDI: ffffffff82a00820
RBP: ffff888f1bacc000 R08: 0000036dae5dbcdb R09: ffff8881038c5300
R10: 000000000000036e R11: 0000000000000001 R12: fffffffffffffec0
R13: dead000000000122 R14: dead000000000100 R15: ffffc900001c7dd0
FS: 0000000000000000(0000) GS:ffff888cccd6b000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000a414f4000 CR4: 0000000000350ef0
Call Trace:
<TASK>
ops_undo_list+0xd9/0x1e0
cleanup_net+0x1b2/0x2c0
process_one_work+0x148/0x240
worker_thread+0x2d7/0x410
? rescuer_thread+0x500/0x500
kthread+0xd5/0x1e0
? kthread_queue_delayed_work+0x70/0x70
ret_from_fork+0xa0/0xe0
? kthread_queue_delayed_work+0x70/0x70
? kthread_queue_delayed_work+0x70/0x70
ret_from_fork_asm+0x11/0x20
</TASK>
CR2: 0000000000000000
---[ end trace 0000000000000000 ]---
2025-06-20 23:47:28 - INFO - ##teamcity[message text='recipe libaio-0.3.113-r0: task do_populate_sysroot: Succeeded' status='NORMAL']
2025-06-20 23:47:28 - ERROR - ##teamcity[message text='recipe libaio-0.3.113-r0: task do_populate_sysroot: Succeeded' status='NORMAL']
RIP: 0010:default_device_exit_batch+0xd0/0x2f0
Code: 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 0f 1f 44 00 00 <49> 8b 94 24 40 01 00 00 4c 89 e5 49 8d 84 24 40 01 00 00 48 39 04
RSP: 0018:ffffc900001c7d58 EFLAGS: 00010202
RAX: ffff888f1bacc140 RBX: ffffc900001c7e18 RCX: 0000000000000002
RDX: ffff888165232930 RSI: 0000000000000000 RDI: ffffffff82a00820
RBP: ffff888f1bacc000 R08: 0000036dae5dbcdb R09: ffff8881038c5300
R10: 000000000000036e R11: 0000000000000001 R12: fffffffffffffec0
R13: dead000000000122 R14: dead000000000100 R15: ffffc900001c7dd0
FS: 0000000000000000(0000) GS:ffff888cccd6b000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000361a000 CR4: 0000000000350ef0
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception ]---
Based on subtracting the set of things that had completed do_compile from
the set of things that started, it was building:
clang-native, duktape, linux-upstream, nodejs-native, and zstd
...when it oopsed. The whole 5MB log is in "new-different-oops.txt".
> > > > I'm also experimenting with stress-ng as a reproducer, no luck so far.
> > >
> > > Not sure what you are using there. I think cargo does:
> > > - lock/ unlock in a threads
> > > - create new thread which triggers auto-resize
> > > - auto-resize gets delayed due to lock/ unlock in other threads (the
> > > reference is held)
> >
> > I've tried various combinations of --io, --fork, --exec, --futex, --cpu,
> > --vm, and --forkheavy. It's not mixing the operations in threads as I
> > understand it, so I guess it won't ever do anything like what you're
> > describing no matter what stressors I run?
> >
> > I did get this message once, something I haven't seen before:
> >
> > [33024.247423] [ T281] sched: DL replenish lagged too much
> >
> > ...but maybe that's my fault for overloading it so much.
> >
> > > And now something happens leading to what we see.
> > > _Maybe_ the cargo application terminates/ execs before the new struct is
> > > assigned in an unexpected way.
> > > The regular hash bucket has reference counting so it should raise
> > > warnings if it goes wrong. I haven't seen those.
> > >
> > > > A third machine with an older Skylake CPU died overnight, but nothing
> > > > was logged over netconsole. Luckily it actually has a serial header on
> > > > the motherboard, so that's wired up and it's running again, maybe it
> > > > dies in a different way that might be a better clue...
> > >
> > > So far I *think* that cargo does something that I don't expect and this
> > > leads to a memory double-free. The SLUB_DEBUG_ON hopefully delays the
> > > process long enough that the double free does not trigger.
> > >
> > > I think I'm going to look for a random rust packet that is using cargo
> > > for building (unless you have a recommendation) and look what it is
> > > doing. It was always cargo after all. Maybe this brings some light.
> >
> > The list of things in my big build that use cargo is pretty short:
> >
> > === Dependendency Snapshot ===
> > Dep =mc:house:cargo-native.do_install
> > Package=mc:house:cargo-native.do_populate_sysroot
> > RDep =mc:house:cargo-c-native.do_prepare_recipe_sysroot
> > mc:house:cargo-native.do_create_spdx
> > mc:house:cbindgen-native.do_prepare_recipe_sysroot
> > mc:house:librsvg-native.do_prepare_recipe_sysroot
> > mc:house:librsvg.do_prepare_recipe_sysroot
> > mc:house:libstd-rs.do_prepare_recipe_sysroot
> > mc:house:python3-maturin-native.do_prepare_recipe_sysroot
> > mc:house:python3-maturin-native.do_populate_sysroot
> > mc:house:python3-rpds-py.do_prepare_recipe_sysroot
> > mc:house:python3-setuptools-rust-native.do_prepare_recipe_sysroot
> >
> > I've tried building each of those targets alone (and all of them
> > together) in a loop, but that hasn't triggered anything. I guess that
> > other concurrent builds are necessary to trigger whatever this is.
> >
> > I tried using stress-ng --vm and --cpu together to "load up" the machine
> > while running the isolated targets, but that hasn't worked either.
> >
> > If you want to run *exactly* what I am, clone this unholy mess:
> >
> > https://github.com/jcalvinowens/meta-house
> >
> > ...setup for yocto and install kas as described here:
> >
> > https://docs.yoctoproject.org/ref-manual/system-requirements.html#ubuntu-and-debian
> > https://github.com/jcalvinowens/meta-house/blob/6f6a9c643169fc37ba809f7230261d0e5255b6d7/README.md#kas
> >
> > ...and run (for the 32-thread machine):
> >
> > BB_NUMBER_THREADS="48" PARALLEL_MAKE="-j 36" kas build kas/walnascar.yaml -- -k
> >
> > Fair warning, it needs a *lot* of RAM at the high concurrency, I have
> > 96GB with 128GB of swap to spill into. It needs ~500GB of disk space if
> > it runs to completion and downloads ~15GB of tarballs when it starts.
> >
> > Annoyingly it won't work if the system compiler is gcc-15 right now (the
> > verison of glib it has won't build, haven't had a chance to fix it yet).
> >
> > > > > > Thanks,
> > > > > > Calvin
> > >
> > > Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-21 7:24 ` Calvin Owens
@ 2025-06-21 21:01 ` Sebastian Andrzej Siewior
2025-06-22 16:17 ` Calvin Owens
0 siblings, 1 reply; 109+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-21 21:01 UTC (permalink / raw)
To: Calvin Owens; +Cc: linux-kernel, Lai, Yi, Peter Zijlstra (Intel), x86
On 2025-06-21 00:24:14 [-0700], Calvin Owens wrote:
>
> I went back to the original GCC config, and set up yocto to log what it
> was doing over /dev/kmsg so maybe we can isolate the trigger.
>
> I got a novel oops this time:
I think I got it:
Could you please try this:
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 005b040c4791b..b37193653e6b5 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -89,6 +89,7 @@ void futex_hash_free(struct mm_struct *mm);
static inline void futex_mm_init(struct mm_struct *mm)
{
RCU_INIT_POINTER(mm->futex_phash, NULL);
+ mm->futex_phash_new = NULL;
mutex_init(&mm->futex_hash_lock);
}
Sebastian
^ permalink raw reply related [flat|nested] 109+ messages in thread
* Re: [tip: locking/urgent] futex: Allow to resize the private local hash
2025-06-21 21:01 ` Sebastian Andrzej Siewior
@ 2025-06-22 16:17 ` Calvin Owens
0 siblings, 0 replies; 109+ messages in thread
From: Calvin Owens @ 2025-06-22 16:17 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, Lai, Yi, Peter Zijlstra (Intel), x86
On Saturday 06/21 at 23:01 +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-21 00:24:14 [-0700], Calvin Owens wrote:
> >
> > I went back to the original GCC config, and set up yocto to log what it
> > was doing over /dev/kmsg so maybe we can isolate the trigger.
> >
> > I got a novel oops this time:
>
> I think I got it:
>
> Could you please try this:
That did it!
Tested-By: Calvin Owens <calvin@wbinvd.org>
This was a fun little diversion, thanks :)
> diff --git a/include/linux/futex.h b/include/linux/futex.h
> index 005b040c4791b..b37193653e6b5 100644
> --- a/include/linux/futex.h
> +++ b/include/linux/futex.h
> @@ -89,6 +89,7 @@ void futex_hash_free(struct mm_struct *mm);
> static inline void futex_mm_init(struct mm_struct *mm)
> {
> RCU_INIT_POINTER(mm->futex_phash, NULL);
> + mm->futex_phash_new = NULL;
> mutex_init(&mm->futex_hash_lock);
> }
>
>
> Sebastian
^ permalink raw reply [flat|nested] 109+ messages in thread
end of thread, other threads:[~2025-06-22 16:17 UTC | newest]
Thread overview: 109+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-16 16:29 [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 01/21] rcuref: Provide rcuref_is_dead() Sebastian Andrzej Siewior
2025-05-05 21:09 ` André Almeida
2025-05-08 10:34 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 02/21] mm: Add vmalloc_huge_node() Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 03/21] futex: Move futex_queue() into futex_wait_setup() Sebastian Andrzej Siewior
2025-05-05 21:43 ` André Almeida
2025-05-16 12:53 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 04/21] futex: Pull futex_hash() out of futex_q_lock() Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 05/21] futex: Create hb scopes Sebastian Andrzej Siewior
2025-05-06 23:45 ` André Almeida
2025-05-16 12:20 ` Sebastian Andrzej Siewior
2025-05-16 13:23 ` Peter Zijlstra
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 06/21] futex: Create futex_hash() get/put class Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 07/21] futex: Create private_hash() " Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 08/21] futex: Acquire a hash reference in futex_wait_multiple_setup() Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 09/21] futex: Decrease the waiter count before the unlock operation Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 10/21] futex: Introduce futex_q_lockptr_lock() Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-05-08 19:06 ` [PATCH v12 10/21] " André Almeida
2025-05-16 12:18 ` Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 11/21] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 12/21] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 13/21] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 14/21] futex: Allow to resize the private local hash Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-05-08 20:32 ` [PATCH v12 14/21] " André Almeida
2025-05-16 10:49 ` Sebastian Andrzej Siewior
2025-05-16 13:00 ` André Almeida
2025-05-10 8:45 ` [PATCH] futex: Fix futex_mm_init() build failure on older compilers, remove rcu_assign_pointer() Ingo Molnar
2025-05-11 8:11 ` [tip: locking/futex] futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init() tip-bot2 for Ingo Molnar
2025-06-01 7:39 ` [PATCH v12 14/21] futex: Allow to resize the private local hash Lai, Yi
2025-06-02 11:00 ` Sebastian Andrzej Siewior
2025-06-02 14:36 ` Lai, Yi
2025-06-02 14:44 ` Sebastian Andrzej Siewior
2025-06-02 15:00 ` Lai, Yi
2025-06-11 9:20 ` [tip: locking/urgent] " tip-bot2 for Sebastian Andrzej Siewior
2025-06-11 14:39 ` tip-bot2 for Sebastian Andrzej Siewior
2025-06-11 14:43 ` Sebastian Andrzej Siewior
2025-06-11 15:11 ` Peter Zijlstra
2025-06-11 15:20 ` Peter Zijlstra
2025-06-11 15:35 ` Sebastian Andrzej Siewior
2025-06-16 17:14 ` Calvin Owens
2025-06-17 7:16 ` Sebastian Andrzej Siewior
2025-06-17 9:23 ` Calvin Owens
2025-06-17 9:50 ` Sebastian Andrzej Siewior
2025-06-17 16:11 ` Calvin Owens
2025-06-18 2:15 ` Calvin Owens
2025-06-18 16:47 ` Sebastian Andrzej Siewior
2025-06-18 16:03 ` Sebastian Andrzej Siewior
2025-06-18 16:49 ` Calvin Owens
2025-06-18 17:09 ` Sebastian Andrzej Siewior
2025-06-18 20:56 ` Calvin Owens
2025-06-18 22:47 ` Calvin Owens
2025-06-19 21:07 ` Calvin Owens
2025-06-20 10:31 ` Sebastian Andrzej Siewior
2025-06-20 18:56 ` Calvin Owens
2025-06-21 1:02 ` Calvin Owens
2025-06-21 7:24 ` Calvin Owens
2025-06-21 21:01 ` Sebastian Andrzej Siewior
2025-06-22 16:17 ` Calvin Owens
2025-04-16 16:29 ` [PATCH v12 15/21] futex: Allow to make the private hash immutable Sebastian Andrzej Siewior
2025-05-02 18:01 ` Peter Zijlstra
2025-05-05 7:14 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 16/21] futex: Implement FUTEX2_NUMA Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 17/21] futex: Implement FUTEX2_MPOL Sebastian Andrzej Siewior
2025-05-02 18:45 ` Peter Zijlstra
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Peter Zijlstra
2025-04-16 16:29 ` [PATCH v12 18/21] tools headers: Synchronize prctl.h ABI header Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 19/21] tools/perf: Allow to select the number of hash buckets Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 20/21] selftests/futex: Add futex_priv_hash Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-05-09 21:22 ` [PATCH v12 20/21] " André Almeida
2025-05-16 7:38 ` Sebastian Andrzej Siewior
2025-05-27 11:28 ` Mark Brown
2025-05-27 12:23 ` Sebastian Andrzej Siewior
2025-05-27 12:35 ` Mark Brown
2025-05-27 12:43 ` Sebastian Andrzej Siewior
2025-05-27 12:59 ` Mark Brown
2025-05-27 13:25 ` Sebastian Andrzej Siewior
2025-05-27 13:40 ` Mark Brown
2025-05-27 13:45 ` Sebastian Andrzej Siewior
2025-04-16 16:29 ` [PATCH v12 21/21] selftests/futex: Add futex_numa_mpol Sebastian Andrzej Siewior
2025-05-02 19:08 ` Peter Zijlstra
2025-05-05 7:33 ` Sebastian Andrzej Siewior
2025-05-02 19:16 ` Peter Zijlstra
2025-05-05 7:36 ` Sebastian Andrzej Siewior
2025-05-08 10:33 ` [tip: locking/futex] " tip-bot2 for Sebastian Andrzej Siewior
2025-04-16 16:31 ` [PATCH v12 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL Sebastian Andrzej Siewior
2025-05-02 19:48 ` Peter Zijlstra
2025-05-03 10:09 ` Peter Zijlstra
2025-05-05 7:30 ` Sebastian Andrzej Siewior
2025-05-06 7:36 ` Peter Zijlstra
2025-05-09 11:41 ` Sebastian Andrzej Siewior
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).