* [PATCH v8 01/15] rcuref: Avoid false positive "imbalanced put" report.
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 02/15] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
` (14 subsequent siblings)
15 siblings, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, kernel test robot, Sebastian Andrzej Siewior
From: Thomas Gleixner <tglx@linutronix.de>
The kernel test robot reported an "imbalanced put" which turned out to
be false positive. Consider the following race:
ref = 0 (via rcuref_init(ref, 1))
T1 T2
rcuref_put(ref)
-> atomic_add_negative_release(-1, ref) # ref -> 0xffffffff
-> rcuref_put_slowpath(ref)
rcuref_get(ref)
-> atomic_add_negative_relaxed(1, &ref->refcnt)
-> return true; # ref -> 0
rcuref_put(ref)
-> atomic_add_negative_release(-1, ref) # ref -> 0xffffffff
-> rcuref_put_slowpath()
-> cnt = atomic_read(&ref->refcnt); # cnt -> 0xffffffff / RCUREF_NOREF
-> atomic_try_cmpxchg_release(&ref->refcnt, &cnt, RCUREF_DEAD)) # ref -> 0xe0000000 / RCUREF_DEAD
-> return true
-> cnt = atomic_read(&ref->refcnt); # cnt -> 0xe0000000 / RCUREF_DEAD
-> if (cnt > RCUREF_RELEASED) # 0xe0000000 > 0xc0000000
-> WARN_ONCE(cnt >= RCUREF_RELEASED, "rcuref - imbalanced put()")
The problem is the additional read in the slow path (after it
decremented to RCUREF_NOREF) which can happen after the counter has been
marked RCUREF_DEAD.
Avoid the false positive by reusing the returning value from the
decrement. Now every "final" put uses RCUREF_NOREF in the slow path and
attempts the final cmpxchg() to RCUREF_DEAD.
Fixes: ee1ee6db07795 ("atomics: Provide rcuref - scalable reference counting")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202412311453.9d7636a2-lkp@intel.com
Debugged-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/rcuref.h | 9 ++++++---
lib/rcuref.c | 5 ++---
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/include/linux/rcuref.h b/include/linux/rcuref.h
index 2c8bfd0f1b6b3..6322d8c1c6b42 100644
--- a/include/linux/rcuref.h
+++ b/include/linux/rcuref.h
@@ -71,27 +71,30 @@ static inline __must_check bool rcuref_get(rcuref_t *ref)
return rcuref_get_slowpath(ref);
}
-extern __must_check bool rcuref_put_slowpath(rcuref_t *ref);
+extern __must_check bool rcuref_put_slowpath(rcuref_t *ref, unsigned int cnt);
/*
* Internal helper. Do not invoke directly.
*/
static __always_inline __must_check bool __rcuref_put(rcuref_t *ref)
{
+ int cnt;
+
RCU_LOCKDEP_WARN(!rcu_read_lock_held() && preemptible(),
"suspicious rcuref_put_rcusafe() usage");
/*
* Unconditionally decrease the reference count. The saturation and
* dead zones provide enough tolerance for this.
*/
- if (likely(!atomic_add_negative_release(-1, &ref->refcnt)))
+ cnt = atomic_sub_return_release(1, &ref->refcnt);
+ if (likely(cnt >= 0))
return false;
/*
* Handle the last reference drop and cases inside the saturation
* and dead zones.
*/
- return rcuref_put_slowpath(ref);
+ return rcuref_put_slowpath(ref, cnt);
}
/**
diff --git a/lib/rcuref.c b/lib/rcuref.c
index 97f300eca927c..5bd726b71e393 100644
--- a/lib/rcuref.c
+++ b/lib/rcuref.c
@@ -220,6 +220,7 @@ EXPORT_SYMBOL_GPL(rcuref_get_slowpath);
/**
* rcuref_put_slowpath - Slowpath of __rcuref_put()
* @ref: Pointer to the reference count
+ * @cnt: The resulting value of the fastpath decrement
*
* Invoked when the reference count is outside of the valid zone.
*
@@ -233,10 +234,8 @@ EXPORT_SYMBOL_GPL(rcuref_get_slowpath);
* with a concurrent get()/put() pair. Caller is not allowed to
* deconstruct the protected object.
*/
-bool rcuref_put_slowpath(rcuref_t *ref)
+bool rcuref_put_slowpath(rcuref_t *ref, unsigned int cnt)
{
- unsigned int cnt = atomic_read(&ref->refcnt);
-
/* Did this drop the last reference? */
if (likely(cnt == RCUREF_NOREF)) {
/*
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* [PATCH v8 02/15] futex: Create helper function to initialize a hash slot.
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 01/15] rcuref: Avoid false positive "imbalanced put" report Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
` (13 subsequent siblings)
15 siblings, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Factor out the futex_hash_bucket initialisation into a helpr function.
The helper function will be used in a follow up patch implementing
process private hash buckets.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index ebdd76b4ecbba..d1d3c7b358b23 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1124,6 +1124,13 @@ void futex_exit_release(struct task_struct *tsk)
futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
}
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
+{
+ atomic_set(&fhb->waiters, 0);
+ plist_head_init(&fhb->chain);
+ spin_lock_init(&fhb->lock);
+}
+
static int __init futex_init(void)
{
unsigned int futex_shift;
@@ -1141,11 +1148,8 @@ static int __init futex_init(void)
futex_hashsize, futex_hashsize);
futex_hashsize = 1UL << futex_shift;
- for (i = 0; i < futex_hashsize; i++) {
- atomic_set(&futex_queues[i].waiters, 0);
- plist_head_init(&futex_queues[i].chain);
- spin_lock_init(&futex_queues[i].lock);
- }
+ for (i = 0; i < futex_hashsize; i++)
+ futex_hash_bucket_init(&futex_queues[i]);
return 0;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 01/15] rcuref: Avoid false positive "imbalanced put" report Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 02/15] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-03 14:27 ` Peter Zijlstra
` (2 more replies)
2025-02-03 13:59 ` [PATCH v8 04/15] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
` (12 subsequent siblings)
15 siblings, 3 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
The futex hashmap is system wide and shared by random tasks. Each slot
is hashed based on its address and VMA. Due to randomized VMAs (and
memory allocations) the same logical lock (pointer) can end up in a
different hash bucket on each invocation of the application. This in
turn means that different applications may share a hash bucket on the
first invocation but not on the second an it is not always clear which
applications will be involved. This can result in high latency's to
acquire the futex_hash_bucket::lock especially if the lock owner is
limited to a CPU and not be effectively PI boosted.
Introduce a task local hash map. The hashmap can be allocated via
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, 0)
The `0' argument allocates a default number of 16 slots, a higher number
can be specified if desired. The current upper limit is 131072.
The allocated hashmap is used by all threads within a process.
A thread can check if the private map has been allocated via
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
Which return the current number of slots.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 20 ++++++++
include/linux/mm_types.h | 6 ++-
include/uapi/linux/prctl.h | 5 ++
kernel/fork.c | 2 +
kernel/futex/core.c | 101 +++++++++++++++++++++++++++++++++++--
kernel/sys.c | 4 ++
6 files changed, 133 insertions(+), 5 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index b70df27d7e85c..943828db52234 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -77,6 +77,15 @@ void futex_exec_release(struct task_struct *tsk);
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3);
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3);
+int futex_hash_allocate_default(void);
+void futex_hash_free(struct mm_struct *mm);
+
+static inline void futex_mm_init(struct mm_struct *mm)
+{
+ mm->futex_hash_bucket = NULL;
+}
+
#else
static inline void futex_init_task(struct task_struct *tsk) { }
static inline void futex_exit_recursive(struct task_struct *tsk) { }
@@ -88,6 +97,17 @@ static inline long do_futex(u32 __user *uaddr, int op, u32 val,
{
return -EINVAL;
}
+static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
+{
+ return -EINVAL;
+}
+static inline int futex_hash_allocate_default(void)
+{
+ return 0;
+}
+static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline void futex_mm_init(struct mm_struct *mm) { }
+
#endif
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6b27db7f94963..c20f2310d78ca 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -30,6 +30,7 @@
#define INIT_PASID 0
struct address_space;
+struct futex_hash_bucket;
struct mem_cgroup;
/*
@@ -936,7 +937,10 @@ struct mm_struct {
*/
seqcount_t mm_lock_seq;
#endif
-
+#ifdef CONFIG_FUTEX
+ unsigned int futex_hash_mask;
+ struct futex_hash_bucket *futex_hash_bucket;
+#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 5c6080680cb27..55b843644c51a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -353,4 +353,9 @@ struct prctl_mm_map {
*/
#define PR_LOCK_SHADOW_STACK_STATUS 76
+/* FUTEX hash management */
+#define PR_FUTEX_HASH 77
+# define PR_FUTEX_HASH_SET_SLOTS 1
+# define PR_FUTEX_HASH_GET_SLOTS 2
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f32..80ac156adebbf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1287,6 +1287,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
RCU_INIT_POINTER(mm->exe_file, NULL);
mmu_notifier_subscriptions_init(mm);
init_tlb_flush_pending(mm);
+ futex_mm_init(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
mm->pmd_huge_pte = NULL;
#endif
@@ -1364,6 +1365,7 @@ static inline void __mmput(struct mm_struct *mm)
if (mm->binfmt)
module_put(mm->binfmt->module);
lru_gen_del_mm(mm);
+ futex_hash_free(mm);
mmdrop(mm);
}
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index d1d3c7b358b23..26328d8072fee 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -39,6 +39,7 @@
#include <linux/memblock.h>
#include <linux/fault-inject.h>
#include <linux/slab.h>
+#include <linux/prctl.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -107,18 +108,40 @@ late_initcall(fail_futex_debugfs);
#endif /* CONFIG_FAIL_FUTEX */
+static inline bool futex_key_is_private(union futex_key *key)
+{
+ /*
+ * Relies on get_futex_key() to set either bit for shared
+ * futexes -- see comment with union futex_key.
+ */
+ return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
+}
+
/**
- * futex_hash - Return the hash bucket in the global hash
+ * futex_hash - Return the hash bucket in the global or local hash
* @key: Pointer to the futex key for which the hash is calculated
*
* We hash on the keys returned from get_futex_key (see below) and return the
- * corresponding hash bucket in the global hash.
+ * corresponding hash bucket in the global hash. If the FUTEX is private and
+ * a local hash table is privated then this one is used.
*/
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
- u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
- key->both.offset);
+ struct futex_hash_bucket *fhb;
+ u32 hash;
+ fhb = current->mm->futex_hash_bucket;
+ if (fhb && futex_key_is_private(key)) {
+ u32 hash_mask = current->mm->futex_hash_mask;
+
+ hash = jhash2((u32 *)key,
+ offsetof(typeof(*key), both.offset) / 4,
+ key->both.offset);
+ return &fhb[hash & hash_mask];
+ }
+ hash = jhash2((u32 *)key,
+ offsetof(typeof(*key), both.offset) / 4,
+ key->both.offset);
return &futex_queues[hash & (futex_hashsize - 1)];
}
@@ -1131,6 +1154,76 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
spin_lock_init(&fhb->lock);
}
+void futex_hash_free(struct mm_struct *mm)
+{
+ kvfree(mm->futex_hash_bucket);
+}
+
+static int futex_hash_allocate(unsigned int hash_slots)
+{
+ struct futex_hash_bucket *fhb;
+ int i;
+
+ if (current->mm->futex_hash_bucket)
+ return -EALREADY;
+
+ if (!thread_group_leader(current))
+ return -EINVAL;
+
+ if (hash_slots == 0)
+ hash_slots = 16;
+ if (hash_slots < 2)
+ hash_slots = 2;
+ if (hash_slots > 131072)
+ hash_slots = 131072;
+ if (!is_power_of_2(hash_slots))
+ hash_slots = rounddown_pow_of_two(hash_slots);
+
+ fhb = kvmalloc_array(hash_slots, sizeof(struct futex_hash_bucket), GFP_KERNEL_ACCOUNT);
+ if (!fhb)
+ return -ENOMEM;
+
+ current->mm->futex_hash_mask = hash_slots - 1;
+
+ for (i = 0; i < hash_slots; i++)
+ futex_hash_bucket_init(&fhb[i]);
+
+ current->mm->futex_hash_bucket = fhb;
+ return 0;
+}
+
+int futex_hash_allocate_default(void)
+{
+ return futex_hash_allocate(0);
+}
+
+static int futex_hash_get_slots(void)
+{
+ if (current->mm->futex_hash_bucket)
+ return current->mm->futex_hash_mask + 1;
+ return 0;
+}
+
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
+{
+ int ret;
+
+ switch (arg2) {
+ case PR_FUTEX_HASH_SET_SLOTS:
+ ret = futex_hash_allocate(arg3);
+ break;
+
+ case PR_FUTEX_HASH_GET_SLOTS:
+ ret = futex_hash_get_slots();
+ break;
+
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
static int __init futex_init(void)
{
unsigned int futex_shift;
diff --git a/kernel/sys.c b/kernel/sys.c
index cb366ff8703af..e509ad9795103 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -52,6 +52,7 @@
#include <linux/user_namespace.h>
#include <linux/time_namespace.h>
#include <linux/binfmts.h>
+#include <linux/futex.h>
#include <linux/sched.h>
#include <linux/sched/autogroup.h>
@@ -2811,6 +2812,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
error = arch_lock_shadow_stack_status(me, arg2);
break;
+ case PR_FUTEX_HASH:
+ error = futex_hash_prctl(arg2, arg3);
+ break;
default:
trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-03 13:59 ` [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
@ 2025-02-03 14:27 ` Peter Zijlstra
2025-02-03 15:51 ` Sebastian Andrzej Siewior
2025-02-03 14:29 ` Peter Zijlstra
2025-02-03 14:41 ` Peter Zijlstra
2 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-03 14:27 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Feb 03, 2025 at 02:59:23PM +0100, Sebastian Andrzej Siewior wrote:
> The futex hashmap is system wide and shared by random tasks. Each slot
> is hashed based on its address and VMA. Due to randomized VMAs (and
> memory allocations) the same logical lock (pointer) can end up in a
> different hash bucket on each invocation of the application. This in
> turn means that different applications may share a hash bucket on the
> first invocation but not on the second an it is not always clear which
> applications will be involved. This can result in high latency's to
> acquire the futex_hash_bucket::lock especially if the lock owner is
> limited to a CPU and not be effectively PI boosted.
>
> Introduce a task local hash map. The hashmap can be allocated via
> prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, 0)
>
> The `0' argument allocates a default number of 16 slots, a higher number
> can be specified if desired. The current upper limit is 131072.
Hmm, I would expect 0 to disable the local thing.
Now, I realize this is somewhat tricky, since there might be futexes
inside it. But mapping 0 to some default value seems.. well, odd.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-03 14:27 ` Peter Zijlstra
@ 2025-02-03 15:51 ` Sebastian Andrzej Siewior
2025-02-04 10:34 ` Peter Zijlstra
0 siblings, 1 reply; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 15:51 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-03 15:27:43 [+0100], Peter Zijlstra wrote:
> On Mon, Feb 03, 2025 at 02:59:23PM +0100, Sebastian Andrzej Siewior wrote:
> > The futex hashmap is system wide and shared by random tasks. Each slot
> > is hashed based on its address and VMA. Due to randomized VMAs (and
> > memory allocations) the same logical lock (pointer) can end up in a
> > different hash bucket on each invocation of the application. This in
> > turn means that different applications may share a hash bucket on the
> > first invocation but not on the second an it is not always clear which
> > applications will be involved. This can result in high latency's to
> > acquire the futex_hash_bucket::lock especially if the lock owner is
> > limited to a CPU and not be effectively PI boosted.
> >
> > Introduce a task local hash map. The hashmap can be allocated via
> > prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, 0)
> >
> > The `0' argument allocates a default number of 16 slots, a higher number
> > can be specified if desired. The current upper limit is 131072.
>
> Hmm, I would expect 0 to disable the local thing.
>
> Now, I realize this is somewhat tricky, since there might be futexes
> inside it. But mapping 0 to some default value seems.. well, odd.
Looking at this from the top of the series:
- The default value (currently 16) supposed to be something sane so the
user does not have to worry about it. We could increase it if needed.
- Currently each thread will alter the limit if needed based on the
formula. So even if you set it to 2 or 16, once you have 16 threads
you will have 64 buckets.
- The number of buckets can only be increased, not shrunk. There is no
technical requirement for it other than "avoid a race where you use a
lower number if a lot of threads a fired at once".
- It is not expected to go back to the global hash.
- The upper limit is the same limit as for the global hash.
All the things I mentioned are open for debate:
- We could force the user to specify a power-of-number for the buckets
(due to current implementation) and not round.
- We could allow to increase and shrink the number of buckets. Allowing
it to shrink by accident if many threads are fired at once.
- We could say the user knows best and disable the heuristic once a size
has been requested.
- We could use 0 to disable the local hash and use the global one
instead if this is a requirement. Doing this before a thread is
created is preferred. Later might get tricky.
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-03 15:51 ` Sebastian Andrzej Siewior
@ 2025-02-04 10:34 ` Peter Zijlstra
2025-02-05 8:39 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-04 10:34 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Feb 03, 2025 at 04:51:14PM +0100, Sebastian Andrzej Siewior wrote:
> All the things I mentioned are open for debate:
So how about we add sysctl_futex_private; when set to 0 it will disable
the private hash for every new task, unless task explicitly calls
PR_FUTEX_HASH_SET. When non-zero, it is the base for the auto-scaling.
(then BASE_SMALL would have this default to 0, otherwise 4 as per this
patch set)
I would let PR_FUTEX_HASH_SET override and disable auto-scaling. There
is nothing more annoying than a computer that thinks it knows better.
User asked for N, user gets N etc.
If user sets 0, fall back to global hash.
Anyway, none of this solves anything when a process has both an active
RT part and an active !RT part (which isn't uncommon AFAICT).
Then the RT bits will still get interference from the !RT bits. Do we
want to complicate things and consider that?
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-04 10:34 ` Peter Zijlstra
@ 2025-02-05 8:39 ` Sebastian Andrzej Siewior
2025-02-07 9:41 ` Juri Lelli
0 siblings, 1 reply; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-05 8:39 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-04 11:34:47 [+0100], Peter Zijlstra wrote:
> On Mon, Feb 03, 2025 at 04:51:14PM +0100, Sebastian Andrzej Siewior wrote:
>
> > All the things I mentioned are open for debate:
>
> So how about we add sysctl_futex_private; when set to 0 it will disable
> the private hash for every new task, unless task explicitly calls
> PR_FUTEX_HASH_SET. When non-zero, it is the base for the auto-scaling.
>
> (then BASE_SMALL would have this default to 0, otherwise 4 as per this
> patch set)
>
> I would let PR_FUTEX_HASH_SET override and disable auto-scaling. There
> is nothing more annoying than a computer that thinks it knows better.
> User asked for N, user gets N etc.
>
> If user sets 0, fall back to global hash.
Looks sane. I started with PR_FUTEX_HASH_SET to avoid the rehash at
runtime. Now that I have this working PR_FUTEX_HASH_SET does not look
needed. But okay. Having a default enable or disable with the sysctl and
a PR_FUTEX_HASH_SET which sets a requested value and disabling
auto-scale looks okay.
Assuming the user knows best. Even that it is hard to get right given
random pointer and so on.
> Anyway, none of this solves anything when a process has both an active
> RT part and an active !RT part (which isn't uncommon AFAICT).
>
> Then the RT bits will still get interference from the !RT bits. Do we
> want to complicate things and consider that?
I don't think so. The active and inactive are common but it is still the
same process so you can expect it. The ugly part is when it is an
entirely different task and it is random which one it is.
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-05 8:39 ` Sebastian Andrzej Siewior
@ 2025-02-07 9:41 ` Juri Lelli
2025-02-07 11:00 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 52+ messages in thread
From: Juri Lelli @ 2025-02-07 9:41 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Peter Zijlstra, linux-kernel, André Almeida, Darren Hart,
Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, Valentin Schneider,
Waiman Long
Hi,
On 05/02/25 09:39, Sebastian Andrzej Siewior wrote:
> On 2025-02-04 11:34:47 [+0100], Peter Zijlstra wrote:
...
> > Anyway, none of this solves anything when a process has both an active
> > RT part and an active !RT part (which isn't uncommon AFAICT).
> >
> > Then the RT bits will still get interference from the !RT bits. Do we
> > want to complicate things and consider that?
>
> I don't think so. The active and inactive are common but it is still the
> same process so you can expect it. The ugly part is when it is an
> entirely different task and it is random which one it is.
Not entirely sure we are thinking about the same situation, but it looks
like we have cases of RT tasks that are affected by the underlying issue
this set is about because they make use of libraries. So, in this case
we have a cross-process (RT/!RT) situation that I am not sure we can
address sanely. What do you think?
Thanks,
Juri
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-07 9:41 ` Juri Lelli
@ 2025-02-07 11:00 ` Sebastian Andrzej Siewior
2025-02-07 11:06 ` Peter Zijlstra
0 siblings, 1 reply; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-07 11:00 UTC (permalink / raw)
To: Juri Lelli
Cc: Peter Zijlstra, linux-kernel, André Almeida, Darren Hart,
Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-07 10:41:02 [+0100], Juri Lelli wrote:
> Hi,
>
> On 05/02/25 09:39, Sebastian Andrzej Siewior wrote:
> > On 2025-02-04 11:34:47 [+0100], Peter Zijlstra wrote:
>
> ...
>
> > > Anyway, none of this solves anything when a process has both an active
> > > RT part and an active !RT part (which isn't uncommon AFAICT).
> > >
> > > Then the RT bits will still get interference from the !RT bits. Do we
> > > want to complicate things and consider that?
> >
> > I don't think so. The active and inactive are common but it is still the
> > same process so you can expect it. The ugly part is when it is an
> > entirely different task and it is random which one it is.
>
> Not entirely sure we are thinking about the same situation, but it looks
> like we have cases of RT tasks that are affected by the underlying issue
> this set is about because they make use of libraries. So, in this case
> we have a cross-process (RT/!RT) situation that I am not sure we can
> address sanely. What do you think?
I wouldn't advice to use "unknown" code in a RT application and even
threads. Audit libs before using them and not just collect them.
A lock without PI in your RT thread is not good. A lot of locks, not
knowing the "locked" time, also not good. Things that work most of the
time due to the fastpath and only break from time to time.
Also, a thread does fork() once during start up and things may continue
to be good but may catch up eventually.
> Thanks,
> Juri
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-07 11:00 ` Sebastian Andrzej Siewior
@ 2025-02-07 11:06 ` Peter Zijlstra
2025-02-07 14:47 ` Juri Lelli
0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-07 11:06 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Juri Lelli, linux-kernel, André Almeida, Darren Hart,
Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Fri, Feb 07, 2025 at 12:00:50PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-02-07 10:41:02 [+0100], Juri Lelli wrote:
> > Hi,
> >
> > On 05/02/25 09:39, Sebastian Andrzej Siewior wrote:
> > > On 2025-02-04 11:34:47 [+0100], Peter Zijlstra wrote:
> >
> > ...
> >
> > > > Anyway, none of this solves anything when a process has both an active
> > > > RT part and an active !RT part (which isn't uncommon AFAICT).
> > > >
> > > > Then the RT bits will still get interference from the !RT bits. Do we
> > > > want to complicate things and consider that?
> > >
> > > I don't think so. The active and inactive are common but it is still the
> > > same process so you can expect it. The ugly part is when it is an
> > > entirely different task and it is random which one it is.
> >
> > Not entirely sure we are thinking about the same situation, but it looks
> > like we have cases of RT tasks that are affected by the underlying issue
> > this set is about because they make use of libraries. So, in this case
> > we have a cross-process (RT/!RT) situation that I am not sure we can
> > address sanely. What do you think?
>
> I wouldn't advice to use "unknown" code in a RT application and even
> threads. Audit libs before using them and not just collect them.
> A lock without PI in your RT thread is not good. A lot of locks, not
> knowing the "locked" time, also not good. Things that work most of the
> time due to the fastpath and only break from time to time.
> Also, a thread does fork() once during start up and things may continue
> to be good but may catch up eventually.
Anyway, supposing people want to tinker, it should be possible to split
the local hash based on if the address is mlocked or not.
This gives some obvious issues where people do futex_wait(), mlock() and
then expect futex_wake() to still work, but that should be rare. You
typically mlock before you start using the memory.
As always, mlockall is bad, do not use, like ever. But especially in
mixed mode RT programs.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-07 11:06 ` Peter Zijlstra
@ 2025-02-07 14:47 ` Juri Lelli
0 siblings, 0 replies; 52+ messages in thread
From: Juri Lelli @ 2025-02-07 14:47 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Sebastian Andrzej Siewior, linux-kernel, André Almeida,
Darren Hart, Davidlohr Bueso, Ingo Molnar, Thomas Gleixner,
Valentin Schneider, Waiman Long
On 07/02/25 12:06, Peter Zijlstra wrote:
> On Fri, Feb 07, 2025 at 12:00:50PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2025-02-07 10:41:02 [+0100], Juri Lelli wrote:
> > > Hi,
> > >
> > > On 05/02/25 09:39, Sebastian Andrzej Siewior wrote:
> > > > On 2025-02-04 11:34:47 [+0100], Peter Zijlstra wrote:
> > >
> > > ...
> > >
> > > > > Anyway, none of this solves anything when a process has both an active
> > > > > RT part and an active !RT part (which isn't uncommon AFAICT).
> > > > >
> > > > > Then the RT bits will still get interference from the !RT bits. Do we
> > > > > want to complicate things and consider that?
> > > >
> > > > I don't think so. The active and inactive are common but it is still the
> > > > same process so you can expect it. The ugly part is when it is an
> > > > entirely different task and it is random which one it is.
> > >
> > > Not entirely sure we are thinking about the same situation, but it looks
> > > like we have cases of RT tasks that are affected by the underlying issue
> > > this set is about because they make use of libraries. So, in this case
> > > we have a cross-process (RT/!RT) situation that I am not sure we can
> > > address sanely. What do you think?
> >
> > I wouldn't advice to use "unknown" code in a RT application and even
> > threads. Audit libs before using them and not just collect them.
Indeed, that is the message. But, you know, existing/legacy applications
or applications "repurposed" for RT. :/
> > A lock without PI in your RT thread is not good. A lot of locks, not
> > knowing the "locked" time, also not good. Things that work most of the
> > time due to the fastpath and only break from time to time.
> > Also, a thread does fork() once during start up and things may continue
> > to be good but may catch up eventually.
>
> Anyway, supposing people want to tinker, it should be possible to split
> the local hash based on if the address is mlocked or not.
So, fully agree that we don't want to implement changes just to target
current broken usages, but wondered if there are saner situations (Peter's
point above?) where the current idea might be further extended to.
> This gives some obvious issues where people do futex_wait(), mlock() and
> then expect futex_wake() to still work, but that should be rare. You
> typically mlock before you start using the memory.
>
> As always, mlockall is bad, do not use, like ever. But especially in
> mixed mode RT programs.
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-03 13:59 ` [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
2025-02-03 14:27 ` Peter Zijlstra
@ 2025-02-03 14:29 ` Peter Zijlstra
2025-02-03 14:41 ` Peter Zijlstra
2 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-03 14:29 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Feb 03, 2025 at 02:59:23PM +0100, Sebastian Andrzej Siewior wrote:
> +static int futex_hash_allocate(unsigned int hash_slots)
> +{
> + struct futex_hash_bucket *fhb;
> + int i;
> +
> + if (current->mm->futex_hash_bucket)
> + return -EALREADY;
> +
> + if (!thread_group_leader(current))
> + return -EINVAL;
> +
> + if (hash_slots == 0)
> + hash_slots = 16;
> + if (hash_slots < 2)
> + hash_slots = 2;
> + if (hash_slots > 131072)
> + hash_slots = 131072;
> + if (!is_power_of_2(hash_slots))
> + hash_slots = rounddown_pow_of_two(hash_slots);
Why not require the argument is 2^n and fail otherwise?
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-03 13:59 ` [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
2025-02-03 14:27 ` Peter Zijlstra
2025-02-03 14:29 ` Peter Zijlstra
@ 2025-02-03 14:41 ` Peter Zijlstra
2025-02-03 15:39 ` Peter Zijlstra
2025-02-03 15:52 ` Sebastian Andrzej Siewior
2 siblings, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-03 14:41 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Feb 03, 2025 at 02:59:23PM +0100, Sebastian Andrzej Siewior wrote:
> struct futex_hash_bucket *futex_hash(union futex_key *key)
> {
> - u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
> - key->both.offset);
> + struct futex_hash_bucket *fhb;
> + u32 hash;
>
> + fhb = current->mm->futex_hash_bucket;
> + if (fhb && futex_key_is_private(key)) {
> + u32 hash_mask = current->mm->futex_hash_mask;
> +
> + hash = jhash2((u32 *)key,
> + offsetof(typeof(*key), both.offset) / 4,
> + key->both.offset);
> + return &fhb[hash & hash_mask];
> + }
> + hash = jhash2((u32 *)key,
> + offsetof(typeof(*key), both.offset) / 4,
> + key->both.offset);
> return &futex_queues[hash & (futex_hashsize - 1)];
> }
Having read this later patch, I would still prefer this to look like:
struct futex_hash_bucker *futex_hash(union futex_key *key)
{
struct futex_hash_bucket *fhb = current->mm->futex_hash_bucket;
u32 hash;
if (futex_key_is_private) {
// shorter jhash invocation;
} else {
hash = jhash2((u32 *)key,
offsetof(typeof(*key), both.offset) / 4,
key->both.offset);
}
if (fhb)
return fhb[hash & current->mm->futex_hash_mask];
return &futex_queues[hash & (futex_hashsize - 1)];
}
Because the shorter hash invocation only cares about being private, not
about having fhb.
Notably, I'm also still thikning about all that pending FUTEX2_NUMA
stuff.
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-03 14:41 ` Peter Zijlstra
@ 2025-02-03 15:39 ` Peter Zijlstra
2025-02-03 15:52 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-03 15:39 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Feb 03, 2025 at 03:41:02PM +0100, Peter Zijlstra wrote:
> Because the shorter hash invocation only cares about being private, not
> about having fhb.
While driving to pick up kid from high school, I realized I was full of
crap. Short hash is very much specific to local thing.
Carry on.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-03 14:41 ` Peter Zijlstra
2025-02-03 15:39 ` Peter Zijlstra
@ 2025-02-03 15:52 ` Sebastian Andrzej Siewior
2025-02-04 8:41 ` Peter Zijlstra
1 sibling, 1 reply; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 15:52 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-03 15:41:01 [+0100], Peter Zijlstra wrote:
> Notably, I'm also still thikning about all that pending FUTEX2_NUMA
> stuff.
With local-hash there is still room for FUTEX2_NUMA?
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-03 15:52 ` Sebastian Andrzej Siewior
@ 2025-02-04 8:41 ` Peter Zijlstra
2025-02-04 9:28 ` Thomas Gleixner
0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-04 8:41 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Feb 03, 2025 at 04:52:51PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-02-03 15:41:01 [+0100], Peter Zijlstra wrote:
> > Notably, I'm also still thikning about all that pending FUTEX2_NUMA
> > stuff.
>
> With local-hash there is still room for FUTEX2_NUMA?
Yes, single process can span multiple nodes, and then you still get the
cross node contention issues.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash.
2025-02-04 8:41 ` Peter Zijlstra
@ 2025-02-04 9:28 ` Thomas Gleixner
0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2025-02-04 9:28 UTC (permalink / raw)
To: Peter Zijlstra, Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Valentin Schneider, Waiman Long
On Tue, Feb 04 2025 at 09:41, Peter Zijlstra wrote:
> On Mon, Feb 03, 2025 at 04:52:51PM +0100, Sebastian Andrzej Siewior wrote:
>> On 2025-02-03 15:41:01 [+0100], Peter Zijlstra wrote:
>> > Notably, I'm also still thikning about all that pending FUTEX2_NUMA
>> > stuff.
>>
>> With local-hash there is still room for FUTEX2_NUMA?
>
> Yes, single process can span multiple nodes, and then you still get the
> cross node contention issues.
This is strictly about process private futexes, while FUTEX2_NUMA covers
both shared and private futexes and helps to avoid the hash collision
and resulting contention problems which are caused by the current global
hash.
With the private futex seperation you obviously get the cross node
memory access problems, but not the general contention issues which
FUTEX2_NUMA tries to address.
Thanks,
tglx
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v8 04/15] futex: Allow automatic allocation of process wide futex hash.
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (2 preceding siblings ...)
2025-02-03 13:59 ` [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-03 14:36 ` Peter Zijlstra
2025-02-03 13:59 ` [PATCH v8 05/15] futex: Hash only the address for private futexes Sebastian Andrzej Siewior
` (11 subsequent siblings)
15 siblings, 1 reply; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Allocate a default futex hash if a task forks its first thread.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 12 ++++++++++++
kernel/fork.c | 24 ++++++++++++++++++++++++
2 files changed, 36 insertions(+)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 943828db52234..bad377c30de5e 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -86,6 +86,13 @@ static inline void futex_mm_init(struct mm_struct *mm)
mm->futex_hash_bucket = NULL;
}
+static inline bool futex_hash_requires_allocation(void)
+{
+ if (current->mm->futex_hash_bucket)
+ return false;
+ return true;
+}
+
#else
static inline void futex_init_task(struct task_struct *tsk) { }
static inline void futex_exit_recursive(struct task_struct *tsk) { }
@@ -108,6 +115,11 @@ static inline int futex_hash_allocate_default(void)
static inline void futex_hash_free(struct mm_struct *mm) { }
static inline void futex_mm_init(struct mm_struct *mm) { }
+static inline bool futex_hash_requires_allocation(void)
+{
+ return false;
+}
+
#endif
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 80ac156adebbf..824cc55d32ece 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2138,6 +2138,15 @@ static void rv_task_fork(struct task_struct *p)
#define rv_task_fork(p) do {} while (0)
#endif
+static bool need_futex_hash_allocate_default(u64 clone_flags)
+{
+ if ((clone_flags & (CLONE_THREAD | CLONE_VM)) != (CLONE_THREAD | CLONE_VM))
+ return false;
+ if (!thread_group_empty(current))
+ return false;
+ return futex_hash_requires_allocation();
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -2515,6 +2524,21 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cancel_cgroup;
+ /*
+ * Allocate a default futex hash for the user process once the first
+ * thread spawns.
+ */
+ if (need_futex_hash_allocate_default(clone_flags)) {
+ retval = futex_hash_allocate_default();
+ if (retval)
+ goto bad_fork_core_free;
+ /*
+ * If we fail beyond this point we don't free the allocated
+ * futex hash map. We assume that another thread will be created
+ * and makes use of it. The hash map will be freed once the main
+ * thread terminates.
+ */
+ }
/*
* From this point on we must avoid any synchronous user-space
* communication until we take the tasklist-lock. In particular, we do
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* Re: [PATCH v8 04/15] futex: Allow automatic allocation of process wide futex hash.
2025-02-03 13:59 ` [PATCH v8 04/15] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
@ 2025-02-03 14:36 ` Peter Zijlstra
2025-02-03 15:54 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-03 14:36 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Feb 03, 2025 at 02:59:24PM +0100, Sebastian Andrzej Siewior wrote:
> Allocate a default futex hash if a task forks its first thread.
This makes it so that private futexes are *always* on the mm local hash;
there is no way to disable this.
This seems somewhat extreme.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 04/15] futex: Allow automatic allocation of process wide futex hash.
2025-02-03 14:36 ` Peter Zijlstra
@ 2025-02-03 15:54 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 15:54 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-03 15:36:02 [+0100], Peter Zijlstra wrote:
> On Mon, Feb 03, 2025 at 02:59:24PM +0100, Sebastian Andrzej Siewior wrote:
> > Allocate a default futex hash if a task forks its first thread.
>
> This makes it so that private futexes are *always* on the mm local hash;
> there is no way to disable this.
>
> This seems somewhat extreme.
That was the plan. But disabling it before creating threads is doable if
needed.
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v8 05/15] futex: Hash only the address for private futexes.
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (3 preceding siblings ...)
2025-02-03 13:59 ` [PATCH v8 04/15] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-03 14:41 ` Peter Zijlstra
2025-02-03 13:59 ` [PATCH v8 06/15] futex: Move private hashing into its own function Sebastian Andrzej Siewior
` (10 subsequent siblings)
15 siblings, 1 reply; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
futex_hash() passes the whole futex_key to jhash2. The first two member
are passed as the first argument and the offset as the "initial value".
For private futexes, the mm-part is always the same and it is used only
within the process. By excluding the mm part from the hash, we reduce
the length passed to jhash2 from 4 (16 / 4) to 2 (8 / 2). This avoids
the __jhash_mix() part of jhash.
The resulting code is smaller and based on testing this variant performs
as good as the original or slightly better.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 26328d8072fee..f608cd6ccc032 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -134,8 +134,8 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
if (fhb && futex_key_is_private(key)) {
u32 hash_mask = current->mm->futex_hash_mask;
- hash = jhash2((u32 *)key,
- offsetof(typeof(*key), both.offset) / 4,
+ hash = jhash2((void *)&key->private.address,
+ sizeof(key->private.address) / 4,
key->both.offset);
return &fhb[hash & hash_mask];
}
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* Re: [PATCH v8 05/15] futex: Hash only the address for private futexes.
2025-02-03 13:59 ` [PATCH v8 05/15] futex: Hash only the address for private futexes Sebastian Andrzej Siewior
@ 2025-02-03 14:41 ` Peter Zijlstra
0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-03 14:41 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Feb 03, 2025 at 02:59:25PM +0100, Sebastian Andrzej Siewior wrote:
> futex_hash() passes the whole futex_key to jhash2. The first two member
> are passed as the first argument and the offset as the "initial value".
>
> For private futexes, the mm-part is always the same and it is used only
> within the process. By excluding the mm part from the hash, we reduce
> the length passed to jhash2 from 4 (16 / 4) to 2 (8 / 2). This avoids
> the __jhash_mix() part of jhash.
>
> The resulting code is smaller and based on testing this variant performs
> as good as the original or slightly better.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> kernel/futex/core.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> index 26328d8072fee..f608cd6ccc032 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -134,8 +134,8 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
> if (fhb && futex_key_is_private(key)) {
> u32 hash_mask = current->mm->futex_hash_mask;
>
> - hash = jhash2((u32 *)key,
> - offsetof(typeof(*key), both.offset) / 4,
> + hash = jhash2((void *)&key->private.address,
> + sizeof(key->private.address) / 4,
> key->both.offset);
> return &fhb[hash & hash_mask];
> }
Like just replied to that other email, this should probably be moved
earlier and be independent of fhb.
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v8 06/15] futex: Move private hashing into its own function.
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (4 preceding siblings ...)
2025-02-03 13:59 ` [PATCH v8 05/15] futex: Hash only the address for private futexes Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-04 9:34 ` Peter Zijlstra
2025-02-03 13:59 ` [PATCH v8 07/15] futex: Decrease the waiter count before the unlock operation Sebastian Andrzej Siewior
` (9 subsequent siblings)
15 siblings, 1 reply; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
The hashing of the private is slightly different and will be needed
again while moving a futex_q entry to a different hash bucket after the
resize.
Move the private hashing into its own function.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index f608cd6ccc032..fdfc3402278a1 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -117,6 +117,18 @@ static inline bool futex_key_is_private(union futex_key *key)
return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
}
+static struct futex_hash_bucket *futex_hash_private(union futex_key *key,
+ struct futex_hash_bucket *fhb,
+ u32 hash_mask)
+{
+ u32 hash;
+
+ hash = jhash2((void *)&key->private.address,
+ sizeof(key->private.address) / 4,
+ key->both.offset);
+ return &fhb[hash & hash_mask];
+}
+
/**
* futex_hash - Return the hash bucket in the global or local hash
* @key: Pointer to the futex key for which the hash is calculated
@@ -131,14 +143,9 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
u32 hash;
fhb = current->mm->futex_hash_bucket;
- if (fhb && futex_key_is_private(key)) {
- u32 hash_mask = current->mm->futex_hash_mask;
+ if (fhb && futex_key_is_private(key))
+ return futex_hash_private(key, fhb, current->mm->futex_hash_mask);
- hash = jhash2((void *)&key->private.address,
- sizeof(key->private.address) / 4,
- key->both.offset);
- return &fhb[hash & hash_mask];
- }
hash = jhash2((u32 *)key,
offsetof(typeof(*key), both.offset) / 4,
key->both.offset);
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* Re: [PATCH v8 06/15] futex: Move private hashing into its own function.
2025-02-03 13:59 ` [PATCH v8 06/15] futex: Move private hashing into its own function Sebastian Andrzej Siewior
@ 2025-02-04 9:34 ` Peter Zijlstra
2025-02-05 7:51 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-04 9:34 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Feb 03, 2025 at 02:59:26PM +0100, Sebastian Andrzej Siewior wrote:
> The hashing of the private is slightly different and will be needed
> again while moving a futex_q entry to a different hash bucket after the
> resize.
>
> Move the private hashing into its own function.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> kernel/futex/core.c | 21 ++++++++++++++-------
> 1 file changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> index f608cd6ccc032..fdfc3402278a1 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -117,6 +117,18 @@ static inline bool futex_key_is_private(union futex_key *key)
> return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
> }
>
> +static struct futex_hash_bucket *futex_hash_private(union futex_key *key,
> + struct futex_hash_bucket *fhb,
> + u32 hash_mask)
> +{
> + u32 hash;
> +
> + hash = jhash2((void *)&key->private.address,
> + sizeof(key->private.address) / 4,
> + key->both.offset);
> + return &fhb[hash & hash_mask];
> +}
> +
> /**
> * futex_hash - Return the hash bucket in the global or local hash
> * @key: Pointer to the futex key for which the hash is calculated
> @@ -131,14 +143,9 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
> u32 hash;
>
> fhb = current->mm->futex_hash_bucket;
> - if (fhb && futex_key_is_private(key)) {
> - u32 hash_mask = current->mm->futex_hash_mask;
> + if (fhb && futex_key_is_private(key))
> + return futex_hash_private(key, fhb, current->mm->futex_hash_mask);
>
> - hash = jhash2((void *)&key->private.address,
> - sizeof(key->private.address) / 4,
> - key->both.offset);
> - return &fhb[hash & hash_mask];
> - }
> hash = jhash2((u32 *)key,
> offsetof(typeof(*key), both.offset) / 4,
> key->both.offset);
Humph, this is a lot of back and forth. Maybe just do it right once?
Anyway, perhaps write it like:
static struct futex_hash_bucket *futex_hash_private(union futex_key *key)
{
struct mm_struct *mm;
u32 hash;
if (!futex_key_is_private(key))
return NULL;
mm = key->private.mm;
if (!mm->futex_hash_bucket)
return NULL;
/*
* Since key->private.mm is the same for all keys in this hash
* table, it does not contribute anything, and skipping it
* allows jhash2() to take a shorter/faster path.
*/
hash = jhash2((void *)&key->private.address,
sizeof(key->private.address) / 4,
key->both.offset);
return mm->futex_hash_bucket[hash & mm->futex_hash_mask];
}
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
struct futex_hash_bucket *fhb;
u32 hash;
fhb = futex_hash_private(key);
if (fhb)
return fhb;
hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
key->both.offset);
return &futex_queues[hash & (futex_hashsize - 1)];
}
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: [PATCH v8 06/15] futex: Move private hashing into its own function.
2025-02-04 9:34 ` Peter Zijlstra
@ 2025-02-05 7:51 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-05 7:51 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-04 10:34:02 [+0100], Peter Zijlstra wrote:
> > index f608cd6ccc032..fdfc3402278a1 100644
> > --- a/kernel/futex/core.c
> > +++ b/kernel/futex/core.c
> > @@ -131,14 +143,9 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
> > u32 hash;
> >
> > fhb = current->mm->futex_hash_bucket;
> > - if (fhb && futex_key_is_private(key)) {
> > - u32 hash_mask = current->mm->futex_hash_mask;
> > + if (fhb && futex_key_is_private(key))
> > + return futex_hash_private(key, fhb, current->mm->futex_hash_mask);
> >
> > - hash = jhash2((void *)&key->private.address,
> > - sizeof(key->private.address) / 4,
> > - key->both.offset);
> > - return &fhb[hash & hash_mask];
> > - }
> > hash = jhash2((u32 *)key,
> > offsetof(typeof(*key), both.offset) / 4,
> > key->both.offset);
>
> Humph, this is a lot of back and forth. Maybe just do it right once?
I pulled it in little steps and I replaced it step by step.
I stayed with the "smaller" jhash2 because it is less code. We might do
something like hash_long() which is even less. I had numbers at
https://lore.kernel.org/all/20241101110810.R3AnEqdu@linutronix.de/
But I can pull it in as suggested.
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v8 07/15] futex: Decrease the waiter count before the unlock operation.
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (5 preceding siblings ...)
2025-02-03 13:59 ` [PATCH v8 06/15] futex: Move private hashing into its own function Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 08/15] futex: Prepare for reference counting of the process private hash end of operation Sebastian Andrzej Siewior
` (8 subsequent siblings)
15 siblings, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
To support runtime resizing of the process private hash, it's required
to not use the obtained hash bucket once the reference count has been
dropped. The reference will be dropped after the unlock of the hash
bucket.
The amount of waiters is decremented after the unlock operation. There
is no requirement that this needs to happen after the unlock. The
increment happens before acquiring the lock to signal early that there
will be a waiter. The waiter can avoid blocking on the lock if it is
known that there will be no waiter.
There is no difference in terms of ordering if the decrement happens
before or after the unlock.
Decrease the waiter count before the unlock operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 2 +-
kernel/futex/requeue.c | 8 ++++----
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index fdfc3402278a1..6d12614dad5e8 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -558,8 +558,8 @@ struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
void futex_q_unlock(struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
- spin_unlock(&hb->lock);
futex_hb_waiters_dec(hb);
+ spin_unlock(&hb->lock);
}
void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index b47bb764b3520..fb69dcdf74da8 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -456,8 +456,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
ret = futex_get_value_locked(&curval, uaddr1);
if (unlikely(ret)) {
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
ret = get_user(curval, uaddr1);
if (ret)
@@ -542,8 +542,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
* waiter::requeue_state is correct.
*/
case -EFAULT:
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
ret = fault_in_user_writeable(uaddr2);
if (!ret)
goto retry;
@@ -556,8 +556,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
* exit to complete.
* - EAGAIN: The user space value changed.
*/
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
/*
* Handle the case where the owner is in the middle of
* exiting. Wait for the exit to complete otherwise
@@ -674,9 +674,9 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
put_pi_state(pi_state);
out_unlock:
+ futex_hb_waiters_dec(hb2);
double_unlock_hb(hb1, hb2);
wake_up_q(&wake_q);
- futex_hb_waiters_dec(hb2);
return ret ? ret : task_count;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* [PATCH v8 08/15] futex: Prepare for reference counting of the process private hash end of operation.
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (6 preceding siblings ...)
2025-02-03 13:59 ` [PATCH v8 07/15] futex: Decrease the waiter count before the unlock operation Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-04 9:49 ` Peter Zijlstra
2025-02-03 13:59 ` [PATCH v8 09/15] futex: Re-evaluate the hash bucket after dropping the lock Sebastian Andrzej Siewior
` (7 subsequent siblings)
15 siblings, 1 reply; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
To support runtime resizing of the process private hash, it's
required to add a reference count to the hash structure. The
reference count ensures that the hash cannot be resized or freed
while a task is operating on it.
The reference count will be obtained within futex_hash() and dropped
once the hash bucket is unlocked and not longer required for the
particular operation (queue, unqueue, wakeup etc.).
This is achieved by:
- appending _put() to existing functions so it's clear that they
also put the hash reference and fixing up the usage sites
- providing new helpers, which combine common operations (unlock,
put), and using them at the appropriate places
- providing new helper for standalone reference counting
functionality and using them at places, where the unlock operation
needs to be separate.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
io_uring/futex.c | 2 +-
kernel/futex/core.c | 12 ++++++++----
kernel/futex/futex.h | 31 ++++++++++++++++++++-----------
kernel/futex/pi.c | 19 ++++++++++---------
kernel/futex/requeue.c | 12 ++++++------
kernel/futex/waitwake.c | 23 ++++++++++++-----------
6 files changed, 57 insertions(+), 42 deletions(-)
diff --git a/io_uring/futex.c b/io_uring/futex.c
index 3159a2b7eeca1..2141811077b81 100644
--- a/io_uring/futex.c
+++ b/io_uring/futex.c
@@ -338,7 +338,7 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
hlist_add_head(&req->hash_node, &ctx->futex_list);
io_ring_submit_unlock(ctx, issue_flags);
- futex_queue(&ifd->q, hb);
+ futex_queue_put(&ifd->q, hb);
return IOU_ISSUE_SKIP_COMPLETE;
}
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 6d12614dad5e8..b54fcb1c6248d 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -152,6 +152,9 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
return &futex_queues[hash & (futex_hashsize - 1)];
}
+void futex_hash_put(struct futex_hash_bucket *hb)
+{
+}
/**
* futex_setup_timer - set up the sleeping hrtimer.
@@ -543,8 +546,8 @@ struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
* Increment the counter before taking the lock so that
* a potential waker won't miss a to-be-slept task that is
* waiting for the spinlock. This is safe as all futex_q_lock()
- * users end up calling futex_queue(). Similarly, for housekeeping,
- * decrement the counter at futex_q_unlock() when some error has
+ * users end up calling futex_queue_put(). Similarly, for housekeeping,
+ * decrement the counter at futex_q_unlock_put() when some error has
* occurred and we don't end up adding the task to the list.
*/
futex_hb_waiters_inc(hb); /* implies smp_mb(); (A) */
@@ -555,11 +558,12 @@ struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
return hb;
}
-void futex_q_unlock(struct futex_hash_bucket *hb)
+void futex_q_unlock_put(struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
futex_hb_waiters_dec(hb);
spin_unlock(&hb->lock);
+ futex_hash_put(hb);
}
void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
@@ -586,7 +590,7 @@ void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
* @q: The futex_q to unqueue
*
* The q->lock_ptr must not be held by the caller. A call to futex_unqueue() must
- * be paired with exactly one earlier call to futex_queue().
+ * be paired with exactly one earlier call to futex_queue_put().
*
* Return:
* - 1 - if the futex_q was still queued (and we removed unqueued it);
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 99b32e728c4ad..36627617f7ced 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -202,6 +202,7 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
int flags, u64 range_ns);
extern struct futex_hash_bucket *futex_hash(union futex_key *key);
+extern void futex_hash_put(struct futex_hash_bucket *hb);
/**
* futex_match - Check whether two futex keys are equal
@@ -288,23 +289,29 @@ extern void __futex_unqueue(struct futex_q *q);
extern void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb);
extern int futex_unqueue(struct futex_q *q);
+static inline void futex_hb_unlock_put(struct futex_hash_bucket *hb)
+{
+ spin_unlock(&hb->lock);
+ futex_hash_put(hb);
+}
+
/**
- * futex_queue() - Enqueue the futex_q on the futex_hash_bucket
+ * futex_queue_put() - Enqueue the futex_q on the futex_hash_bucket
* @q: The futex_q to enqueue
* @hb: The destination hash bucket
*
- * The hb->lock must be held by the caller, and is released here. A call to
- * futex_queue() is typically paired with exactly one call to futex_unqueue(). The
- * exceptions involve the PI related operations, which may use futex_unqueue_pi()
- * or nothing if the unqueue is done as part of the wake process and the unqueue
- * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
- * an example).
+ * The hb->lock must be held by the caller, and is released here and the reference
+ * on the hb is dropped. A call to futex_queue_put() is typically paired with
+ * exactly one call to futex_unqueue(). The exceptions involve the PI related
+ * operations, which may use futex_unqueue_pi() or nothing if the unqueue is
+ * done as part of the wake process and the unqueue state is implicit in the
+ * state of woken task (see futex_wait_requeue_pi() for an example).
*/
-static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
+static inline void futex_queue_put(struct futex_q *q, struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
__futex_queue(q, hb);
- spin_unlock(&hb->lock);
+ futex_hb_unlock_put(hb);
}
extern void futex_unqueue_pi(struct futex_q *q);
@@ -350,7 +357,7 @@ static inline int futex_hb_waiters_pending(struct futex_hash_bucket *hb)
}
extern struct futex_hash_bucket *futex_q_lock(struct futex_q *q);
-extern void futex_q_unlock(struct futex_hash_bucket *hb);
+extern void futex_q_unlock_put(struct futex_hash_bucket *hb);
extern int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
@@ -380,11 +387,13 @@ double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
}
static inline void
-double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
+double_unlock_hb_put(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
{
spin_unlock(&hb1->lock);
if (hb1 != hb2)
spin_unlock(&hb2->lock);
+ futex_hash_put(hb1);
+ futex_hash_put(hb2);
}
/* syscalls */
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index daea650b16f51..797270228665a 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -217,9 +217,9 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
/*
* We get here with hb->lock held, and having found a
* futex_top_waiter(). This means that futex_lock_pi() of said futex_q
- * has dropped the hb->lock in between futex_queue() and futex_unqueue_pi(),
- * which in turn means that futex_lock_pi() still has a reference on
- * our pi_state.
+ * has dropped the hb->lock in between futex_queue_put() and
+ * futex_unqueue_pi(), which in turn means that futex_lock_pi() still
+ * has a reference on our pi_state.
*
* The waiter holding a reference on @pi_state also protects against
* the unlocked put_pi_state() in futex_unlock_pi(), futex_lock_pi()
@@ -963,7 +963,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
* exit to complete.
* - EAGAIN: The user space value changed.
*/
- futex_q_unlock(hb);
+ futex_q_unlock_put(hb);
/*
* Handle the case where the owner is in the middle of
* exiting. Wait for the exit to complete otherwise
@@ -1083,7 +1083,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
goto out;
out_unlock_put_key:
- futex_q_unlock(hb);
+ futex_q_unlock_put(hb);
out:
if (to) {
@@ -1093,7 +1093,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
return ret != -EINTR ? ret : -ERESTARTNOINTR;
uaddr_faulted:
- futex_q_unlock(hb);
+ futex_q_unlock_put(hb);
ret = fault_in_user_writeable(uaddr);
if (ret)
@@ -1193,7 +1193,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
}
get_pi_state(pi_state);
- spin_unlock(&hb->lock);
+ futex_hb_unlock_put(hb);
/* drops pi_state->pi_mutex.wait_lock */
ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
@@ -1232,7 +1232,8 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
* owner.
*/
if ((ret = futex_cmpxchg_value_locked(&curval, uaddr, uval, 0))) {
- spin_unlock(&hb->lock);
+ futex_hb_unlock_put(hb);
+
switch (ret) {
case -EFAULT:
goto pi_faulted;
@@ -1252,7 +1253,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
ret = (curval == uval) ? 0 : -EAGAIN;
out_unlock:
- spin_unlock(&hb->lock);
+ futex_hb_unlock_put(hb);
return ret;
pi_retry:
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index fb69dcdf74da8..217cec5c8302e 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -58,7 +58,7 @@ enum {
};
const struct futex_q futex_q_init = {
- /* list gets initialized in futex_queue()*/
+ /* list gets initialized in futex_queue_put()*/
.wake = futex_wake_mark,
.key = FUTEX_KEY_INIT,
.bitset = FUTEX_BITSET_MATCH_ANY,
@@ -457,7 +457,7 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
if (unlikely(ret)) {
futex_hb_waiters_dec(hb2);
- double_unlock_hb(hb1, hb2);
+ double_unlock_hb_put(hb1, hb2);
ret = get_user(curval, uaddr1);
if (ret)
@@ -543,7 +543,7 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
*/
case -EFAULT:
futex_hb_waiters_dec(hb2);
- double_unlock_hb(hb1, hb2);
+ double_unlock_hb_put(hb1, hb2);
ret = fault_in_user_writeable(uaddr2);
if (!ret)
goto retry;
@@ -557,7 +557,7 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
* - EAGAIN: The user space value changed.
*/
futex_hb_waiters_dec(hb2);
- double_unlock_hb(hb1, hb2);
+ double_unlock_hb_put(hb1, hb2);
/*
* Handle the case where the owner is in the middle of
* exiting. Wait for the exit to complete otherwise
@@ -675,7 +675,7 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
out_unlock:
futex_hb_waiters_dec(hb2);
- double_unlock_hb(hb1, hb2);
+ double_unlock_hb_put(hb1, hb2);
wake_up_q(&wake_q);
return ret ? ret : task_count;
}
@@ -814,7 +814,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
* shared futexes. We need to compare the keys:
*/
if (futex_match(&q.key, &key2)) {
- futex_q_unlock(hb);
+ futex_q_unlock_put(hb);
ret = -EINVAL;
goto out;
}
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index eb86a7ade06a2..8027195802ca1 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -195,7 +195,7 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
}
}
- spin_unlock(&hb->lock);
+ futex_hb_unlock_put(hb);
wake_up_q(&wake_q);
return ret;
}
@@ -273,7 +273,7 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
double_lock_hb(hb1, hb2);
op_ret = futex_atomic_op_inuser(op, uaddr2);
if (unlikely(op_ret < 0)) {
- double_unlock_hb(hb1, hb2);
+ double_unlock_hb_put(hb1, hb2);
if (!IS_ENABLED(CONFIG_MMU) ||
unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
@@ -326,7 +326,7 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
}
out_unlock:
- double_unlock_hb(hb1, hb2);
+ double_unlock_hb_put(hb1, hb2);
wake_up_q(&wake_q);
return ret;
}
@@ -334,7 +334,7 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
static long futex_wait_restart(struct restart_block *restart);
/**
- * futex_wait_queue() - futex_queue() and wait for wakeup, timeout, or signal
+ * futex_wait_queue() - futex_queue_put() and wait for wakeup, timeout, or signal
* @hb: the futex hash bucket, must be locked by the caller
* @q: the futex_q to queue up on
* @timeout: the prepared hrtimer_sleeper, or null for no timeout
@@ -345,11 +345,11 @@ void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
/*
* The task state is guaranteed to be set before another task can
* wake it. set_current_state() is implemented using smp_store_mb() and
- * futex_queue() calls spin_unlock() upon completion, both serializing
+ * futex_queue_put() calls spin_unlock() upon completion, both serializing
* access to the hash list and forcing another memory barrier.
*/
set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
- futex_queue(q, hb);
+ futex_queue_put(q, hb);
/* Arm the timer */
if (timeout)
@@ -460,11 +460,12 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
* next futex. Queue each futex at this moment so hb can
* be unlocked.
*/
- futex_queue(q, hb);
+ futex_queue_put(q, hb);
continue;
}
- futex_q_unlock(hb);
+ futex_q_unlock_put(hb);
+
__set_current_state(TASK_RUNNING);
/*
@@ -623,7 +624,7 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
ret = futex_get_value_locked(&uval, uaddr);
if (ret) {
- futex_q_unlock(*hb);
+ futex_q_unlock_put(*hb);
ret = get_user(uval, uaddr);
if (ret)
@@ -636,7 +637,7 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
}
if (uval != val) {
- futex_q_unlock(*hb);
+ futex_q_unlock_put(*hb);
ret = -EWOULDBLOCK;
}
@@ -664,7 +665,7 @@ int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
if (ret)
return ret;
- /* futex_queue and wait for wakeup, timeout, or a signal. */
+ /* futex_queue_put() and wait for wakeup, timeout, or a signal. */
futex_wait_queue(hb, &q, to);
/* If we were woken (and unqueued), we succeeded, whatever. */
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* Re: [PATCH v8 08/15] futex: Prepare for reference counting of the process private hash end of operation.
2025-02-03 13:59 ` [PATCH v8 08/15] futex: Prepare for reference counting of the process private hash end of operation Sebastian Andrzej Siewior
@ 2025-02-04 9:49 ` Peter Zijlstra
2025-02-05 7:54 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-04 9:49 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Feb 03, 2025 at 02:59:28PM +0100, Sebastian Andrzej Siewior wrote:
> @@ -555,11 +558,12 @@ struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
> return hb;
> }
>
> -void futex_q_unlock(struct futex_hash_bucket *hb)
> +void futex_q_unlock_put(struct futex_hash_bucket *hb)
> __releases(&hb->lock)
> {
> futex_hb_waiters_dec(hb);
> spin_unlock(&hb->lock);
> + futex_hash_put(hb);
> }
Here you don't
> @@ -288,23 +289,29 @@ extern void __futex_unqueue(struct futex_q *q);
> extern void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb);
> extern int futex_unqueue(struct futex_q *q);
>
> +static inline void futex_hb_unlock_put(struct futex_hash_bucket *hb)
> +{
> + spin_unlock(&hb->lock);
> + futex_hash_put(hb);
> +}
> +
> /**
> - * futex_queue() - Enqueue the futex_q on the futex_hash_bucket
> + * futex_queue_put() - Enqueue the futex_q on the futex_hash_bucket
> * @q: The futex_q to enqueue
> * @hb: The destination hash bucket
> *
> - * The hb->lock must be held by the caller, and is released here. A call to
> - * futex_queue() is typically paired with exactly one call to futex_unqueue(). The
> - * exceptions involve the PI related operations, which may use futex_unqueue_pi()
> - * or nothing if the unqueue is done as part of the wake process and the unqueue
> - * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
> - * an example).
> + * The hb->lock must be held by the caller, and is released here and the reference
> + * on the hb is dropped. A call to futex_queue_put() is typically paired with
> + * exactly one call to futex_unqueue(). The exceptions involve the PI related
> + * operations, which may use futex_unqueue_pi() or nothing if the unqueue is
> + * done as part of the wake process and the unqueue state is implicit in the
> + * state of woken task (see futex_wait_requeue_pi() for an example).
> */
> -static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
> +static inline void futex_queue_put(struct futex_q *q, struct futex_hash_bucket *hb)
> __releases(&hb->lock)
> {
> __futex_queue(q, hb);
> - spin_unlock(&hb->lock);
> + futex_hb_unlock_put(hb);
> }
And here you do.
> @@ -380,11 +387,13 @@ double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> }
>
> static inline void
> -double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> +double_unlock_hb_put(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> {
> spin_unlock(&hb1->lock);
> if (hb1 != hb2)
> spin_unlock(&hb2->lock);
> + futex_hash_put(hb1);
> + futex_hash_put(hb2);
> }
>
This seems horribly inconsistent and makes my head hurt. Where are the
matching gets for double_lock_hb() ?
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: [PATCH v8 08/15] futex: Prepare for reference counting of the process private hash end of operation.
2025-02-04 9:49 ` Peter Zijlstra
@ 2025-02-05 7:54 ` Sebastian Andrzej Siewior
2025-02-05 9:37 ` Peter Zijlstra
0 siblings, 1 reply; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-05 7:54 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-04 10:49:22 [+0100], Peter Zijlstra wrote:
> On Mon, Feb 03, 2025 at 02:59:28PM +0100, Sebastian Andrzej Siewior wrote:
>
> > @@ -555,11 +558,12 @@ struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
> > return hb;
> > }
> >
> > -void futex_q_unlock(struct futex_hash_bucket *hb)
> > +void futex_q_unlock_put(struct futex_hash_bucket *hb)
> > __releases(&hb->lock)
> > {
> > futex_hb_waiters_dec(hb);
> > spin_unlock(&hb->lock);
> > + futex_hash_put(hb);
> > }
>
> Here you don't
unlock + put.
> > @@ -288,23 +289,29 @@ extern void __futex_unqueue(struct futex_q *q);
> > extern void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb);
> > extern int futex_unqueue(struct futex_q *q);
> >
> > +static inline void futex_hb_unlock_put(struct futex_hash_bucket *hb)
> > +{
> > + spin_unlock(&hb->lock);
> > + futex_hash_put(hb);
> > +}
> > +
> > /**
> > - * futex_queue() - Enqueue the futex_q on the futex_hash_bucket
> > + * futex_queue_put() - Enqueue the futex_q on the futex_hash_bucket
> > * @q: The futex_q to enqueue
> > * @hb: The destination hash bucket
> > *
> > - * The hb->lock must be held by the caller, and is released here. A call to
> > - * futex_queue() is typically paired with exactly one call to futex_unqueue(). The
> > - * exceptions involve the PI related operations, which may use futex_unqueue_pi()
> > - * or nothing if the unqueue is done as part of the wake process and the unqueue
> > - * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
> > - * an example).
> > + * The hb->lock must be held by the caller, and is released here and the reference
> > + * on the hb is dropped. A call to futex_queue_put() is typically paired with
> > + * exactly one call to futex_unqueue(). The exceptions involve the PI related
> > + * operations, which may use futex_unqueue_pi() or nothing if the unqueue is
> > + * done as part of the wake process and the unqueue state is implicit in the
> > + * state of woken task (see futex_wait_requeue_pi() for an example).
> > */
> > -static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
> > +static inline void futex_queue_put(struct futex_q *q, struct futex_hash_bucket *hb)
> > __releases(&hb->lock)
> > {
> > __futex_queue(q, hb);
> > - spin_unlock(&hb->lock);
> > + futex_hb_unlock_put(hb);
> > }
>
> And here you do.
unlock + put. What do I don't do?
> > @@ -380,11 +387,13 @@ double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> > }
> >
> > static inline void
> > -double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> > +double_unlock_hb_put(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> > {
> > spin_unlock(&hb1->lock);
> > if (hb1 != hb2)
> > spin_unlock(&hb2->lock);
> > + futex_hash_put(hb1);
> > + futex_hash_put(hb2);
> > }
> >
>
> This seems horribly inconsistent and makes my head hurt. Where are the
> matching gets for double_lock_hb() ?
There are in futex_hash().
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: [PATCH v8 08/15] futex: Prepare for reference counting of the process private hash end of operation.
2025-02-05 7:54 ` Sebastian Andrzej Siewior
@ 2025-02-05 9:37 ` Peter Zijlstra
0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-05 9:37 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Wed, Feb 05, 2025 at 08:54:05AM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-02-04 10:49:22 [+0100], Peter Zijlstra wrote:
> > On Mon, Feb 03, 2025 at 02:59:28PM +0100, Sebastian Andrzej Siewior wrote:
> >
> > > @@ -555,11 +558,12 @@ struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
> > > return hb;
> > > }
> > >
> > > -void futex_q_unlock(struct futex_hash_bucket *hb)
> > > +void futex_q_unlock_put(struct futex_hash_bucket *hb)
> > > __releases(&hb->lock)
> > > {
> > > futex_hb_waiters_dec(hb);
> > > spin_unlock(&hb->lock);
> > > + futex_hash_put(hb);
> > > }
> >
> > Here you don't
>
> unlock + put.
>
> > > @@ -288,23 +289,29 @@ extern void __futex_unqueue(struct futex_q *q);
> > > extern void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb);
> > > extern int futex_unqueue(struct futex_q *q);
> > >
> > > +static inline void futex_hb_unlock_put(struct futex_hash_bucket *hb)
> > > +{
> > > + spin_unlock(&hb->lock);
> > > + futex_hash_put(hb);
> > > +}
> > > +
> > > /**
> > > - * futex_queue() - Enqueue the futex_q on the futex_hash_bucket
> > > + * futex_queue_put() - Enqueue the futex_q on the futex_hash_bucket
> > > * @q: The futex_q to enqueue
> > > * @hb: The destination hash bucket
> > > *
> > > - * The hb->lock must be held by the caller, and is released here. A call to
> > > - * futex_queue() is typically paired with exactly one call to futex_unqueue(). The
> > > - * exceptions involve the PI related operations, which may use futex_unqueue_pi()
> > > - * or nothing if the unqueue is done as part of the wake process and the unqueue
> > > - * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
> > > - * an example).
> > > + * The hb->lock must be held by the caller, and is released here and the reference
> > > + * on the hb is dropped. A call to futex_queue_put() is typically paired with
> > > + * exactly one call to futex_unqueue(). The exceptions involve the PI related
> > > + * operations, which may use futex_unqueue_pi() or nothing if the unqueue is
> > > + * done as part of the wake process and the unqueue state is implicit in the
> > > + * state of woken task (see futex_wait_requeue_pi() for an example).
> > > */
> > > -static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
> > > +static inline void futex_queue_put(struct futex_q *q, struct futex_hash_bucket *hb)
> > > __releases(&hb->lock)
> > > {
> > > __futex_queue(q, hb);
> > > - spin_unlock(&hb->lock);
> > > + futex_hb_unlock_put(hb);
> > > }
> >
> > And here you do.
>
> unlock + put. What do I don't do?
Use this futex_hb_unlock_put() helper consistently :-)
> > > @@ -380,11 +387,13 @@ double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> > > }
> > >
> > > static inline void
> > > -double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> > > +double_unlock_hb_put(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> > > {
> > > spin_unlock(&hb1->lock);
> > > if (hb1 != hb2)
> > > spin_unlock(&hb2->lock);
> > > + futex_hash_put(hb1);
> > > + futex_hash_put(hb2);
> > > }
> > >
> >
> > This seems horribly inconsistent and makes my head hurt. Where are the
> > matching gets for double_lock_hb() ?
>
> There are in futex_hash().
Yeah, that took me a very long while to find. And also, ARGH at the
asymmetry of things.
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v8 09/15] futex: Re-evaluate the hash bucket after dropping the lock
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (7 preceding siblings ...)
2025-02-03 13:59 ` [PATCH v8 08/15] futex: Prepare for reference counting of the process private hash end of operation Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 10/15] futex: Introduce futex_get_locked_hb() Sebastian Andrzej Siewior
` (6 subsequent siblings)
15 siblings, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
In futex_requeue() and futex_wake_op() the hash bucket lock is
dropped in the failure paths for handling page faults and other
error scenarios. After that the code jumps back to retry_private
which relocks the hash bucket[s] under the assumption that the hash
bucket pointer which was retrieved via futex_hash() is still valid.
With resizable private hash buckets, that assumption is not longer
true as the waiters can be moved to a larger hash in the meantime.
Move the retry_private label above the hashing function to handle
this correctly.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/requeue.c | 2 +-
kernel/futex/waitwake.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 217cec5c8302e..31ec543e7fdb3 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -443,10 +443,10 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
if (requeue_pi && futex_match(&key1, &key2))
return -EINVAL;
+retry_private:
hb1 = futex_hash(&key1);
hb2 = futex_hash(&key2);
-retry_private:
futex_hb_waiters_inc(hb2);
double_lock_hb(hb1, hb2);
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 8027195802ca1..98409ba9605a8 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -266,10 +266,10 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
if (unlikely(ret != 0))
return ret;
+retry_private:
hb1 = futex_hash(&key1);
hb2 = futex_hash(&key2);
-retry_private:
double_lock_hb(hb1, hb2);
op_ret = futex_atomic_op_inuser(op, uaddr2);
if (unlikely(op_ret < 0)) {
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* [PATCH v8 10/15] futex: Introduce futex_get_locked_hb().
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (8 preceding siblings ...)
2025-02-03 13:59 ` [PATCH v8 09/15] futex: Re-evaluate the hash bucket after dropping the lock Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 11/15] futex: Acquire a hash reference in futex_wait_multiple_setup() Sebastian Andrzej Siewior
` (5 subsequent siblings)
15 siblings, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
futex_lock_pi() and __fixup_pi_state_owner() acquire the
futex_q::lock_ptr without holding a reference assuming the previously
obtained hash bucket and the assigned lock_ptr are still valid. This
isn't the case once the private hash can be resized and becomes invalid
after the reference drop.
Introduce futex_get_locked_hb() to lock the hash bucket recorded in
futex_q::lock_ptr. The lock pointer is read in a RCU section to ensure
that it does not go away if the hash bucket has been replaced and the
old pointer has been observed. After locking the pointer needs to be
compared to check if it changed. If so then the hash bucket has been
replaced and the user has been moved to the new one and lock_ptr has
been updated. The lock operation needs to be redone in this case.
Once the lock_ptr is the same, we can return the futex_hash_bucket it
belongs to as the hash bucket for the caller locked. This is important
because we don't own a reference so the hash bucket is valid as long as
we hold the lock.
This means if the local hash is resized then this (old) hash bucket
remains valid as long as we hold the lock because all user need to be
moved to the new hash bucket and have their lock_ptr updated. The task
performing the resize will block.
A special case is an early return in futex_lock_pi() (due to signal or
timeout) and a successful futex_wait_requeue_pi(). In both cases a valid
futex_q::lock_ptr is expected (and its matching hash bucket) but since
the waiter has been removed from the hash this can no longer be guaranteed.
Therefore before the waiter is removed a reference is acquired which is
later dropped by the waiter to avoid a resize.
Add futex_get_locked_hb() and use it.
Acquire an additional reference in requeue_pi_wake_futex() and
futex_unlock_pi() while the futex_q is removed, denote this extra reference
in futex_q::drop_hb_ref and let the waiter drop the reference in this case.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 41 +++++++++++++++++++++++++++++++++++++++++
kernel/futex/futex.h | 4 +++-
kernel/futex/pi.c | 17 +++++++++++++++--
kernel/futex/requeue.c | 16 ++++++++++++----
4 files changed, 71 insertions(+), 7 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index b54fcb1c6248d..b0fb2b10a387c 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -156,6 +156,17 @@ void futex_hash_put(struct futex_hash_bucket *hb)
{
}
+/**
+ * futex_hash_get - Get an additional reference for the local hash.
+ * @hb: ptr to the private local hash.
+ *
+ * Obtain an additional reference for the already obtained hash bucket. The
+ * caller must already own an reference.
+ */
+void futex_hash_get(struct futex_hash_bucket *hb)
+{
+}
+
/**
* futex_setup_timer - set up the sleeping hrtimer.
* @time: ptr to the given timeout value
@@ -639,6 +650,36 @@ int futex_unqueue(struct futex_q *q)
return ret;
}
+struct futex_hash_bucket *futex_get_locked_hb(struct futex_q *q)
+{
+ struct futex_hash_bucket *hb;
+ spinlock_t *lock_ptr;
+
+ /*
+ * See futex_unqueue() why lock_ptr can change.
+ */
+ guard(rcu)();
+retry:
+ lock_ptr = READ_ONCE(q->lock_ptr);
+ spin_lock(lock_ptr);
+
+ if (unlikely(lock_ptr != q->lock_ptr)) {
+ spin_unlock(lock_ptr);
+ goto retry;
+ }
+
+ hb = container_of(lock_ptr, struct futex_hash_bucket, lock);
+ /*
+ * The caller needs to either hold a reference on the hash (to ensure
+ * that the hash is not resized) _or_ be enqueued on the hash. This
+ * ensures that futex_q::lock_ptr is updated while moved to the new
+ * hash during resize.
+ * Once the hash bucket is locked the resize operation, which might be
+ * in progress, will block on the lock.
+ */
+ return hb;
+}
+
/*
* PI futexes can not be requeued and must remove themselves from the hash
* bucket. The hash bucket lock (i.e. lock_ptr) is held.
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 36627617f7ced..5b33016648ecd 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -182,6 +182,7 @@ struct futex_q {
union futex_key *requeue_pi_key;
u32 bitset;
atomic_t requeue_state;
+ bool drop_hb_ref;
#ifdef CONFIG_PREEMPT_RT
struct rcuwait requeue_wait;
#endif
@@ -196,12 +197,13 @@ enum futex_access {
extern int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
enum futex_access rw);
-
+extern struct futex_hash_bucket *futex_get_locked_hb(struct futex_q *q);
extern struct hrtimer_sleeper *
futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
int flags, u64 range_ns);
extern struct futex_hash_bucket *futex_hash(union futex_key *key);
+extern void futex_hash_get(struct futex_hash_bucket *hb);
extern void futex_hash_put(struct futex_hash_bucket *hb);
/**
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 797270228665a..c83fe575f954a 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -806,7 +806,7 @@ static int __fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
break;
}
- spin_lock(q->lock_ptr);
+ futex_get_locked_hb(q);
raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
/*
@@ -922,6 +922,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
struct rt_mutex_waiter rt_waiter;
struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
+ bool fast_path_ref_put = false;
DEFINE_WAKE_Q(wake_q);
int res, ret;
@@ -988,6 +989,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
/* Fixup the trylock return value: */
ret = ret ? 0 : -EWOULDBLOCK;
+ fast_path_ref_put = true;
goto no_block;
}
@@ -1014,6 +1016,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
*/
raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
spin_unlock(q.lock_ptr);
+ futex_hash_put(hb);
/*
* __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
* such that futex_unlock_pi() is guaranteed to observe the waiter when
@@ -1060,7 +1063,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
* spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
* the
*/
- spin_lock(q.lock_ptr);
+ hb = futex_get_locked_hb(&q);
/*
* Waiter is unqueued.
*/
@@ -1080,6 +1083,10 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
futex_unqueue_pi(&q);
spin_unlock(q.lock_ptr);
+ if (q.drop_hb_ref)
+ futex_hash_put(hb);
+ if (fast_path_ref_put)
+ futex_hash_put(hb);
goto out;
out_unlock_put_key:
@@ -1187,6 +1194,12 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
*/
rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
if (!rt_waiter) {
+ /*
+ * Acquire a reference for the leaving waiter to ensure
+ * valid futex_q::lock_ptr.
+ */
+ futex_hash_get(hb);
+ top_waiter->drop_hb_ref = true;
__futex_unqueue(top_waiter);
raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
goto retry_hb;
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 31ec543e7fdb3..167204e856fec 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -231,7 +231,12 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
WARN_ON(!q->rt_waiter);
q->rt_waiter = NULL;
-
+ /*
+ * Acquire a reference for the waiter to ensure valid
+ * futex_q::lock_ptr.
+ */
+ futex_hash_get(hb);
+ q->drop_hb_ref = true;
q->lock_ptr = &hb->lock;
/* Signal locked state to the waiter */
@@ -825,7 +830,8 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
switch (futex_requeue_pi_wakeup_sync(&q)) {
case Q_REQUEUE_PI_IGNORE:
/* The waiter is still on uaddr1 */
- spin_lock(&hb->lock);
+ hb = futex_get_locked_hb(&q);
+
ret = handle_early_requeue_pi_wakeup(hb, &q, to);
spin_unlock(&hb->lock);
break;
@@ -833,7 +839,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
case Q_REQUEUE_PI_LOCKED:
/* The requeue acquired the lock */
if (q.pi_state && (q.pi_state->owner != current)) {
- spin_lock(q.lock_ptr);
+ futex_get_locked_hb(&q);
ret = fixup_pi_owner(uaddr2, &q, true);
/*
* Drop the reference to the pi state which the
@@ -860,7 +866,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
ret = 0;
- spin_lock(q.lock_ptr);
+ futex_get_locked_hb(&q);
debug_rt_mutex_free_waiter(&rt_waiter);
/*
* Fixup the pi_state owner and possibly acquire the lock if we
@@ -892,6 +898,8 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
default:
BUG();
}
+ if (q.drop_hb_ref)
+ futex_hash_put(hb);
out:
if (to) {
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* [PATCH v8 11/15] futex: Acquire a hash reference in futex_wait_multiple_setup().
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (9 preceding siblings ...)
2025-02-03 13:59 ` [PATCH v8 10/15] futex: Introduce futex_get_locked_hb() Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 12/15] futex: Allow to re-allocate the private local hash Sebastian Andrzej Siewior
` (4 subsequent siblings)
15 siblings, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
futex_wait_multiple_setup() changes task_struct::__state to
!TASK_RUNNING and then enqueues on multiple futexes. Every
futex_q_lock() acquires a reference on the global hash which is dropped
later.
If a rehash is in progress then the loop will block on
mm_struct::futex_hash_bucket for the rehash to complete and this will
lose the previously set task_struct::__state.
Acquire a reference on the local hash to avoiding blocking on
mm_struct::futex_hash_bucket.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 10 ++++++++++
kernel/futex/futex.h | 2 ++
kernel/futex/waitwake.c | 21 +++++++++++++++++++--
3 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index b0fb2b10a387c..7130019aa9ec6 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -129,6 +129,11 @@ static struct futex_hash_bucket *futex_hash_private(union futex_key *key,
return &fhb[hash & hash_mask];
}
+struct futex_private_hash *futex_get_private_hash(void)
+{
+ return NULL;
+}
+
/**
* futex_hash - Return the hash bucket in the global or local hash
* @key: Pointer to the futex key for which the hash is calculated
@@ -152,6 +157,11 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
return &futex_queues[hash & (futex_hashsize - 1)];
}
+bool futex_put_private_hash(struct futex_private_hash *hb_p)
+{
+ return false;
+}
+
void futex_hash_put(struct futex_hash_bucket *hb)
{
}
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 5b33016648ecd..d6fa6f663d9ad 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -205,6 +205,8 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
extern struct futex_hash_bucket *futex_hash(union futex_key *key);
extern void futex_hash_get(struct futex_hash_bucket *hb);
extern void futex_hash_put(struct futex_hash_bucket *hb);
+extern struct futex_private_hash *futex_get_private_hash(void);
+extern bool futex_put_private_hash(struct futex_private_hash *hb_p);
/**
* futex_match - Check whether two futex keys are equal
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 98409ba9605a8..3d57b47692f57 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -395,7 +395,7 @@ int futex_unqueue_multiple(struct futex_vector *v, int count)
}
/**
- * futex_wait_multiple_setup - Prepare to wait and enqueue multiple futexes
+ * __futex_wait_multiple_setup - Prepare to wait and enqueue multiple futexes
* @vs: The futex list to wait on
* @count: The size of the list
* @woken: Index of the last woken futex, if any. Used to notify the
@@ -410,7 +410,7 @@ int futex_unqueue_multiple(struct futex_vector *v, int count)
* - 0 - Success
* - <0 - -EFAULT, -EWOULDBLOCK or -EINVAL
*/
-int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
+static int __futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
{
struct futex_hash_bucket *hb;
bool retry = false;
@@ -499,6 +499,23 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
return 0;
}
+int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
+{
+ struct futex_private_hash *hb_p;
+ int ret;
+
+ /*
+ * Assume to have a private futex and acquire a reference on the private
+ * hash to avoid blocking on mm_struct::futex_hash_bucket during rehash
+ * after changing the task state.
+ */
+ hb_p = futex_get_private_hash();
+ ret = __futex_wait_multiple_setup(vs, count, woken);
+ if (hb_p)
+ futex_put_private_hash(hb_p);
+ return ret;
+}
+
/**
* futex_sleep_multiple - Check sleeping conditions and sleep
* @vs: List of futexes to wait for
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* [PATCH v8 12/15] futex: Allow to re-allocate the private local hash.
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (10 preceding siblings ...)
2025-02-03 13:59 ` [PATCH v8 11/15] futex: Acquire a hash reference in futex_wait_multiple_setup() Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-04 11:05 ` Peter Zijlstra
2025-02-03 13:59 ` [PATCH v8 13/15] futex: Resize local futex hash table based on number of threads Sebastian Andrzej Siewior
` (3 subsequent siblings)
15 siblings, 1 reply; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/
replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS
operation can now be invoked at runtime and resize an already existing
internal private futex_hash_bucket to another size.
The reallocation is based on an idea by Thomas Gleixner: The initial
allocation of struct futex_private_hash sets the reference count
to one. Every user acquires a reference on the local hash before using
it and drops it after it enqueued itself on the hash bucket. There is no
reference held while the task is scheduled out while waiting for the
wake up.
The resize allocates a new struct futex_private_hash and drops the
initial reference under the mm_struct::futex_hash_lock. If the reference
drop results in destruction of the object then users currently queued on
the local hash will be requeued on the new local hash. At the end
mm_struct::futex_phash is updated, the old pointer is RCU freed
and the mutex is dropped.
If the reference drop does not result in destruction of the object then
the new pointer is saved as mm_struct::futex_phash_new. In this case
replacement is delayed. The user dropping the last reference is not
always the best choice to perform the replacement. For instance
futex_wait_queue() drops the reference after changing its task
state which will also be modified while the futex_hash_lock is acquired.
Therefore the replacement is delayed to the task acquiring a reference
on the current local hash.
This scheme keeps the requirement that all waiters/ wakers of the same address
block always on the same futex_hash_bucket::lock.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 5 +-
include/linux/mm_types.h | 7 +-
kernel/futex/core.c | 252 +++++++++++++++++++++++++++++++++++----
kernel/futex/futex.h | 1 +
kernel/futex/requeue.c | 5 +
kernel/futex/waitwake.c | 4 +-
6 files changed, 243 insertions(+), 31 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index bad377c30de5e..bfb38764bac7a 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -83,12 +83,13 @@ void futex_hash_free(struct mm_struct *mm);
static inline void futex_mm_init(struct mm_struct *mm)
{
- mm->futex_hash_bucket = NULL;
+ rcu_assign_pointer(mm->futex_phash, NULL);
+ mutex_init(&mm->futex_hash_lock);
}
static inline bool futex_hash_requires_allocation(void)
{
- if (current->mm->futex_hash_bucket)
+ if (current->mm->futex_phash)
return false;
return true;
}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c20f2310d78ca..19abbc870e0a9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -30,7 +30,7 @@
#define INIT_PASID 0
struct address_space;
-struct futex_hash_bucket;
+struct futex_private_hash;
struct mem_cgroup;
/*
@@ -938,8 +938,9 @@ struct mm_struct {
seqcount_t mm_lock_seq;
#endif
#ifdef CONFIG_FUTEX
- unsigned int futex_hash_mask;
- struct futex_hash_bucket *futex_hash_bucket;
+ struct mutex futex_hash_lock;
+ struct futex_private_hash __rcu *futex_phash;
+ struct futex_private_hash *futex_phash_new;
#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 7130019aa9ec6..e1bf43f7eb277 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -40,6 +40,7 @@
#include <linux/fault-inject.h>
#include <linux/slab.h>
#include <linux/prctl.h>
+#include <linux/rcuref.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -56,6 +57,14 @@ static struct {
#define futex_queues (__futex_data.queues)
#define futex_hashsize (__futex_data.hashsize)
+struct futex_private_hash {
+ rcuref_t users;
+ unsigned int hash_mask;
+ struct rcu_head rcu;
+ bool initial_ref_dropped;
+ bool released;
+ struct futex_hash_bucket queues[];
+};
/*
* Fault injections for futexes.
@@ -129,9 +138,122 @@ static struct futex_hash_bucket *futex_hash_private(union futex_key *key,
return &fhb[hash & hash_mask];
}
+static void futex_rehash_current_users(struct futex_private_hash *old,
+ struct futex_private_hash *new)
+{
+ struct futex_hash_bucket *hb_old, *hb_new;
+ unsigned int slots = old->hash_mask + 1;
+ u32 hash_mask = new->hash_mask;
+ unsigned int i;
+
+ for (i = 0; i < slots; i++) {
+ struct futex_q *this, *tmp;
+
+ hb_old = &old->queues[i];
+
+ spin_lock(&hb_old->lock);
+ plist_for_each_entry_safe(this, tmp, &hb_old->chain, list) {
+
+ plist_del(&this->list, &hb_old->chain);
+ futex_hb_waiters_dec(hb_old);
+
+ WARN_ON_ONCE(this->lock_ptr != &hb_old->lock);
+
+ hb_new = futex_hash_private(&this->key, new->queues, hash_mask);
+ futex_hb_waiters_inc(hb_new);
+ /*
+ * The new pointer isn't published yet but an already
+ * moved user can be unqueued due to timeout or signal.
+ */
+ spin_lock_nested(&hb_new->lock, SINGLE_DEPTH_NESTING);
+ plist_add(&this->list, &hb_new->chain);
+ this->lock_ptr = &hb_new->lock;
+ spin_unlock(&hb_new->lock);
+ }
+ spin_unlock(&hb_old->lock);
+ }
+}
+
+static void futex_assign_new_hash(struct futex_private_hash *hb_p_new,
+ struct mm_struct *mm)
+{
+ bool drop_init_ref = hb_p_new != NULL;
+ struct futex_private_hash *hb_p;
+
+ if (!hb_p_new) {
+ hb_p_new = mm->futex_phash_new;
+ mm->futex_phash_new = NULL;
+ }
+ /* Someone was quicker, the current mask is valid */
+ if (!hb_p_new)
+ return;
+
+ hb_p = rcu_dereference_check(mm->futex_phash,
+ lockdep_is_held(&mm->futex_hash_lock));
+ if (hb_p) {
+ if (hb_p->hash_mask >= hb_p_new->hash_mask) {
+ /* It was increased again while we were waiting */
+ kvfree(hb_p_new);
+ return;
+ }
+ /*
+ * If the caller started the resize then the initial reference
+ * needs to be dropped. If the object can not be deconstructed
+ * we save hb_p_new for later and ensure the reference counter
+ * is not dropped again.
+ */
+ if (drop_init_ref &&
+ (hb_p->initial_ref_dropped || !futex_put_private_hash(hb_p))) {
+ mm->futex_phash_new = hb_p_new;
+ hb_p->initial_ref_dropped = true;
+ return;
+ }
+ if (!READ_ONCE(hb_p->released)) {
+ mm->futex_phash_new = hb_p_new;
+ return;
+ }
+
+ futex_rehash_current_users(hb_p, hb_p_new);
+ }
+ rcu_assign_pointer(mm->futex_phash, hb_p_new);
+ kvfree_rcu(hb_p, rcu);
+}
+
struct futex_private_hash *futex_get_private_hash(void)
{
- return NULL;
+ struct mm_struct *mm = current->mm;
+ /*
+ * Ideally we don't loop. If there is a replacement in progress
+ * then a new private hash is already prepared and a reference can't be
+ * obtained once the last user dropped it's.
+ * In that case we block on mm_struct::futex_hash_lock and either have
+ * to perform the replacement or wait while someone else is doing the
+ * job. Eitherway, on the second iteration we acquire a reference on the
+ * new private hash or loop again because a new replacement has been
+ * requested.
+ */
+again:
+ scoped_guard(rcu) {
+ struct futex_private_hash *hb_p;
+
+ hb_p = rcu_dereference(mm->futex_phash);
+ if (!hb_p)
+ return NULL;
+
+ if (rcuref_get(&hb_p->users))
+ return hb_p;
+ }
+ scoped_guard(mutex, ¤t->mm->futex_hash_lock)
+ futex_assign_new_hash(NULL, mm);
+ goto again;
+}
+
+static struct futex_private_hash *futex_get_private_hb(union futex_key *key)
+{
+ if (!futex_key_is_private(key))
+ return NULL;
+
+ return futex_get_private_hash();
}
/**
@@ -144,12 +266,12 @@ struct futex_private_hash *futex_get_private_hash(void)
*/
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
- struct futex_hash_bucket *fhb;
+ struct futex_private_hash *hb_p;
u32 hash;
- fhb = current->mm->futex_hash_bucket;
- if (fhb && futex_key_is_private(key))
- return futex_hash_private(key, fhb, current->mm->futex_hash_mask);
+ hb_p = futex_get_private_hb(key);
+ if (hb_p)
+ return futex_hash_private(key, hb_p->queues, hb_p->hash_mask);
hash = jhash2((u32 *)key,
offsetof(typeof(*key), both.offset) / 4,
@@ -159,11 +281,25 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
bool futex_put_private_hash(struct futex_private_hash *hb_p)
{
- return false;
+ bool released;
+
+ guard(preempt)();
+ released = rcuref_put_rcusafe(&hb_p->users);
+ if (released)
+ WRITE_ONCE(hb_p->released, true);
+ return released;
}
void futex_hash_put(struct futex_hash_bucket *hb)
{
+ struct futex_private_hash *hb_p;
+
+ if (hb->hb_slot == 0)
+ return;
+ hb_p = container_of(hb, struct futex_private_hash,
+ queues[hb->hb_slot - 1]);
+
+ futex_put_private_hash(hb_p);
}
/**
@@ -175,6 +311,14 @@ void futex_hash_put(struct futex_hash_bucket *hb)
*/
void futex_hash_get(struct futex_hash_bucket *hb)
{
+ struct futex_private_hash *hb_p;
+
+ if (hb->hb_slot == 0)
+ return;
+ hb_p = container_of(hb, struct futex_private_hash,
+ queues[hb->hb_slot - 1]);
+
+ WARN_ON_ONCE(!rcuref_get(&hb_p->users));
}
/**
@@ -622,6 +766,8 @@ int futex_unqueue(struct futex_q *q)
spinlock_t *lock_ptr;
int ret = 0;
+ /* RCU so lock_ptr is not going away during locking. */
+ guard(rcu)();
/* In the common case we don't take the spinlock, which is nice. */
retry:
/*
@@ -1032,10 +1178,22 @@ static void compat_exit_robust_list(struct task_struct *curr)
static void exit_pi_state_list(struct task_struct *curr)
{
struct list_head *next, *head = &curr->pi_state_list;
+ struct futex_private_hash *hb_p;
struct futex_pi_state *pi_state;
struct futex_hash_bucket *hb;
union futex_key key = FUTEX_KEY_INIT;
+ /*
+ * The mutex mm_struct::futex_hash_lock might be acquired.
+ */
+ might_sleep();
+ /*
+ * Ensure the hash remains stable (no resize) during the while loop
+ * below. The hb pointer is acquired under the pi_lock so we can't block
+ * on the mutex.
+ */
+ WARN_ON(curr != current);
+ hb_p = futex_get_private_hash();
/*
* We are a ZOMBIE and nobody can enqueue itself on
* pi_state_list anymore, but we have to be careful
@@ -1061,6 +1219,7 @@ static void exit_pi_state_list(struct task_struct *curr)
if (!refcount_inc_not_zero(&pi_state->refcount)) {
raw_spin_unlock_irq(&curr->pi_lock);
cpu_relax();
+ futex_hash_put(hb);
raw_spin_lock_irq(&curr->pi_lock);
continue;
}
@@ -1076,7 +1235,7 @@ static void exit_pi_state_list(struct task_struct *curr)
if (head->next != next) {
/* retain curr->pi_lock for the loop invariant */
raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
- spin_unlock(&hb->lock);
+ futex_hb_unlock_put(hb);
put_pi_state(pi_state);
continue;
}
@@ -1088,7 +1247,7 @@ static void exit_pi_state_list(struct task_struct *curr)
raw_spin_unlock(&curr->pi_lock);
raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
- spin_unlock(&hb->lock);
+ futex_hb_unlock_put(hb);
rt_mutex_futex_unlock(&pi_state->pi_mutex);
put_pi_state(pi_state);
@@ -1096,6 +1255,8 @@ static void exit_pi_state_list(struct task_struct *curr)
raw_spin_lock_irq(&curr->pi_lock);
}
raw_spin_unlock_irq(&curr->pi_lock);
+ if (hb_p)
+ futex_put_private_hash(hb_p);
}
#else
static inline void exit_pi_state_list(struct task_struct *curr) { }
@@ -1209,8 +1370,9 @@ void futex_exit_release(struct task_struct *tsk)
futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
}
-static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb, unsigned int slot)
{
+ fhb->hb_slot = slot;
atomic_set(&fhb->waiters, 0);
plist_head_init(&fhb->chain);
spin_lock_init(&fhb->lock);
@@ -1218,20 +1380,33 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
void futex_hash_free(struct mm_struct *mm)
{
- kvfree(mm->futex_hash_bucket);
+ struct futex_private_hash *hb_p;
+
+ kvfree(mm->futex_phash_new);
+ /*
+ * The mm_struct belonging to the task is removed so all threads, that
+ * ever accessed the private hash, are gone and the pointer can be
+ * accessed directly (omitting a RCU-read section).
+ * Since there can not be a thread holding a reference to the private
+ * hash we free it immediately.
+ */
+ hb_p = rcu_access_pointer(mm->futex_phash);
+ if (!hb_p)
+ return;
+
+ if (!hb_p->initial_ref_dropped && WARN_ON(!futex_put_private_hash(hb_p)))
+ return;
+
+ kvfree(hb_p);
}
static int futex_hash_allocate(unsigned int hash_slots)
{
- struct futex_hash_bucket *fhb;
+ struct futex_private_hash *hb_p, *hb_tofree = NULL;
+ struct mm_struct *mm = current->mm;
+ size_t alloc_size;
int i;
- if (current->mm->futex_hash_bucket)
- return -EALREADY;
-
- if (!thread_group_leader(current))
- return -EINVAL;
-
if (hash_slots == 0)
hash_slots = 16;
if (hash_slots < 2)
@@ -1241,16 +1416,39 @@ static int futex_hash_allocate(unsigned int hash_slots)
if (!is_power_of_2(hash_slots))
hash_slots = rounddown_pow_of_two(hash_slots);
- fhb = kvmalloc_array(hash_slots, sizeof(struct futex_hash_bucket), GFP_KERNEL_ACCOUNT);
- if (!fhb)
+ if (unlikely(check_mul_overflow(hash_slots, sizeof(struct futex_hash_bucket),
+ &alloc_size)))
return -ENOMEM;
- current->mm->futex_hash_mask = hash_slots - 1;
+ if (unlikely(check_add_overflow(alloc_size, sizeof(struct futex_private_hash),
+ &alloc_size)))
+ return -ENOMEM;
+
+ hb_p = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
+ if (!hb_p)
+ return -ENOMEM;
+
+ rcuref_init(&hb_p->users, 1);
+ hb_p->initial_ref_dropped = false;
+ hb_p->released = false;
+ hb_p->hash_mask = hash_slots - 1;
for (i = 0; i < hash_slots; i++)
- futex_hash_bucket_init(&fhb[i]);
+ futex_hash_bucket_init(&hb_p->queues[i], i + 1);
- current->mm->futex_hash_bucket = fhb;
+ scoped_guard(mutex, &mm->futex_hash_lock) {
+ if (mm->futex_phash_new) {
+ if (mm->futex_phash_new->hash_mask <= hb_p->hash_mask) {
+ hb_tofree = mm->futex_phash_new;
+ } else {
+ hb_tofree = hb_p;
+ hb_p = mm->futex_phash_new;
+ }
+ mm->futex_phash_new = NULL;
+ }
+ futex_assign_new_hash(hb_p, mm);
+ }
+ kvfree(hb_tofree);
return 0;
}
@@ -1261,8 +1459,12 @@ int futex_hash_allocate_default(void)
static int futex_hash_get_slots(void)
{
- if (current->mm->futex_hash_bucket)
- return current->mm->futex_hash_mask + 1;
+ struct futex_private_hash *hb_p;
+
+ guard(rcu)();
+ hb_p = rcu_dereference(current->mm->futex_phash);
+ if (hb_p)
+ return hb_p->hash_mask + 1;
return 0;
}
@@ -1304,7 +1506,7 @@ static int __init futex_init(void)
futex_hashsize = 1UL << futex_shift;
for (i = 0; i < futex_hashsize; i++)
- futex_hash_bucket_init(&futex_queues[i]);
+ futex_hash_bucket_init(&futex_queues[i], 0);
return 0;
}
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index d6fa6f663d9ad..127f333d3b0d5 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -115,6 +115,7 @@ static inline bool should_fail_futex(bool fshared)
*/
struct futex_hash_bucket {
atomic_t waiters;
+ unsigned int hb_slot;
spinlock_t lock;
struct plist_head chain;
} ____cacheline_aligned_in_smp;
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 167204e856fec..eb506428c9574 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -87,6 +87,11 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
futex_hb_waiters_inc(hb2);
plist_add(&q->list, &hb2->chain);
q->lock_ptr = &hb2->lock;
+ /*
+ * hb1 and hb2 belong to the same futex_hash_bucket_private
+ * because if we managed get a reference on hb1 then it can't be
+ * replaced. Therefore we avoid put(hb1)+get(hb2) here.
+ */
}
q->key = *key2;
}
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 3d57b47692f57..7dd75781c9b6b 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -173,8 +173,10 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
hb = futex_hash(&key);
/* Make sure we really have tasks to wakeup */
- if (!futex_hb_waiters_pending(hb))
+ if (!futex_hb_waiters_pending(hb)) {
+ futex_hash_put(hb);
return ret;
+ }
spin_lock(&hb->lock);
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* Re: [PATCH v8 12/15] futex: Allow to re-allocate the private local hash.
2025-02-03 13:59 ` [PATCH v8 12/15] futex: Allow to re-allocate the private local hash Sebastian Andrzej Siewior
@ 2025-02-04 11:05 ` Peter Zijlstra
2025-02-05 8:00 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-04 11:05 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Feb 03, 2025 at 02:59:32PM +0100, Sebastian Andrzej Siewior wrote:
> @@ -159,11 +281,25 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
>
> bool futex_put_private_hash(struct futex_private_hash *hb_p)
> {
> - return false;
> + bool released;
> +
> + guard(preempt)();
> + released = rcuref_put_rcusafe(&hb_p->users);
> + if (released)
> + WRITE_ONCE(hb_p->released, true);
> + return released;
> }
>
> void futex_hash_put(struct futex_hash_bucket *hb)
> {
> + struct futex_private_hash *hb_p;
> +
> + if (hb->hb_slot == 0)
> + return;
> + hb_p = container_of(hb, struct futex_private_hash,
> + queues[hb->hb_slot - 1]);
> +
> + futex_put_private_hash(hb_p);
> }
>
> /**
> @@ -175,6 +311,14 @@ void futex_hash_put(struct futex_hash_bucket *hb)
> */
> void futex_hash_get(struct futex_hash_bucket *hb)
> {
> + struct futex_private_hash *hb_p;
> +
> + if (hb->hb_slot == 0)
> + return;
> + hb_p = container_of(hb, struct futex_private_hash,
> + queues[hb->hb_slot - 1]);
> +
> + WARN_ON_ONCE(!rcuref_get(&hb_p->users));
> }
> --- a/kernel/futex/futex.h
> +++ b/kernel/futex/futex.h
> @@ -115,6 +115,7 @@ static inline bool should_fail_futex(bool fshared)
> */
> struct futex_hash_bucket {
> atomic_t waiters;
> + unsigned int hb_slot;
> spinlock_t lock;
> struct plist_head chain;
> } ____cacheline_aligned_in_smp;
*groan*
Just sticking a pointer in was too complicated?
void futex_hash_put(struct futex_hash_bucket *hb)
{
struct futex_private_hash *ph = hb->private;
if (!ph)
return;
futex_put_private_hash(ph);
}
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: [PATCH v8 12/15] futex: Allow to re-allocate the private local hash.
2025-02-04 11:05 ` Peter Zijlstra
@ 2025-02-05 8:00 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-05 8:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-04 12:05:33 [+0100], Peter Zijlstra wrote:
> > @@ -175,6 +311,14 @@ void futex_hash_put(struct futex_hash_bucket *hb)
> > */
> > void futex_hash_get(struct futex_hash_bucket *hb)
> > {
> > + struct futex_private_hash *hb_p;
> > +
> > + if (hb->hb_slot == 0)
> > + return;
> > + hb_p = container_of(hb, struct futex_private_hash,
> > + queues[hb->hb_slot - 1]);
> > +
> > + WARN_ON_ONCE(!rcuref_get(&hb_p->users));
> > }
>
> > --- a/kernel/futex/futex.h
> > +++ b/kernel/futex/futex.h
> > @@ -115,6 +115,7 @@ static inline bool should_fail_futex(bool fshared)
> > */
> > struct futex_hash_bucket {
> > atomic_t waiters;
> > + unsigned int hb_slot;
> > spinlock_t lock;
> > struct plist_head chain;
> > } ____cacheline_aligned_in_smp;
>
> *groan*
>
> Just sticking a pointer in was too complicated?
It looked simpler. But okay.
> void futex_hash_put(struct futex_hash_bucket *hb)
> {
> struct futex_private_hash *ph = hb->private;
> if (!ph)
> return;
>
> futex_put_private_hash(ph);
> }
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v8 13/15] futex: Resize local futex hash table based on number of threads.
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (11 preceding siblings ...)
2025-02-03 13:59 ` [PATCH v8 12/15] futex: Allow to re-allocate the private local hash Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-04 10:21 ` Peter Zijlstra
2025-02-03 13:59 ` [PATCH v8 14/15] futex: Use a hashmask instead of hashsize Sebastian Andrzej Siewior
` (2 subsequent siblings)
15 siblings, 1 reply; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Automatically size the local hash based on the number of threads. The
logic tries to allocate between 16 and futex_hashsize (the default for
the system wide hash bucket) and uses 4 * number-of-threads.
On CONFIG_BASE_SMALL configs the suggested size is always 2.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 12 ------------
kernel/fork.c | 4 +---
kernel/futex/core.c | 34 +++++++++++++++++++++++++++++++---
3 files changed, 32 insertions(+), 18 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index bfb38764bac7a..6469aeb76a150 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -87,13 +87,6 @@ static inline void futex_mm_init(struct mm_struct *mm)
mutex_init(&mm->futex_hash_lock);
}
-static inline bool futex_hash_requires_allocation(void)
-{
- if (current->mm->futex_phash)
- return false;
- return true;
-}
-
#else
static inline void futex_init_task(struct task_struct *tsk) { }
static inline void futex_exit_recursive(struct task_struct *tsk) { }
@@ -116,11 +109,6 @@ static inline int futex_hash_allocate_default(void)
static inline void futex_hash_free(struct mm_struct *mm) { }
static inline void futex_mm_init(struct mm_struct *mm) { }
-static inline bool futex_hash_requires_allocation(void)
-{
- return false;
-}
-
#endif
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 824cc55d32ece..5e15e5b24f289 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2142,9 +2142,7 @@ static bool need_futex_hash_allocate_default(u64 clone_flags)
{
if ((clone_flags & (CLONE_THREAD | CLONE_VM)) != (CLONE_THREAD | CLONE_VM))
return false;
- if (!thread_group_empty(current))
- return false;
- return futex_hash_requires_allocation();
+ return true;
}
/*
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index e1bf43f7eb277..9a12dccb1c995 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1411,8 +1411,8 @@ static int futex_hash_allocate(unsigned int hash_slots)
hash_slots = 16;
if (hash_slots < 2)
hash_slots = 2;
- if (hash_slots > 131072)
- hash_slots = 131072;
+ if (hash_slots > futex_hashsize)
+ hash_slots = futex_hashsize;
if (!is_power_of_2(hash_slots))
hash_slots = rounddown_pow_of_two(hash_slots);
@@ -1454,7 +1454,35 @@ static int futex_hash_allocate(unsigned int hash_slots)
int futex_hash_allocate_default(void)
{
- return futex_hash_allocate(0);
+ unsigned int threads, buckets, current_buckets = 0;
+ struct futex_private_hash *hb_p;
+
+ if (!current->mm)
+ return 0;
+
+ scoped_guard(rcu) {
+ threads = get_nr_threads(current);
+ hb_p = rcu_dereference(current->mm->futex_phash);
+ if (hb_p)
+ current_buckets = hb_p->hash_mask + 1;
+ }
+
+ if (IS_ENABLED(CONFIG_BASE_SMALL)) {
+ buckets = 2;
+
+ } else {
+ /*
+ * The default allocation will remain within
+ * 16 <= threads * 4 <= global hash size
+ */
+ buckets = roundup_pow_of_two(4 * threads);
+ buckets = max(buckets, 16);
+ buckets = min(buckets, futex_hashsize);
+ }
+ if (current_buckets >= buckets)
+ return 0;
+
+ return futex_hash_allocate(buckets);
}
static int futex_hash_get_slots(void)
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* Re: [PATCH v8 13/15] futex: Resize local futex hash table based on number of threads.
2025-02-03 13:59 ` [PATCH v8 13/15] futex: Resize local futex hash table based on number of threads Sebastian Andrzej Siewior
@ 2025-02-04 10:21 ` Peter Zijlstra
2025-02-05 8:05 ` Sebastian Andrzej Siewior
2025-02-07 9:07 ` Sebastian Andrzej Siewior
0 siblings, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-04 10:21 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Feb 03, 2025 at 02:59:33PM +0100, Sebastian Andrzej Siewior wrote:
> Automatically size the local hash based on the number of threads. The
> logic tries to allocate between 16 and futex_hashsize (the default for
> the system wide hash bucket) and uses 4 * number-of-threads.
> On CONFIG_BASE_SMALL configs the suggested size is always 2.
> + scoped_guard(rcu) {
> + threads = get_nr_threads(current);
How about something like:
threads = min(get_nr_threads(), num_online_cpus())
?
> + hb_p = rcu_dereference(current->mm->futex_phash);
> + if (hb_p)
> + current_buckets = hb_p->hash_mask + 1;
> + }
> +
> + if (IS_ENABLED(CONFIG_BASE_SMALL)) {
> + buckets = 2;
> +
Or... you just disable the local thing entirely for BASE_SMALL and have
it fall back to the global hash.
> + } else {
> + /*
> + * The default allocation will remain within
> + * 16 <= threads * 4 <= global hash size
> + */
> + buckets = roundup_pow_of_two(4 * threads);
> + buckets = max(buckets, 16);
> + buckets = min(buckets, futex_hashsize);
> + }
> + if (current_buckets >= buckets)
> + return 0;
> +
> + return futex_hash_allocate(buckets);
> }
>
> static int futex_hash_get_slots(void)
> --
> 2.47.2
>
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: [PATCH v8 13/15] futex: Resize local futex hash table based on number of threads.
2025-02-04 10:21 ` Peter Zijlstra
@ 2025-02-05 8:05 ` Sebastian Andrzej Siewior
2025-02-07 9:07 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-05 8:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-04 11:21:46 [+0100], Peter Zijlstra wrote:
> On Mon, Feb 03, 2025 at 02:59:33PM +0100, Sebastian Andrzej Siewior wrote:
> > Automatically size the local hash based on the number of threads. The
> > logic tries to allocate between 16 and futex_hashsize (the default for
> > the system wide hash bucket) and uses 4 * number-of-threads.
> > On CONFIG_BASE_SMALL configs the suggested size is always 2.
>
> > + scoped_guard(rcu) {
> > + threads = get_nr_threads(current);
>
> How about something like:
>
> threads = min(get_nr_threads(), num_online_cpus())
>
> ?
makes sense.
> > + hb_p = rcu_dereference(current->mm->futex_phash);
> > + if (hb_p)
> > + current_buckets = hb_p->hash_mask + 1;
> > + }
> > +
> > + if (IS_ENABLED(CONFIG_BASE_SMALL)) {
> > + buckets = 2;
> > +
>
> Or... you just disable the local thing entirely for BASE_SMALL and have
> it fall back to the global hash.
Okay, why not.
> > + } else {
> > + /*
> > + * The default allocation will remain within
> > + * 16 <= threads * 4 <= global hash size
> > + */
> > + buckets = roundup_pow_of_two(4 * threads);
> > + buckets = max(buckets, 16);
> > + buckets = min(buckets, futex_hashsize);
> > + }
> > + if (current_buckets >= buckets)
> > + return 0;
> > +
> > + return futex_hash_allocate(buckets);
> > }
> >
> > static int futex_hash_get_slots(void)
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: [PATCH v8 13/15] futex: Resize local futex hash table based on number of threads.
2025-02-04 10:21 ` Peter Zijlstra
2025-02-05 8:05 ` Sebastian Andrzej Siewior
@ 2025-02-07 9:07 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-07 9:07 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-04 11:21:46 [+0100], Peter Zijlstra wrote:
> > + if (IS_ENABLED(CONFIG_BASE_SMALL)) {
> > + buckets = 2;
> > +
>
> Or... you just disable the local thing entirely for BASE_SMALL and have
> it fall back to the global hash.
If we don't assign a local hash on auto resize for CONFIG_BASE_SMALL
builds then we need to also disable PR_FUTEX_HASH_SET_SLOTS. Not against
it at all, just pointing out.
The reason is that there is at least one spot (exit_pi_state_list())
where I need a stable view of the hash. I ensure this by grabbing a
reference so this pointer does not change.
If this local hash is set to NULL then we are single threaded and it can
not be assigned. If we are multi threaded then the pointer is not NULL
and PR_FUTEX_HASH_SET_SLOTS based assignment will be delayed.
But if we are multi threaded and the local hash is set to NULL then it
could be assigned at which point the whole logic breaks.
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v8 14/15] futex: Use a hashmask instead of hashsize.
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (12 preceding siblings ...)
2025-02-03 13:59 ` [PATCH v8 13/15] futex: Resize local futex hash table based on number of threads Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 15/15] futex: Avoid allocating new local hash if there is something pending Sebastian Andrzej Siewior
2025-02-04 15:14 ` [PATCH v8 00/15] futex: Add support task local hash maps Peter Zijlstra
15 siblings, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
The global hash uses futex_hashsize to save the amount of the hash
buckets that have been allocated during system boot. On each
futex_hash() invocation this number is substracted by one to get the
mask. This can be optimized by saving directly the mask avoiding the
substraction on each futex_hash() invocation.
Rename futex_hashsize to futex_hashmask and save the mask of the
allocated hash map.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 27 ++++++++++++++-------------
1 file changed, 14 insertions(+), 13 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 9a12dccb1c995..9e32bfa7ba4ab 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -52,10 +52,10 @@
*/
static struct {
struct futex_hash_bucket *queues;
- unsigned long hashsize;
+ unsigned long hashmask;
} __futex_data __read_mostly __aligned(2*sizeof(long));
#define futex_queues (__futex_data.queues)
-#define futex_hashsize (__futex_data.hashsize)
+#define futex_hashmask (__futex_data.hashmask)
struct futex_private_hash {
rcuref_t users;
@@ -276,7 +276,7 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
hash = jhash2((u32 *)key,
offsetof(typeof(*key), both.offset) / 4,
key->both.offset);
- return &futex_queues[hash & (futex_hashsize - 1)];
+ return &futex_queues[hash & futex_hashmask];
}
bool futex_put_private_hash(struct futex_private_hash *hb_p)
@@ -1411,8 +1411,8 @@ static int futex_hash_allocate(unsigned int hash_slots)
hash_slots = 16;
if (hash_slots < 2)
hash_slots = 2;
- if (hash_slots > futex_hashsize)
- hash_slots = futex_hashsize;
+ if (hash_slots > futex_hashmask + 1)
+ hash_slots = futex_hashmask + 1;
if (!is_power_of_2(hash_slots))
hash_slots = rounddown_pow_of_two(hash_slots);
@@ -1477,7 +1477,7 @@ int futex_hash_allocate_default(void)
*/
buckets = roundup_pow_of_two(4 * threads);
buckets = max(buckets, 16);
- buckets = min(buckets, futex_hashsize);
+ buckets = min(buckets, futex_hashmask + 1);
}
if (current_buckets >= buckets)
return 0;
@@ -1518,24 +1518,25 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
static int __init futex_init(void)
{
+ unsigned long i, hashsize;
unsigned int futex_shift;
- unsigned long i;
#ifdef CONFIG_BASE_SMALL
- futex_hashsize = 16;
+ hashsize = 16;
#else
- futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
+ hashsize = roundup_pow_of_two(256 * num_possible_cpus());
#endif
futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
- futex_hashsize, 0, 0,
+ hashsize, 0, 0,
&futex_shift, NULL,
- futex_hashsize, futex_hashsize);
- futex_hashsize = 1UL << futex_shift;
+ hashsize, hashsize);
+ hashsize = 1UL << futex_shift;
- for (i = 0; i < futex_hashsize; i++)
+ for (i = 0; i < hashsize; i++)
futex_hash_bucket_init(&futex_queues[i], 0);
+ futex_hashmask = hashsize - 1;
return 0;
}
core_initcall(futex_init);
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* [PATCH v8 15/15] futex: Avoid allocating new local hash if there is something pending.
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (13 preceding siblings ...)
2025-02-03 13:59 ` [PATCH v8 14/15] futex: Use a hashmask instead of hashsize Sebastian Andrzej Siewior
@ 2025-02-03 13:59 ` Sebastian Andrzej Siewior
2025-02-04 15:14 ` [PATCH v8 00/15] futex: Add support task local hash maps Peter Zijlstra
15 siblings, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-03 13:59 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
If a new private hash has been prepared but not yet applied then
creating a new thread will lead to the allocation of another private
hash which will be freed later.
This is an optional optimisation to avoid allocating a new private hash
if the already prepared private hash matches the current requirement.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 9e32bfa7ba4ab..7f61f734ca5d9 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1424,6 +1424,13 @@ static int futex_hash_allocate(unsigned int hash_slots)
&alloc_size)))
return -ENOMEM;
+ if (mm->futex_phash_new) {
+ scoped_guard(mutex, &mm->futex_hash_lock) {
+ if (mm->futex_phash_new && mm->futex_phash_new->hash_mask >= hash_slots - 1)
+ return 0;
+ }
+ }
+
hb_p = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
if (!hb_p)
return -ENOMEM;
--
2.47.2
^ permalink raw reply related [flat|nested] 52+ messages in thread* Re: [PATCH v8 00/15] futex: Add support task local hash maps.
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (14 preceding siblings ...)
2025-02-03 13:59 ` [PATCH v8 15/15] futex: Avoid allocating new local hash if there is something pending Sebastian Andrzej Siewior
@ 2025-02-04 15:14 ` Peter Zijlstra
2025-02-05 8:46 ` Sebastian Andrzej Siewior
2025-02-05 12:20 ` Sebastian Andrzej Siewior
15 siblings, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-04 15:14 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Feb 03, 2025 at 02:59:20PM +0100, Sebastian Andrzej Siewior wrote:
> futex: Decrease the waiter count before the unlock operation.
> futex: Prepare for reference counting of the process private hash end
> of operation.
> futex: Re-evaluate the hash bucket after dropping the lock
> futex: Introduce futex_get_locked_hb().
I must admit to not being a fan of these patches. I find them very hard
to review.
In part because while you go stick _put() on various things, there is no
corresponding _get(), things like futex_hash() become an implicit get
-- this took a while to figure out.
I tried tying the whole get/put to the hb variable itself, using the
cleanup stuff. The patch is pretty massive because fairly large chunks
of code got re-indented, but perhaps the final result is clearer, not
sure.
---
io_uring/futex.c | 5 +-
kernel/futex/core.c | 10 +-
kernel/futex/futex.h | 22 ++-
kernel/futex/pi.c | 280 +++++++++++++++----------------
kernel/futex/requeue.c | 427 ++++++++++++++++++++++++------------------------
kernel/futex/waitwake.c | 184 +++++++++++----------
6 files changed, 468 insertions(+), 460 deletions(-)
diff --git a/io_uring/futex.c b/io_uring/futex.c
index 3159a2b7eeca..18cd5ccde36d 100644
--- a/io_uring/futex.c
+++ b/io_uring/futex.c
@@ -311,7 +311,6 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex);
struct io_ring_ctx *ctx = req->ctx;
struct io_futex_data *ifd = NULL;
- struct futex_hash_bucket *hb;
int ret;
if (!iof->futex_mask) {
@@ -332,13 +331,13 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
ifd->q.wake = io_futex_wake_fn;
ifd->req = req;
+ // XXX task->state is messed up
ret = futex_wait_setup(iof->uaddr, iof->futex_val, iof->futex_flags,
- &ifd->q, &hb);
+ &ifd->q, NULL);
if (!ret) {
hlist_add_head(&req->hash_node, &ctx->futex_list);
io_ring_submit_unlock(ctx, issue_flags);
- futex_queue(&ifd->q, hb);
return IOU_ISSUE_SKIP_COMPLETE;
}
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index ebdd76b4ecbb..4b2fcaa60d5b 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -505,9 +505,7 @@ void __futex_unqueue(struct futex_q *q)
struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
__acquires(&hb->lock)
{
- struct futex_hash_bucket *hb;
-
- hb = futex_hash(&q->key);
+ CLASS(hb, hb)(&q->key);
/*
* Increment the counter before taking the lock so that
@@ -522,7 +520,7 @@ struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
q->lock_ptr = &hb->lock;
spin_lock(&hb->lock);
- return hb;
+ return no_free_ptr(hb);
}
void futex_q_unlock(struct futex_hash_bucket *hb)
@@ -948,7 +946,6 @@ static void exit_pi_state_list(struct task_struct *curr)
{
struct list_head *next, *head = &curr->pi_state_list;
struct futex_pi_state *pi_state;
- struct futex_hash_bucket *hb;
union futex_key key = FUTEX_KEY_INIT;
/*
@@ -961,7 +958,8 @@ static void exit_pi_state_list(struct task_struct *curr)
next = head->next;
pi_state = list_entry(next, struct futex_pi_state, list);
key = pi_state->key;
- hb = futex_hash(&key);
+
+ CLASS(hb, hb)(&key);
/*
* We can race against put_pi_state() removing itself from the
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 99b32e728c4a..cd40f71987b3 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -7,6 +7,7 @@
#include <linux/sched/wake_q.h>
#include <linux/compat.h>
#include <linux/uaccess.h>
+#include <linux/cleanup.h>
#ifdef CONFIG_PREEMPT_RT
#include <linux/rcuwait.h>
@@ -119,6 +120,19 @@ struct futex_hash_bucket {
struct plist_head chain;
} ____cacheline_aligned_in_smp;
+struct futex_q;
+
+extern struct futex_hash_bucket *futex_hash(union futex_key *key);
+extern struct futex_hash_bucket *futex_q_lock(struct futex_q *q);
+extern void futex_hash_put(struct futex_hash_bucket *hb);
+
+DEFINE_CLASS(hb, struct futex_hash_bucket *,
+ if (_T) futex_hash_put(_T),
+ futex_hash(key), union futex_key *key);
+
+EXTEND_CLASS(hb, _q_lock,
+ futex_q_lock(q), struct futex_q *q);
+
/*
* Priority Inheritance state:
*/
@@ -201,8 +215,6 @@ extern struct hrtimer_sleeper *
futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
int flags, u64 range_ns);
-extern struct futex_hash_bucket *futex_hash(union futex_key *key);
-
/**
* futex_match - Check whether two futex keys are equal
* @key1: Pointer to key1
@@ -219,9 +231,8 @@ static inline int futex_match(union futex_key *key1, union futex_key *key2)
}
extern int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
- struct futex_q *q, struct futex_hash_bucket **hb);
-extern void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
- struct hrtimer_sleeper *timeout);
+ struct futex_q *q, union futex_key *key2);
+extern void futex_wait(struct futex_q *q, struct hrtimer_sleeper *timeout);
extern bool __futex_wake_mark(struct futex_q *q);
extern void futex_wake_mark(struct wake_q_head *wake_q, struct futex_q *q);
@@ -349,7 +360,6 @@ static inline int futex_hb_waiters_pending(struct futex_hash_bucket *hb)
#endif
}
-extern struct futex_hash_bucket *futex_q_lock(struct futex_q *q);
extern void futex_q_unlock(struct futex_hash_bucket *hb);
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index daea650b16f5..ccc3c5ed21c1 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -920,7 +920,6 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
struct hrtimer_sleeper timeout, *to;
struct task_struct *exiting = NULL;
struct rt_mutex_waiter rt_waiter;
- struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
DEFINE_WAKE_Q(wake_q);
int res, ret;
@@ -939,151 +938,166 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
goto out;
retry_private:
- hb = futex_q_lock(&q);
+ if (1) {
+ CLASS(hb_q_lock, hb)(&q);
- ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
- &exiting, 0);
- if (unlikely(ret)) {
- /*
- * Atomic work succeeded and we got the lock,
- * or failed. Either way, we do _not_ block.
- */
- switch (ret) {
- case 1:
- /* We got the lock. */
- ret = 0;
- goto out_unlock_put_key;
- case -EFAULT:
- goto uaddr_faulted;
- case -EBUSY:
- case -EAGAIN:
- /*
- * Two reasons for this:
- * - EBUSY: Task is exiting and we just wait for the
- * exit to complete.
- * - EAGAIN: The user space value changed.
- */
- futex_q_unlock(hb);
+ ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current,
+ &exiting, 0);
+ if (unlikely(ret)) {
/*
- * Handle the case where the owner is in the middle of
- * exiting. Wait for the exit to complete otherwise
- * this task might loop forever, aka. live lock.
+ * Atomic work succeeded and we got the lock,
+ * or failed. Either way, we do _not_ block.
*/
- wait_for_owner_exiting(ret, exiting);
- cond_resched();
- goto retry;
- default:
- goto out_unlock_put_key;
+ switch (ret) {
+ case 1:
+ /* We got the lock. */
+ ret = 0;
+ goto out_unlock_put_key;
+ case -EFAULT:
+ goto uaddr_faulted;
+ case -EBUSY:
+ case -EAGAIN:
+ /*
+ * Two reasons for this:
+ * - EBUSY: Task is exiting and we just wait for the
+ * exit to complete.
+ * - EAGAIN: The user space value changed.
+ */
+ futex_q_unlock(hb);
+ /*
+ * Handle the case where the owner is in the middle of
+ * exiting. Wait for the exit to complete otherwise
+ * this task might loop forever, aka. live lock.
+ */
+ wait_for_owner_exiting(ret, exiting);
+ cond_resched();
+ goto retry;
+ default:
+ goto out_unlock_put_key;
+ }
}
- }
- WARN_ON(!q.pi_state);
+ WARN_ON(!q.pi_state);
- /*
- * Only actually queue now that the atomic ops are done:
- */
- __futex_queue(&q, hb);
+ /*
+ * Only actually queue now that the atomic ops are done:
+ */
+ __futex_queue(&q, hb);
- if (trylock) {
- ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
- /* Fixup the trylock return value: */
- ret = ret ? 0 : -EWOULDBLOCK;
- goto no_block;
- }
+ if (trylock) {
+ ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
+ /* Fixup the trylock return value: */
+ ret = ret ? 0 : -EWOULDBLOCK;
+ goto no_block;
+ }
- /*
- * Must be done before we enqueue the waiter, here is unfortunately
- * under the hb lock, but that *should* work because it does nothing.
- */
- rt_mutex_pre_schedule();
+ /*
+ * Must be done before we enqueue the waiter, here is unfortunately
+ * under the hb lock, but that *should* work because it does nothing.
+ */
+ rt_mutex_pre_schedule();
- rt_mutex_init_waiter(&rt_waiter);
+ rt_mutex_init_waiter(&rt_waiter);
- /*
- * On PREEMPT_RT, when hb->lock becomes an rt_mutex, we must not
- * hold it while doing rt_mutex_start_proxy(), because then it will
- * include hb->lock in the blocking chain, even through we'll not in
- * fact hold it while blocking. This will lead it to report -EDEADLK
- * and BUG when futex_unlock_pi() interleaves with this.
- *
- * Therefore acquire wait_lock while holding hb->lock, but drop the
- * latter before calling __rt_mutex_start_proxy_lock(). This
- * interleaves with futex_unlock_pi() -- which does a similar lock
- * handoff -- such that the latter can observe the futex_q::pi_state
- * before __rt_mutex_start_proxy_lock() is done.
- */
- raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
- spin_unlock(q.lock_ptr);
- /*
- * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
- * such that futex_unlock_pi() is guaranteed to observe the waiter when
- * it sees the futex_q::pi_state.
- */
- ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current, &wake_q);
- raw_spin_unlock_irq_wake(&q.pi_state->pi_mutex.wait_lock, &wake_q);
+ /*
+ * On PREEMPT_RT, when hb->lock becomes an rt_mutex, we must not
+ * hold it while doing rt_mutex_start_proxy(), because then it will
+ * include hb->lock in the blocking chain, even through we'll not in
+ * fact hold it while blocking. This will lead it to report -EDEADLK
+ * and BUG when futex_unlock_pi() interleaves with this.
+ *
+ * Therefore acquire wait_lock while holding hb->lock, but drop the
+ * latter before calling __rt_mutex_start_proxy_lock(). This
+ * interleaves with futex_unlock_pi() -- which does a similar lock
+ * handoff -- such that the latter can observe the futex_q::pi_state
+ * before __rt_mutex_start_proxy_lock() is done.
+ */
+ raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
+ spin_unlock(q.lock_ptr);
+ /*
+ * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
+ * such that futex_unlock_pi() is guaranteed to observe the waiter when
+ * it sees the futex_q::pi_state.
+ */
+ ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current, &wake_q);
+ raw_spin_unlock_irq_wake(&q.pi_state->pi_mutex.wait_lock, &wake_q);
- if (ret) {
- if (ret == 1)
- ret = 0;
- goto cleanup;
- }
+ if (ret) {
+ if (ret == 1)
+ ret = 0;
+ goto cleanup;
+ }
- if (unlikely(to))
- hrtimer_sleeper_start_expires(to, HRTIMER_MODE_ABS);
+ if (unlikely(to))
+ hrtimer_sleeper_start_expires(to, HRTIMER_MODE_ABS);
- ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
+ ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
cleanup:
- /*
- * If we failed to acquire the lock (deadlock/signal/timeout), we must
- * must unwind the above, however we canont lock hb->lock because
- * rt_mutex already has a waiter enqueued and hb->lock can itself try
- * and enqueue an rt_waiter through rtlock.
- *
- * Doing the cleanup without holding hb->lock can cause inconsistent
- * state between hb and pi_state, but only in the direction of not
- * seeing a waiter that is leaving.
- *
- * See futex_unlock_pi(), it deals with this inconsistency.
- *
- * There be dragons here, since we must deal with the inconsistency on
- * the way out (here), it is impossible to detect/warn about the race
- * the other way around (missing an incoming waiter).
- *
- * What could possibly go wrong...
- */
- if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
- ret = 0;
+ /*
+ * If we failed to acquire the lock (deadlock/signal/timeout), we must
+ * must unwind the above, however we canont lock hb->lock because
+ * rt_mutex already has a waiter enqueued and hb->lock can itself try
+ * and enqueue an rt_waiter through rtlock.
+ *
+ * Doing the cleanup without holding hb->lock can cause inconsistent
+ * state between hb and pi_state, but only in the direction of not
+ * seeing a waiter that is leaving.
+ *
+ * See futex_unlock_pi(), it deals with this inconsistency.
+ *
+ * There be dragons here, since we must deal with the inconsistency on
+ * the way out (here), it is impossible to detect/warn about the race
+ * the other way around (missing an incoming waiter).
+ *
+ * What could possibly go wrong...
+ */
+ if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
+ ret = 0;
- /*
- * Now that the rt_waiter has been dequeued, it is safe to use
- * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
- * the
- */
- spin_lock(q.lock_ptr);
- /*
- * Waiter is unqueued.
- */
- rt_mutex_post_schedule();
+ /*
+ * Now that the rt_waiter has been dequeued, it is safe to use
+ * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
+ * the
+ */
+ spin_lock(q.lock_ptr);
+ /*
+ * Waiter is unqueued.
+ */
+ rt_mutex_post_schedule();
no_block:
- /*
- * Fixup the pi_state owner and possibly acquire the lock if we
- * haven't already.
- */
- res = fixup_pi_owner(uaddr, &q, !ret);
- /*
- * If fixup_pi_owner() returned an error, propagate that. If it acquired
- * the lock, clear our -ETIMEDOUT or -EINTR.
- */
- if (res)
- ret = (res < 0) ? res : 0;
+ /*
+ * Fixup the pi_state owner and possibly acquire the lock if we
+ * haven't already.
+ */
+ res = fixup_pi_owner(uaddr, &q, !ret);
+ /*
+ * If fixup_pi_owner() returned an error, propagate that. If it acquired
+ * the lock, clear our -ETIMEDOUT or -EINTR.
+ */
+ if (res)
+ ret = (res < 0) ? res : 0;
- futex_unqueue_pi(&q);
- spin_unlock(q.lock_ptr);
- goto out;
+ futex_unqueue_pi(&q);
+ spin_unlock(q.lock_ptr);
+ goto out;
out_unlock_put_key:
- futex_q_unlock(hb);
+ futex_q_unlock(hb);
+ goto out;
+
+uaddr_faulted:
+ futex_q_unlock(hb);
+
+ ret = fault_in_user_writeable(uaddr);
+ if (ret)
+ goto out;
+
+ if (!(flags & FLAGS_SHARED))
+ goto retry_private;
+
+ goto retry;
+ }
out:
if (to) {
@@ -1092,17 +1106,6 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
}
return ret != -EINTR ? ret : -ERESTARTNOINTR;
-uaddr_faulted:
- futex_q_unlock(hb);
-
- ret = fault_in_user_writeable(uaddr);
- if (ret)
- goto out;
-
- if (!(flags & FLAGS_SHARED))
- goto retry_private;
-
- goto retry;
}
/*
@@ -1114,7 +1117,6 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
{
u32 curval, uval, vpid = task_pid_vnr(current);
union futex_key key = FUTEX_KEY_INIT;
- struct futex_hash_bucket *hb;
struct futex_q *top_waiter;
int ret;
@@ -1134,7 +1136,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
if (ret)
return ret;
- hb = futex_hash(&key);
+ CLASS(hb, hb)(&key);
spin_lock(&hb->lock);
retry_hb:
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index b47bb764b352..01ba7d09036a 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -371,7 +371,6 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
int task_count = 0, ret;
struct futex_pi_state *pi_state = NULL;
- struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
DEFINE_WAKE_Q(wake_q);
@@ -443,240 +442,242 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
if (requeue_pi && futex_match(&key1, &key2))
return -EINVAL;
- hb1 = futex_hash(&key1);
- hb2 = futex_hash(&key2);
-
retry_private:
- futex_hb_waiters_inc(hb2);
- double_lock_hb(hb1, hb2);
+ if (1) {
+ CLASS(hb, hb1)(&key1);
+ CLASS(hb, hb2)(&key2);
- if (likely(cmpval != NULL)) {
- u32 curval;
+ futex_hb_waiters_inc(hb2);
+ double_lock_hb(hb1, hb2);
- ret = futex_get_value_locked(&curval, uaddr1);
+ if (likely(cmpval != NULL)) {
+ u32 curval;
- if (unlikely(ret)) {
- double_unlock_hb(hb1, hb2);
- futex_hb_waiters_dec(hb2);
+ ret = futex_get_value_locked(&curval, uaddr1);
- ret = get_user(curval, uaddr1);
- if (ret)
- return ret;
+ if (unlikely(ret)) {
+ double_unlock_hb(hb1, hb2);
+ futex_hb_waiters_dec(hb2);
- if (!(flags1 & FLAGS_SHARED))
- goto retry_private;
+ ret = get_user(curval, uaddr1);
+ if (ret)
+ return ret;
- goto retry;
- }
- if (curval != *cmpval) {
- ret = -EAGAIN;
- goto out_unlock;
- }
- }
+ if (!(flags1 & FLAGS_SHARED))
+ goto retry_private;
- if (requeue_pi) {
- struct task_struct *exiting = NULL;
+ goto retry;
+ }
+ if (curval != *cmpval) {
+ ret = -EAGAIN;
+ goto out_unlock;
+ }
+ }
- /*
- * Attempt to acquire uaddr2 and wake the top waiter. If we
- * intend to requeue waiters, force setting the FUTEX_WAITERS
- * bit. We force this here where we are able to easily handle
- * faults rather in the requeue loop below.
- *
- * Updates topwaiter::requeue_state if a top waiter exists.
- */
- ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
- &key2, &pi_state,
- &exiting, nr_requeue);
+ if (requeue_pi) {
+ struct task_struct *exiting = NULL;
- /*
- * At this point the top_waiter has either taken uaddr2 or
- * is waiting on it. In both cases pi_state has been
- * established and an initial refcount on it. In case of an
- * error there's nothing.
- *
- * The top waiter's requeue_state is up to date:
- *
- * - If the lock was acquired atomically (ret == 1), then
- * the state is Q_REQUEUE_PI_LOCKED.
- *
- * The top waiter has been dequeued and woken up and can
- * return to user space immediately. The kernel/user
- * space state is consistent. In case that there must be
- * more waiters requeued the WAITERS bit in the user
- * space futex is set so the top waiter task has to go
- * into the syscall slowpath to unlock the futex. This
- * will block until this requeue operation has been
- * completed and the hash bucket locks have been
- * dropped.
- *
- * - If the trylock failed with an error (ret < 0) then
- * the state is either Q_REQUEUE_PI_NONE, i.e. "nothing
- * happened", or Q_REQUEUE_PI_IGNORE when there was an
- * interleaved early wakeup.
- *
- * - If the trylock did not succeed (ret == 0) then the
- * state is either Q_REQUEUE_PI_IN_PROGRESS or
- * Q_REQUEUE_PI_WAIT if an early wakeup interleaved.
- * This will be cleaned up in the loop below, which
- * cannot fail because futex_proxy_trylock_atomic() did
- * the same sanity checks for requeue_pi as the loop
- * below does.
- */
- switch (ret) {
- case 0:
- /* We hold a reference on the pi state. */
- break;
-
- case 1:
/*
- * futex_proxy_trylock_atomic() acquired the user space
- * futex. Adjust task_count.
+ * Attempt to acquire uaddr2 and wake the top waiter. If we
+ * intend to requeue waiters, force setting the FUTEX_WAITERS
+ * bit. We force this here where we are able to easily handle
+ * faults rather in the requeue loop below.
+ *
+ * Updates topwaiter::requeue_state if a top waiter exists.
*/
- task_count++;
- ret = 0;
- break;
+ ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
+ &key2, &pi_state,
+ &exiting, nr_requeue);
- /*
- * If the above failed, then pi_state is NULL and
- * waiter::requeue_state is correct.
- */
- case -EFAULT:
- double_unlock_hb(hb1, hb2);
- futex_hb_waiters_dec(hb2);
- ret = fault_in_user_writeable(uaddr2);
- if (!ret)
- goto retry;
- return ret;
- case -EBUSY:
- case -EAGAIN:
/*
- * Two reasons for this:
- * - EBUSY: Owner is exiting and we just wait for the
- * exit to complete.
- * - EAGAIN: The user space value changed.
+ * At this point the top_waiter has either taken uaddr2 or
+ * is waiting on it. In both cases pi_state has been
+ * established and an initial refcount on it. In case of an
+ * error there's nothing.
+ *
+ * The top waiter's requeue_state is up to date:
+ *
+ * - If the lock was acquired atomically (ret == 1), then
+ * the state is Q_REQUEUE_PI_LOCKED.
+ *
+ * The top waiter has been dequeued and woken up and can
+ * return to user space immediately. The kernel/user
+ * space state is consistent. In case that there must be
+ * more waiters requeued the WAITERS bit in the user
+ * space futex is set so the top waiter task has to go
+ * into the syscall slowpath to unlock the futex. This
+ * will block until this requeue operation has been
+ * completed and the hash bucket locks have been
+ * dropped.
+ *
+ * - If the trylock failed with an error (ret < 0) then
+ * the state is either Q_REQUEUE_PI_NONE, i.e. "nothing
+ * happened", or Q_REQUEUE_PI_IGNORE when there was an
+ * interleaved early wakeup.
+ *
+ * - If the trylock did not succeed (ret == 0) then the
+ * state is either Q_REQUEUE_PI_IN_PROGRESS or
+ * Q_REQUEUE_PI_WAIT if an early wakeup interleaved.
+ * This will be cleaned up in the loop below, which
+ * cannot fail because futex_proxy_trylock_atomic() did
+ * the same sanity checks for requeue_pi as the loop
+ * below does.
*/
- double_unlock_hb(hb1, hb2);
- futex_hb_waiters_dec(hb2);
- /*
- * Handle the case where the owner is in the middle of
- * exiting. Wait for the exit to complete otherwise
- * this task might loop forever, aka. live lock.
- */
- wait_for_owner_exiting(ret, exiting);
- cond_resched();
- goto retry;
- default:
- goto out_unlock;
- }
- }
-
- plist_for_each_entry_safe(this, next, &hb1->chain, list) {
- if (task_count - nr_wake >= nr_requeue)
- break;
-
- if (!futex_match(&this->key, &key1))
- continue;
-
- /*
- * FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI should always
- * be paired with each other and no other futex ops.
- *
- * We should never be requeueing a futex_q with a pi_state,
- * which is awaiting a futex_unlock_pi().
- */
- if ((requeue_pi && !this->rt_waiter) ||
- (!requeue_pi && this->rt_waiter) ||
- this->pi_state) {
- ret = -EINVAL;
- break;
- }
-
- /* Plain futexes just wake or requeue and are done */
- if (!requeue_pi) {
- if (++task_count <= nr_wake)
- this->wake(&wake_q, this);
- else
- requeue_futex(this, hb1, hb2, &key2);
- continue;
+ switch (ret) {
+ case 0:
+ /* We hold a reference on the pi state. */
+ break;
+
+ case 1:
+ /*
+ * futex_proxy_trylock_atomic() acquired the user space
+ * futex. Adjust task_count.
+ */
+ task_count++;
+ ret = 0;
+ break;
+
+ /*
+ * If the above failed, then pi_state is NULL and
+ * waiter::requeue_state is correct.
+ */
+ case -EFAULT:
+ double_unlock_hb(hb1, hb2);
+ futex_hb_waiters_dec(hb2);
+ ret = fault_in_user_writeable(uaddr2);
+ if (!ret)
+ goto retry;
+ return ret;
+ case -EBUSY:
+ case -EAGAIN:
+ /*
+ * Two reasons for this:
+ * - EBUSY: Owner is exiting and we just wait for the
+ * exit to complete.
+ * - EAGAIN: The user space value changed.
+ */
+ double_unlock_hb(hb1, hb2);
+ futex_hb_waiters_dec(hb2);
+ /*
+ * Handle the case where the owner is in the middle of
+ * exiting. Wait for the exit to complete otherwise
+ * this task might loop forever, aka. live lock.
+ */
+ wait_for_owner_exiting(ret, exiting);
+ cond_resched();
+ goto retry;
+ default:
+ goto out_unlock;
+ }
}
- /* Ensure we requeue to the expected futex for requeue_pi. */
- if (!futex_match(this->requeue_pi_key, &key2)) {
- ret = -EINVAL;
- break;
- }
+ plist_for_each_entry_safe(this, next, &hb1->chain, list) {
+ if (task_count - nr_wake >= nr_requeue)
+ break;
- /*
- * Requeue nr_requeue waiters and possibly one more in the case
- * of requeue_pi if we couldn't acquire the lock atomically.
- *
- * Prepare the waiter to take the rt_mutex. Take a refcount
- * on the pi_state and store the pointer in the futex_q
- * object of the waiter.
- */
- get_pi_state(pi_state);
+ if (!futex_match(&this->key, &key1))
+ continue;
- /* Don't requeue when the waiter is already on the way out. */
- if (!futex_requeue_pi_prepare(this, pi_state)) {
/*
- * Early woken waiter signaled that it is on the
- * way out. Drop the pi_state reference and try the
- * next waiter. @this->pi_state is still NULL.
+ * FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI should always
+ * be paired with each other and no other futex ops.
+ *
+ * We should never be requeueing a futex_q with a pi_state,
+ * which is awaiting a futex_unlock_pi().
*/
- put_pi_state(pi_state);
- continue;
- }
+ if ((requeue_pi && !this->rt_waiter) ||
+ (!requeue_pi && this->rt_waiter) ||
+ this->pi_state) {
+ ret = -EINVAL;
+ break;
+ }
+
+ /* Plain futexes just wake or requeue and are done */
+ if (!requeue_pi) {
+ if (++task_count <= nr_wake)
+ this->wake(&wake_q, this);
+ else
+ requeue_futex(this, hb1, hb2, &key2);
+ continue;
+ }
+
+ /* Ensure we requeue to the expected futex for requeue_pi. */
+ if (!futex_match(this->requeue_pi_key, &key2)) {
+ ret = -EINVAL;
+ break;
+ }
- ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
- this->rt_waiter,
- this->task);
-
- if (ret == 1) {
- /*
- * We got the lock. We do neither drop the refcount
- * on pi_state nor clear this->pi_state because the
- * waiter needs the pi_state for cleaning up the
- * user space value. It will drop the refcount
- * after doing so. this::requeue_state is updated
- * in the wakeup as well.
- */
- requeue_pi_wake_futex(this, &key2, hb2);
- task_count++;
- } else if (!ret) {
- /* Waiter is queued, move it to hb2 */
- requeue_futex(this, hb1, hb2, &key2);
- futex_requeue_pi_complete(this, 0);
- task_count++;
- } else {
/*
- * rt_mutex_start_proxy_lock() detected a potential
- * deadlock when we tried to queue that waiter.
- * Drop the pi_state reference which we took above
- * and remove the pointer to the state from the
- * waiters futex_q object.
+ * Requeue nr_requeue waiters and possibly one more in the case
+ * of requeue_pi if we couldn't acquire the lock atomically.
+ *
+ * Prepare the waiter to take the rt_mutex. Take a refcount
+ * on the pi_state and store the pointer in the futex_q
+ * object of the waiter.
*/
- this->pi_state = NULL;
- put_pi_state(pi_state);
- futex_requeue_pi_complete(this, ret);
- /*
- * We stop queueing more waiters and let user space
- * deal with the mess.
- */
- break;
+ get_pi_state(pi_state);
+
+ /* Don't requeue when the waiter is already on the way out. */
+ if (!futex_requeue_pi_prepare(this, pi_state)) {
+ /*
+ * Early woken waiter signaled that it is on the
+ * way out. Drop the pi_state reference and try the
+ * next waiter. @this->pi_state is still NULL.
+ */
+ put_pi_state(pi_state);
+ continue;
+ }
+
+ ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
+ this->rt_waiter,
+ this->task);
+
+ if (ret == 1) {
+ /*
+ * We got the lock. We do neither drop the refcount
+ * on pi_state nor clear this->pi_state because the
+ * waiter needs the pi_state for cleaning up the
+ * user space value. It will drop the refcount
+ * after doing so. this::requeue_state is updated
+ * in the wakeup as well.
+ */
+ requeue_pi_wake_futex(this, &key2, hb2);
+ task_count++;
+ } else if (!ret) {
+ /* Waiter is queued, move it to hb2 */
+ requeue_futex(this, hb1, hb2, &key2);
+ futex_requeue_pi_complete(this, 0);
+ task_count++;
+ } else {
+ /*
+ * rt_mutex_start_proxy_lock() detected a potential
+ * deadlock when we tried to queue that waiter.
+ * Drop the pi_state reference which we took above
+ * and remove the pointer to the state from the
+ * waiters futex_q object.
+ */
+ this->pi_state = NULL;
+ put_pi_state(pi_state);
+ futex_requeue_pi_complete(this, ret);
+ /*
+ * We stop queueing more waiters and let user space
+ * deal with the mess.
+ */
+ break;
+ }
}
- }
- /*
- * We took an extra initial reference to the pi_state in
- * futex_proxy_trylock_atomic(). We need to drop it here again.
- */
- put_pi_state(pi_state);
+ /*
+ * We took an extra initial reference to the pi_state in
+ * futex_proxy_trylock_atomic(). We need to drop it here again.
+ */
+ put_pi_state(pi_state);
out_unlock:
- double_unlock_hb(hb1, hb2);
+ double_unlock_hb(hb1, hb2);
+ futex_hb_waiters_dec(hb2);
+ }
wake_up_q(&wake_q);
- futex_hb_waiters_dec(hb2);
return ret ? ret : task_count;
}
@@ -805,22 +806,12 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
* Prepare to wait on uaddr. On success, it holds hb->lock and q
* is initialized.
*/
- ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
+ ret = futex_wait_setup(uaddr, val, flags, &q, &key2);
if (ret)
goto out;
- /*
- * The check above which compares uaddrs is not sufficient for
- * shared futexes. We need to compare the keys:
- */
- if (futex_match(&q.key, &key2)) {
- futex_q_unlock(hb);
- ret = -EINVAL;
- goto out;
- }
-
/* Queue the futex_q, drop the hb lock, wait for wakeup. */
- futex_wait_queue(hb, &q, to);
+ futex_wait(&q, to);
switch (futex_requeue_pi_wakeup_sync(&q)) {
case Q_REQUEUE_PI_IGNORE:
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index eb86a7ade06a..daf9ab07a77f 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -154,7 +154,6 @@ void futex_wake_mark(struct wake_q_head *wake_q, struct futex_q *q)
*/
int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
- struct futex_hash_bucket *hb;
struct futex_q *this, *next;
union futex_key key = FUTEX_KEY_INIT;
DEFINE_WAKE_Q(wake_q);
@@ -170,7 +169,7 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
if ((flags & FLAGS_STRICT) && !nr_wake)
return 0;
- hb = futex_hash(&key);
+ CLASS(hb, hb)(&key);
/* Make sure we really have tasks to wakeup */
if (!futex_hb_waiters_pending(hb))
@@ -253,7 +252,6 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
int nr_wake, int nr_wake2, int op)
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
- struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
int ret, op_ret;
DEFINE_WAKE_Q(wake_q);
@@ -266,67 +264,69 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
if (unlikely(ret != 0))
return ret;
- hb1 = futex_hash(&key1);
- hb2 = futex_hash(&key2);
-
retry_private:
- double_lock_hb(hb1, hb2);
- op_ret = futex_atomic_op_inuser(op, uaddr2);
- if (unlikely(op_ret < 0)) {
- double_unlock_hb(hb1, hb2);
-
- if (!IS_ENABLED(CONFIG_MMU) ||
- unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
- /*
- * we don't get EFAULT from MMU faults if we don't have
- * an MMU, but we might get them from range checking
- */
- ret = op_ret;
- return ret;
- }
-
- if (op_ret == -EFAULT) {
- ret = fault_in_user_writeable(uaddr2);
- if (ret)
+ if (1) {
+ CLASS(hb, hb1)(&key1);
+ CLASS(hb, hb2)(&key2);
+
+ double_lock_hb(hb1, hb2);
+ op_ret = futex_atomic_op_inuser(op, uaddr2);
+ if (unlikely(op_ret < 0)) {
+ double_unlock_hb(hb1, hb2);
+
+ if (!IS_ENABLED(CONFIG_MMU) ||
+ unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
+ /*
+ * we don't get EFAULT from MMU faults if we don't have
+ * an MMU, but we might get them from range checking
+ */
+ ret = op_ret;
return ret;
- }
-
- cond_resched();
- if (!(flags & FLAGS_SHARED))
- goto retry_private;
- goto retry;
- }
+ }
- plist_for_each_entry_safe(this, next, &hb1->chain, list) {
- if (futex_match (&this->key, &key1)) {
- if (this->pi_state || this->rt_waiter) {
- ret = -EINVAL;
- goto out_unlock;
+ if (op_ret == -EFAULT) {
+ ret = fault_in_user_writeable(uaddr2);
+ if (ret)
+ return ret;
}
- this->wake(&wake_q, this);
- if (++ret >= nr_wake)
- break;
+
+ cond_resched();
+ if (!(flags & FLAGS_SHARED))
+ goto retry_private;
+ goto retry;
}
- }
- if (op_ret > 0) {
- op_ret = 0;
- plist_for_each_entry_safe(this, next, &hb2->chain, list) {
- if (futex_match (&this->key, &key2)) {
+ plist_for_each_entry_safe(this, next, &hb1->chain, list) {
+ if (futex_match (&this->key, &key1)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
goto out_unlock;
}
this->wake(&wake_q, this);
- if (++op_ret >= nr_wake2)
+ if (++ret >= nr_wake)
break;
}
}
- ret += op_ret;
- }
+
+ if (op_ret > 0) {
+ op_ret = 0;
+ plist_for_each_entry_safe(this, next, &hb2->chain, list) {
+ if (futex_match (&this->key, &key2)) {
+ if (this->pi_state || this->rt_waiter) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+ this->wake(&wake_q, this);
+ if (++op_ret >= nr_wake2)
+ break;
+ }
+ }
+ ret += op_ret;
+ }
out_unlock:
- double_unlock_hb(hb1, hb2);
+ double_unlock_hb(hb1, hb2);
+ }
wake_up_q(&wake_q);
return ret;
}
@@ -339,17 +339,8 @@ static long futex_wait_restart(struct restart_block *restart);
* @q: the futex_q to queue up on
* @timeout: the prepared hrtimer_sleeper, or null for no timeout
*/
-void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
- struct hrtimer_sleeper *timeout)
+void futex_wait(struct futex_q *q, struct hrtimer_sleeper *timeout)
{
- /*
- * The task state is guaranteed to be set before another task can
- * wake it. set_current_state() is implemented using smp_store_mb() and
- * futex_queue() calls spin_unlock() upon completion, both serializing
- * access to the hash list and forcing another memory barrier.
- */
- set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
- futex_queue(q, hb);
/* Arm the timer */
if (timeout)
@@ -451,20 +442,22 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
struct futex_q *q = &vs[i].q;
u32 val = vs[i].w.val;
- hb = futex_q_lock(q);
- ret = futex_get_value_locked(&uval, uaddr);
+ if (1) {
+ CLASS(hb_q_lock, hb)(q);
+ ret = futex_get_value_locked(&uval, uaddr);
+
+ if (!ret && uval == val) {
+ /*
+ * The bucket lock can't be held while dealing with the
+ * next futex. Queue each futex at this moment so hb can
+ * be unlocked.
+ */
+ futex_queue(q, hb);
+ continue;
+ }
- if (!ret && uval == val) {
- /*
- * The bucket lock can't be held while dealing with the
- * next futex. Queue each futex at this moment so hb can
- * be unlocked.
- */
- futex_queue(q, hb);
- continue;
+ futex_q_unlock(hb);
}
-
- futex_q_unlock(hb);
__set_current_state(TASK_RUNNING);
/*
@@ -589,7 +582,7 @@ int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
* - <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
*/
int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
- struct futex_q *q, struct futex_hash_bucket **hb)
+ struct futex_q *q, union futex_key *key2)
{
u32 uval;
int ret;
@@ -618,26 +611,42 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
return ret;
retry_private:
- *hb = futex_q_lock(q);
+ if (1) {
+ CLASS(hb_q_lock, hb)(q);
- ret = futex_get_value_locked(&uval, uaddr);
+ ret = futex_get_value_locked(&uval, uaddr);
- if (ret) {
- futex_q_unlock(*hb);
+ if (ret) {
+ futex_q_unlock(hb);
- ret = get_user(uval, uaddr);
- if (ret)
- return ret;
+ ret = get_user(uval, uaddr);
+ if (ret)
+ return ret;
- if (!(flags & FLAGS_SHARED))
- goto retry_private;
+ if (!(flags & FLAGS_SHARED))
+ goto retry_private;
- goto retry;
- }
+ goto retry;
+ }
+
+ if (uval != val) {
+ futex_q_unlock(hb);
+ return -EWOULDBLOCK;
+ }
+
+ if (key2 && !futex_match(&q->key, key2)) {
+ futex_q_unlock(hb);
+ return -EINVAL;
+ }
- if (uval != val) {
- futex_q_unlock(*hb);
- ret = -EWOULDBLOCK;
+ /*
+ * The task state is guaranteed to be set before another task can
+ * wake it. set_current_state() is implemented using smp_store_mb() and
+ * futex_queue() calls spin_unlock() upon completion, both serializing
+ * access to the hash list and forcing another memory barrier.
+ */
+ set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+ futex_queue(q, hb);
}
return ret;
@@ -647,7 +656,6 @@ int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
struct hrtimer_sleeper *to, u32 bitset)
{
struct futex_q q = futex_q_init;
- struct futex_hash_bucket *hb;
int ret;
if (!bitset)
@@ -660,12 +668,12 @@ int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
* Prepare to wait on uaddr. On success, it holds hb->lock and q
* is initialized.
*/
- ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
+ ret = futex_wait_setup(uaddr, val, flags, &q, NULL);
if (ret)
return ret;
/* futex_queue and wait for wakeup, timeout, or a signal. */
- futex_wait_queue(hb, &q, to);
+ futex_wait(&q, to);
/* If we were woken (and unqueued), we succeeded, whatever. */
if (!futex_unqueue(&q))
^ permalink raw reply related [flat|nested] 52+ messages in thread* Re: [PATCH v8 00/15] futex: Add support task local hash maps.
2025-02-04 15:14 ` [PATCH v8 00/15] futex: Add support task local hash maps Peter Zijlstra
@ 2025-02-05 8:46 ` Sebastian Andrzej Siewior
2025-02-05 12:20 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-05 8:46 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-04 16:14:05 [+0100], Peter Zijlstra wrote:
> On Mon, Feb 03, 2025 at 02:59:20PM +0100, Sebastian Andrzej Siewior wrote:
> > futex: Decrease the waiter count before the unlock operation.
> > futex: Prepare for reference counting of the process private hash end
> > of operation.
> > futex: Re-evaluate the hash bucket after dropping the lock
> > futex: Introduce futex_get_locked_hb().
>
> I must admit to not being a fan of these patches. I find them very hard
> to review.
>
> In part because while you go stick _put() on various things, there is no
> corresponding _get(), things like futex_hash() become an implicit get
> -- this took a while to figure out.
There is one get in futex_hash() and multiple puts depending on the
context usually after dropping the hash bucket lock.
> I tried tying the whole get/put to the hb variable itself, using the
> cleanup stuff. The patch is pretty massive because fairly large chunks
> of code got re-indented, but perhaps the final result is clearer, not
> sure.
… Let me find some time and dig through this.
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 00/15] futex: Add support task local hash maps.
2025-02-04 15:14 ` [PATCH v8 00/15] futex: Add support task local hash maps Peter Zijlstra
2025-02-05 8:46 ` Sebastian Andrzej Siewior
@ 2025-02-05 12:20 ` Sebastian Andrzej Siewior
2025-02-05 12:52 ` Peter Zijlstra
2025-02-20 15:12 ` Peter Zijlstra
1 sibling, 2 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-05 12:20 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-04 16:14:05 [+0100], Peter Zijlstra wrote:
This does not compile. Let me fix this up, a few comments…
> diff --git a/io_uring/futex.c b/io_uring/futex.c
> index 3159a2b7eeca..18cd5ccde36d 100644
> --- a/io_uring/futex.c
> +++ b/io_uring/futex.c
> @@ -332,13 +331,13 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
> ifd->q.wake = io_futex_wake_fn;
> ifd->req = req;
>
> + // XXX task->state is messed up
> ret = futex_wait_setup(iof->uaddr, iof->futex_val, iof->futex_flags,
> - &ifd->q, &hb);
> + &ifd->q, NULL);
> if (!ret) {
> hlist_add_head(&req->hash_node, &ctx->futex_list);
> io_ring_submit_unlock(ctx, issue_flags);
>
> - futex_queue(&ifd->q, hb);
> return IOU_ISSUE_SKIP_COMPLETE;
This looks interesting. This is called from
req->io_task_work.func = io_req_task_submit
| io_req_task_submit()
| -> io_issue_sqe()
| -> def->issue() <- io_futex_wait
and
io_fallback_req_func() iterates over a list and invokes
req->io_task_work.func. This seems to be also invoked from
io_sq_thread() (via io_sq_tw() -> io_handle_tw_list()).
If this (wait and wake) is only used within kernel threads then it is
fine. If the waker and/ or waiter are in user context then we are in
trouble because one will use the private hash of the process and the
other won't because it is a kernel thread. So the messer-up task->state
is the least of problems.
> }
…
> --- a/kernel/futex/waitwake.c
> +++ b/kernel/futex/waitwake.c
> @@ -266,67 +264,69 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
> if (unlikely(ret != 0))
> return ret;
>
> - hb1 = futex_hash(&key1);
> - hb2 = futex_hash(&key2);
> -
> retry_private:
> - double_lock_hb(hb1, hb2);
> - op_ret = futex_atomic_op_inuser(op, uaddr2);
> - if (unlikely(op_ret < 0)) {
> - double_unlock_hb(hb1, hb2);
> -
> - if (!IS_ENABLED(CONFIG_MMU) ||
> - unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
> - /*
> - * we don't get EFAULT from MMU faults if we don't have
> - * an MMU, but we might get them from range checking
> - */
> - ret = op_ret;
> - return ret;
> - }
> -
> - if (op_ret == -EFAULT) {
> - ret = fault_in_user_writeable(uaddr2);
> - if (ret)
> + if (1) {
> + CLASS(hb, hb1)(&key1);
> + CLASS(hb, hb2)(&key2);
I don't know if hiding these things makes it better because this will do
futex_hash_put() if it gets out of scope. This means we still hold the
reference while in fault_in_user_writeable() and cond_resched(). Is this
on purpose?
I guess it does not matter much. The resize will be delayed until the
task gets back and releases the reference. This will make progress. So
it is okay.
> + double_lock_hb(hb1, hb2);
> + op_ret = futex_atomic_op_inuser(op, uaddr2);
> + if (unlikely(op_ret < 0)) {
> + double_unlock_hb(hb1, hb2);
> +
> + if (!IS_ENABLED(CONFIG_MMU) ||
> + unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
> + /*
> + * we don't get EFAULT from MMU faults if we don't have
> + * an MMU, but we might get them from range checking
> + */
> + ret = op_ret;
> return ret;
…
> @@ -451,20 +442,22 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
> struct futex_q *q = &vs[i].q;
> u32 val = vs[i].w.val;
>
> - hb = futex_q_lock(q);
> - ret = futex_get_value_locked(&uval, uaddr);
> + if (1) {
> + CLASS(hb_q_lock, hb)(q);
> + ret = futex_get_value_locked(&uval, uaddr);
This confused me at the beginning because I expected hb_q_lock having
the lock part in the constructor and also the matching unlock in the
deconstructor. But no, this is not the case.
> +
> @@ -618,26 +611,42 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
…
>
> + if (uval != val) {
> + futex_q_unlock(hb);
> + return -EWOULDBLOCK;
> + }
> +
> + if (key2 && !futex_match(&q->key, key2)) {
There should be no !
> + futex_q_unlock(hb);
> + return -EINVAL;
> + }
>
> - if (uval != val) {
> - futex_q_unlock(*hb);
> - ret = -EWOULDBLOCK;
> + /*
> + * The task state is guaranteed to be set before another task can
> + * wake it. set_current_state() is implemented using smp_store_mb() and
> + * futex_queue() calls spin_unlock() upon completion, both serializing
> + * access to the hash list and forcing another memory barrier.
> + */
> + set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
> + futex_queue(q, hb);
> }
>
> return ret;
So the beauty of it is that you enforce a ref drop on hb once it gets
out of scope. So you can't use it by chance once the ref is dropped.
But this does not help in futex_lock_pi() where you have the drop the
reference before __rt_mutex_start_proxy_lock() (or at least before
rt_mutex_wait_proxy_lock()) but still have it you go for the no_block
shortcut. At which point even the lock is still owned.
While it makes the other cases nicer, the futex_lock_pi() function was
the only one where I was thinking about setting hb to NULL to avoid
accidental usage later on.
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: [PATCH v8 00/15] futex: Add support task local hash maps.
2025-02-05 12:20 ` Sebastian Andrzej Siewior
@ 2025-02-05 12:52 ` Peter Zijlstra
2025-02-05 16:52 ` Sebastian Andrzej Siewior
2025-02-20 15:12 ` Peter Zijlstra
1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-05 12:52 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Wed, Feb 05, 2025 at 01:20:26PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-02-04 16:14:05 [+0100], Peter Zijlstra wrote:
>
> This does not compile. Let me fix this up, a few comments…
Moo, clangd didn't complain :/ But yeah, I didn't actually compile this,
only had neovim running clangd.
> > diff --git a/io_uring/futex.c b/io_uring/futex.c
> > index 3159a2b7eeca..18cd5ccde36d 100644
> > --- a/io_uring/futex.c
> > +++ b/io_uring/futex.c
> > @@ -332,13 +331,13 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
> > ifd->q.wake = io_futex_wake_fn;
> > ifd->req = req;
> >
> > + // XXX task->state is messed up
> > ret = futex_wait_setup(iof->uaddr, iof->futex_val, iof->futex_flags,
> > - &ifd->q, &hb);
> > + &ifd->q, NULL);
> > if (!ret) {
> > hlist_add_head(&req->hash_node, &ctx->futex_list);
> > io_ring_submit_unlock(ctx, issue_flags);
> >
> > - futex_queue(&ifd->q, hb);
> > return IOU_ISSUE_SKIP_COMPLETE;
>
> This looks interesting. This is called from
> req->io_task_work.func = io_req_task_submit
> | io_req_task_submit()
> | -> io_issue_sqe()
> | -> def->issue() <- io_futex_wait
>
> and
> io_fallback_req_func() iterates over a list and invokes
> req->io_task_work.func. This seems to be also invoked from
> io_sq_thread() (via io_sq_tw() -> io_handle_tw_list()).
>
> If this (wait and wake) is only used within kernel threads then it is
> fine. If the waker and/ or waiter are in user context then we are in
> trouble because one will use the private hash of the process and the
> other won't because it is a kernel thread. So the messer-up task->state
> is the least of problems.
Right, so the io-uring stuff is tricky, I think this more or less does
what it used to though. I 'simply' moved the futex_queue() into
futex_wait_setup().
IIRC the io-uring threads share the process-mm but will never hit
userspace.
> > }
> …
> > --- a/kernel/futex/waitwake.c
> > +++ b/kernel/futex/waitwake.c
> > @@ -266,67 +264,69 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
> > if (unlikely(ret != 0))
> > return ret;
> >
> > - hb1 = futex_hash(&key1);
> > - hb2 = futex_hash(&key2);
> > -
> > retry_private:
> > - double_lock_hb(hb1, hb2);
> > - op_ret = futex_atomic_op_inuser(op, uaddr2);
> > - if (unlikely(op_ret < 0)) {
> > - double_unlock_hb(hb1, hb2);
> > -
> > - if (!IS_ENABLED(CONFIG_MMU) ||
> > - unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
> > - /*
> > - * we don't get EFAULT from MMU faults if we don't have
> > - * an MMU, but we might get them from range checking
> > - */
> > - ret = op_ret;
> > - return ret;
> > - }
> > -
> > - if (op_ret == -EFAULT) {
> > - ret = fault_in_user_writeable(uaddr2);
> > - if (ret)
> > + if (1) {
> > + CLASS(hb, hb1)(&key1);
> > + CLASS(hb, hb2)(&key2);
>
> I don't know if hiding these things makes it better because this will do
> futex_hash_put() if it gets out of scope. This means we still hold the
> reference while in fault_in_user_writeable() and cond_resched(). Is this
> on purpose?
Sorta, I found it very hard to figure out what your patches did exactly,
and..
> I guess it does not matter much. The resize will be delayed until the
> task gets back and releases the reference. This will make progress. So
> it is okay.
this.
> > + double_lock_hb(hb1, hb2);
> > + op_ret = futex_atomic_op_inuser(op, uaddr2);
> > + if (unlikely(op_ret < 0)) {
> > + double_unlock_hb(hb1, hb2);
> > +
> > + if (!IS_ENABLED(CONFIG_MMU) ||
> > + unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
> > + /*
> > + * we don't get EFAULT from MMU faults if we don't have
> > + * an MMU, but we might get them from range checking
> > + */
> > + ret = op_ret;
> > return ret;
> …
> > @@ -451,20 +442,22 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
> > struct futex_q *q = &vs[i].q;
> > u32 val = vs[i].w.val;
> >
> > - hb = futex_q_lock(q);
> > - ret = futex_get_value_locked(&uval, uaddr);
> > + if (1) {
> > + CLASS(hb_q_lock, hb)(q);
> > + ret = futex_get_value_locked(&uval, uaddr);
>
> This confused me at the beginning because I expected hb_q_lock having
> the lock part in the constructor and also the matching unlock in the
> deconstructor. But no, this is not the case.
Agreed, that *is* rather ugly. The sane way to fix that might be to
untangle futex_q_lock() from futex_hash(). And instead do:
CLASS(hb, hb)(&q->key);
futex_q_lock(q, hb);
Or somesuch. That might be a nice cleanup either way.
> > @@ -618,26 +611,42 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
> …
> >
> > + if (uval != val) {
> > + futex_q_unlock(hb);
> > + return -EWOULDBLOCK;
> > + }
> > +
> > + if (key2 && !futex_match(&q->key, key2)) {
>
> There should be no !
Duh..
> > + futex_q_unlock(hb);
> > + return -EINVAL;
> > + }
> >
> > - if (uval != val) {
> > - futex_q_unlock(*hb);
> > - ret = -EWOULDBLOCK;
> > + /*
> > + * The task state is guaranteed to be set before another task can
> > + * wake it. set_current_state() is implemented using smp_store_mb() and
> > + * futex_queue() calls spin_unlock() upon completion, both serializing
> > + * access to the hash list and forcing another memory barrier.
> > + */
> > + set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
> > + futex_queue(q, hb);
> > }
> >
> > return ret;
>
> So the beauty of it is that you enforce a ref drop on hb once it gets
> out of scope. So you can't use it by chance once the ref is dropped.
Right.
> But this does not help in futex_lock_pi() where you have the drop the
> reference before __rt_mutex_start_proxy_lock() (or at least before
> rt_mutex_wait_proxy_lock()) but still have it you go for the no_block
> shortcut. At which point even the lock is still owned.
>
> While it makes the other cases nicer, the futex_lock_pi() function was
> the only one where I was thinking about setting hb to NULL to avoid
> accidental usage later on.
OK, so yeah, I got completely lost in futex_lock_pi(), and I couldn't
figure out what you did there. Let me try and untangle that again.
^ permalink raw reply [flat|nested] 52+ messages in thread* Re: [PATCH v8 00/15] futex: Add support task local hash maps.
2025-02-05 12:52 ` Peter Zijlstra
@ 2025-02-05 16:52 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-05 16:52 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-05 13:52:50 [+0100], Peter Zijlstra wrote:
> On Wed, Feb 05, 2025 at 01:20:26PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2025-02-04 16:14:05 [+0100], Peter Zijlstra wrote:
> >
> > This does not compile. Let me fix this up, a few comments…
>
> Moo, clangd didn't complain :/ But yeah, I didn't actually compile this,
> only had neovim running clangd.
don't worry. I assumed that it is a sketch :) Let me add that one to the
list of things to try…
> > > diff --git a/io_uring/futex.c b/io_uring/futex.c
> > > index 3159a2b7eeca..18cd5ccde36d 100644
> > > --- a/io_uring/futex.c
> > > +++ b/io_uring/futex.c
> > > @@ -332,13 +331,13 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
> > > ifd->q.wake = io_futex_wake_fn;
> > > ifd->req = req;
> > >
> > > + // XXX task->state is messed up
> > > ret = futex_wait_setup(iof->uaddr, iof->futex_val, iof->futex_flags,
> > > - &ifd->q, &hb);
> > > + &ifd->q, NULL);
> > > if (!ret) {
> > > hlist_add_head(&req->hash_node, &ctx->futex_list);
> > > io_ring_submit_unlock(ctx, issue_flags);
> > >
> > > - futex_queue(&ifd->q, hb);
> > > return IOU_ISSUE_SKIP_COMPLETE;
> >
> > This looks interesting. This is called from
> > req->io_task_work.func = io_req_task_submit
> > | io_req_task_submit()
> > | -> io_issue_sqe()
> > | -> def->issue() <- io_futex_wait
> >
> > and
> > io_fallback_req_func() iterates over a list and invokes
> > req->io_task_work.func. This seems to be also invoked from
> > io_sq_thread() (via io_sq_tw() -> io_handle_tw_list()).
> >
> > If this (wait and wake) is only used within kernel threads then it is
> > fine. If the waker and/ or waiter are in user context then we are in
> > trouble because one will use the private hash of the process and the
> > other won't because it is a kernel thread. So the messer-up task->state
> > is the least of problems.
>
> Right, so the io-uring stuff is tricky, I think this more or less does
> what it used to though. I 'simply' moved the futex_queue() into
> futex_wait_setup().
This is correct. The changed task's state may or may not be relevant.
> IIRC the io-uring threads share the process-mm but will never hit
> userspace.
One function had a kworker prototype. There are also the PF_IO_WORKER. I
need clarify with io_uring ppl how this works…
> > > @@ -451,20 +442,22 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
> > > struct futex_q *q = &vs[i].q;
> > > u32 val = vs[i].w.val;
> > >
> > > - hb = futex_q_lock(q);
> > > - ret = futex_get_value_locked(&uval, uaddr);
> > > + if (1) {
> > > + CLASS(hb_q_lock, hb)(q);
> > > + ret = futex_get_value_locked(&uval, uaddr);
> >
> > This confused me at the beginning because I expected hb_q_lock having
> > the lock part in the constructor and also the matching unlock in the
> > deconstructor. But no, this is not the case.
>
> Agreed, that *is* rather ugly. The sane way to fix that might be to
> untangle futex_q_lock() from futex_hash(). And instead do:
>
> CLASS(hb, hb)(&q->key);
> futex_q_lock(q, hb);
>
> Or somesuch. That might be a nice cleanup either way.
If you know, you know.
We could do the lock manually in the first iteration and then try to
hide it later on. The error path always drops it early. We could unlock
manually there and tell the compiler that it is done.
> > So the beauty of it is that you enforce a ref drop on hb once it gets
> > out of scope. So you can't use it by chance once the ref is dropped.
>
> Right.
>
> > But this does not help in futex_lock_pi() where you have the drop the
> > reference before __rt_mutex_start_proxy_lock() (or at least before
> > rt_mutex_wait_proxy_lock()) but still have it you go for the no_block
> > shortcut. At which point even the lock is still owned.
> >
> > While it makes the other cases nicer, the futex_lock_pi() function was
> > the only one where I was thinking about setting hb to NULL to avoid
> > accidental usage later on.
>
> OK, so yeah, I got completely lost in futex_lock_pi(), and I couldn't
> figure out what you did there. Let me try and untangle that again.
The hash bucket reference should be dropped if the task is going to
sleep. The try-lock path there has a fast-forward to the end.
Based on memory the unlock PI (and requeue PI) is also a bit odd because
I need to grab an extra reference in the corner case. That is, to avoid
a resize because I need a stable ::lock_ptr and task is no longer
enqueued. So the ::lock_ptr will point to a stale hash bucket.
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 00/15] futex: Add support task local hash maps.
2025-02-05 12:20 ` Sebastian Andrzej Siewior
2025-02-05 12:52 ` Peter Zijlstra
@ 2025-02-20 15:12 ` Peter Zijlstra
2025-02-20 15:57 ` Sebastian Andrzej Siewior
2025-02-21 16:00 ` Sebastian Andrzej Siewior
1 sibling, 2 replies; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-20 15:12 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Wed, Feb 05, 2025 at 01:20:26PM +0100, Sebastian Andrzej Siewior wrote:
> So the beauty of it is that you enforce a ref drop on hb once it gets
> out of scope. So you can't use it by chance once the ref is dropped.
>
> But this does not help in futex_lock_pi() where you have the drop the
> reference before __rt_mutex_start_proxy_lock() (or at least before
> rt_mutex_wait_proxy_lock()) but still have it you go for the no_block
> shortcut. At which point even the lock is still owned.
>
> While it makes the other cases nicer, the futex_lock_pi() function was
> the only one where I was thinking about setting hb to NULL to avoid
> accidental usage later on.
Sorry for the delay.. got distracted :/
I think we can simply put:
futex_hash_put(no_free_ptr(hb));
right where we drop hb->lock in futex_lock_pi().
I've split up the patch a little and stuck them here:
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=futex/scope
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 00/15] futex: Add support task local hash maps.
2025-02-20 15:12 ` Peter Zijlstra
@ 2025-02-20 15:57 ` Sebastian Andrzej Siewior
2025-02-21 16:00 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-20 15:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-20 16:12:06 [+0100], Peter Zijlstra wrote:
> Sorry for the delay.. got distracted :/
Thank you. I will digest it tomorrow.
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 00/15] futex: Add support task local hash maps.
2025-02-20 15:12 ` Peter Zijlstra
2025-02-20 15:57 ` Sebastian Andrzej Siewior
@ 2025-02-21 16:00 ` Sebastian Andrzej Siewior
2025-02-21 19:21 ` Peter Zijlstra
1 sibling, 1 reply; 52+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-21 16:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-02-20 16:12:06 [+0100], Peter Zijlstra wrote:
> I've split up the patch a little and stuck them here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=futex/scope
futex_wait_setup() has an unused task argument (current is always used).
You solved the thing in lock_pi, I mentioned, by throwing in a
no_free_ptr() in the middle. Well that works, I assumed we wanted to
close the context somehow.
This should work. Let me zap that last argument in futex_wait_setup()
(confidently assuming I won't need it) and then rebase what I got so far
on top.
Sebastian
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v8 00/15] futex: Add support task local hash maps.
2025-02-21 16:00 ` Sebastian Andrzej Siewior
@ 2025-02-21 19:21 ` Peter Zijlstra
0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2025-02-21 19:21 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Fri, Feb 21, 2025 at 05:00:43PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-02-20 16:12:06 [+0100], Peter Zijlstra wrote:
> > I've split up the patch a little and stuck them here:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=futex/scope
>
> futex_wait_setup() has an unused task argument (current is always used).
>
> You solved the thing in lock_pi, I mentioned, by throwing in a
> no_free_ptr() in the middle. Well that works, I assumed we wanted to
> close the context somehow.
Yeah, it ain't pretty, but given the crazy code flow around there, I
couldn't come up with something sensible :/
^ permalink raw reply [flat|nested] 52+ messages in thread