* [PATCH v9 01/11] futex: fixup futex_wait_setup [fold futex: Move futex_queue() into futex_wait_setup()]
2025-02-25 17:09 [PATCH v9 00/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
@ 2025-02-25 17:09 ` Sebastian Andrzej Siewior
2025-02-26 8:15 ` Thomas Gleixner
2025-02-25 17:09 ` [PATCH v9 02/11] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
` (10 subsequent siblings)
11 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-25 17:09 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
we could also make @task a bool signaling it is either NULL or current.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/waitwake.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 7655de59ab3d6..44034dee7a48c 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -571,7 +571,8 @@ int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
* @val: the expected value
* @flags: futex flags (FLAGS_SHARED, etc.)
* @q: the associated futex_q
- * @hb: storage for hash_bucket pointer to be returned to caller
+ * @key2: the second futex_key if used for requeue PI
+ * task: Task queueing this futex
*
* Setup the futex_q and locate the hash_bucket. Get the futex value and
* compare it with the expected value. Handle atomic faults internally.
@@ -634,7 +635,7 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
if (uval != val) {
futex_q_unlock(hb);
- ret = -EWOULDBLOCK;
+ return -EWOULDBLOCK;
}
if (key2 && futex_match(&q->key, key2)) {
@@ -648,8 +649,9 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
* futex_queue() calls spin_unlock() upon completion, both serializing
* access to the hash list and forcing another memory barrier.
*/
- set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
- futex_queue(q, hb, current);
+ if (task == current)
+ set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+ futex_queue(q, hb, task);
}
return ret;
--
2.47.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH v9 01/11] futex: fixup futex_wait_setup [fold futex: Move futex_queue() into futex_wait_setup()]
2025-02-25 17:09 ` [PATCH v9 01/11] futex: fixup futex_wait_setup [fold futex: Move futex_queue() into futex_wait_setup()] Sebastian Andrzej Siewior
@ 2025-02-26 8:15 ` Thomas Gleixner
2025-02-26 8:40 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2025-02-26 8:15 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
On Tue, Feb 25 2025 at 18:09, Sebastian Andrzej Siewior wrote:
> we could also make @task a bool signaling it is either NULL or current.
I have no idea what this change log is trying to tell me. It gives zero
information what this patch is about and the subject line is confusing
at best.
Thanks,
tglx
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH v9 01/11] futex: fixup futex_wait_setup [fold futex: Move futex_queue() into futex_wait_setup()]
2025-02-26 8:15 ` Thomas Gleixner
@ 2025-02-26 8:40 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-26 8:40 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Valentin Schneider,
Waiman Long
On 2025-02-26 09:15:02 [+0100], Thomas Gleixner wrote:
> On Tue, Feb 25 2025 at 18:09, Sebastian Andrzej Siewior wrote:
>
> > we could also make @task a bool signaling it is either NULL or current.
>
> I have no idea what this change log is trying to tell me. It gives zero
> information what this patch is about and the subject line is confusing
> at best.
This is meant for PeterZ as this should be folded into one of this
patches on which this series is built upon.
The argument "task" passed to futex_wait_setup() is always current or
NULL. The latter is only used by io_uring. So instead getting a task
passed and assuming it belongs to the current process we could have a
bool and enforce this.
The problem would be if things change later on and the passed task does
not belong to the current process hierarchy and would not share the same
private hash.
> Thanks,
>
> tglx
Sebastian
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v9 02/11] futex: Create helper function to initialize a hash slot.
2025-02-25 17:09 [PATCH v9 00/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
2025-02-25 17:09 ` [PATCH v9 01/11] futex: fixup futex_wait_setup [fold futex: Move futex_queue() into futex_wait_setup()] Sebastian Andrzej Siewior
@ 2025-02-25 17:09 ` Sebastian Andrzej Siewior
2025-02-25 17:09 ` [PATCH v9 03/11] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
` (9 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-25 17:09 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Factor out the futex_hash_bucket initialisation into a helpr function.
The helper function will be used in a follow up patch implementing
process private hash buckets.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index f72f4561eb94e..69424994e7d9e 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1122,6 +1122,13 @@ void futex_exit_release(struct task_struct *tsk)
futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
}
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
+{
+ atomic_set(&fhb->waiters, 0);
+ plist_head_init(&fhb->chain);
+ spin_lock_init(&fhb->lock);
+}
+
static int __init futex_init(void)
{
unsigned int futex_shift;
@@ -1139,11 +1146,8 @@ static int __init futex_init(void)
futex_hashsize, futex_hashsize);
futex_hashsize = 1UL << futex_shift;
- for (i = 0; i < futex_hashsize; i++) {
- atomic_set(&futex_queues[i].waiters, 0);
- plist_head_init(&futex_queues[i].chain);
- spin_lock_init(&futex_queues[i].lock);
- }
+ for (i = 0; i < futex_hashsize; i++)
+ futex_hash_bucket_init(&futex_queues[i]);
return 0;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH v9 03/11] futex: Add basic infrastructure for local task local hash.
2025-02-25 17:09 [PATCH v9 00/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
2025-02-25 17:09 ` [PATCH v9 01/11] futex: fixup futex_wait_setup [fold futex: Move futex_queue() into futex_wait_setup()] Sebastian Andrzej Siewior
2025-02-25 17:09 ` [PATCH v9 02/11] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
@ 2025-02-25 17:09 ` Sebastian Andrzej Siewior
2025-02-25 17:09 ` [PATCH v9 04/11] futex: Hash only the address for private futexes Sebastian Andrzej Siewior
` (8 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-25 17:09 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
The futex hashmap is system wide and shared by random tasks. Each slot
is hashed based on its address and VMA. Due to randomized VMAs (and
memory allocations) the same logical lock (pointer) can end up in a
different hash bucket on each invocation of the application. This in
turn means that different applications may share a hash bucket on the
first invocation but not on the second an it is not always clear which
applications will be involved. This can result in high latency's to
acquire the futex_hash_bucket::lock especially if the lock owner is
limited to a CPU and not be effectively PI boosted.
Introduce a task local hash map. The hashmap can be allocated via
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, 0)
The `0' argument allocates a default number of 16 slots, a higher number
can be specified if desired. The current upper limit is 131072.
The allocated hashmap is used by all threads within a process.
A thread can check if the private map has been allocated via
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
Which return the current number of slots.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 20 ++++++++
include/linux/mm_types.h | 6 ++-
include/uapi/linux/prctl.h | 5 ++
kernel/fork.c | 2 +
kernel/futex/core.c | 101 +++++++++++++++++++++++++++++++++++--
kernel/sys.c | 4 ++
6 files changed, 133 insertions(+), 5 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index b70df27d7e85c..943828db52234 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -77,6 +77,15 @@ void futex_exec_release(struct task_struct *tsk);
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3);
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3);
+int futex_hash_allocate_default(void);
+void futex_hash_free(struct mm_struct *mm);
+
+static inline void futex_mm_init(struct mm_struct *mm)
+{
+ mm->futex_hash_bucket = NULL;
+}
+
#else
static inline void futex_init_task(struct task_struct *tsk) { }
static inline void futex_exit_recursive(struct task_struct *tsk) { }
@@ -88,6 +97,17 @@ static inline long do_futex(u32 __user *uaddr, int op, u32 val,
{
return -EINVAL;
}
+static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
+{
+ return -EINVAL;
+}
+static inline int futex_hash_allocate_default(void)
+{
+ return 0;
+}
+static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline void futex_mm_init(struct mm_struct *mm) { }
+
#endif
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6b27db7f94963..c20f2310d78ca 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -30,6 +30,7 @@
#define INIT_PASID 0
struct address_space;
+struct futex_hash_bucket;
struct mem_cgroup;
/*
@@ -936,7 +937,10 @@ struct mm_struct {
*/
seqcount_t mm_lock_seq;
#endif
-
+#ifdef CONFIG_FUTEX
+ unsigned int futex_hash_mask;
+ struct futex_hash_bucket *futex_hash_bucket;
+#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 5c6080680cb27..55b843644c51a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -353,4 +353,9 @@ struct prctl_mm_map {
*/
#define PR_LOCK_SHADOW_STACK_STATUS 76
+/* FUTEX hash management */
+#define PR_FUTEX_HASH 77
+# define PR_FUTEX_HASH_SET_SLOTS 1
+# define PR_FUTEX_HASH_GET_SLOTS 2
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f32..80ac156adebbf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1287,6 +1287,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
RCU_INIT_POINTER(mm->exe_file, NULL);
mmu_notifier_subscriptions_init(mm);
init_tlb_flush_pending(mm);
+ futex_mm_init(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
mm->pmd_huge_pte = NULL;
#endif
@@ -1364,6 +1365,7 @@ static inline void __mmput(struct mm_struct *mm)
if (mm->binfmt)
module_put(mm->binfmt->module);
lru_gen_del_mm(mm);
+ futex_hash_free(mm);
mmdrop(mm);
}
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 69424994e7d9e..e64a5cf818414 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -39,6 +39,7 @@
#include <linux/memblock.h>
#include <linux/fault-inject.h>
#include <linux/slab.h>
+#include <linux/prctl.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -107,18 +108,40 @@ late_initcall(fail_futex_debugfs);
#endif /* CONFIG_FAIL_FUTEX */
+static inline bool futex_key_is_private(union futex_key *key)
+{
+ /*
+ * Relies on get_futex_key() to set either bit for shared
+ * futexes -- see comment with union futex_key.
+ */
+ return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
+}
+
/**
- * futex_hash - Return the hash bucket in the global hash
+ * futex_hash - Return the hash bucket in the global or local hash
* @key: Pointer to the futex key for which the hash is calculated
*
* We hash on the keys returned from get_futex_key (see below) and return the
- * corresponding hash bucket in the global hash.
+ * corresponding hash bucket in the global hash. If the FUTEX is private and
+ * a local hash table is privated then this one is used.
*/
struct futex_hash_bucket *__futex_hash(union futex_key *key)
{
- u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
- key->both.offset);
+ struct futex_hash_bucket *fhb;
+ u32 hash;
+ fhb = current->mm->futex_hash_bucket;
+ if (fhb && futex_key_is_private(key)) {
+ u32 hash_mask = current->mm->futex_hash_mask;
+
+ hash = jhash2((u32 *)key,
+ offsetof(typeof(*key), both.offset) / 4,
+ key->both.offset);
+ return &fhb[hash & hash_mask];
+ }
+ hash = jhash2((u32 *)key,
+ offsetof(typeof(*key), both.offset) / 4,
+ key->both.offset);
return &futex_queues[hash & (futex_hashsize - 1)];
}
@@ -1129,6 +1152,76 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
spin_lock_init(&fhb->lock);
}
+void futex_hash_free(struct mm_struct *mm)
+{
+ kvfree(mm->futex_hash_bucket);
+}
+
+static int futex_hash_allocate(unsigned int hash_slots)
+{
+ struct futex_hash_bucket *fhb;
+ int i;
+
+ if (current->mm->futex_hash_bucket)
+ return -EALREADY;
+
+ if (!thread_group_leader(current))
+ return -EINVAL;
+
+ if (hash_slots == 0)
+ hash_slots = 16;
+ if (hash_slots < 2)
+ hash_slots = 2;
+ if (hash_slots > 131072)
+ hash_slots = 131072;
+ if (!is_power_of_2(hash_slots))
+ hash_slots = rounddown_pow_of_two(hash_slots);
+
+ fhb = kvmalloc_array(hash_slots, sizeof(struct futex_hash_bucket), GFP_KERNEL_ACCOUNT);
+ if (!fhb)
+ return -ENOMEM;
+
+ current->mm->futex_hash_mask = hash_slots - 1;
+
+ for (i = 0; i < hash_slots; i++)
+ futex_hash_bucket_init(&fhb[i]);
+
+ current->mm->futex_hash_bucket = fhb;
+ return 0;
+}
+
+int futex_hash_allocate_default(void)
+{
+ return futex_hash_allocate(0);
+}
+
+static int futex_hash_get_slots(void)
+{
+ if (current->mm->futex_hash_bucket)
+ return current->mm->futex_hash_mask + 1;
+ return 0;
+}
+
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
+{
+ int ret;
+
+ switch (arg2) {
+ case PR_FUTEX_HASH_SET_SLOTS:
+ ret = futex_hash_allocate(arg3);
+ break;
+
+ case PR_FUTEX_HASH_GET_SLOTS:
+ ret = futex_hash_get_slots();
+ break;
+
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
static int __init futex_init(void)
{
unsigned int futex_shift;
diff --git a/kernel/sys.c b/kernel/sys.c
index cb366ff8703af..e509ad9795103 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -52,6 +52,7 @@
#include <linux/user_namespace.h>
#include <linux/time_namespace.h>
#include <linux/binfmts.h>
+#include <linux/futex.h>
#include <linux/sched.h>
#include <linux/sched/autogroup.h>
@@ -2811,6 +2812,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
error = arch_lock_shadow_stack_status(me, arg2);
break;
+ case PR_FUTEX_HASH:
+ error = futex_hash_prctl(arg2, arg3);
+ break;
default:
trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
--
2.47.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH v9 04/11] futex: Hash only the address for private futexes.
2025-02-25 17:09 [PATCH v9 00/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (2 preceding siblings ...)
2025-02-25 17:09 ` [PATCH v9 03/11] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
@ 2025-02-25 17:09 ` Sebastian Andrzej Siewior
2025-02-25 17:09 ` [PATCH v9 05/11] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
` (7 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-25 17:09 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
futex_hash() passes the whole futex_key to jhash2. The first two member
are passed as the first argument and the offset as the "initial value".
For private futexes, the mm-part is always the same and it is used only
within the process. By excluding the mm part from the hash, we reduce
the length passed to jhash2 from 4 (16 / 4) to 2 (8 / 2). This avoids
the __jhash_mix() part of jhash.
The resulting code is smaller and based on testing this variant performs
as good as the original or slightly better.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index e64a5cf818414..e4e0bc7722d78 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -117,6 +117,18 @@ static inline bool futex_key_is_private(union futex_key *key)
return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
}
+static struct futex_hash_bucket *futex_hash_private(union futex_key *key,
+ struct futex_hash_bucket *fhb,
+ u32 hash_mask)
+{
+ u32 hash;
+
+ hash = jhash2((void *)&key->private.address,
+ sizeof(key->private.address) / 4,
+ key->both.offset);
+ return &fhb[hash & hash_mask];
+}
+
/**
* futex_hash - Return the hash bucket in the global or local hash
* @key: Pointer to the futex key for which the hash is calculated
@@ -131,14 +143,9 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
u32 hash;
fhb = current->mm->futex_hash_bucket;
- if (fhb && futex_key_is_private(key)) {
- u32 hash_mask = current->mm->futex_hash_mask;
+ if (fhb && futex_key_is_private(key))
+ return futex_hash_private(key, fhb, current->mm->futex_hash_mask);
- hash = jhash2((u32 *)key,
- offsetof(typeof(*key), both.offset) / 4,
- key->both.offset);
- return &fhb[hash & hash_mask];
- }
hash = jhash2((u32 *)key,
offsetof(typeof(*key), both.offset) / 4,
key->both.offset);
--
2.47.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH v9 05/11] futex: Allow automatic allocation of process wide futex hash.
2025-02-25 17:09 [PATCH v9 00/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (3 preceding siblings ...)
2025-02-25 17:09 ` [PATCH v9 04/11] futex: Hash only the address for private futexes Sebastian Andrzej Siewior
@ 2025-02-25 17:09 ` Sebastian Andrzej Siewior
2025-02-25 17:09 ` [PATCH v9 06/11] futex: Decrease the waiter count before the unlock operation Sebastian Andrzej Siewior
` (6 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-25 17:09 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Allocate a default futex hash if a task forks its first thread.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 12 ++++++++++++
kernel/fork.c | 24 ++++++++++++++++++++++++
2 files changed, 36 insertions(+)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 943828db52234..bad377c30de5e 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -86,6 +86,13 @@ static inline void futex_mm_init(struct mm_struct *mm)
mm->futex_hash_bucket = NULL;
}
+static inline bool futex_hash_requires_allocation(void)
+{
+ if (current->mm->futex_hash_bucket)
+ return false;
+ return true;
+}
+
#else
static inline void futex_init_task(struct task_struct *tsk) { }
static inline void futex_exit_recursive(struct task_struct *tsk) { }
@@ -108,6 +115,11 @@ static inline int futex_hash_allocate_default(void)
static inline void futex_hash_free(struct mm_struct *mm) { }
static inline void futex_mm_init(struct mm_struct *mm) { }
+static inline bool futex_hash_requires_allocation(void)
+{
+ return false;
+}
+
#endif
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 80ac156adebbf..824cc55d32ece 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2138,6 +2138,15 @@ static void rv_task_fork(struct task_struct *p)
#define rv_task_fork(p) do {} while (0)
#endif
+static bool need_futex_hash_allocate_default(u64 clone_flags)
+{
+ if ((clone_flags & (CLONE_THREAD | CLONE_VM)) != (CLONE_THREAD | CLONE_VM))
+ return false;
+ if (!thread_group_empty(current))
+ return false;
+ return futex_hash_requires_allocation();
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -2515,6 +2524,21 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cancel_cgroup;
+ /*
+ * Allocate a default futex hash for the user process once the first
+ * thread spawns.
+ */
+ if (need_futex_hash_allocate_default(clone_flags)) {
+ retval = futex_hash_allocate_default();
+ if (retval)
+ goto bad_fork_core_free;
+ /*
+ * If we fail beyond this point we don't free the allocated
+ * futex hash map. We assume that another thread will be created
+ * and makes use of it. The hash map will be freed once the main
+ * thread terminates.
+ */
+ }
/*
* From this point on we must avoid any synchronous user-space
* communication until we take the tasklist-lock. In particular, we do
--
2.47.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH v9 06/11] futex: Decrease the waiter count before the unlock operation.
2025-02-25 17:09 [PATCH v9 00/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (4 preceding siblings ...)
2025-02-25 17:09 ` [PATCH v9 05/11] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
@ 2025-02-25 17:09 ` Sebastian Andrzej Siewior
2025-02-25 17:09 ` [PATCH v9 07/11] futex: Introduce futex_q_lockptr_lock() Sebastian Andrzej Siewior
` (5 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-25 17:09 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
To support runtime resizing of the process private hash, it's required
to not use the obtained hash bucket once the reference count has been
dropped. The reference will be dropped after the unlock of the hash
bucket.
The amount of waiters is decremented after the unlock operation. There
is no requirement that this needs to happen after the unlock. The
increment happens before acquiring the lock to signal early that there
will be a waiter. The waiter can avoid blocking on the lock if it is
known that there will be no waiter.
There is no difference in terms of ordering if the decrement happens
before or after the unlock.
Decrease the waiter count before the unlock operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 2 +-
kernel/futex/requeue.c | 8 ++++----
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index e4e0bc7722d78..a66623524a952 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -554,8 +554,8 @@ void futex_q_lock(struct futex_q *q, struct futex_hash_bucket *hb)
void futex_q_unlock(struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
- spin_unlock(&hb->lock);
futex_hb_waiters_dec(hb);
+ spin_unlock(&hb->lock);
}
void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb,
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 992e3ce005c6f..023c028d2fce3 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -456,8 +456,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
ret = futex_get_value_locked(&curval, uaddr1);
if (unlikely(ret)) {
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
ret = get_user(curval, uaddr1);
if (ret)
@@ -542,8 +542,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
* waiter::requeue_state is correct.
*/
case -EFAULT:
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
ret = fault_in_user_writeable(uaddr2);
if (!ret)
goto retry;
@@ -556,8 +556,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
* exit to complete.
* - EAGAIN: The user space value changed.
*/
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
/*
* Handle the case where the owner is in the middle of
* exiting. Wait for the exit to complete otherwise
@@ -674,8 +674,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
put_pi_state(pi_state);
out_unlock:
- double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ double_unlock_hb(hb1, hb2);
}
wake_up_q(&wake_q);
return ret ? ret : task_count;
--
2.47.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH v9 07/11] futex: Introduce futex_q_lockptr_lock().
2025-02-25 17:09 [PATCH v9 00/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (5 preceding siblings ...)
2025-02-25 17:09 ` [PATCH v9 06/11] futex: Decrease the waiter count before the unlock operation Sebastian Andrzej Siewior
@ 2025-02-25 17:09 ` Sebastian Andrzej Siewior
2025-02-25 17:09 ` [PATCH v9 08/11] futex: Acquire a hash reference in futex_wait_multiple_setup() Sebastian Andrzej Siewior
` (4 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-25 17:09 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
futex_lock_pi() and __fixup_pi_state_owner() acquire the
futex_q::lock_ptr without holding a reference assuming the previously
obtained hash bucket and the assigned lock_ptr are still valid. This
isn't the case once the private hash can be resized and becomes invalid
after the reference drop.
Introduce futex_q_lockptr_lock() to lock the hash bucket recorded in
futex_q::lock_ptr. The lock pointer is read in a RCU section to ensure
that it does not go away if the hash bucket has been replaced and the
old pointer has been observed. After locking the pointer needs to be
compared to check if it changed. If so then the hash bucket has been
replaced and the user has been moved to the new one and lock_ptr has
been updated. The lock operation needs to be redone in this case.
The locked hash bucket is not returned.
A special case is an early return in futex_lock_pi() (due to signal or
timeout) and a successful futex_wait_requeue_pi(). In both cases a valid
futex_q::lock_ptr is expected (and its matching hash bucket) but since
the waiter has been removed from the hash this can no longer be
guaranteed. Therefore before the waiter is removed and a reference is
acquired which is later dropped by the waiter to avoid a resize.
Add futex_q_lockptr_lock() and use it.
Acquire an additional reference in requeue_pi_wake_futex() and
futex_unlock_pi() while the futex_q is removed, denote this extra
reference in futex_q::drop_hb_ref and let the waiter drop the reference
in this case.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 44 ++++++++++++++++++++++++++++++++++++++++++
kernel/futex/futex.h | 4 +++-
kernel/futex/pi.c | 15 ++++++++++++--
kernel/futex/requeue.c | 16 ++++++++++++---
4 files changed, 73 insertions(+), 6 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index a66623524a952..239179e9ed9d5 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -152,6 +152,17 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
return &futex_queues[hash & (futex_hashsize - 1)];
}
+/**
+ * futex_hash_get - Get an additional reference for the local hash.
+ * @hb: ptr to the private local hash.
+ *
+ * Obtain an additional reference for the already obtained hash bucket. The
+ * caller must already own an reference.
+ */
+void futex_hash_get(struct futex_hash_bucket *hb)
+{
+}
+
void futex_hash_put(struct futex_hash_bucket *hb) { }
/**
@@ -632,6 +643,39 @@ int futex_unqueue(struct futex_q *q)
return ret;
}
+void futex_q_lockptr_lock(struct futex_q *q)
+{
+#if 0
+ struct futex_hash_bucket *hb;
+#endif
+ spinlock_t *lock_ptr;
+
+ /*
+ * See futex_unqueue() why lock_ptr can change.
+ */
+ guard(rcu)();
+retry:
+ lock_ptr = READ_ONCE(q->lock_ptr);
+ spin_lock(lock_ptr);
+
+ if (unlikely(lock_ptr != q->lock_ptr)) {
+ spin_unlock(lock_ptr);
+ goto retry;
+ }
+#if 0
+ hb = container_of(lock_ptr, struct futex_hash_bucket, lock);
+ /*
+ * The caller needs to either hold a reference on the hash (to ensure
+ * that the hash is not resized) _or_ be enqueued on the hash. This
+ * ensures that futex_q::lock_ptr is updated while moved to the new
+ * hash during resize.
+ * Once the hash bucket is locked the resize operation, which might be
+ * in progress, will block on the lock.
+ */
+ return hb;
+#endif
+}
+
/*
* PI futexes can not be requeued and must remove themselves from the hash
* bucket. The hash bucket lock (i.e. lock_ptr) is held.
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index eac6de6ed563a..e6f8f2f9281aa 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -183,6 +183,7 @@ struct futex_q {
union futex_key *requeue_pi_key;
u32 bitset;
atomic_t requeue_state;
+ bool drop_hb_ref;
#ifdef CONFIG_PREEMPT_RT
struct rcuwait requeue_wait;
#endif
@@ -197,12 +198,13 @@ enum futex_access {
extern int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
enum futex_access rw);
-
+extern void futex_q_lockptr_lock(struct futex_q *q);
extern struct hrtimer_sleeper *
futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
int flags, u64 range_ns);
extern struct futex_hash_bucket *__futex_hash(union futex_key *key);
+extern void futex_hash_get(struct futex_hash_bucket *hb);
extern void futex_hash_put(struct futex_hash_bucket *hb);
DEFINE_CLASS(hb, struct futex_hash_bucket *,
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 4cee9ec5d97d6..51c69e8808152 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -806,7 +806,7 @@ static int __fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
break;
}
- spin_lock(q->lock_ptr);
+ futex_q_lockptr_lock(q);
raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
/*
@@ -1066,7 +1066,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
* spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
* the
*/
- spin_lock(q.lock_ptr);
+ futex_q_lockptr_lock(&q);
/*
* Waiter is unqueued.
*/
@@ -1086,6 +1086,11 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
futex_unqueue_pi(&q);
spin_unlock(q.lock_ptr);
+ if (q.drop_hb_ref) {
+ CLASS(hb, hb)(&q.key);
+ /* Additional reference from futex_unlock_pi() */
+ futex_hash_put(hb);
+ }
goto out;
out_unlock_put_key:
@@ -1194,6 +1199,12 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
*/
rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
if (!rt_waiter) {
+ /*
+ * Acquire a reference for the leaving waiter to ensure
+ * valid futex_q::lock_ptr.
+ */
+ futex_hash_get(hb);
+ top_waiter->drop_hb_ref = true;
__futex_unqueue(top_waiter);
raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
goto retry_hb;
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 023c028d2fce3..b0e64fd454d96 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -231,7 +231,12 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
WARN_ON(!q->rt_waiter);
q->rt_waiter = NULL;
-
+ /*
+ * Acquire a reference for the waiter to ensure valid
+ * futex_q::lock_ptr.
+ */
+ futex_hash_get(hb);
+ q->drop_hb_ref = true;
q->lock_ptr = &hb->lock;
/* Signal locked state to the waiter */
@@ -826,7 +831,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
case Q_REQUEUE_PI_LOCKED:
/* The requeue acquired the lock */
if (q.pi_state && (q.pi_state->owner != current)) {
- spin_lock(q.lock_ptr);
+ futex_q_lockptr_lock(&q);
ret = fixup_pi_owner(uaddr2, &q, true);
/*
* Drop the reference to the pi state which the
@@ -853,7 +858,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
ret = 0;
- spin_lock(q.lock_ptr);
+ futex_q_lockptr_lock(&q);
debug_rt_mutex_free_waiter(&rt_waiter);
/*
* Fixup the pi_state owner and possibly acquire the lock if we
@@ -885,6 +890,11 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
default:
BUG();
}
+ if (q.drop_hb_ref) {
+ CLASS(hb, hb)(&q.key);
+ /* Additional reference from requeue_pi_wake_futex() */
+ futex_hash_put(hb);
+ }
out:
if (to) {
--
2.47.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH v9 08/11] futex: Acquire a hash reference in futex_wait_multiple_setup().
2025-02-25 17:09 [PATCH v9 00/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (6 preceding siblings ...)
2025-02-25 17:09 ` [PATCH v9 07/11] futex: Introduce futex_q_lockptr_lock() Sebastian Andrzej Siewior
@ 2025-02-25 17:09 ` Sebastian Andrzej Siewior
2025-02-25 17:09 ` [PATCH v9 09/11] futex: Allow to re-allocate the private local hash Sebastian Andrzej Siewior
` (3 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-25 17:09 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
futex_wait_multiple_setup() changes task_struct::__state to
!TASK_RUNNING and then enqueues on multiple futexes. Every
futex_q_lock() acquires a reference on the global hash which is dropped
later.
If a rehash is in progress then the loop will block on
mm_struct::futex_hash_bucket for the rehash to complete and this will
lose the previously set task_struct::__state.
Acquire a reference on the local hash to avoiding blocking on
mm_struct::futex_hash_bucket.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 10 ++++++++++
kernel/futex/futex.h | 2 ++
kernel/futex/waitwake.c | 21 +++++++++++++++++++--
3 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 239179e9ed9d5..b08bca2ed0342 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -129,6 +129,11 @@ static struct futex_hash_bucket *futex_hash_private(union futex_key *key,
return &fhb[hash & hash_mask];
}
+struct futex_private_hash *futex_get_private_hash(void)
+{
+ return NULL;
+}
+
/**
* futex_hash - Return the hash bucket in the global or local hash
* @key: Pointer to the futex key for which the hash is calculated
@@ -152,6 +157,11 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
return &futex_queues[hash & (futex_hashsize - 1)];
}
+bool futex_put_private_hash(struct futex_private_hash *hb_p)
+{
+ return false;
+}
+
/**
* futex_hash_get - Get an additional reference for the local hash.
* @hb: ptr to the private local hash.
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index e6f8f2f9281aa..0a76ee6e7dc10 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -206,6 +206,8 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
extern struct futex_hash_bucket *__futex_hash(union futex_key *key);
extern void futex_hash_get(struct futex_hash_bucket *hb);
extern void futex_hash_put(struct futex_hash_bucket *hb);
+extern struct futex_private_hash *futex_get_private_hash(void);
+extern bool futex_put_private_hash(struct futex_private_hash *hb_p);
DEFINE_CLASS(hb, struct futex_hash_bucket *,
if (_T) futex_hash_put(_T),
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 44034dee7a48c..67eebb5b4b212 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -385,7 +385,7 @@ int futex_unqueue_multiple(struct futex_vector *v, int count)
}
/**
- * futex_wait_multiple_setup - Prepare to wait and enqueue multiple futexes
+ * __futex_wait_multiple_setup - Prepare to wait and enqueue multiple futexes
* @vs: The futex list to wait on
* @count: The size of the list
* @woken: Index of the last woken futex, if any. Used to notify the
@@ -400,7 +400,7 @@ int futex_unqueue_multiple(struct futex_vector *v, int count)
* - 0 - Success
* - <0 - -EFAULT, -EWOULDBLOCK or -EINVAL
*/
-int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
+static int __futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
{
bool retry = false;
int ret, i;
@@ -491,6 +491,23 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
return 0;
}
+int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
+{
+ struct futex_private_hash *hb_p;
+ int ret;
+
+ /*
+ * Assume to have a private futex and acquire a reference on the private
+ * hash to avoid blocking on mm_struct::futex_hash_bucket during rehash
+ * after changing the task state.
+ */
+ hb_p = futex_get_private_hash();
+ ret = __futex_wait_multiple_setup(vs, count, woken);
+ if (hb_p)
+ futex_put_private_hash(hb_p);
+ return ret;
+}
+
/**
* futex_sleep_multiple - Check sleeping conditions and sleep
* @vs: List of futexes to wait for
--
2.47.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH v9 09/11] futex: Allow to re-allocate the private local hash.
2025-02-25 17:09 [PATCH v9 00/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (7 preceding siblings ...)
2025-02-25 17:09 ` [PATCH v9 08/11] futex: Acquire a hash reference in futex_wait_multiple_setup() Sebastian Andrzej Siewior
@ 2025-02-25 17:09 ` Sebastian Andrzej Siewior
2025-02-25 17:09 ` [PATCH v9 10/11] futex: Resize local futex hash table based on number of threads Sebastian Andrzej Siewior
` (2 subsequent siblings)
11 siblings, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-25 17:09 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/
replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS
operation can now be invoked at runtime and resize an already existing
internal private futex_hash_bucket to another size.
The reallocation is based on an idea by Thomas Gleixner: The initial
allocation of struct futex_private_hash sets the reference count
to one. Every user acquires a reference on the local hash before using
it and drops it after it enqueued itself on the hash bucket. There is no
reference held while the task is scheduled out while waiting for the
wake up.
The resize allocates a new struct futex_private_hash and drops the
initial reference under the mm_struct::futex_hash_lock. If the reference
drop results in destruction of the object then users currently queued on
the local hash will be requeued on the new local hash. At the end
mm_struct::futex_phash is updated, the old pointer is RCU freed
and the mutex is dropped.
If the reference drop does not result in destruction of the object then
the new pointer is saved as mm_struct::futex_phash_new. In this case
replacement is delayed. The user dropping the last reference is not
always the best choice to perform the replacement. For instance
futex_wait_queue() drops the reference after changing its task
state which will also be modified while the futex_hash_lock is acquired.
Therefore the replacement is delayed to the task acquiring a reference
on the current local hash.
This scheme keeps the requirement that all waiters/ wakers of the same address
block always on the same futex_hash_bucket::lock.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 5 +-
include/linux/mm_types.h | 7 +-
kernel/futex/core.c | 248 +++++++++++++++++++++++++++++++++++----
kernel/futex/futex.h | 1 +
kernel/futex/requeue.c | 5 +
5 files changed, 237 insertions(+), 29 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index bad377c30de5e..bfb38764bac7a 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -83,12 +83,13 @@ void futex_hash_free(struct mm_struct *mm);
static inline void futex_mm_init(struct mm_struct *mm)
{
- mm->futex_hash_bucket = NULL;
+ rcu_assign_pointer(mm->futex_phash, NULL);
+ mutex_init(&mm->futex_hash_lock);
}
static inline bool futex_hash_requires_allocation(void)
{
- if (current->mm->futex_hash_bucket)
+ if (current->mm->futex_phash)
return false;
return true;
}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c20f2310d78ca..19abbc870e0a9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -30,7 +30,7 @@
#define INIT_PASID 0
struct address_space;
-struct futex_hash_bucket;
+struct futex_private_hash;
struct mem_cgroup;
/*
@@ -938,8 +938,9 @@ struct mm_struct {
seqcount_t mm_lock_seq;
#endif
#ifdef CONFIG_FUTEX
- unsigned int futex_hash_mask;
- struct futex_hash_bucket *futex_hash_bucket;
+ struct mutex futex_hash_lock;
+ struct futex_private_hash __rcu *futex_phash;
+ struct futex_private_hash *futex_phash_new;
#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index b08bca2ed0342..4d9ee3bcaa6d0 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -40,6 +40,7 @@
#include <linux/fault-inject.h>
#include <linux/slab.h>
#include <linux/prctl.h>
+#include <linux/rcuref.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -56,6 +57,14 @@ static struct {
#define futex_queues (__futex_data.queues)
#define futex_hashsize (__futex_data.hashsize)
+struct futex_private_hash {
+ rcuref_t users;
+ unsigned int hash_mask;
+ struct rcu_head rcu;
+ bool initial_ref_dropped;
+ bool released;
+ struct futex_hash_bucket queues[];
+};
/*
* Fault injections for futexes.
@@ -129,9 +138,122 @@ static struct futex_hash_bucket *futex_hash_private(union futex_key *key,
return &fhb[hash & hash_mask];
}
+static void futex_rehash_current_users(struct futex_private_hash *old,
+ struct futex_private_hash *new)
+{
+ struct futex_hash_bucket *hb_old, *hb_new;
+ unsigned int slots = old->hash_mask + 1;
+ u32 hash_mask = new->hash_mask;
+ unsigned int i;
+
+ for (i = 0; i < slots; i++) {
+ struct futex_q *this, *tmp;
+
+ hb_old = &old->queues[i];
+
+ spin_lock(&hb_old->lock);
+ plist_for_each_entry_safe(this, tmp, &hb_old->chain, list) {
+
+ plist_del(&this->list, &hb_old->chain);
+ futex_hb_waiters_dec(hb_old);
+
+ WARN_ON_ONCE(this->lock_ptr != &hb_old->lock);
+
+ hb_new = futex_hash_private(&this->key, new->queues, hash_mask);
+ futex_hb_waiters_inc(hb_new);
+ /*
+ * The new pointer isn't published yet but an already
+ * moved user can be unqueued due to timeout or signal.
+ */
+ spin_lock_nested(&hb_new->lock, SINGLE_DEPTH_NESTING);
+ plist_add(&this->list, &hb_new->chain);
+ this->lock_ptr = &hb_new->lock;
+ spin_unlock(&hb_new->lock);
+ }
+ spin_unlock(&hb_old->lock);
+ }
+}
+
+static void futex_assign_new_hash(struct futex_private_hash *hb_p_new,
+ struct mm_struct *mm)
+{
+ bool drop_init_ref = hb_p_new != NULL;
+ struct futex_private_hash *hb_p;
+
+ if (!hb_p_new) {
+ hb_p_new = mm->futex_phash_new;
+ mm->futex_phash_new = NULL;
+ }
+ /* Someone was quicker, the current mask is valid */
+ if (!hb_p_new)
+ return;
+
+ hb_p = rcu_dereference_check(mm->futex_phash,
+ lockdep_is_held(&mm->futex_hash_lock));
+ if (hb_p) {
+ if (hb_p->hash_mask >= hb_p_new->hash_mask) {
+ /* It was increased again while we were waiting */
+ kvfree(hb_p_new);
+ return;
+ }
+ /*
+ * If the caller started the resize then the initial reference
+ * needs to be dropped. If the object can not be deconstructed
+ * we save hb_p_new for later and ensure the reference counter
+ * is not dropped again.
+ */
+ if (drop_init_ref &&
+ (hb_p->initial_ref_dropped || !futex_put_private_hash(hb_p))) {
+ mm->futex_phash_new = hb_p_new;
+ hb_p->initial_ref_dropped = true;
+ return;
+ }
+ if (!READ_ONCE(hb_p->released)) {
+ mm->futex_phash_new = hb_p_new;
+ return;
+ }
+
+ futex_rehash_current_users(hb_p, hb_p_new);
+ }
+ rcu_assign_pointer(mm->futex_phash, hb_p_new);
+ kvfree_rcu(hb_p, rcu);
+}
+
struct futex_private_hash *futex_get_private_hash(void)
{
- return NULL;
+ struct mm_struct *mm = current->mm;
+ /*
+ * Ideally we don't loop. If there is a replacement in progress
+ * then a new private hash is already prepared and a reference can't be
+ * obtained once the last user dropped it's.
+ * In that case we block on mm_struct::futex_hash_lock and either have
+ * to perform the replacement or wait while someone else is doing the
+ * job. Eitherway, on the second iteration we acquire a reference on the
+ * new private hash or loop again because a new replacement has been
+ * requested.
+ */
+again:
+ scoped_guard(rcu) {
+ struct futex_private_hash *hb_p;
+
+ hb_p = rcu_dereference(mm->futex_phash);
+ if (!hb_p)
+ return NULL;
+
+ if (rcuref_get(&hb_p->users))
+ return hb_p;
+ }
+ scoped_guard(mutex, ¤t->mm->futex_hash_lock)
+ futex_assign_new_hash(NULL, mm);
+ goto again;
+}
+
+static struct futex_private_hash *futex_get_private_hb(union futex_key *key)
+{
+ if (!futex_key_is_private(key))
+ return NULL;
+
+ return futex_get_private_hash();
}
/**
@@ -144,12 +266,12 @@ struct futex_private_hash *futex_get_private_hash(void)
*/
struct futex_hash_bucket *__futex_hash(union futex_key *key)
{
- struct futex_hash_bucket *fhb;
+ struct futex_private_hash *hb_p;
u32 hash;
- fhb = current->mm->futex_hash_bucket;
- if (fhb && futex_key_is_private(key))
- return futex_hash_private(key, fhb, current->mm->futex_hash_mask);
+ hb_p = futex_get_private_hb(key);
+ if (hb_p)
+ return futex_hash_private(key, hb_p->queues, hb_p->hash_mask);
hash = jhash2((u32 *)key,
offsetof(typeof(*key), both.offset) / 4,
@@ -159,7 +281,13 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
bool futex_put_private_hash(struct futex_private_hash *hb_p)
{
- return false;
+ bool released;
+
+ guard(preempt)();
+ released = rcuref_put_rcusafe(&hb_p->users);
+ if (released)
+ WRITE_ONCE(hb_p->released, true);
+ return released;
}
/**
@@ -171,9 +299,22 @@ bool futex_put_private_hash(struct futex_private_hash *hb_p)
*/
void futex_hash_get(struct futex_hash_bucket *hb)
{
+ struct futex_private_hash *hb_p = hb->hb_p;
+
+ if (!hb_p)
+ return;
+
+ WARN_ON_ONCE(!rcuref_get(&hb_p->users));
}
-void futex_hash_put(struct futex_hash_bucket *hb) { }
+void futex_hash_put(struct futex_hash_bucket *hb)
+{
+ struct futex_private_hash *hb_p = hb->hb_p;
+
+ if (!hb_p)
+ return;
+ futex_put_private_hash(hb_p);
+}
/**
* futex_setup_timer - set up the sleeping hrtimer.
@@ -615,6 +756,8 @@ int futex_unqueue(struct futex_q *q)
spinlock_t *lock_ptr;
int ret = 0;
+ /* RCU so lock_ptr is not going away during locking. */
+ guard(rcu)();
/* In the common case we don't take the spinlock, which is nice. */
retry:
/*
@@ -1028,9 +1171,21 @@ static void compat_exit_robust_list(struct task_struct *curr)
static void exit_pi_state_list(struct task_struct *curr)
{
struct list_head *next, *head = &curr->pi_state_list;
+ struct futex_private_hash *hb_p;
struct futex_pi_state *pi_state;
union futex_key key = FUTEX_KEY_INIT;
+ /*
+ * The mutex mm_struct::futex_hash_lock might be acquired.
+ */
+ might_sleep();
+ /*
+ * Ensure the hash remains stable (no resize) during the while loop
+ * below. The hb pointer is acquired under the pi_lock so we can't block
+ * on the mutex.
+ */
+ WARN_ON(curr != current);
+ hb_p = futex_get_private_hash();
/*
* We are a ZOMBIE and nobody can enqueue itself on
* pi_state_list anymore, but we have to be careful
@@ -1093,6 +1248,8 @@ static void exit_pi_state_list(struct task_struct *curr)
raw_spin_lock_irq(&curr->pi_lock);
}
raw_spin_unlock_irq(&curr->pi_lock);
+ if (hb_p)
+ futex_put_private_hash(hb_p);
}
#else
static inline void exit_pi_state_list(struct task_struct *curr) { }
@@ -1206,8 +1363,10 @@ void futex_exit_release(struct task_struct *tsk)
futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
}
-static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
+ struct futex_private_hash *hb_p)
{
+ fhb->hb_p = hb_p;
atomic_set(&fhb->waiters, 0);
plist_head_init(&fhb->chain);
spin_lock_init(&fhb->lock);
@@ -1215,20 +1374,34 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
void futex_hash_free(struct mm_struct *mm)
{
- kvfree(mm->futex_hash_bucket);
+ struct futex_private_hash *hb_p;
+
+ kvfree(mm->futex_phash_new);
+ /*
+ * The mm_struct belonging to the task is about to be removed so all
+ * threads, that ever accessed the private hash, are gone and the
+ * pointer can be accessed directly (omitting a RCU-read section or
+ * lock).
+ * Since there can not be a thread holding a reference to the private
+ * hash we free it immediately.
+ */
+ hb_p = rcu_dereference_raw(mm->futex_phash);
+ if (!hb_p)
+ return;
+
+ if (!hb_p->initial_ref_dropped && WARN_ON(!futex_put_private_hash(hb_p)))
+ return;
+
+ kvfree(hb_p);
}
static int futex_hash_allocate(unsigned int hash_slots)
{
- struct futex_hash_bucket *fhb;
+ struct futex_private_hash *hb_p, *hb_tofree = NULL;
+ struct mm_struct *mm = current->mm;
+ size_t alloc_size;
int i;
- if (current->mm->futex_hash_bucket)
- return -EALREADY;
-
- if (!thread_group_leader(current))
- return -EINVAL;
-
if (hash_slots == 0)
hash_slots = 16;
if (hash_slots < 2)
@@ -1238,16 +1411,39 @@ static int futex_hash_allocate(unsigned int hash_slots)
if (!is_power_of_2(hash_slots))
hash_slots = rounddown_pow_of_two(hash_slots);
- fhb = kvmalloc_array(hash_slots, sizeof(struct futex_hash_bucket), GFP_KERNEL_ACCOUNT);
- if (!fhb)
+ if (unlikely(check_mul_overflow(hash_slots, sizeof(struct futex_hash_bucket),
+ &alloc_size)))
return -ENOMEM;
- current->mm->futex_hash_mask = hash_slots - 1;
+ if (unlikely(check_add_overflow(alloc_size, sizeof(struct futex_private_hash),
+ &alloc_size)))
+ return -ENOMEM;
+
+ hb_p = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
+ if (!hb_p)
+ return -ENOMEM;
+
+ rcuref_init(&hb_p->users, 1);
+ hb_p->initial_ref_dropped = false;
+ hb_p->released = false;
+ hb_p->hash_mask = hash_slots - 1;
for (i = 0; i < hash_slots; i++)
- futex_hash_bucket_init(&fhb[i]);
+ futex_hash_bucket_init(&hb_p->queues[i], hb_p);
- current->mm->futex_hash_bucket = fhb;
+ scoped_guard(mutex, &mm->futex_hash_lock) {
+ if (mm->futex_phash_new) {
+ if (mm->futex_phash_new->hash_mask <= hb_p->hash_mask) {
+ hb_tofree = mm->futex_phash_new;
+ } else {
+ hb_tofree = hb_p;
+ hb_p = mm->futex_phash_new;
+ }
+ mm->futex_phash_new = NULL;
+ }
+ futex_assign_new_hash(hb_p, mm);
+ }
+ kvfree(hb_tofree);
return 0;
}
@@ -1258,8 +1454,12 @@ int futex_hash_allocate_default(void)
static int futex_hash_get_slots(void)
{
- if (current->mm->futex_hash_bucket)
- return current->mm->futex_hash_mask + 1;
+ struct futex_private_hash *hb_p;
+
+ guard(rcu)();
+ hb_p = rcu_dereference(current->mm->futex_phash);
+ if (hb_p)
+ return hb_p->hash_mask + 1;
return 0;
}
@@ -1301,7 +1501,7 @@ static int __init futex_init(void)
futex_hashsize = 1UL << futex_shift;
for (i = 0; i < futex_hashsize; i++)
- futex_hash_bucket_init(&futex_queues[i]);
+ futex_hash_bucket_init(&futex_queues[i], 0);
return 0;
}
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 0a76ee6e7dc10..973efcca2e01b 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -118,6 +118,7 @@ struct futex_hash_bucket {
atomic_t waiters;
spinlock_t lock;
struct plist_head chain;
+ struct futex_private_hash *hb_p;
} ____cacheline_aligned_in_smp;
/*
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index b0e64fd454d96..c716a66f86929 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -87,6 +87,11 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
futex_hb_waiters_inc(hb2);
plist_add(&q->list, &hb2->chain);
q->lock_ptr = &hb2->lock;
+ /*
+ * hb1 and hb2 belong to the same futex_hash_bucket_private
+ * because if we managed get a reference on hb1 then it can't be
+ * replaced. Therefore we avoid put(hb1)+get(hb2) here.
+ */
}
q->key = *key2;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH v9 10/11] futex: Resize local futex hash table based on number of threads.
2025-02-25 17:09 [PATCH v9 00/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (8 preceding siblings ...)
2025-02-25 17:09 ` [PATCH v9 09/11] futex: Allow to re-allocate the private local hash Sebastian Andrzej Siewior
@ 2025-02-25 17:09 ` Sebastian Andrzej Siewior
2025-02-25 17:09 ` [PATCH v9 11/11] futex: Use a hashmask instead of hashsize Sebastian Andrzej Siewior
2025-03-03 10:54 ` [PATCH v9 00/11] futex: Add support task local hash maps Peter Zijlstra
11 siblings, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-25 17:09 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Automatically size the local hash based on the number of threads but
don't exceed the number of online CPUs. The logic tries to allocate
between 16 and futex_hashsize (the default for the system wide hash
bucket) and uses 4 * number-of-threads.
On CONFIG_BASE_SMALL configs, the additional members for private hash
resize have been removed in order to save memory on mm_struct and avoid
any additional memory consumption. If we really do this, then I would
re-arrange the code structure in the previous patches to limit the
ifdefs.
The alternatives would be to limit the buckets allocated in
futex_hash_allocate_default() to 2. Avoiding
futex_hash_allocate_default() but allowing PR_FUTEX_HASH_SET_SLOTS to
work would require to hold mm_struct::futex_hash_lock in
exit_pi_state_list() and futex_wait_multiple_setup() that private does
not appear during these operations (which is currently ensured by
holding a reference).
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 21 +++++++--------
include/linux/mm_types.h | 2 +-
kernel/fork.c | 4 +--
kernel/futex/core.c | 57 +++++++++++++++++++++++++++++++++++++---
kernel/futex/futex.h | 8 ++++++
5 files changed, 73 insertions(+), 19 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index bfb38764bac7a..77821a78059f2 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -78,6 +78,13 @@ void futex_exec_release(struct task_struct *tsk);
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3);
int futex_hash_prctl(unsigned long arg2, unsigned long arg3);
+
+#ifdef CONFIG_BASE_SMALL
+static inline int futex_hash_allocate_default(void) { return 0; }
+static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline void futex_mm_init(struct mm_struct *mm) { }
+#else /* !CONFIG_BASE_SMALL */
+
int futex_hash_allocate_default(void);
void futex_hash_free(struct mm_struct *mm);
@@ -87,14 +94,9 @@ static inline void futex_mm_init(struct mm_struct *mm)
mutex_init(&mm->futex_hash_lock);
}
-static inline bool futex_hash_requires_allocation(void)
-{
- if (current->mm->futex_phash)
- return false;
- return true;
-}
+#endif /* CONFIG_BASE_SMALL */
-#else
+#else /* !CONFIG_FUTEX */
static inline void futex_init_task(struct task_struct *tsk) { }
static inline void futex_exit_recursive(struct task_struct *tsk) { }
static inline void futex_exit_release(struct task_struct *tsk) { }
@@ -116,11 +118,6 @@ static inline int futex_hash_allocate_default(void)
static inline void futex_hash_free(struct mm_struct *mm) { }
static inline void futex_mm_init(struct mm_struct *mm) { }
-static inline bool futex_hash_requires_allocation(void)
-{
- return false;
-}
-
#endif
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 19abbc870e0a9..72e68de850745 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -937,7 +937,7 @@ struct mm_struct {
*/
seqcount_t mm_lock_seq;
#endif
-#ifdef CONFIG_FUTEX
+#if defined(CONFIG_FUTEX) && !defined(CONFIG_BASE_SMALL)
struct mutex futex_hash_lock;
struct futex_private_hash __rcu *futex_phash;
struct futex_private_hash *futex_phash_new;
diff --git a/kernel/fork.c b/kernel/fork.c
index 824cc55d32ece..5e15e5b24f289 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2142,9 +2142,7 @@ static bool need_futex_hash_allocate_default(u64 clone_flags)
{
if ((clone_flags & (CLONE_THREAD | CLONE_VM)) != (CLONE_THREAD | CLONE_VM))
return false;
- if (!thread_group_empty(current))
- return false;
- return futex_hash_requires_allocation();
+ return true;
}
/*
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 4d9ee3bcaa6d0..6d375b9407c85 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -138,6 +138,7 @@ static struct futex_hash_bucket *futex_hash_private(union futex_key *key,
return &fhb[hash & hash_mask];
}
+#ifndef CONFIG_BASE_SMALL
static void futex_rehash_current_users(struct futex_private_hash *old,
struct futex_private_hash *new)
{
@@ -256,6 +257,14 @@ static struct futex_private_hash *futex_get_private_hb(union futex_key *key)
return futex_get_private_hash();
}
+#else
+
+static struct futex_private_hash *futex_get_private_hb(union futex_key *key)
+{
+ return NULL;
+}
+#endif
+
/**
* futex_hash - Return the hash bucket in the global or local hash
* @key: Pointer to the futex key for which the hash is calculated
@@ -279,6 +288,7 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
return &futex_queues[hash & (futex_hashsize - 1)];
}
+#ifndef CONFIG_BASE_SMALL
bool futex_put_private_hash(struct futex_private_hash *hb_p)
{
bool released;
@@ -315,6 +325,7 @@ void futex_hash_put(struct futex_hash_bucket *hb)
return;
futex_put_private_hash(hb_p);
}
+#endif
/**
* futex_setup_timer - set up the sleeping hrtimer.
@@ -1366,12 +1377,15 @@ void futex_exit_release(struct task_struct *tsk)
static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
struct futex_private_hash *hb_p)
{
+#ifndef CONFIG_BASE_SMALL
fhb->hb_p = hb_p;
+#endif
atomic_set(&fhb->waiters, 0);
plist_head_init(&fhb->chain);
spin_lock_init(&fhb->lock);
}
+#ifndef CONFIG_BASE_SMALL
void futex_hash_free(struct mm_struct *mm)
{
struct futex_private_hash *hb_p;
@@ -1406,8 +1420,8 @@ static int futex_hash_allocate(unsigned int hash_slots)
hash_slots = 16;
if (hash_slots < 2)
hash_slots = 2;
- if (hash_slots > 131072)
- hash_slots = 131072;
+ if (hash_slots > futex_hashsize)
+ hash_slots = futex_hashsize;
if (!is_power_of_2(hash_slots))
hash_slots = rounddown_pow_of_two(hash_slots);
@@ -1449,7 +1463,31 @@ static int futex_hash_allocate(unsigned int hash_slots)
int futex_hash_allocate_default(void)
{
- return futex_hash_allocate(0);
+ unsigned int threads, buckets, current_buckets = 0;
+ struct futex_private_hash *hb_p;
+
+ if (!current->mm)
+ return 0;
+
+ scoped_guard(rcu) {
+ threads = min_t(unsigned int, get_nr_threads(current), num_online_cpus());
+ hb_p = rcu_dereference(current->mm->futex_phash);
+ if (hb_p)
+ current_buckets = hb_p->hash_mask + 1;
+ }
+
+ /*
+ * The default allocation will remain within
+ * 16 <= threads * 4 <= global hash size
+ */
+ buckets = roundup_pow_of_two(4 * threads);
+ buckets = max(buckets, 16);
+ buckets = min(buckets, futex_hashsize);
+
+ if (current_buckets >= buckets)
+ return 0;
+
+ return futex_hash_allocate(buckets);
}
static int futex_hash_get_slots(void)
@@ -1463,6 +1501,19 @@ static int futex_hash_get_slots(void)
return 0;
}
+#else
+
+static int futex_hash_allocate(unsigned int hash_slots)
+{
+ return -EINVAL;
+}
+
+static int futex_hash_get_slots(void)
+{
+ return 0;
+}
+#endif
+
int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
{
int ret;
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 973efcca2e01b..d1149739f3110 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -205,10 +205,18 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
int flags, u64 range_ns);
extern struct futex_hash_bucket *__futex_hash(union futex_key *key);
+#ifdef CONFIG_BASE_SMALL
+static inline void futex_hash_get(struct futex_hash_bucket *hb) { }
+static inline void futex_hash_put(struct futex_hash_bucket *hb) { }
+static inline struct futex_private_hash *futex_get_private_hash(void) { return NULL; }
+static inline bool futex_put_private_hash(struct futex_private_hash *hb_p) { return false; }
+
+#else /* !CONFIG_BASE_SMALL */
extern void futex_hash_get(struct futex_hash_bucket *hb);
extern void futex_hash_put(struct futex_hash_bucket *hb);
extern struct futex_private_hash *futex_get_private_hash(void);
extern bool futex_put_private_hash(struct futex_private_hash *hb_p);
+#endif
DEFINE_CLASS(hb, struct futex_hash_bucket *,
if (_T) futex_hash_put(_T),
--
2.47.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH v9 11/11] futex: Use a hashmask instead of hashsize.
2025-02-25 17:09 [PATCH v9 00/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (9 preceding siblings ...)
2025-02-25 17:09 ` [PATCH v9 10/11] futex: Resize local futex hash table based on number of threads Sebastian Andrzej Siewior
@ 2025-02-25 17:09 ` Sebastian Andrzej Siewior
2025-02-26 8:17 ` Thomas Gleixner
2025-03-03 10:54 ` [PATCH v9 00/11] futex: Add support task local hash maps Peter Zijlstra
11 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-25 17:09 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
The global hash uses futex_hashsize to save the amount of the hash
buckets that have been allocated during system boot. On each
futex_hash() invocation this number is substracted by one to get the
mask. This can be optimized by saving directly the mask avoiding the
substraction on each futex_hash() invocation.
Rename futex_hashsize to futex_hashmask and save the mask of the
allocated hash map.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 27 ++++++++++++++-------------
1 file changed, 14 insertions(+), 13 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 6d375b9407c85..283e6644c05f9 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -52,10 +52,10 @@
*/
static struct {
struct futex_hash_bucket *queues;
- unsigned long hashsize;
+ unsigned long hashmask;
} __futex_data __read_mostly __aligned(2*sizeof(long));
#define futex_queues (__futex_data.queues)
-#define futex_hashsize (__futex_data.hashsize)
+#define futex_hashmask (__futex_data.hashmask)
struct futex_private_hash {
rcuref_t users;
@@ -285,7 +285,7 @@ struct futex_hash_bucket *__futex_hash(union futex_key *key)
hash = jhash2((u32 *)key,
offsetof(typeof(*key), both.offset) / 4,
key->both.offset);
- return &futex_queues[hash & (futex_hashsize - 1)];
+ return &futex_queues[hash & futex_hashmask];
}
#ifndef CONFIG_BASE_SMALL
@@ -1420,8 +1420,8 @@ static int futex_hash_allocate(unsigned int hash_slots)
hash_slots = 16;
if (hash_slots < 2)
hash_slots = 2;
- if (hash_slots > futex_hashsize)
- hash_slots = futex_hashsize;
+ if (hash_slots > futex_hashmask + 1)
+ hash_slots = futex_hashmask + 1;
if (!is_power_of_2(hash_slots))
hash_slots = rounddown_pow_of_two(hash_slots);
@@ -1482,7 +1482,7 @@ int futex_hash_allocate_default(void)
*/
buckets = roundup_pow_of_two(4 * threads);
buckets = max(buckets, 16);
- buckets = min(buckets, futex_hashsize);
+ buckets = min(buckets, futex_hashmask + 1);
if (current_buckets >= buckets)
return 0;
@@ -1536,24 +1536,25 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3)
static int __init futex_init(void)
{
+ unsigned long i, hashsize;
unsigned int futex_shift;
- unsigned long i;
#ifdef CONFIG_BASE_SMALL
- futex_hashsize = 16;
+ hashsize = 16;
#else
- futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
+ hashsize = roundup_pow_of_two(256 * num_possible_cpus());
#endif
futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
- futex_hashsize, 0, 0,
+ hashsize, 0, 0,
&futex_shift, NULL,
- futex_hashsize, futex_hashsize);
- futex_hashsize = 1UL << futex_shift;
+ hashsize, hashsize);
+ hashsize = 1UL << futex_shift;
- for (i = 0; i < futex_hashsize; i++)
+ for (i = 0; i < hashsize; i++)
futex_hash_bucket_init(&futex_queues[i], 0);
+ futex_hashmask = hashsize - 1;
return 0;
}
core_initcall(futex_init);
--
2.47.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH v9 11/11] futex: Use a hashmask instead of hashsize.
2025-02-25 17:09 ` [PATCH v9 11/11] futex: Use a hashmask instead of hashsize Sebastian Andrzej Siewior
@ 2025-02-26 8:17 ` Thomas Gleixner
0 siblings, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2025-02-26 8:17 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
On Tue, Feb 25 2025 at 18:09, Sebastian Andrzej Siewior wrote:
> The global hash uses futex_hashsize to save the amount of the hash
> buckets that have been allocated during system boot. On each
> futex_hash() invocation this number is substracted by one to get the
> mask. This can be optimized by saving directly the mask avoiding the
> substraction on each futex_hash() invocation.
As this is true independent of the private hash muck, this should go to
the top of the series, so it can be applied right away. Aside of that it
spares the churn in the new code ....
Thanks,
tglx
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v9 00/11] futex: Add support task local hash maps.
2025-02-25 17:09 [PATCH v9 00/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (10 preceding siblings ...)
2025-02-25 17:09 ` [PATCH v9 11/11] futex: Use a hashmask instead of hashsize Sebastian Andrzej Siewior
@ 2025-03-03 10:54 ` Peter Zijlstra
2025-03-03 14:17 ` Sebastian Andrzej Siewior
2025-03-11 15:20 ` Sebastian Andrzej Siewior
11 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2025-03-03 10:54 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Tue, Feb 25, 2025 at 06:09:03PM +0100, Sebastian Andrzej Siewior wrote:
> Sebastian Andrzej Siewior (11):
> futex: fixup futex_wait_setup [fold futex: Move futex_queue() into
> futex_wait_setup()]
> futex: Create helper function to initialize a hash slot.
> futex: Add basic infrastructure for local task local hash.
> futex: Hash only the address for private futexes.
> futex: Allow automatic allocation of process wide futex hash.
> futex: Decrease the waiter count before the unlock operation.
> futex: Introduce futex_q_lockptr_lock().
> futex: Acquire a hash reference in futex_wait_multiple_setup().
> futex: Allow to re-allocate the private local hash.
> futex: Resize local futex hash table based on number of threads.
Right, I've been going over this and been poking at the patches for the
past few days, and I'm not quite sure where to start.
There's a bunch of simple things, that can be trivially fixed, but
there's also some more fundamental things.
I've written a pile of patches on top of this while playing around with
things. The latest pile sits in:
queue/locking/futex
I'm not sure I should post the patches as a reply to this email (I can,
if people want), but let me try and summarize what I did and why.
Primarily, the reason I started poking at it is that I think the prctl()
as implemented is completely useless. Notably its effect is entirely
ephemeral, one pthread_create() call can re-size the hash, destroying
the user requested size. Also, I still feel one should be able to set
the hash size to 0 and have it revert to global hash.
Finally prctl() should not return until the rehash is complete.
I think my implementation now does all that -- but I've not tested it
yet -- I've to write a prctl() testcase and it was too nice outside :-)
So, on the way to reworking the prctl(), I ran into:
- naming; hb_p is a terrible name, the way I read that is
hash-bucket-private, or hash-bucket pointer, neither make much sense,
because they're a pointer to struct futex_private_hash, which is a
hash-table.
I've very uninspired done s/hb_p/fph/g with the exception of
hb->hb_p, which is now hb->priv.
- more naming; you had:
hb = __futex_hash(key);
futex_hash_get(hb);
futex_hash_put(hb);
fph = futex_get_private_hash();
futex_put_private_hash();
which is all sorts of inconsistent, and I've made that:
hb = __futex_hash(key); /* hash, no get */
hb = futex_hash(key) /* hash and get */
futex_hash_get(hb); /* get */
futex_hash_put(hb); /* put */
fph = futex_private_hash();
futex_private_hash_get(fph);
futex_private_hash_put(fph);
- There was some superfluous state; notably, AFAICT
futex_private_hash::{initial_ref_dropped,released} are unneeded and
made the code unnecessarily complicated.
You can drop the initial ref when phash && !phash_new, eg on the
first time around when you allocate a new hash-table.
We don't need to track released because we can simply check for that
state using rcuref_read() == 0.
- As alluded to in a previous point, there was no means of only
hashing, the fph get was both non-obviously hidden inside the private
hash and unconditional. Untangled that.
My current prctl() thing does:
- reject !power-of-two and 1
- accepts 0
- returns once rehash is done
Notably, having done a prctl() disables the auto-sizing.
When allocating a new private hash table and there is already one
pending, it compares the tables. The compare function checks in order:
- custom (user provided / prctl())
- zero size
- biggest size
IOW, any user requested size always wins, a 0 size is final otherwise
go with the largest.
After that I rebased my FUTEX2_NUMA patch on top of all this and added
a new FUTEX2_MPOL, which is something Christoph Lameter asked for a
while back, and something we can now actually do sanely, since we have
lockless vma lookups working.
Anyway, the entire stack builds and boots, but is otherwise very much
untested.
WDYT?
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH v9 00/11] futex: Add support task local hash maps.
2025-03-03 10:54 ` [PATCH v9 00/11] futex: Add support task local hash maps Peter Zijlstra
@ 2025-03-03 14:17 ` Sebastian Andrzej Siewior
2025-03-03 16:40 ` Sebastian Andrzej Siewior
2025-03-11 15:20 ` Sebastian Andrzej Siewior
1 sibling, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-03 14:17 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-03-03 11:54:16 [+0100], Peter Zijlstra wrote:
> Right, I've been going over this and been poking at the patches for the
> past few days, and I'm not quite sure where to start.
>
> There's a bunch of simple things, that can be trivially fixed, but
> there's also some more fundamental things.
>
> I've written a pile of patches on top of this while playing around with
> things. The latest pile sits in:
>
> queue/locking/futex
>
> I'm not sure I should post the patches as a reply to this email (I can,
> if people want), but let me try and summarize what I did and why.
>
>
> Primarily, the reason I started poking at it is that I think the prctl()
> as implemented is completely useless. Notably its effect is entirely
> ephemeral, one pthread_create() call can re-size the hash, destroying
> the user requested size. Also, I still feel one should be able to set
> the hash size to 0 and have it revert to global hash.
>
> Finally prctl() should not return until the rehash is complete.
>
> I think my implementation now does all that -- but I've not tested it
> yet -- I've to write a prctl() testcase and it was too nice outside :-)
I kept prctl() mostly around for testing with a few hacks to be able to
always resize it, even if the size is the same/ smaller. tglx to have it
only increasing. However, let me take this and do some testing.
…
> Anyway, the entire stack builds and boots, but is otherwise very much
> untested.
>
> WDYT?
well. Let take a look and do a bit of hammering.
Sebastian
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v9 00/11] futex: Add support task local hash maps.
2025-03-03 14:17 ` Sebastian Andrzej Siewior
@ 2025-03-03 16:40 ` Sebastian Andrzej Siewior
2025-03-04 14:58 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-03 16:40 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-03-03 15:17:55 [+0100], To Peter Zijlstra wrote:
> > Anyway, the entire stack builds and boots, but is otherwise very much
> > untested.
> >
> > WDYT?
>
> well. Let take a look and do a bit of hammering.
so you kept the q.drop_hb_ref logic and the reference get. You kept the
private reference but renamed it and hid it behind the CLASS. I meant to
do it, just wanted to check if you had another idea regarding it. But
okay.
You avoided the two states by dropping refcount only there is no !new
pointer. That should work.
There is no refcount check in futex_hash_free(). It wouldn't hurt to
check futex_phash for 0/1, right?
My first few tests succeeded. And I have a few RCU annotations, which I
post once I complete them and finish my requeue-pi tests.
Sebastian
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v9 00/11] futex: Add support task local hash maps.
2025-03-03 16:40 ` Sebastian Andrzej Siewior
@ 2025-03-04 14:58 ` Sebastian Andrzej Siewior
2025-03-05 9:02 ` Sebastian Andrzej Siewior
2025-03-10 15:57 ` Peter Zijlstra
0 siblings, 2 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-04 14:58 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-03-03 17:40:16 [+0100], To Peter Zijlstra wrote:
…
> You avoided the two states by dropping refcount only there is no !new
> pointer. That should work.
…
> My first few tests succeeded. And I have a few RCU annotations, which I
> post once I complete them and finish my requeue-pi tests.
get_futex_key() has this:
|…
| if (!fshared) {
|…
| if (IS_ENABLED(CONFIG_MMU))
| key->private.mm = mm;
| else
| key->private.mm = NULL;
|
| key->private.address = address;
|
and now __futex_hash_private() has this:
| {
| if (!futex_key_is_private(key))
| return NULL;
|
| if (!fph)
| fph = rcu_dereference(key->private.mm->futex_phash);
Dereferencing mm won't work on !CONFIG_MMU. We could limit private hash
to !CONFIG_BASE_SMALL && CONFIG_MMU.
Ignoring this, I managed to crash the box on top of 49fd6b8f5d59
("futex: Implement FUTEX2_MPOL"). I had one commit on top to make the
prctl not blocking (make futex_hash_allocate(, false)). This is simulate
the fork resize. The backtrace:
| [ T8658] BUG: unable to handle page fault for address: fffffffffffffff0
| [ T8658] #PF: supervisor read access in kernel mode
| [ T8658] #PF: error_code(0x0000) - not-present page
| [ T8658] PGD 2c5a067 P4D 2c5a067 PUD 2c5c067 PMD 0
| [ T8658] Oops: Oops: 0000 [#1] PREEMPT_RT SMP NOPTI
| [ T8658] CPU: 6 UID: 1001 PID: 8658 Comm: thread-create-l Not tainted 6.14.0-rc4+ #188 676565269ee73396c27dead3a66b3f774bd9af57
| [ T8658] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014
| [ T8658] RIP: 0010:plist_check_list+0xb/0xa0
| [ T8658] Code: cc cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 41 54 49 89 fc 55 53 48 83 ec 10 <48> 8b 1f 48 8b 43 08 48 39 c7 74 27 48 8b 4f 08 50 49 89 f8 48 89
| [ T8658] RSP: 0018:ffffc90022e27c90 EFLAGS: 00010286
| [ T8658] RAX: 0000000000000000 RBX: ffffc90022e27e00 RCX: 0000000000000000
| [ T8658] RDX: ffff888558da02a8 RSI: ffff888558da02a8 RDI: fffffffffffffff0
| [ T8658] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff8885680dc980
| [ T8658] R10: 0000031e8e1a7200 R11: ffff888574990028 R12: fffffffffffffff0
| [ T8658] R13: ffff888558da02a8 R14: ffffc90022e27e48 R15: ffffc90022e27d38
| [ T8658] FS: 00007f741af9e6c0(0000) GS:ffff8885a7c2b000(0000) knlGS:0000000000000000
| [ T8658] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
| [ T8658] CR2: fffffffffffffff0 CR3: 00000006d7aca005 CR4: 00000000000626f0
| [ T8658] Call Trace:
| [ T8658] <TASK>
| [ T8658] plist_del+0x28/0x100
| [ T8658] __futex_unqueue+0x29/0x40
| [ T8658] futex_unqueue_pi+0x1f/0x40
| [ T8658] futex_lock_pi+0x24d/0x420
| [ T8658] do_futex+0x57/0x190
| [ T8658] __x64_sys_futex+0xfe/0x1a0
It takes about 1h+ to reproduce. And only on one particular stubborn
box. This originates from futex_unqueue_pi() after
futex_q_lockptr_lock(). I have another crash within
futex_q_lockptr_lock() (in spin_lock()).
This looks like the locking task was not enqueued in the hash bucket
during the resize. This means there was a timeout and the unlocking task
removed it while looking for the next owner. But the unlocking part
acquired an additional reference to avoid a resize in that case. So,
confused I am.
I reverted to 50ca0ec83226 ("futex: Resize local futex hash table based
on number of threads."), have the another "always resize hack" and so
far it looks good.
Looking at __futex_pivot_hash() there is this:
| if (fph) {
| if (rcuref_read(&fph->users) != 0) {
| mm->futex_phash_new = new;
| return false;
| }
|
| futex_rehash_private(fph, new);
| }
So we stash the new pointer as long as rcuref_read() does not return 0.
How stable is rcuref_read()'s 0 return actually? The code says:
| static inline unsigned int rcuref_read(rcuref_t *ref)
| {
| unsigned int c = atomic_read(&ref->refcnt);
|
| /* Return 0 if within the DEAD zone. */
| return c >= RCUREF_RELEASED ? 0 : c + 1;
| }
so if it got negative on its final put, the c becomes -1/ 0xff…ff. This
+1 will be 0 and we do a resize. But it is negative and did not reach
RCUREF_DEAD yet so it can be bumbed back to positive. It will not be
deconstructed because the cmpxchg in rcuref_put_slowpath() fails. So it
will remains active. But we do a resize here and end up with to private
hash. That is why I had the `released' member.
Sebastian
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH v9 00/11] futex: Add support task local hash maps.
2025-03-04 14:58 ` Sebastian Andrzej Siewior
@ 2025-03-05 9:02 ` Sebastian Andrzej Siewior
2025-03-10 16:01 ` Peter Zijlstra
2025-03-10 15:57 ` Peter Zijlstra
1 sibling, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-05 9:02 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-03-04 15:58:39 [+0100], To Peter Zijlstra wrote:
> hash. That is why I had the `released' member.
The box was still alive this morning so it did survive >12h testing. I
would bring back the `released' member back unless you have other
preferences.
Depending on those I could fold the fixes directly into the patches and
repost the whole thing or prepare you patches that can be folded back
and send those.
Sebastian
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v9 00/11] futex: Add support task local hash maps.
2025-03-05 9:02 ` Sebastian Andrzej Siewior
@ 2025-03-10 16:01 ` Peter Zijlstra
2025-03-10 16:27 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2025-03-10 16:01 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Wed, Mar 05, 2025 at 10:02:37AM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-04 15:58:39 [+0100], To Peter Zijlstra wrote:
> > hash. That is why I had the `released' member.
>
> The box was still alive this morning so it did survive >12h testing. I
> would bring back the `released' member back unless you have other
> preferences.
Like I just wrote in that other email; I'm a bit confused as to how this
can happen. If rcuref_put() returns success, then the value is DEAD. It must
then either be decremented below RELEASED or incremented past NOREF in
order for rcuref_read() to no longer return 0.
> Depending on those I could fold the fixes directly into the patches and
> repost the whole thing or prepare you patches that can be folded back
> and send those.
Please, it appears I don't have as much time as I would like :/
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v9 00/11] futex: Add support task local hash maps.
2025-03-10 16:01 ` Peter Zijlstra
@ 2025-03-10 16:27 ` Sebastian Andrzej Siewior
2025-03-11 10:17 ` Peter Zijlstra
0 siblings, 1 reply; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-10 16:27 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-03-10 17:01:02 [+0100], Peter Zijlstra wrote:
> On Wed, Mar 05, 2025 at 10:02:37AM +0100, Sebastian Andrzej Siewior wrote:
> > On 2025-03-04 15:58:39 [+0100], To Peter Zijlstra wrote:
> > > hash. That is why I had the `released' member.
> >
> > The box was still alive this morning so it did survive >12h testing. I
> > would bring back the `released' member back unless you have other
> > preferences.
>
> Like I just wrote in that other email; I'm a bit confused as to how this
> can happen. If rcuref_put() returns success, then the value is DEAD. It must
> then either be decremented below RELEASED or incremented past NOREF in
> order for rcuref_read() to no longer return 0.
We can't rely on 0 to be released as it might become active. We could
change rcuref_read() to return 0 if it could be obtained and -1 if it
can not.
We don't have many users atm so an audit should be quick.
> > Depending on those I could fold the fixes directly into the patches and
> > repost the whole thing or prepare you patches that can be folded back
> > and send those.
>
> Please, it appears I don't have as much time as I would like :/
Sebastian
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v9 00/11] futex: Add support task local hash maps.
2025-03-10 16:27 ` Sebastian Andrzej Siewior
@ 2025-03-11 10:17 ` Peter Zijlstra
2025-03-11 10:33 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2025-03-11 10:17 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Mon, Mar 10, 2025 at 05:27:10PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-10 17:01:02 [+0100], Peter Zijlstra wrote:
> > On Wed, Mar 05, 2025 at 10:02:37AM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2025-03-04 15:58:39 [+0100], To Peter Zijlstra wrote:
> > > > hash. That is why I had the `released' member.
> > >
> > > The box was still alive this morning so it did survive >12h testing. I
> > > would bring back the `released' member back unless you have other
> > > preferences.
> >
> > Like I just wrote in that other email; I'm a bit confused as to how this
> > can happen. If rcuref_put() returns success, then the value is DEAD. It must
> > then either be decremented below RELEASED or incremented past NOREF in
> > order for rcuref_read() to no longer return 0.
>
> We can't rely on 0 to be released as it might become active. We could
> change rcuref_read() to return 0 if it could be obtained and -1 if it
> can not.
> We don't have many users atm so an audit should be quick.
Right, so I failed to understand initially. When DEAD it stays 0, but
there is indeed the one case where it isn't yet DEAD but still returns
0.
Making the DEAD return -1 seems like a good solution.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v9 00/11] futex: Add support task local hash maps.
2025-03-11 10:17 ` Peter Zijlstra
@ 2025-03-11 10:33 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-11 10:33 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-03-11 11:17:14 [+0100], Peter Zijlstra wrote:
> Right, so I failed to understand initially. When DEAD it stays 0, but
> there is indeed the one case where it isn't yet DEAD but still returns
> 0.
>
> Making the DEAD return -1 seems like a good solution.
The patch below is what I have/ tglx asked for. I intend to use it the
series and repost it once I fixed it up.
-------------->8--------------
Subject: [PATCH] rcuref: Provide rcuref_is_dead().
rcuref_read() returns the number of references that are currently held.
If 0 is returned then it is not safe to assume that the object ca be
scheduled for deconstruction because it is marked DEAD. This happens if
the return value of rcuref_put() is ignored and assumptions are made.
If 0 is returned then the counter transitioned from 0 to RCUREF_NOREF.
If rcuref_put() did not return to the caller then the counter did not
yet transition from RCUREF_NOREF to RCUREF_DEAD. This means that there
is still a chance that the counter counter will transition from
RCUREF_NOREF to 0 meaning it is still valid and must not be
deconstructed. In this brief window rcuref_read() will return 0.
Provide rcuref_is_dead() to determine if the counter is marked as
RCUREF_DEAD.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/rcuref.h | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/include/linux/rcuref.h b/include/linux/rcuref.h
index 6322d8c1c6b42..2fb2af6d98249 100644
--- a/include/linux/rcuref.h
+++ b/include/linux/rcuref.h
@@ -30,7 +30,11 @@ static inline void rcuref_init(rcuref_t *ref, unsigned int cnt)
* rcuref_read - Read the number of held reference counts of a rcuref
* @ref: Pointer to the reference count
*
- * Return: The number of held references (0 ... N)
+ * Return: The number of held references (0 ... N). The value 0 does not
+ * indicate that it is safe to schedule the object, protected by this reference
+ * counter, for deconstruction.
+ * If you want to know if the reference counter has been marked DEAD (as
+ * signaled by rcuref_put()) please use rcuread_is_dead().
*/
static inline unsigned int rcuref_read(rcuref_t *ref)
{
@@ -40,6 +44,22 @@ static inline unsigned int rcuref_read(rcuref_t *ref)
return c >= RCUREF_RELEASED ? 0 : c + 1;
}
+/**
+ * rcuref_is_dead - Check if the rcuref has been already marked dead
+ * @ref: Pointer to the reference count
+ *
+ * Return: True if the object has been marked DEAD. This signals that a previous
+ * invocation of rcuref_put() returned true on this reference counter meaning
+ * the protected object can safely be scheduled for deconstruction.
+ * Otherwise, returns false.
+ */
+static inline bool rcuref_is_dead(rcuref_t *ref)
+{
+ unsigned int c = atomic_read(&ref->refcnt);
+
+ return (c >= RCUREF_RELEASED) && (c < RCUREF_NOREF);
+}
+
extern __must_check bool rcuref_get_slowpath(rcuref_t *ref);
/**
--
2.47.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v9 00/11] futex: Add support task local hash maps.
2025-03-04 14:58 ` Sebastian Andrzej Siewior
2025-03-05 9:02 ` Sebastian Andrzej Siewior
@ 2025-03-10 15:57 ` Peter Zijlstra
1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2025-03-10 15:57 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On Tue, Mar 04, 2025 at 03:58:37PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-03 17:40:16 [+0100], To Peter Zijlstra wrote:
> …
> > You avoided the two states by dropping refcount only there is no !new
> > pointer. That should work.
> …
> > My first few tests succeeded. And I have a few RCU annotations, which I
> > post once I complete them and finish my requeue-pi tests.
>
> get_futex_key() has this:
> |…
> | if (!fshared) {
> |…
> | if (IS_ENABLED(CONFIG_MMU))
> | key->private.mm = mm;
> | else
> | key->private.mm = NULL;
> |
> | key->private.address = address;
> |
>
> and now __futex_hash_private() has this:
> | {
> | if (!futex_key_is_private(key))
> | return NULL;
> |
> | if (!fph)
> | fph = rcu_dereference(key->private.mm->futex_phash);
>
> Dereferencing mm won't work on !CONFIG_MMU. We could limit private hash
> to !CONFIG_BASE_SMALL && CONFIG_MMU.
Humph, yeah, not sure we should care about !MMU.
> Ignoring this, I managed to crash the box on top of 49fd6b8f5d59
> ("futex: Implement FUTEX2_MPOL"). I had one commit on top to make the
> prctl not blocking (make futex_hash_allocate(, false)). This is simulate
> the fork resize. The backtrace:
> | [ T8658] BUG: unable to handle page fault for address: fffffffffffffff0
> | [ T8658] #PF: supervisor read access in kernel mode
> | [ T8658] #PF: error_code(0x0000) - not-present page
> | [ T8658] PGD 2c5a067 P4D 2c5a067 PUD 2c5c067 PMD 0
> | [ T8658] Oops: Oops: 0000 [#1] PREEMPT_RT SMP NOPTI
> | [ T8658] CPU: 6 UID: 1001 PID: 8658 Comm: thread-create-l Not tainted 6.14.0-rc4+ #188 676565269ee73396c27dead3a66b3f774bd9af57
> | [ T8658] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014
> | [ T8658] RIP: 0010:plist_check_list+0xb/0xa0
> | [ T8658] Code: cc cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 41 54 49 89 fc 55 53 48 83 ec 10 <48> 8b 1f 48 8b 43 08 48 39 c7 74 27 48 8b 4f 08 50 49 89 f8 48 89
> | [ T8658] RSP: 0018:ffffc90022e27c90 EFLAGS: 00010286
> | [ T8658] RAX: 0000000000000000 RBX: ffffc90022e27e00 RCX: 0000000000000000
> | [ T8658] RDX: ffff888558da02a8 RSI: ffff888558da02a8 RDI: fffffffffffffff0
> | [ T8658] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff8885680dc980
> | [ T8658] R10: 0000031e8e1a7200 R11: ffff888574990028 R12: fffffffffffffff0
> | [ T8658] R13: ffff888558da02a8 R14: ffffc90022e27e48 R15: ffffc90022e27d38
> | [ T8658] FS: 00007f741af9e6c0(0000) GS:ffff8885a7c2b000(0000) knlGS:0000000000000000
> | [ T8658] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> | [ T8658] CR2: fffffffffffffff0 CR3: 00000006d7aca005 CR4: 00000000000626f0
> | [ T8658] Call Trace:
> | [ T8658] <TASK>
> | [ T8658] plist_del+0x28/0x100
> | [ T8658] __futex_unqueue+0x29/0x40
> | [ T8658] futex_unqueue_pi+0x1f/0x40
> | [ T8658] futex_lock_pi+0x24d/0x420
> | [ T8658] do_futex+0x57/0x190
> | [ T8658] __x64_sys_futex+0xfe/0x1a0
>
> It takes about 1h+ to reproduce. And only on one particular stubborn
> box. This originates from futex_unqueue_pi() after
> futex_q_lockptr_lock(). I have another crash within
> futex_q_lockptr_lock() (in spin_lock()).
>
> This looks like the locking task was not enqueued in the hash bucket
> during the resize. This means there was a timeout and the unlocking task
> removed it while looking for the next owner. But the unlocking part
> acquired an additional reference to avoid a resize in that case. So,
> confused I am.
Yeah, weird that.
> I reverted to 50ca0ec83226 ("futex: Resize local futex hash table based
> on number of threads."), have the another "always resize hack" and so
> far it looks good.
> Looking at __futex_pivot_hash() there is this:
> | if (fph) {
> | if (rcuref_read(&fph->users) != 0) {
> | mm->futex_phash_new = new;
> | return false;
> | }
> |
> | futex_rehash_private(fph, new);
> | }
>
> So we stash the new pointer as long as rcuref_read() does not return 0.
> How stable is rcuref_read()'s 0 return actually? The code says:
>
> | static inline unsigned int rcuref_read(rcuref_t *ref)
> | {
> | unsigned int c = atomic_read(&ref->refcnt);
> |
> | /* Return 0 if within the DEAD zone. */
> | return c >= RCUREF_RELEASED ? 0 : c + 1;
> | }
>
> so if it got negative on its final put, the c becomes -1/ 0xff…ff. This
> +1 will be 0 and we do a resize. But it is negative and did not reach
> RCUREF_DEAD yet so it can be bumbed back to positive. It will not be
> deconstructed because the cmpxchg in rcuref_put_slowpath() fails. So it
> will remains active. But we do a resize here and end up with to private
> hash. That is why I had the `released' member.
I am not quite sure I follow. If rcuref_put_slowpath() returns true;
then the value has been set to DEAD (high nibble E), any concurrent
inc/dec will move it away from that a little, but it will always be set
back to DEAD (IOW, you need 1<<29 concurrent modifications into the same
direction to push it out of the DEAD range).
As long as it is within those 29 bits of DEAD, rcuref_read() should
return 0.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v9 00/11] futex: Add support task local hash maps.
2025-03-03 10:54 ` [PATCH v9 00/11] futex: Add support task local hash maps Peter Zijlstra
2025-03-03 14:17 ` Sebastian Andrzej Siewior
@ 2025-03-11 15:20 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-11 15:20 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Thomas Gleixner, Valentin Schneider,
Waiman Long
On 2025-03-03 11:54:16 [+0100], Peter Zijlstra wrote:
> After that I rebased my FUTEX2_NUMA patch on top of all this and added
> a new FUTEX2_MPOL, which is something Christoph Lameter asked for a
> while back, and something we can now actually do sanely, since we have
> lockless vma lookups working.
I'm going to keep the keep the changes mostly as-is (except the few
compile fallouts). I thing I wanted to mention in case someone has a
simple idea: We have this now:
|struct {
| unsigned long hashmask;
| unsigned int hashshift;
| struct futex_hash_bucket *queues[MAX_NUMNODES];
| } __futex_data __read_mostly __aligned(2*sizeof(long));
This MAX_NUMNODES will be set to 1 << 10 due to MAXSMP for instance on
Debian. This in turn leads to an 8KiB huge queues array which will be
largely unused on a simple machine which has no / 1 nodes. I don't have
access to machine with more than 4 nodes so _assumed_ this is the limit.
Anyway. I'm also not aware about the corner cases, say we have that many
nodes (1024) but just two CPUs. That would lead roundup_pow_of_two(0) in
futex_init().
> WDYT?
Sebastian
^ permalink raw reply [flat|nested] 26+ messages in thread