* [PATCH v4 0/11] futex: Add support task local hash maps.
@ 2024-12-03 16:42 Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 01/11] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
` (10 more replies)
0 siblings, 11 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-03 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long
Hi,
this is a follow up on
https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb
and adds support for task local futex_hash_bucket. It can be created via
prctl().
This version supports resize at runtime, auto resize while creating
threads. The upper limit is at 256 * num_possible_cpus() but I guess we
can lower that.
I posted performance numbers of "perf bench futex hash"
https://lore.kernel.org/all/20241101110810.R3AnEqdu@linutronix.de/
I didn't do any new. While the performance of the 16 default bucket look
worse than the 512 (after that the performance hardly changes while
before that doubles) be aware those are now task local (and not shared
with others) and it seems to be sufficient in general.
For the systems with 512CPUs and one db application we can have the
resize. So either the application needs to resize it or we offer auto
resize based on threads and CPUs. But be aware that workloads like
"xz huge_file.tar" will happily acquire all CPUs in the system and only
use a few locks in total and not very often. So it would probably
perform with two hash buckets as good as 512 in this scenario.
v3…v4: https://lore.kernel.org/all/20241115172035.795842-1-bigeasy@linutronix.de/
- Completed resize. Tested with wait/wake, lock_pi, requeue and
requeue_pi.
- Added auto resize during thread creation.
- Fixed bucket initialisation of the global hash bucket resilting in a
crash sometimes.
v2…v3 https://lore.kernel.org/all/20241028121921.1264150-1-bigeasy@linutronix.de/
- The default auto size for auto creation is 16.
- For the private hash jhash2 is used and only for the address.
- My "perf bench futex hash" hacks have been added.
- The structure moved from signal's struct to mm.
- It is possible resize it at runtime.
v1…v2 https://lore.kernel.org/all/20241026224306.982896-1-bigeasy@linutronix.de/:
- Moved to struct signal_struct and is used process wide.
- Automaticly allocated once the first thread is created.
Sebastian
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v4 01/11] futex: Create helper function to initialize a hash slot.
2024-12-03 16:42 [PATCH v4 0/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
@ 2024-12-03 16:42 ` Sebastian Andrzej Siewior
2024-12-10 15:13 ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 02/11] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
` (9 subsequent siblings)
10 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-03 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Factor out the futex_hash_bucket initialisation into a helpr function.
The helper function will be used in a follow up patch.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index ebdd76b4ecbba..d1d3c7b358b23 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1124,6 +1124,13 @@ void futex_exit_release(struct task_struct *tsk)
futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
}
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
+{
+ atomic_set(&fhb->waiters, 0);
+ plist_head_init(&fhb->chain);
+ spin_lock_init(&fhb->lock);
+}
+
static int __init futex_init(void)
{
unsigned int futex_shift;
@@ -1141,11 +1148,8 @@ static int __init futex_init(void)
futex_hashsize, futex_hashsize);
futex_hashsize = 1UL << futex_shift;
- for (i = 0; i < futex_hashsize; i++) {
- atomic_set(&futex_queues[i].waiters, 0);
- plist_head_init(&futex_queues[i].chain);
- spin_lock_init(&futex_queues[i].lock);
- }
+ for (i = 0; i < futex_hashsize; i++)
+ futex_hash_bucket_init(&futex_queues[i]);
return 0;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v4 02/11] futex: Add basic infrastructure for local task local hash.
2024-12-03 16:42 [PATCH v4 0/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 01/11] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
@ 2024-12-03 16:42 ` Sebastian Andrzej Siewior
2024-12-10 15:23 ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 03/11] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
` (8 subsequent siblings)
10 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-03 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
The futex hashmap is system wide and shared by random tasks. Each slot
is hashed based on its address and VMA. Due to randomized VMAs (and
memory allocations) the same logical lock (pointer) can end up in a
different hash bucket on each invocation of the application. This in
turn means that different applications may share a hash bucket on the
first invocation but not on the second an it is not always clear which
applications will be involved. This can result in high latency's to
acquire the futex_hash_bucket::lock especially if the lock owner is
limited to a CPU and not be effectively PI boosted.
Introduce a task local hash map. The hashmap can be allocated via
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, 0)
The `0' argument allocates a default number of 16 slots, a higher number
can be specified if desired. The current upper limit is 131072.
The allocated hashmap is used by all threads within a process.
A thread can check if the private map has been allocated via
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
Which return the current number of slots.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 22 ++++++++
include/linux/mm_types.h | 3 ++
include/uapi/linux/prctl.h | 5 ++
kernel/fork.c | 2 +
kernel/futex/core.c | 100 +++++++++++++++++++++++++++++++++++--
kernel/sys.c | 4 ++
6 files changed, 133 insertions(+), 3 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index b70df27d7e85c..61e81b866d34e 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -77,6 +77,16 @@ void futex_exec_release(struct task_struct *tsk);
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3);
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+int futex_hash_allocate_default(void);
+void futex_hash_free(struct mm_struct *mm);
+
+static inline void futex_mm_init(struct mm_struct *mm)
+{
+ mm->futex_hash_bucket = NULL;
+}
+
#else
static inline void futex_init_task(struct task_struct *tsk) { }
static inline void futex_exit_recursive(struct task_struct *tsk) { }
@@ -88,6 +98,18 @@ static inline long do_futex(u32 __user *uaddr, int op, u32 val,
{
return -EINVAL;
}
+static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+ return -EINVAL;
+}
+static inline int futex_hash_allocate_default(void)
+{
+ return 0;
+}
+static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline void futex_mm_init(struct mm_struct *mm) { }
+
#endif
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7361a8f3ab68e..b16b97ab8fb2a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -30,6 +30,7 @@
#define INIT_PASID 0
struct address_space;
+struct futex_hash_bucket;
struct mem_cgroup;
/*
@@ -902,6 +903,8 @@ struct mm_struct {
int mm_lock_seq;
#endif
+ unsigned int futex_hash_mask;
+ struct futex_hash_bucket *futex_hash_bucket;
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 5c6080680cb27..55b843644c51a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -353,4 +353,9 @@ struct prctl_mm_map {
*/
#define PR_LOCK_SHADOW_STACK_STATUS 76
+/* FUTEX hash management */
+#define PR_FUTEX_HASH 77
+# define PR_FUTEX_HASH_SET_SLOTS 1
+# define PR_FUTEX_HASH_GET_SLOTS 2
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 1450b461d196a..cda8886f3a1d7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1284,6 +1284,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
RCU_INIT_POINTER(mm->exe_file, NULL);
mmu_notifier_subscriptions_init(mm);
init_tlb_flush_pending(mm);
+ futex_mm_init(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
mm->pmd_huge_pte = NULL;
#endif
@@ -1361,6 +1362,7 @@ static inline void __mmput(struct mm_struct *mm)
if (mm->binfmt)
module_put(mm->binfmt->module);
lru_gen_del_mm(mm);
+ futex_hash_free(mm);
mmdrop(mm);
}
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index d1d3c7b358b23..fbfe1f1e94505 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -39,6 +39,7 @@
#include <linux/memblock.h>
#include <linux/fault-inject.h>
#include <linux/slab.h>
+#include <linux/prctl.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -107,18 +108,40 @@ late_initcall(fail_futex_debugfs);
#endif /* CONFIG_FAIL_FUTEX */
+static inline bool futex_key_is_private(union futex_key *key)
+{
+ /*
+ * Relies on get_futex_key() to set either bit for shared
+ * futexes -- see comment with union futex_key.
+ */
+ return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
+}
+
/**
* futex_hash - Return the hash bucket in the global hash
* @key: Pointer to the futex key for which the hash is calculated
*
* We hash on the keys returned from get_futex_key (see below) and return the
- * corresponding hash bucket in the global hash.
+ * corresponding hash bucket in the global hash. If the FUTEX is private and
+ * a local hash table is privated then this one is used.
*/
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
- u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
- key->both.offset);
+ struct futex_hash_bucket *fhb;
+ u32 hash;
+ fhb = current->mm->futex_hash_bucket;
+ if (fhb && futex_key_is_private(key)) {
+ u32 hash_mask = current->mm->futex_hash_mask;
+
+ hash = jhash2((u32 *)key,
+ offsetof(typeof(*key), both.offset) / 4,
+ key->both.offset);
+ return &fhb[hash & hash_mask];
+ }
+ hash = jhash2((u32 *)key,
+ offsetof(typeof(*key), both.offset) / 4,
+ key->both.offset);
return &futex_queues[hash & (futex_hashsize - 1)];
}
@@ -1131,6 +1154,77 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
spin_lock_init(&fhb->lock);
}
+void futex_hash_free(struct mm_struct *mm)
+{
+ kvfree(mm->futex_hash_bucket);
+}
+
+static int futex_hash_allocate(unsigned int hash_slots)
+{
+ struct futex_hash_bucket *fhb;
+ int i;
+
+ if (current->mm->futex_hash_bucket)
+ return -EALREADY;
+
+ if (!thread_group_leader(current))
+ return -EINVAL;
+
+ if (hash_slots == 0)
+ hash_slots = 16;
+ if (hash_slots < 2)
+ hash_slots = 2;
+ if (hash_slots > 131072)
+ hash_slots = 131072;
+ if (!is_power_of_2(hash_slots))
+ hash_slots = rounddown_pow_of_two(hash_slots);
+
+ fhb = kvmalloc_array(hash_slots, sizeof(struct futex_hash_bucket), GFP_KERNEL_ACCOUNT);
+ if (!fhb)
+ return -ENOMEM;
+
+ current->mm->futex_hash_mask = hash_slots - 1;
+
+ for (i = 0; i < hash_slots; i++)
+ futex_hash_bucket_init(&fhb[i]);
+
+ current->mm->futex_hash_bucket = fhb;
+ return 0;
+}
+
+int futex_hash_allocate_default(void)
+{
+ return futex_hash_allocate(0);
+}
+
+static int futex_hash_get_slots(void)
+{
+ if (current->mm->futex_hash_bucket)
+ return current->mm->futex_hash_mask + 1;
+ return 0;
+}
+
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+ int ret;
+
+ switch (arg2) {
+ case PR_FUTEX_HASH_SET_SLOTS:
+ ret = futex_hash_allocate(arg3);
+ break;
+
+ case PR_FUTEX_HASH_GET_SLOTS:
+ ret = futex_hash_get_slots();
+ break;
+
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
static int __init futex_init(void)
{
unsigned int futex_shift;
diff --git a/kernel/sys.c b/kernel/sys.c
index c4c701c6f0b4d..dfa8b1b344edb 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -52,6 +52,7 @@
#include <linux/user_namespace.h>
#include <linux/time_namespace.h>
#include <linux/binfmts.h>
+#include <linux/futex.h>
#include <linux/sched.h>
#include <linux/sched/autogroup.h>
@@ -2809,6 +2810,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
error = arch_lock_shadow_stack_status(me, arg2);
break;
+ case PR_FUTEX_HASH:
+ error = futex_hash_prctl(arg2, arg3, arg4, arg5);
+ break;
default:
error = -EINVAL;
break;
--
2.45.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v4 03/11] futex: Allow automatic allocation of process wide futex hash.
2024-12-03 16:42 [PATCH v4 0/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 01/11] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 02/11] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
@ 2024-12-03 16:42 ` Sebastian Andrzej Siewior
2024-12-10 16:07 ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 04/11] futex: Hash only the address for private futexes Sebastian Andrzej Siewior
` (7 subsequent siblings)
10 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-03 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Allocate a default futex hash if a task forks its first thread.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/fork.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/kernel/fork.c b/kernel/fork.c
index cda8886f3a1d7..6267d600af991 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2130,6 +2130,17 @@ static void rv_task_fork(struct task_struct *p)
#define rv_task_fork(p) do {} while (0)
#endif
+static bool need_futex_hash_allocate_default(u64 clone_flags)
+{
+ if ((clone_flags & (CLONE_THREAD | CLONE_VM)) != (CLONE_THREAD | CLONE_VM))
+ return false;
+ if (!thread_group_empty(current))
+ return false;
+ if (current->mm->futex_hash_bucket)
+ return false;
+ return true;
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -2507,6 +2518,21 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cancel_cgroup;
+ /*
+ * Allocate a default futex hash for the user process once the first
+ * thread spawns.
+ */
+ if (need_futex_hash_allocate_default(clone_flags)) {
+ retval = futex_hash_allocate_default();
+ if (retval)
+ goto bad_fork_core_free;
+ /*
+ * If we fail beyond this point we don't free the allocated
+ * futex hash map. We assume that another thread will created
+ * and makes use of it The hash map will be freed once the main
+ * thread terminates.
+ */
+ }
/*
* From this point on we must avoid any synchronous user-space
* communication until we take the tasklist-lock. In particular, we do
--
2.45.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v4 04/11] futex: Hash only the address for private futexes.
2024-12-03 16:42 [PATCH v4 0/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (2 preceding siblings ...)
2024-12-03 16:42 ` [PATCH v4 03/11] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
@ 2024-12-03 16:42 ` Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 05/11] futex: Track the futex hash bucket Sebastian Andrzej Siewior
` (6 subsequent siblings)
10 siblings, 0 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-03 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
futex_hash() passes the whole futex_key to jhash2. The first two member
are passed as the first argument and the offset as the "initial value".
For private futexes, the mm-part is always the same and it is used only
within the process. By excluding the mm part from the hash, we reduce
the length passed to jhash2 from 4 (16 / 4) to 2 (8 / 2). This avoids
the __jhash_mix() part of jhash.
The resulting code is smaller and based on testing this variant performs
as good as the original or slightly better.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index fbfe1f1e94505..14251bbafaffb 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -134,8 +134,8 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
if (fhb && futex_key_is_private(key)) {
u32 hash_mask = current->mm->futex_hash_mask;
- hash = jhash2((u32 *)key,
- offsetof(typeof(*key), both.offset) / 4,
+ hash = jhash2((void *)&key->private.address,
+ sizeof(key->private.address) / 4,
key->both.offset);
return &fhb[hash & hash_mask];
}
--
2.45.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v4 05/11] futex: Track the futex hash bucket.
2024-12-03 16:42 [PATCH v4 0/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (3 preceding siblings ...)
2024-12-03 16:42 ` [PATCH v4 04/11] futex: Hash only the address for private futexes Sebastian Andrzej Siewior
@ 2024-12-03 16:42 ` Sebastian Andrzej Siewior
2024-12-10 18:45 ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 06/11] futex: Allow to re-allocate the private " Sebastian Andrzej Siewior
` (5 subsequent siblings)
10 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-03 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Add futex_hash_get/put() to keep the assigned hash_bucket around while a
futex operation is performed. Have RCU lifetime guarantee for
futex_hash_bucket_private.
This is should have the right amount of gets/ puts so that the private
hash bucket is released on exit. This is preparatory work to allow
change the hash bucket at runtime.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 2 +-
include/linux/mm_types.h | 5 +-
kernel/futex/core.c | 104 +++++++++++++++++++++++++++++++++------
kernel/futex/futex.h | 8 +++
kernel/futex/pi.c | 7 +++
kernel/futex/requeue.c | 16 ++++++
kernel/futex/waitwake.c | 15 +++++-
7 files changed, 136 insertions(+), 21 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 61e81b866d34e..359fc24eb37ff 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -84,7 +84,7 @@ void futex_hash_free(struct mm_struct *mm);
static inline void futex_mm_init(struct mm_struct *mm)
{
- mm->futex_hash_bucket = NULL;
+ rcu_assign_pointer(mm->futex_hash_bucket, NULL);
}
#else
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b16b97ab8fb2a..4f39928631042 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -30,7 +30,7 @@
#define INIT_PASID 0
struct address_space;
-struct futex_hash_bucket;
+struct futex_hash_bucket_private;
struct mem_cgroup;
/*
@@ -903,8 +903,7 @@ struct mm_struct {
int mm_lock_seq;
#endif
- unsigned int futex_hash_mask;
- struct futex_hash_bucket *futex_hash_bucket;
+ struct futex_hash_bucket_private __rcu *futex_hash_bucket;
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 14251bbafaffb..464918d85395e 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -40,6 +40,7 @@
#include <linux/fault-inject.h>
#include <linux/slab.h>
#include <linux/prctl.h>
+#include <linux/rcuref.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -56,6 +57,12 @@ static struct {
#define futex_queues (__futex_data.queues)
#define futex_hashsize (__futex_data.hashsize)
+struct futex_hash_bucket_private {
+ rcuref_t users;
+ unsigned int hash_mask;
+ struct rcu_head rcu;
+ struct futex_hash_bucket queues[];
+};
/*
* Fault injections for futexes.
@@ -127,17 +134,24 @@ static inline bool futex_key_is_private(union futex_key *key)
*/
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
- struct futex_hash_bucket *fhb;
+ struct futex_hash_bucket_private *hb_p = NULL;
u32 hash;
- fhb = current->mm->futex_hash_bucket;
- if (fhb && futex_key_is_private(key)) {
- u32 hash_mask = current->mm->futex_hash_mask;
+ if (futex_key_is_private(key)) {
+ guard(rcu)();
+
+ do {
+ hb_p = rcu_dereference(current->mm->futex_hash_bucket);
+ } while (hb_p && !rcuref_get(&hb_p->users));
+ }
+
+ if (hb_p) {
+ u32 hash_mask = hb_p->hash_mask;
hash = jhash2((void *)&key->private.address,
sizeof(key->private.address) / 4,
key->both.offset);
- return &fhb[hash & hash_mask];
+ return &hb_p->queues[hash & hash_mask];
}
hash = jhash2((u32 *)key,
offsetof(typeof(*key), both.offset) / 4,
@@ -145,6 +159,35 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
return &futex_queues[hash & (futex_hashsize - 1)];
}
+static void futex_hash_priv_put(struct futex_hash_bucket_private *hb_p)
+{
+ if (rcuref_put(&hb_p->users))
+ kvfree_rcu(hb_p, rcu);
+}
+
+void futex_hash_put(struct futex_hash_bucket *hb)
+{
+ struct futex_hash_bucket_private *hb_p;
+
+ if (hb->hb_slot == 0)
+ return;
+ hb_p = container_of(hb, struct futex_hash_bucket_private,
+ queues[hb->hb_slot - 1]);
+ futex_hash_priv_put(hb_p);
+}
+
+void futex_hash_get(struct futex_hash_bucket *hb)
+{
+ struct futex_hash_bucket_private *hb_p;
+
+ if (hb->hb_slot == 0)
+ return;
+
+ hb_p = container_of(hb, struct futex_hash_bucket_private,
+ queues[hb->hb_slot - 1]);
+ /* The ref needs to be owned by the caller so this can't fail */
+ WARN_ON_ONCE(!rcuref_get(&hb_p->users));
+}
/**
* futex_setup_timer - set up the sleeping hrtimer.
@@ -599,7 +642,10 @@ int futex_unqueue(struct futex_q *q)
*/
lock_ptr = READ_ONCE(q->lock_ptr);
if (lock_ptr != NULL) {
+ struct futex_hash_bucket *hb;
+
spin_lock(lock_ptr);
+ hb = futex_hb_from_futex_q(q);
/*
* q->lock_ptr can change between reading it and
* spin_lock(), causing us to take the wrong lock. This
@@ -622,6 +668,7 @@ int futex_unqueue(struct futex_q *q)
BUG_ON(q->pi_state);
spin_unlock(lock_ptr);
+ futex_hash_put(hb);
ret = 1;
}
@@ -999,6 +1046,7 @@ static void exit_pi_state_list(struct task_struct *curr)
if (!refcount_inc_not_zero(&pi_state->refcount)) {
raw_spin_unlock_irq(&curr->pi_lock);
cpu_relax();
+ futex_hash_put(hb);
raw_spin_lock_irq(&curr->pi_lock);
continue;
}
@@ -1015,6 +1063,7 @@ static void exit_pi_state_list(struct task_struct *curr)
/* retain curr->pi_lock for the loop invariant */
raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
spin_unlock(&hb->lock);
+ futex_hash_put(hb);
put_pi_state(pi_state);
continue;
}
@@ -1027,6 +1076,7 @@ static void exit_pi_state_list(struct task_struct *curr)
raw_spin_unlock(&curr->pi_lock);
raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
spin_unlock(&hb->lock);
+ futex_hash_put(hb);
rt_mutex_futex_unlock(&pi_state->pi_mutex);
put_pi_state(pi_state);
@@ -1147,8 +1197,9 @@ void futex_exit_release(struct task_struct *tsk)
futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
}
-static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb, unsigned int slot)
{
+ fhb->hb_slot = slot;
atomic_set(&fhb->waiters, 0);
plist_head_init(&fhb->chain);
spin_lock_init(&fhb->lock);
@@ -1156,12 +1207,20 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
void futex_hash_free(struct mm_struct *mm)
{
- kvfree(mm->futex_hash_bucket);
+ struct futex_hash_bucket_private *hb_p;
+
+ /* own a reference */
+ hb_p = rcu_dereference_check(mm->futex_hash_bucket, true);
+ if (!hb_p)
+ return;
+ WARN_ON(rcuref_read(&hb_p->users) != 1);
+ futex_hash_priv_put(hb_p);
}
static int futex_hash_allocate(unsigned int hash_slots)
{
- struct futex_hash_bucket *fhb;
+ struct futex_hash_bucket_private *hb_p;
+ size_t alloc_size;
int i;
if (current->mm->futex_hash_bucket)
@@ -1179,16 +1238,25 @@ static int futex_hash_allocate(unsigned int hash_slots)
if (!is_power_of_2(hash_slots))
hash_slots = rounddown_pow_of_two(hash_slots);
- fhb = kvmalloc_array(hash_slots, sizeof(struct futex_hash_bucket), GFP_KERNEL_ACCOUNT);
- if (!fhb)
+ if (unlikely(check_mul_overflow(hash_slots, sizeof(struct futex_hash_bucket),
+ &alloc_size)))
return -ENOMEM;
- current->mm->futex_hash_mask = hash_slots - 1;
+ if (unlikely(check_add_overflow(alloc_size, sizeof(struct futex_hash_bucket_private),
+ &alloc_size)))
+ return -ENOMEM;
+
+ hb_p = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
+ if (!hb_p)
+ return -ENOMEM;
+
+ rcuref_init(&hb_p->users, 1);
+ hb_p->hash_mask = hash_slots - 1;
for (i = 0; i < hash_slots; i++)
- futex_hash_bucket_init(&fhb[i]);
+ futex_hash_bucket_init(&hb_p->queues[i], i + 1);
- current->mm->futex_hash_bucket = fhb;
+ rcu_assign_pointer(current->mm->futex_hash_bucket, hb_p);
return 0;
}
@@ -1199,8 +1267,12 @@ int futex_hash_allocate_default(void)
static int futex_hash_get_slots(void)
{
- if (current->mm->futex_hash_bucket)
- return current->mm->futex_hash_mask + 1;
+ struct futex_hash_bucket_private *hb_p;
+
+ guard(rcu)();
+ hb_p = rcu_dereference(current->mm->futex_hash_bucket);
+ if (hb_p)
+ return hb_p->hash_mask + 1;
return 0;
}
@@ -1243,7 +1315,7 @@ static int __init futex_init(void)
futex_hashsize = 1UL << futex_shift;
for (i = 0; i < futex_hashsize; i++)
- futex_hash_bucket_init(&futex_queues[i]);
+ futex_hash_bucket_init(&futex_queues[i], 0);
return 0;
}
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 618ce1fe870e9..ceea260ad9e80 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -115,6 +115,7 @@ static inline bool should_fail_futex(bool fshared)
*/
struct futex_hash_bucket {
atomic_t waiters;
+ unsigned int hb_slot;
spinlock_t lock;
struct plist_head chain;
} ____cacheline_aligned_in_smp;
@@ -202,6 +203,13 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
int flags, u64 range_ns);
extern struct futex_hash_bucket *futex_hash(union futex_key *key);
+extern void futex_hash_put(struct futex_hash_bucket *hb);
+extern void futex_hash_get(struct futex_hash_bucket *hb);
+
+static inline struct futex_hash_bucket *futex_hb_from_futex_q(struct futex_q *q)
+{
+ return container_of(q->lock_ptr, struct futex_hash_bucket, lock);
+}
/**
* futex_match - Check whether two futex keys are equal
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index d62cca5ed8f4c..60a62ab250b08 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -964,6 +964,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
* - EAGAIN: The user space value changed.
*/
futex_q_unlock(hb);
+ futex_hash_put(hb);
/*
* Handle the case where the owner is in the middle of
* exiting. Wait for the exit to complete otherwise
@@ -1083,10 +1084,12 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
futex_unqueue_pi(&q);
spin_unlock(q.lock_ptr);
+ futex_hash_put(hb);
goto out;
out_unlock_put_key:
futex_q_unlock(hb);
+ futex_hash_put(hb);
out:
if (to) {
@@ -1097,6 +1100,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
uaddr_faulted:
futex_q_unlock(hb);
+ futex_hash_put(hb);
ret = fault_in_user_writeable(uaddr);
if (ret)
@@ -1197,6 +1201,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
get_pi_state(pi_state);
spin_unlock(&hb->lock);
+ futex_hash_put(hb);
/* drops pi_state->pi_mutex.wait_lock */
ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
@@ -1236,6 +1241,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
*/
if ((ret = futex_cmpxchg_value_locked(&curval, uaddr, uval, 0))) {
spin_unlock(&hb->lock);
+ futex_hash_put(hb);
switch (ret) {
case -EFAULT:
goto pi_faulted;
@@ -1256,6 +1262,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
out_unlock:
spin_unlock(&hb->lock);
+ futex_hash_put(hb);
return ret;
pi_retry:
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index b47bb764b3520..39e96f1bef8ce 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -87,6 +87,8 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
futex_hb_waiters_inc(hb2);
plist_add(&q->list, &hb2->chain);
q->lock_ptr = &hb2->lock;
+ futex_hash_put(hb1);
+ futex_hash_get(hb2);
}
q->key = *key2;
}
@@ -231,8 +233,10 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
WARN_ON(!q->rt_waiter);
q->rt_waiter = NULL;
+ futex_hash_put(futex_hb_from_futex_q(q));
q->lock_ptr = &hb->lock;
+ futex_hash_get(hb);
/* Signal locked state to the waiter */
futex_requeue_pi_complete(q, 1);
@@ -458,6 +462,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
if (unlikely(ret)) {
double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ futex_hash_put(hb1);
+ futex_hash_put(hb2);
ret = get_user(curval, uaddr1);
if (ret)
@@ -544,6 +550,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
case -EFAULT:
double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ futex_hash_put(hb1);
+ futex_hash_put(hb2);
ret = fault_in_user_writeable(uaddr2);
if (!ret)
goto retry;
@@ -558,6 +566,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
*/
double_unlock_hb(hb1, hb2);
futex_hb_waiters_dec(hb2);
+ futex_hash_put(hb1);
+ futex_hash_put(hb2);
/*
* Handle the case where the owner is in the middle of
* exiting. Wait for the exit to complete otherwise
@@ -677,6 +687,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
double_unlock_hb(hb1, hb2);
wake_up_q(&wake_q);
futex_hb_waiters_dec(hb2);
+ futex_hash_put(hb1);
+ futex_hash_put(hb2);
return ret ? ret : task_count;
}
@@ -815,6 +827,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
*/
if (futex_match(&q.key, &key2)) {
futex_q_unlock(hb);
+ futex_hash_put(hb);
ret = -EINVAL;
goto out;
}
@@ -828,6 +841,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
spin_lock(&hb->lock);
ret = handle_early_requeue_pi_wakeup(hb, &q, to);
spin_unlock(&hb->lock);
+ futex_hash_put(hb);
break;
case Q_REQUEUE_PI_LOCKED:
@@ -847,6 +861,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
*/
ret = ret < 0 ? ret : 0;
}
+ futex_hash_put(futex_hb_from_futex_q(&q));
break;
case Q_REQUEUE_PI_DONE:
@@ -876,6 +891,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
futex_unqueue_pi(&q);
spin_unlock(q.lock_ptr);
+ futex_hash_put(futex_hb_from_futex_q(&q));
if (ret == -EINTR) {
/*
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 3a10375d95218..1f2d11eb7f89f 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -113,6 +113,8 @@ bool __futex_wake_mark(struct futex_q *q)
return false;
__futex_unqueue(q);
+ /* Waiters reference */
+ futex_hash_put(futex_hb_from_futex_q(q));
/*
* The waiting task can free the futex_q as soon as q->lock_ptr = NULL
* is written, without taking any locks. This is possible in the event
@@ -173,8 +175,10 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
hb = futex_hash(&key);
/* Make sure we really have tasks to wakeup */
- if (!futex_hb_waiters_pending(hb))
+ if (!futex_hb_waiters_pending(hb)) {
+ futex_hash_put(hb);
return ret;
+ }
spin_lock(&hb->lock);
@@ -196,6 +200,7 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
}
spin_unlock(&hb->lock);
+ futex_hash_put(hb);
wake_up_q(&wake_q);
return ret;
}
@@ -275,6 +280,8 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
op_ret = futex_atomic_op_inuser(op, uaddr2);
if (unlikely(op_ret < 0)) {
double_unlock_hb(hb1, hb2);
+ futex_hash_put(hb1);
+ futex_hash_put(hb2);
if (!IS_ENABLED(CONFIG_MMU) ||
unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
@@ -329,6 +336,8 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
out_unlock:
double_unlock_hb(hb1, hb2);
wake_up_q(&wake_q);
+ futex_hash_put(hb1);
+ futex_hash_put(hb2);
return ret;
}
@@ -466,6 +475,8 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
}
futex_q_unlock(hb);
+ futex_hash_put(hb);
+
__set_current_state(TASK_RUNNING);
/*
@@ -625,6 +636,7 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
if (ret) {
futex_q_unlock(*hb);
+ futex_hash_put(*hb);
ret = get_user(uval, uaddr);
if (ret)
@@ -638,6 +650,7 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
if (uval != val) {
futex_q_unlock(*hb);
+ futex_hash_put(*hb);
ret = -EWOULDBLOCK;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v4 06/11] futex: Allow to re-allocate the private hash bucket.
2024-12-03 16:42 [PATCH v4 0/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (4 preceding siblings ...)
2024-12-03 16:42 ` [PATCH v4 05/11] futex: Track the futex hash bucket Sebastian Andrzej Siewior
@ 2024-12-03 16:42 ` Sebastian Andrzej Siewior
2024-12-10 22:27 ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 07/11] futex: Allow to make the number of slots invariant Sebastian Andrzej Siewior
` (4 subsequent siblings)
10 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-03 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/
replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS
operation can now be invoked at runtime and resize the internal private
futex_hash_bucket to another size.
The idea is to use the recently introduced ref counting to keep a
currently used HB around. On resize/ replacement a new HB (array) is
assigned to the process. All users on the old HB will receive a wakeup
so they can dequeue them self from the old hb and enqueue on the new one.
In the WAIT case, after a wakeup the needs to check if a successful
wake up occurred and if not and the HB changed just dequeue + enqueue and
wait again.
In the WAKE case, it needs to iterate over all waiters. If a the HB
changed then the waiter can not disappear. New waiters will use the new
HB and therefore will be missed. Therefore it will try again if the HB
changed and it may wake more tasks.
The same logic applies to REQUEUE.
LOCK_PI, UNLOCK_PI and its REQUEUE_PI part are slightly more
complicated due to the internal kernel state. If the HB changes then we
have the old PI state created by the first waiter and possible a new PI
state created by waiter on the new HB lock.
On LOCK_PI, if the HB changed it needs to abandon the PI state it may
have acquired the lock on PI state but everyone else might use the "new"
PI state. This PI state won't be used anymore because every water will
requeue. It is needed to check the UADDR if the lock has been passed by
UNLOCK_PI prio the HB change or if were woken up due to the HB change.
If we own the lock based on UADDR, we own it otherwise we retry.
UNLOCK_PI takes the first waiter and passes the lock. If there is no
waiter then it updates the UADDR to 0. Before the update succeeds the HB
can change and a waiter can setup a new PI state based for the UNLOCK_PI
caller and wait on it. To complicate it further, userland can acquire
the lock at this time. This may happen because new waiter no longer
block on the hb lock. To avoid this race, futex_hash_lock is acquired
for the update to 0 ensure the HB can't change and all waiter will
block.
The same logic applies to REQUEUE_PI. WAIT_REQUEUE_PI tries to recover
from a HB change in a similar way LOCK_PI does. If the requeue occurred
but it waits already on UADDR2 then for the last step it simply invokes
futex_lock_pi().
CMP_REQUEUE_PI follows the UNLOCK_PI logic and acquires futex_hash_lock
for the whole operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 1 +
include/linux/mm_types.h | 1 +
kernel/futex/core.c | 65 ++++++++++++++++---
kernel/futex/futex.h | 3 +
kernel/futex/pi.c | 110 +++++++++++++++++++++++++++++++-
kernel/futex/requeue.c | 74 ++++++++++++++++++++-
kernel/futex/waitwake.c | 42 ++++++++++--
kernel/locking/rtmutex.c | 26 ++++++++
kernel/locking/rtmutex_common.h | 5 +-
9 files changed, 308 insertions(+), 19 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 359fc24eb37ff..ce9e284bbeb09 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -85,6 +85,7 @@ void futex_hash_free(struct mm_struct *mm);
static inline void futex_mm_init(struct mm_struct *mm)
{
rcu_assign_pointer(mm->futex_hash_bucket, NULL);
+ init_rwsem(&mm->futex_hash_lock);
}
#else
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4f39928631042..07f1567f2b51f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -903,6 +903,7 @@ struct mm_struct {
int mm_lock_seq;
#endif
+ struct rw_semaphore futex_hash_lock;
struct futex_hash_bucket_private __rcu *futex_hash_bucket;
unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 464918d85395e..0dd7100e36419 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -573,6 +573,7 @@ struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
{
struct futex_hash_bucket *hb;
+try_again:
hb = futex_hash(&q->key);
/*
@@ -588,7 +589,13 @@ struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
q->lock_ptr = &hb->lock;
spin_lock(&hb->lock);
- return hb;
+ if (futex_check_hb_valid(hb))
+ return hb;
+
+ futex_hb_waiters_dec(hb);
+ spin_unlock(&hb->lock);
+ futex_hash_put(hb);
+ goto try_again;
}
void futex_q_unlock(struct futex_hash_bucket *hb)
@@ -1217,18 +1224,50 @@ void futex_hash_free(struct mm_struct *mm)
futex_hash_priv_put(hb_p);
}
+static void futex_put_old_hb_p(struct futex_hash_bucket_private *hb_p)
+{
+ unsigned int slots = hb_p->hash_mask + 1;
+ struct futex_hash_bucket *hb;
+ DEFINE_WAKE_Q(wake_q);
+ unsigned int i;
+
+ for (i = 0; i < slots; i++) {
+ struct futex_q *this;
+
+ hb = &hb_p->queues[i];
+
+ spin_lock(&hb->lock);
+ plist_for_each_entry(this, &hb->chain, list)
+ wake_q_add(&wake_q, this->task);
+ spin_unlock(&hb->lock);
+ }
+ futex_hash_priv_put(hb_p);
+
+ wake_up_q(&wake_q);
+}
+
+bool futex_check_hb_valid(struct futex_hash_bucket *hb)
+{
+ struct futex_hash_bucket_private *hb_p_now;
+ struct futex_hash_bucket_private *hb_p;
+
+ if (hb->hb_slot == 0)
+ return true;
+ guard(rcu)();
+ hb_p_now = rcu_dereference(current->mm->futex_hash_bucket);
+ hb_p = container_of(hb, struct futex_hash_bucket_private,
+ queues[hb->hb_slot - 1]);
+
+ return hb_p_now == hb_p;
+}
+
static int futex_hash_allocate(unsigned int hash_slots)
{
- struct futex_hash_bucket_private *hb_p;
+ struct futex_hash_bucket_private *hb_p, *hb_p_old = NULL;
+ struct mm_struct *mm;
size_t alloc_size;
int i;
- if (current->mm->futex_hash_bucket)
- return -EALREADY;
-
- if (!thread_group_leader(current))
- return -EINVAL;
-
if (hash_slots == 0)
hash_slots = 16;
if (hash_slots < 2)
@@ -1256,7 +1295,15 @@ static int futex_hash_allocate(unsigned int hash_slots)
for (i = 0; i < hash_slots; i++)
futex_hash_bucket_init(&hb_p->queues[i], i + 1);
- rcu_assign_pointer(current->mm->futex_hash_bucket, hb_p);
+ mm = current->mm;
+ scoped_guard(rwsem_write, &mm->futex_hash_lock) {
+ hb_p_old = rcu_dereference_check(mm->futex_hash_bucket,
+ lockdep_is_held(&mm->futex_hash_lock));
+ rcu_assign_pointer(mm->futex_hash_bucket, hb_p);
+ }
+ if (hb_p_old)
+ futex_put_old_hb_p(hb_p_old);
+
return 0;
}
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index ceea260ad9e80..503f56643a966 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -205,6 +205,9 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout,
extern struct futex_hash_bucket *futex_hash(union futex_key *key);
extern void futex_hash_put(struct futex_hash_bucket *hb);
extern void futex_hash_get(struct futex_hash_bucket *hb);
+extern bool futex_check_hb_valid(struct futex_hash_bucket *hb);
+extern bool check_pi_lock_owner(u32 __user *uaddr);
+extern void reset_pi_state_owner(struct futex_pi_state *pi_state);
static inline struct futex_hash_bucket *futex_hb_from_futex_q(struct futex_q *q)
{
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 60a62ab250b08..b4156d1cc6608 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -43,8 +43,8 @@ static struct futex_pi_state *alloc_pi_state(void)
return pi_state;
}
-static void pi_state_update_owner(struct futex_pi_state *pi_state,
- struct task_struct *new_owner)
+void pi_state_update_owner(struct futex_pi_state *pi_state,
+ struct task_struct *new_owner)
{
struct task_struct *old_owner = pi_state->owner;
@@ -854,6 +854,47 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
return ret;
}
+bool check_pi_lock_owner(u32 __user *uaddr)
+{
+ u32 our_tid, uval;
+ int ret;
+
+ our_tid = task_pid_vnr(current);
+ do {
+ ret = futex_get_value_locked(&uval, uaddr);
+ switch (ret) {
+ case 0:
+ if ((uval & FUTEX_TID_MASK) == our_tid)
+ return true;
+ return false;
+ break;
+
+ case -EFAULT:
+ ret = fault_in_user_writeable(uaddr);
+ if (ret < 0)
+ return false;
+ break;
+
+ case -EAGAIN:
+ cond_resched();
+ break;
+
+ default:
+ WARN_ON_ONCE(1);
+ return false;
+ }
+
+ } while (1);
+}
+
+void reset_pi_state_owner(struct futex_pi_state *pi_state)
+{
+ raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+ pi_state_update_owner(pi_state, NULL);
+ pi_state->owner = NULL;
+ raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+}
+
/**
* fixup_pi_owner() - Post lock pi_state and corner case management
* @uaddr: user address of the futex
@@ -999,6 +1040,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
rt_mutex_pre_schedule();
rt_mutex_init_waiter(&rt_waiter);
+ rt_waiter.hb = hb;
/*
* On PREEMPT_RT, when hb->lock becomes an rt_mutex, we must not
@@ -1070,6 +1112,37 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
*/
rt_mutex_post_schedule();
no_block:
+ if (!futex_check_hb_valid(hb)) {
+ bool uaddr_owner;
+ /*
+ * We might got the lock, we might not. We own the outdated internal
+ * state because the HB changed under us so it might have been all
+ * for nothing.
+ * We need to reset the pi_state and its owner because it
+ * points to current owner of the lock but it is not what new
+ * lock/ unlock caller will see so it needs a clean up. If we own
+ * the lock according to uaddr then it has been passed to us by an
+ * unlock and we got it before the HB changed. Lucky us, we keep
+ * it. If we were able to steal it or did not get it in the
+ * first place then we need to try again with the HB in place.
+ */
+ reset_pi_state_owner(q.pi_state);
+ futex_unqueue_pi(&q);
+ spin_unlock(q.lock_ptr);
+ futex_hash_put(hb);
+
+ uaddr_owner = check_pi_lock_owner(uaddr);
+ if (uaddr_owner) {
+ ret = 0;
+ goto out;
+ }
+
+ if (refill_pi_state_cache()) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ goto retry_private;
+ }
/*
* Fixup the pi_state owner and possibly acquire the lock if we
* haven't already.
@@ -1121,6 +1194,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
{
u32 curval, uval, vpid = task_pid_vnr(current);
union futex_key key = FUTEX_KEY_INIT;
+ struct rw_semaphore *futex_hash_lock = NULL;
struct futex_hash_bucket *hb;
struct futex_q *top_waiter;
int ret;
@@ -1128,6 +1202,8 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
if (!IS_ENABLED(CONFIG_FUTEX_PI))
return -ENOSYS;
+ if (!(flags & FLAGS_SHARED))
+ futex_hash_lock = ¤t->mm->futex_hash_lock;
retry:
if (get_user(uval, uaddr))
return -EFAULT;
@@ -1232,6 +1308,32 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
return ret;
}
+ /*
+ * If the hb changed before the following cmpxchg finished then the
+ * situtation gets complicated as we don't own the lock anymore but
+ * there could be an internal state recorded under our name by the
+ * waiter under a different hb->lock. Also the PI-lock could be snuck in
+ * userland so there is no guarantee that we get it back.
+ * To avoid the mess due to this tiny race, ensure that the HB can not
+ * be resized while the PI lock with no owner is unlocked.
+ */
+ if (futex_hash_lock) {
+ spin_unlock(&hb->lock);
+ down_read(futex_hash_lock);
+ spin_lock(&hb->lock);
+
+ if (!futex_check_hb_valid(hb)) {
+ spin_unlock(&hb->lock);
+ up_read(futex_hash_lock);
+ futex_hash_put(hb);
+ goto retry;
+ }
+ if (futex_top_waiter(hb, &key)) {
+ up_read(futex_hash_lock);
+ goto retry_hb;
+ }
+ }
+
/*
* We have no kernel internal state, i.e. no waiters in the
* kernel. Waiters which are about to queue themselves are stuck
@@ -1241,6 +1343,8 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
*/
if ((ret = futex_cmpxchg_value_locked(&curval, uaddr, uval, 0))) {
spin_unlock(&hb->lock);
+ if (futex_hash_lock)
+ up_read(futex_hash_lock);
futex_hash_put(hb);
switch (ret) {
case -EFAULT:
@@ -1254,6 +1358,8 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
return ret;
}
}
+ if (futex_hash_lock)
+ up_read(futex_hash_lock);
/*
* If uval has changed, let user space handle it.
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 39e96f1bef8ce..6b3c4413fbf47 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -378,6 +378,7 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
DEFINE_WAKE_Q(wake_q);
+ struct rw_semaphore *futex_hash_lock = NULL;
if (nr_wake < 0 || nr_requeue < 0)
return -EINVAL;
@@ -429,6 +430,9 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
*/
if (refill_pi_state_cache())
return -ENOMEM;
+
+ if (!(flags1 & FLAGS_SHARED) || !(flags2 & FLAGS_SHARED))
+ futex_hash_lock = ¤t->mm->futex_hash_lock;
}
retry:
@@ -447,10 +451,12 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
if (requeue_pi && futex_match(&key1, &key2))
return -EINVAL;
+retry_private:
+ if (futex_hash_lock)
+ down_read(futex_hash_lock);
hb1 = futex_hash(&key1);
hb2 = futex_hash(&key2);
-retry_private:
futex_hb_waiters_inc(hb2);
double_lock_hb(hb1, hb2);
@@ -465,6 +471,9 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
futex_hash_put(hb1);
futex_hash_put(hb2);
+ if (futex_hash_lock)
+ up_read(futex_hash_lock);
+
ret = get_user(curval, uaddr1);
if (ret)
return ret;
@@ -552,6 +561,9 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
futex_hb_waiters_dec(hb2);
futex_hash_put(hb1);
futex_hash_put(hb2);
+ if (futex_hash_lock)
+ up_read(futex_hash_lock);
+
ret = fault_in_user_writeable(uaddr2);
if (!ret)
goto retry;
@@ -568,6 +580,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
futex_hb_waiters_dec(hb2);
futex_hash_put(hb1);
futex_hash_put(hb2);
+ if (futex_hash_lock)
+ up_read(futex_hash_lock);
/*
* Handle the case where the owner is in the middle of
* exiting. Wait for the exit to complete otherwise
@@ -687,6 +701,23 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
double_unlock_hb(hb1, hb2);
wake_up_q(&wake_q);
futex_hb_waiters_dec(hb2);
+
+ /*
+ * If there was no error in the process so far and we woke less than we
+ * could have and hb changed then we try again in case we missed
+ * someone.
+ */
+ if (ret >= 0 &&
+ !(task_count - nr_wake >= nr_requeue) &&
+ (!futex_check_hb_valid(hb1) || !futex_check_hb_valid(hb2))) {
+ futex_hash_put(hb1);
+ futex_hash_put(hb2);
+ wake_q_init(&wake_q);
+ goto retry_private;
+ }
+ if (futex_hash_lock)
+ up_read(futex_hash_lock);
+
futex_hash_put(hb1);
futex_hash_put(hb2);
return ret ? ret : task_count;
@@ -783,8 +814,8 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
struct rt_mutex_waiter rt_waiter;
struct futex_hash_bucket *hb;
union futex_key key2 = FUTEX_KEY_INIT;
- struct futex_q q = futex_q_init;
struct rt_mutex_base *pi_mutex;
+ struct futex_q q;
int res, ret;
if (!IS_ENABLED(CONFIG_FUTEX_PI))
@@ -799,6 +830,8 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
to = futex_setup_timer(abs_time, &timeout, flags,
current->timer_slack_ns);
+hb_changed_again:
+ q = futex_q_init;
/*
* The waiter is allocated on our stack, manipulated by the requeue
* code while we sleep on uaddr.
@@ -841,6 +874,12 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
spin_lock(&hb->lock);
ret = handle_early_requeue_pi_wakeup(hb, &q, to);
spin_unlock(&hb->lock);
+
+ if (ret == -EWOULDBLOCK && !futex_check_hb_valid(hb)) {
+ futex_hash_put(hb);
+ goto hb_changed_again;
+ }
+
futex_hash_put(hb);
break;
@@ -865,6 +904,8 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
break;
case Q_REQUEUE_PI_DONE:
+ rt_waiter.hb = futex_hb_from_futex_q(&q);
+
/* Requeue completed. Current is 'pi_blocked_on' the rtmutex */
pi_mutex = &q.pi_state->pi_mutex;
ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
@@ -876,6 +917,35 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
ret = 0;
spin_lock(q.lock_ptr);
+ if (!futex_check_hb_valid(rt_waiter.hb)) {
+ bool uaddr_owner;
+
+ debug_rt_mutex_free_waiter(&rt_waiter);
+ /*
+ * The HB changed under us after we were requeued on
+ * uaddr2. We may have acquire the lock on the pi_state
+ * but this the state that is seen on the current HB.
+ * However, there could also be an UNLOCK_PI event
+ * before and we own the lock based on uaddr2.
+ * Unlock so the next waiter can do the same and
+ * acquire the PI lock on uaddr2.
+ */
+ reset_pi_state_owner(q.pi_state);
+
+ futex_unqueue_pi(&q);
+ spin_unlock(q.lock_ptr);
+ futex_hash_put(futex_hb_from_futex_q(&q));
+
+ if (to) {
+ hrtimer_cancel(&to->timer);
+ destroy_hrtimer_on_stack(&to->timer);
+ }
+ uaddr_owner = check_pi_lock_owner(uaddr2);
+ if (uaddr_owner)
+ return 0;
+
+ return futex_lock_pi(uaddr2, flags, abs_time, 0);
+ }
debug_rt_mutex_free_waiter(&rt_waiter);
/*
* Fixup the pi_state owner and possibly acquire the lock if we
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 1f2d11eb7f89f..0179b61877529 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -180,6 +180,7 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
return ret;
}
+again_hb_change:
spin_lock(&hb->lock);
plist_for_each_entry_safe(this, next, &hb->chain, list) {
@@ -200,6 +201,16 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
}
spin_unlock(&hb->lock);
+ /*
+ * If there was no error, we woke less than we could have and the hb
+ * changed then we try again.
+ */
+ if (ret > 0 && ret < nr_wake && !futex_check_hb_valid(hb)) {
+ futex_hash_put(hb);
+ hb = futex_hash(&key);
+ if (futex_hb_waiters_pending(hb))
+ goto again_hb_change;
+ }
futex_hash_put(hb);
wake_up_q(&wake_q);
return ret;
@@ -261,7 +272,7 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
- int ret, op_ret;
+ int ret, op_ret, op_woke;
DEFINE_WAKE_Q(wake_q);
retry:
@@ -272,11 +283,19 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
if (unlikely(ret != 0))
return ret;
+retry_hash:
hb1 = futex_hash(&key1);
hb2 = futex_hash(&key2);
retry_private:
double_lock_hb(hb1, hb2);
+ if (!futex_check_hb_valid(hb1) || !futex_check_hb_valid(hb2)) {
+ double_unlock_hb(hb1, hb2);
+ futex_hash_put(hb1);
+ futex_hash_put(hb2);
+ goto retry_hash;
+ }
+
op_ret = futex_atomic_op_inuser(op, uaddr2);
if (unlikely(op_ret < 0)) {
double_unlock_hb(hb1, hb2);
@@ -305,6 +324,8 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
goto retry;
}
+ op_woke = 0;
+retry_wake:
plist_for_each_entry_safe(this, next, &hb1->chain, list) {
if (futex_match (&this->key, &key1)) {
if (this->pi_state || this->rt_waiter) {
@@ -318,7 +339,6 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
}
if (op_ret > 0) {
- op_ret = 0;
plist_for_each_entry_safe(this, next, &hb2->chain, list) {
if (futex_match (&this->key, &key2)) {
if (this->pi_state || this->rt_waiter) {
@@ -326,19 +346,31 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
goto out_unlock;
}
this->wake(&wake_q, this);
- if (++op_ret >= nr_wake2)
+ if (++op_woke >= nr_wake2)
break;
}
}
- ret += op_ret;
}
out_unlock:
double_unlock_hb(hb1, hb2);
+ if (ret >= 0 &&
+ (!(ret >= nr_wake) || !(op_woke >= nr_wake2)) &&
+ (!futex_check_hb_valid(hb1) || !futex_check_hb_valid(hb2))) {
+
+ futex_hash_put(hb1);
+ futex_hash_put(hb2);
+ hb1 = futex_hash(&key1);
+ hb2 = futex_hash(&key2);
+ double_lock_hb(hb1, hb2);
+ goto retry_wake;
+ }
+
wake_up_q(&wake_q);
+
futex_hash_put(hb1);
futex_hash_put(hb2);
- return ret;
+ return ret < 0 ? ret : ret + op_woke;
}
static long futex_wait_restart(struct restart_block *restart);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index ac1365afcc4a5..ce1cf32dc7ed0 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -58,10 +58,29 @@ static inline int __ww_mutex_check_kill(struct rt_mutex *lock,
return 0;
}
+extern bool futex_check_hb_valid(struct futex_hash_bucket *hb);
+
+static inline bool __internal_retry_reason(struct rt_mutex_waiter *waiter)
+{
+ if (!IS_ENABLED(CONFIG_FUTEX))
+ return false;
+
+ if (!waiter->hb)
+ return false;
+ if (futex_check_hb_valid(waiter->hb))
+ return false;
+ return true;
+}
+
#else
# define build_ww_mutex() (true)
# define ww_container_of(rtm) container_of(rtm, struct ww_mutex, base)
# include "ww_mutex.h"
+
+static inline bool __internal_retry_reason(struct rt_mutex_waiter *waiter)
+{
+ return false;
+}
#endif
/*
@@ -1633,6 +1652,13 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock,
break;
}
+ if (!build_ww_mutex()) {
+ if (__internal_retry_reason(waiter)) {
+ ret = -EAGAIN;
+ break;
+ }
+ }
+
if (waiter == rt_mutex_top_waiter(lock))
owner = rt_mutex_owner(lock);
else
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index c38a2d2d4a7ee..3bd0925a73a6a 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -56,6 +56,7 @@ struct rt_mutex_waiter {
struct rt_mutex_base *lock;
unsigned int wake_state;
struct ww_acquire_ctx *ww_ctx;
+ struct futex_hash_bucket *hb;
};
/**
@@ -100,7 +101,8 @@ extern int __rt_mutex_futex_trylock(struct rt_mutex_base *l);
extern void rt_mutex_futex_unlock(struct rt_mutex_base *lock);
extern bool __rt_mutex_futex_unlock(struct rt_mutex_base *lock,
struct rt_wake_q_head *wqh);
-
+extern void pi_state_update_owner(struct futex_pi_state *pi_state,
+ struct task_struct *new_owner);
extern void rt_mutex_postunlock(struct rt_wake_q_head *wqh);
/*
@@ -216,6 +218,7 @@ static inline void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
RB_CLEAR_NODE(&waiter->tree.entry);
waiter->wake_state = TASK_NORMAL;
waiter->task = NULL;
+ waiter->hb = NULL;
}
static inline void rt_mutex_init_rtlock_waiter(struct rt_mutex_waiter *waiter)
--
2.45.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v4 07/11] futex: Allow to make the number of slots invariant.
2024-12-03 16:42 [PATCH v4 0/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (5 preceding siblings ...)
2024-12-03 16:42 ` [PATCH v4 06/11] futex: Allow to re-allocate the private " Sebastian Andrzej Siewior
@ 2024-12-03 16:42 ` Sebastian Andrzej Siewior
2024-12-06 8:19 ` Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 08/11] futex: Resize futex hash table based on number of threads Sebastian Andrzej Siewior
` (3 subsequent siblings)
10 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-03 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Add an option to freeze the number of hash buckets. The idea is to have
fixed once a certain size is acceptable so that it can be avoided to
acquire mm_struct::futex_hash_lock on certain operations.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/uapi/linux/prctl.h | 2 ++
kernel/futex/core.c | 54 ++++++++++++++++++++++++++++++++++++--
kernel/futex/futex.h | 1 +
kernel/futex/pi.c | 2 +-
kernel/futex/requeue.c | 3 ++-
5 files changed, 58 insertions(+), 4 deletions(-)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 55b843644c51a..d1f4b3dea565c 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -357,5 +357,7 @@ struct prctl_mm_map {
#define PR_FUTEX_HASH 77
# define PR_FUTEX_HASH_SET_SLOTS 1
# define PR_FUTEX_HASH_GET_SLOTS 2
+# define PR_FUTEX_HASH_SET_INVARIANT 3
+# define PR_FUTEX_HASH_GET_INVARIANT 4
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 0dd7100e36419..1abea8f9abd22 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -61,6 +61,7 @@ struct futex_hash_bucket_private {
rcuref_t users;
unsigned int hash_mask;
struct rcu_head rcu;
+ bool slots_invariant;
struct futex_hash_bucket queues[];
};
@@ -1266,6 +1267,7 @@ static int futex_hash_allocate(unsigned int hash_slots)
struct futex_hash_bucket_private *hb_p, *hb_p_old = NULL;
struct mm_struct *mm;
size_t alloc_size;
+ int ret = 0;
int i;
if (hash_slots == 0)
@@ -1291,20 +1293,30 @@ static int futex_hash_allocate(unsigned int hash_slots)
rcuref_init(&hb_p->users, 1);
hb_p->hash_mask = hash_slots - 1;
+ hb_p->slots_invariant = false;
for (i = 0; i < hash_slots; i++)
futex_hash_bucket_init(&hb_p->queues[i], i + 1);
mm = current->mm;
scoped_guard(rwsem_write, &mm->futex_hash_lock) {
+
hb_p_old = rcu_dereference_check(mm->futex_hash_bucket,
lockdep_is_held(&mm->futex_hash_lock));
- rcu_assign_pointer(mm->futex_hash_bucket, hb_p);
+ if (hb_p_old && hb_p_old->slots_invariant)
+ ret = -EINVAL;
+ else
+ rcu_assign_pointer(mm->futex_hash_bucket, hb_p);
}
+ if (ret) {
+ kvfree(hb_p);
+ return ret;
+ }
+
if (hb_p_old)
futex_put_old_hb_p(hb_p_old);
- return 0;
+ return ret;
}
int futex_hash_allocate_default(void)
@@ -1323,6 +1335,36 @@ static int futex_hash_get_slots(void)
return 0;
}
+static int futex_hash_set_invariant(void)
+{
+ struct futex_hash_bucket_private *hb_p;
+ struct mm_struct *mm;
+
+ mm = current->mm;
+ guard(rwsem_write)(&mm->futex_hash_lock);
+ hb_p = rcu_dereference_check(mm->futex_hash_bucket,
+ lockdep_is_held(&mm->futex_hash_lock));
+ if (!hb_p)
+ return -EINVAL;
+ if (hb_p->slots_invariant)
+ return -EALREADY;
+ hb_p->slots_invariant = true;
+ return 0;
+}
+
+bool futex_hash_is_invariant(void)
+{
+ struct futex_hash_bucket_private *hb_p;
+ struct mm_struct *mm;
+
+ mm = current->mm;
+ guard(rcu)();
+ hb_p = rcu_dereference(mm->futex_hash_bucket);
+ if (!hb_p)
+ return -EINVAL;
+ return hb_p->slots_invariant;
+}
+
int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5)
{
@@ -1337,6 +1379,14 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
ret = futex_hash_get_slots();
break;
+ case PR_FUTEX_HASH_SET_INVARIANT:
+ ret = futex_hash_set_invariant();
+ break;
+
+ case PR_FUTEX_HASH_GET_INVARIANT:
+ ret = futex_hash_is_invariant();
+ break;
+
default:
ret = -EINVAL;
break;
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 503f56643a966..e81820a393027 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -208,6 +208,7 @@ extern void futex_hash_get(struct futex_hash_bucket *hb);
extern bool futex_check_hb_valid(struct futex_hash_bucket *hb);
extern bool check_pi_lock_owner(u32 __user *uaddr);
extern void reset_pi_state_owner(struct futex_pi_state *pi_state);
+extern bool futex_hash_is_invariant(void);
static inline struct futex_hash_bucket *futex_hb_from_futex_q(struct futex_q *q)
{
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index b4156d1cc6608..9df320be750c3 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1202,7 +1202,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
if (!IS_ENABLED(CONFIG_FUTEX_PI))
return -ENOSYS;
- if (!(flags & FLAGS_SHARED))
+ if (!(flags & FLAGS_SHARED) && !futex_hash_is_invariant())
futex_hash_lock = ¤t->mm->futex_hash_lock;
retry:
if (get_user(uval, uaddr))
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index 6b3c4413fbf47..904c68abfb8f3 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -431,7 +431,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags1,
if (refill_pi_state_cache())
return -ENOMEM;
- if (!(flags1 & FLAGS_SHARED) || !(flags2 & FLAGS_SHARED))
+ if ((!(flags1 & FLAGS_SHARED) || !(flags2 & FLAGS_SHARED)) &&
+ !futex_hash_is_invariant())
futex_hash_lock = ¤t->mm->futex_hash_lock;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v4 08/11] futex: Resize futex hash table based on number of threads.
2024-12-03 16:42 [PATCH v4 0/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (6 preceding siblings ...)
2024-12-03 16:42 ` [PATCH v4 07/11] futex: Allow to make the number of slots invariant Sebastian Andrzej Siewior
@ 2024-12-03 16:42 ` Sebastian Andrzej Siewior
2024-12-06 9:27 ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 09/11] tools/perf: Add the prctl(PR_FUTEX_HASH,…) to futex-hash Sebastian Andrzej Siewior
` (2 subsequent siblings)
10 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-03 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Automatically size hash bucket based on the number of threads. The logic
tries to allocate between 16 and futex_hashsize (the default for the
system wide hash bucket) and uses 4 * number-of-threads.
If the upper limit is reached, the HB will be made invariant.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/fork.c | 4 ----
kernel/futex/core.c | 39 +++++++++++++++++++++++++++++++++------
2 files changed, 33 insertions(+), 10 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 6267d600af991..35ec9958707c5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2134,10 +2134,6 @@ static bool need_futex_hash_allocate_default(u64 clone_flags)
{
if ((clone_flags & (CLONE_THREAD | CLONE_VM)) != (CLONE_THREAD | CLONE_VM))
return false;
- if (!thread_group_empty(current))
- return false;
- if (current->mm->futex_hash_bucket)
- return false;
return true;
}
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 1abea8f9abd22..19515aa5a6430 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -65,6 +65,8 @@ struct futex_hash_bucket_private {
struct futex_hash_bucket queues[];
};
+static unsigned int futex_default_max_buckets;
+
/*
* Fault injections for futexes.
*/
@@ -1262,7 +1264,7 @@ bool futex_check_hb_valid(struct futex_hash_bucket *hb)
return hb_p_now == hb_p;
}
-static int futex_hash_allocate(unsigned int hash_slots)
+static int futex_hash_allocate(unsigned int hash_slots, bool slots_invariant)
{
struct futex_hash_bucket_private *hb_p, *hb_p_old = NULL;
struct mm_struct *mm;
@@ -1274,8 +1276,8 @@ static int futex_hash_allocate(unsigned int hash_slots)
hash_slots = 16;
if (hash_slots < 2)
hash_slots = 2;
- if (hash_slots > 131072)
- hash_slots = 131072;
+ if (hash_slots > futex_default_max_buckets)
+ hash_slots = futex_default_max_buckets;
if (!is_power_of_2(hash_slots))
hash_slots = rounddown_pow_of_two(hash_slots);
@@ -1293,7 +1295,7 @@ static int futex_hash_allocate(unsigned int hash_slots)
rcuref_init(&hb_p->users, 1);
hb_p->hash_mask = hash_slots - 1;
- hb_p->slots_invariant = false;
+ hb_p->slots_invariant = slots_invariant;
for (i = 0; i < hash_slots; i++)
futex_hash_bucket_init(&hb_p->queues[i], i + 1);
@@ -1321,7 +1323,31 @@ static int futex_hash_allocate(unsigned int hash_slots)
int futex_hash_allocate_default(void)
{
- return futex_hash_allocate(0);
+ unsigned int threads;
+ unsigned int buckets;
+ unsigned int current_buckets = 0;
+ struct futex_hash_bucket_private *hb_p;
+
+ if (!current->mm)
+ return 0;
+
+ scoped_guard(rcu) {
+ threads = get_nr_threads(current);
+ hb_p = rcu_dereference(current->mm->futex_hash_bucket);
+ if (hb_p) {
+ if (hb_p->slots_invariant)
+ return 0;
+ current_buckets = hb_p->hash_mask + 1;
+ }
+ }
+
+ buckets = roundup_pow_of_two(4 * threads);
+ buckets = max(buckets, 16);
+ buckets = min(buckets, futex_default_max_buckets);
+ if (current_buckets > buckets)
+ return 0;
+
+ return futex_hash_allocate(buckets, buckets == futex_default_max_buckets);
}
static int futex_hash_get_slots(void)
@@ -1372,7 +1398,7 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
switch (arg2) {
case PR_FUTEX_HASH_SET_SLOTS:
- ret = futex_hash_allocate(arg3);
+ ret = futex_hash_allocate(arg3, false);
break;
case PR_FUTEX_HASH_GET_SLOTS:
@@ -1404,6 +1430,7 @@ static int __init futex_init(void)
#else
futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
#endif
+ futex_default_max_buckets = futex_hashsize;
futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
futex_hashsize, 0, 0,
--
2.45.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v4 09/11] tools/perf: Add the prctl(PR_FUTEX_HASH,…) to futex-hash.
2024-12-03 16:42 [PATCH v4 0/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (7 preceding siblings ...)
2024-12-03 16:42 ` [PATCH v4 08/11] futex: Resize futex hash table based on number of threads Sebastian Andrzej Siewior
@ 2024-12-03 16:42 ` Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 10/11] tools/perf: The the current affinity for CPU pinning in futex-hash Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 11/11] tools/perf: Allocate futex locks on the local CPU-node Sebastian Andrzej Siewior
10 siblings, 0 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-03 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Wire up PR_FUTEX_HASH to futex-hash. Use the `-b' argument to specify
the number of buckets. Read it back and show during invocation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
tools/perf/bench/futex-hash.c | 19 +++++++++++++++++--
tools/perf/bench/futex.h | 1 +
2 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/tools/perf/bench/futex-hash.c b/tools/perf/bench/futex-hash.c
index b472eded521b1..e24e987ae213e 100644
--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -22,6 +22,7 @@
#include <sys/time.h>
#include <sys/mman.h>
#include <perf/cpumap.h>
+#include <sys/prctl.h>
#include "../util/mutex.h"
#include "../util/stat.h"
@@ -53,6 +54,7 @@ static struct bench_futex_parameters params = {
};
static const struct option options[] = {
+ OPT_UINTEGER('b', "buckets", ¶ms.nbuckets, "Task local futex buckets to allocate"),
OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads"),
OPT_UINTEGER('r', "runtime", ¶ms.runtime, "Specify runtime (in seconds)"),
OPT_UINTEGER('f', "futexes", ¶ms.nfutexes, "Specify amount of futexes per threads"),
@@ -120,6 +122,10 @@ static void print_summary(void)
(int)bench__runtime.tv_sec);
}
+#define PR_FUTEX_HASH 77
+# define PR_FUTEX_HASH_SET_SLOTS 1
+# define PR_FUTEX_HASH_GET_SLOTS 2
+
int bench_futex_hash(int argc, const char **argv)
{
int ret = 0;
@@ -131,6 +137,7 @@ int bench_futex_hash(int argc, const char **argv)
struct perf_cpu_map *cpu;
int nrcpus;
size_t size;
+ int num_buckets;
argc = parse_options(argc, argv, options, bench_futex_hash_usage, 0);
if (argc) {
@@ -147,6 +154,14 @@ int bench_futex_hash(int argc, const char **argv)
act.sa_sigaction = toggle_done;
sigaction(SIGINT, &act, NULL);
+ ret = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, params.nbuckets);
+ if (ret) {
+ printf("Allocation of %u hash buckets failed: %d/%m\n",
+ params.nbuckets, ret);
+ goto errmem;
+ }
+ num_buckets = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
+
if (params.mlockall) {
if (mlockall(MCL_CURRENT | MCL_FUTURE))
err(EXIT_FAILURE, "mlockall");
@@ -162,8 +177,8 @@ int bench_futex_hash(int argc, const char **argv)
if (!params.fshared)
futex_flag = FUTEX_PRIVATE_FLAG;
- printf("Run summary [PID %d]: %d threads, each operating on %d [%s] futexes for %d secs.\n\n",
- getpid(), params.nthreads, params.nfutexes, params.fshared ? "shared":"private", params.runtime);
+ printf("Run summary [PID %d]: %d threads, hash slots: %d each operating on %d [%s] futexes for %d secs.\n\n",
+ getpid(), params.nthreads, num_buckets, params.nfutexes, params.fshared ? "shared":"private", params.runtime);
init_stats(&throughput_stats);
mutex_init(&thread_lock);
diff --git a/tools/perf/bench/futex.h b/tools/perf/bench/futex.h
index ebdc2b032afc1..abc353c63a9a4 100644
--- a/tools/perf/bench/futex.h
+++ b/tools/perf/bench/futex.h
@@ -20,6 +20,7 @@ struct bench_futex_parameters {
bool multi; /* lock-pi */
bool pi; /* requeue-pi */
bool broadcast; /* requeue */
+ unsigned int nbuckets;
unsigned int runtime; /* seconds*/
unsigned int nthreads;
unsigned int nfutexes;
--
2.45.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v4 10/11] tools/perf: The the current affinity for CPU pinning in futex-hash.
2024-12-03 16:42 [PATCH v4 0/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (8 preceding siblings ...)
2024-12-03 16:42 ` [PATCH v4 09/11] tools/perf: Add the prctl(PR_FUTEX_HASH,…) to futex-hash Sebastian Andrzej Siewior
@ 2024-12-03 16:42 ` Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 11/11] tools/perf: Allocate futex locks on the local CPU-node Sebastian Andrzej Siewior
10 siblings, 0 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-03 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
In order to simplify NUMA local testing, let futex-hash use the current
affinity mask and pin the individual threads based on that mask.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
tools/perf/bench/futex-hash.c | 30 ++++++++++++++++++++++++------
1 file changed, 24 insertions(+), 6 deletions(-)
diff --git a/tools/perf/bench/futex-hash.c b/tools/perf/bench/futex-hash.c
index e24e987ae213e..216b0d1301ffc 100644
--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -126,10 +126,24 @@ static void print_summary(void)
# define PR_FUTEX_HASH_SET_SLOTS 1
# define PR_FUTEX_HASH_GET_SLOTS 2
+static unsigned int get_cpu_bit(cpu_set_t *set, size_t set_size, unsigned int r_cpu)
+{
+ unsigned int cpu = 0;
+
+ do {
+ if (CPU_ISSET_S(cpu, set_size, set)) {
+ if (!r_cpu)
+ return cpu;
+ r_cpu--;
+ }
+ cpu++;
+ } while (1);
+}
+
int bench_futex_hash(int argc, const char **argv)
{
int ret = 0;
- cpu_set_t *cpuset;
+ cpu_set_t *cpuset, cpuset_;
struct sigaction act;
unsigned int i;
pthread_attr_t thread_attr;
@@ -167,8 +181,12 @@ int bench_futex_hash(int argc, const char **argv)
err(EXIT_FAILURE, "mlockall");
}
+ ret = pthread_getaffinity_np(pthread_self(), sizeof(cpuset_), &cpuset_);
+ BUG_ON(ret);
+ nrcpus = CPU_COUNT(&cpuset_);
+
if (!params.nthreads) /* default to the number of CPUs */
- params.nthreads = perf_cpu_map__nr(cpu);
+ params.nthreads = nrcpus;
worker = calloc(params.nthreads, sizeof(*worker));
if (!worker)
@@ -189,10 +207,9 @@ int bench_futex_hash(int argc, const char **argv)
pthread_attr_init(&thread_attr);
gettimeofday(&bench__start, NULL);
- nrcpus = cpu__max_cpu().cpu;
- cpuset = CPU_ALLOC(nrcpus);
+ cpuset = CPU_ALLOC(4096);
BUG_ON(!cpuset);
- size = CPU_ALLOC_SIZE(nrcpus);
+ size = CPU_ALLOC_SIZE(4096);
for (i = 0; i < params.nthreads; i++) {
worker[i].tid = i;
@@ -202,7 +219,8 @@ int bench_futex_hash(int argc, const char **argv)
CPU_ZERO_S(size, cpuset);
- CPU_SET_S(perf_cpu_map__cpu(cpu, i % perf_cpu_map__nr(cpu)).cpu, size, cpuset);
+ CPU_SET_S(get_cpu_bit(&cpuset_, sizeof(cpuset_), i % nrcpus), size, cpuset);
+
ret = pthread_attr_setaffinity_np(&thread_attr, size, cpuset);
if (ret) {
CPU_FREE(cpuset);
--
2.45.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v4 11/11] tools/perf: Allocate futex locks on the local CPU-node.
2024-12-03 16:42 [PATCH v4 0/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (9 preceding siblings ...)
2024-12-03 16:42 ` [PATCH v4 10/11] tools/perf: The the current affinity for CPU pinning in futex-hash Sebastian Andrzej Siewior
@ 2024-12-03 16:42 ` Sebastian Andrzej Siewior
10 siblings, 0 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-03 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Sebastian Andrzej Siewior
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
tools/perf/bench/futex-hash.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/tools/perf/bench/futex-hash.c b/tools/perf/bench/futex-hash.c
index 216b0d1301ffc..4c7c6677463f8 100644
--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -122,6 +122,8 @@ static void print_summary(void)
(int)bench__runtime.tv_sec);
}
+#include <numa.h>
+
#define PR_FUTEX_HASH 77
# define PR_FUTEX_HASH_SET_SLOTS 1
# define PR_FUTEX_HASH_GET_SLOTS 2
@@ -212,14 +214,19 @@ int bench_futex_hash(int argc, const char **argv)
size = CPU_ALLOC_SIZE(4096);
for (i = 0; i < params.nthreads; i++) {
+ unsigned int cpu_num;
worker[i].tid = i;
- worker[i].futex = calloc(params.nfutexes, sizeof(*worker[i].futex));
- if (!worker[i].futex)
- goto errmem;
CPU_ZERO_S(size, cpuset);
+ cpu_num = get_cpu_bit(&cpuset_, sizeof(cpuset_), i % nrcpus);
+ //worker[i].futex = calloc(params.nfutexes, sizeof(*worker[i].futex));
- CPU_SET_S(get_cpu_bit(&cpuset_, sizeof(cpuset_), i % nrcpus), size, cpuset);
+ worker[i].futex = numa_alloc_onnode(params.nfutexes * sizeof(*worker[i].futex),
+ numa_node_of_cpu(cpu_num));
+ if (worker[i].futex == MAP_FAILED || worker[i].futex == NULL)
+ goto errmem;
+
+ CPU_SET_S(cpu_num, size, cpuset);
ret = pthread_attr_setaffinity_np(&thread_attr, size, cpuset);
if (ret) {
@@ -271,7 +278,7 @@ int bench_futex_hash(int argc, const char **argv)
&worker[i].futex[params.nfutexes-1], t);
}
- zfree(&worker[i].futex);
+ numa_free(worker[i].futex, params.nfutexes * sizeof(*worker[i].futex));
}
print_summary();
--
2.45.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v4 07/11] futex: Allow to make the number of slots invariant.
2024-12-03 16:42 ` [PATCH v4 07/11] futex: Allow to make the number of slots invariant Sebastian Andrzej Siewior
@ 2024-12-06 8:19 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-06 8:19 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Thomas Gleixner, Valentin Schneider,
Waiman Long, Dan Carpenter
On 2024-12-03 17:42:15 [+0100], To linux-kernel@vger.kernel.org wrote:
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> index 0dd7100e36419..1abea8f9abd22 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -1323,6 +1335,36 @@ static int futex_hash_get_slots(void)
…
> +bool futex_hash_is_invariant(void)
> +{
> + struct futex_hash_bucket_private *hb_p;
> + struct mm_struct *mm;
> +
> + mm = current->mm;
> + guard(rcu)();
> + hb_p = rcu_dereference(mm->futex_hash_bucket);
> + if (!hb_p)
> + return -EINVAL;
Dan spotted that this should be false.
> + return hb_p->slots_invariant;
> +}
Sebastian
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 08/11] futex: Resize futex hash table based on number of threads.
2024-12-03 16:42 ` [PATCH v4 08/11] futex: Resize futex hash table based on number of threads Sebastian Andrzej Siewior
@ 2024-12-06 9:27 ` Thomas Gleixner
0 siblings, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2024-12-06 9:27 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
On Tue, Dec 03 2024 at 17:42, Sebastian Andrzej Siewior wrote:
> int futex_hash_allocate_default(void)
> {
> - return futex_hash_allocate(0);
> + unsigned int threads;
> + unsigned int buckets;
> + unsigned int current_buckets = 0;
> + struct futex_hash_bucket_private *hb_p;
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#variable-declarations
> +
> + if (!current->mm)
> + return 0;
> +
> + scoped_guard(rcu) {
> + threads = get_nr_threads(current);
> + hb_p = rcu_dereference(current->mm->futex_hash_bucket);
> + if (hb_p) {
> + if (hb_p->slots_invariant)
> + return 0;
> + current_buckets = hb_p->hash_mask + 1;
> + }
> + }
> +
> + buckets = roundup_pow_of_two(4 * threads);
> + buckets = max(buckets, 16);
> + buckets = min(buckets, futex_default_max_buckets);
> + if (current_buckets > buckets)
>= ?
Thanks,
tglx
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 01/11] futex: Create helper function to initialize a hash slot.
2024-12-03 16:42 ` [PATCH v4 01/11] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
@ 2024-12-10 15:13 ` Thomas Gleixner
0 siblings, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2024-12-10 15:13 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
On Tue, Dec 03 2024 at 17:42, Sebastian Andrzej Siewior wrote:
> Factor out the futex_hash_bucket initialisation into a helpr function.
helpr?
> The helper function will be used in a follow up patch.
used for implementing process private hash buckets. or such, hmm?
Thanks,
tglx
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 02/11] futex: Add basic infrastructure for local task local hash.
2024-12-03 16:42 ` [PATCH v4 02/11] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
@ 2024-12-10 15:23 ` Thomas Gleixner
0 siblings, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2024-12-10 15:23 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
On Tue, Dec 03 2024 at 17:42, Sebastian Andrzej Siewior wrote:
> long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
> u32 __user *uaddr2, u32 val2, u32 val3);
> +int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
> + unsigned long arg4, unsigned long arg5);
> +int futex_hash_allocate_default(void);
> +void futex_hash_free(struct mm_struct *mm);
> +
> +static inline void futex_mm_init(struct mm_struct *mm)
> +{
> + mm->futex_hash_bucket = NULL;
> +}
> +
> #else
> static inline void futex_init_task(struct task_struct *tsk) { }
> static inline void futex_exit_recursive(struct task_struct *tsk) { }
> @@ -88,6 +98,18 @@ static inline long do_futex(u32 __user *uaddr, int op, u32 val,
> {
> return -EINVAL;
> }
> +static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
> + unsigned long arg4, unsigned long arg5)
> +{
> + return -EINVAL;
> +}
> +static inline int futex_hash_allocate_default(void)
> +{
> + return 0;
> +}
> +static inline void futex_hash_free(struct mm_struct *mm) { }
> +static inline void futex_mm_init(struct mm_struct *mm) { }
> +
> #endif
>
> #endif
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 7361a8f3ab68e..b16b97ab8fb2a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -30,6 +30,7 @@
> #define INIT_PASID 0
>
> struct address_space;
> +struct futex_hash_bucket;
> struct mem_cgroup;
>
> /*
> @@ -902,6 +903,8 @@ struct mm_struct {
> int mm_lock_seq;
> #endif
>
> + unsigned int futex_hash_mask;
> + struct futex_hash_bucket *futex_hash_bucket;
So you provided stubs for the accessors/init functions etc. Shouldn't
this be #ifdef guarded too?
Thanks,
tglx
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 03/11] futex: Allow automatic allocation of process wide futex hash.
2024-12-03 16:42 ` [PATCH v4 03/11] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
@ 2024-12-10 16:07 ` Thomas Gleixner
0 siblings, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2024-12-10 16:07 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
On Tue, Dec 03 2024 at 17:42, Sebastian Andrzej Siewior wrote:
> +static bool need_futex_hash_allocate_default(u64 clone_flags)
> +{
> + if ((clone_flags & (CLONE_THREAD | CLONE_VM)) != (CLONE_THREAD | CLONE_VM))
> + return false;
> + if (!thread_group_empty(current))
> + return false;
> + if (current->mm->futex_hash_bucket)
> + return false;
If you add an accessor like:
if (mm_get_futex_hash_bucket(current->mm))
then you can either #ifdef the futex muck in mm_struct or make it
struct mm_futex_hash_bucket {
#ifdef CONFIG_FUTEX
unsigned int futex_hash_mask;
struct futex_hash_bucket *futex_hash_bucket;
#endif
};
and avoid the #ifdeffery in mm_struct itself because the empty struct
occupies zero space.
The accessor and the other helpers allow the the compiler to optimize
all of it out for CONFIG_FUTEX=n.
Thanks,
tglx
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 05/11] futex: Track the futex hash bucket.
2024-12-03 16:42 ` [PATCH v4 05/11] futex: Track the futex hash bucket Sebastian Andrzej Siewior
@ 2024-12-10 18:45 ` Thomas Gleixner
2024-12-12 16:41 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2024-12-10 18:45 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
On Tue, Dec 03 2024 at 17:42, Sebastian Andrzej Siewior wrote:
> Add futex_hash_get/put() to keep the assigned hash_bucket around while a
> futex operation is performed. Have RCU lifetime guarantee for
> futex_hash_bucket_private.
>
> This is should have the right amount of gets/ puts so that the private
This is should have? This either has or not :)
> struct futex_hash_bucket *futex_hash(union futex_key *key)
> {
> - struct futex_hash_bucket *fhb;
> + struct futex_hash_bucket_private *hb_p = NULL;
> u32 hash;
>
> - fhb = current->mm->futex_hash_bucket;
> - if (fhb && futex_key_is_private(key)) {
> - u32 hash_mask = current->mm->futex_hash_mask;
> + if (futex_key_is_private(key)) {
> + guard(rcu)();
> +
> + do {
> + hb_p = rcu_dereference(current->mm->futex_hash_bucket);
> + } while (hb_p && !rcuref_get(&hb_p->users));
This loop really wants an explanation about the potential loop
duration.
> +void futex_hash_put(struct futex_hash_bucket *hb)
> +{
> + struct futex_hash_bucket_private *hb_p;
> +
> + if (hb->hb_slot == 0)
> + return;
> + hb_p = container_of(hb, struct futex_hash_bucket_private,
> + queues[hb->hb_slot - 1]);
Duh. This off by one abuse of hb_slot is really counter intuitive. It
took me a while to wrap my head around it.
The structure has a 4 byte hole, so adding a private flag or such is
feasible without going over a cache line, unless lockdep or rt is
enabled, but in that case it expands into a second cache line anyway.
> + futex_hash_priv_put(hb_p);
> +}
> +
> +void futex_hash_get(struct futex_hash_bucket *hb)
> +{
> + struct futex_hash_bucket_private *hb_p;
> +
> + if (hb->hb_slot == 0)
> + return;
> +
> + hb_p = container_of(hb, struct futex_hash_bucket_private,
> + queues[hb->hb_slot - 1]);
> + /* The ref needs to be owned by the caller so this can't fail */
reference please. This is not twatter. But see below.
> + WARN_ON_ONCE(!rcuref_get(&hb_p->users));
> +}
>
> /**
> * futex_setup_timer - set up the sleeping hrtimer.
> @@ -599,7 +642,10 @@ int futex_unqueue(struct futex_q *q)
> */
> lock_ptr = READ_ONCE(q->lock_ptr);
> if (lock_ptr != NULL) {
> + struct futex_hash_bucket *hb;
> +
> spin_lock(lock_ptr);
> + hb = futex_hb_from_futex_q(q);
> /*
> * q->lock_ptr can change between reading it and
> * spin_lock(), causing us to take the wrong lock. This
> @@ -622,6 +668,7 @@ int futex_unqueue(struct futex_q *q)
> BUG_ON(q->pi_state);
>
> spin_unlock(lock_ptr);
> + futex_hash_put(hb);
This is invoked from futex_wait_multiple() which means you are
are holding the reference count accross schedule(),
I'm not convinced that this is the right thing to do. Let me look at
your actual resize implementation...
> futex_q_unlock(hb);
> + futex_hash_put(hb);
This pattern is there in a gazillion instances. Can't we have a single
function doing all of it?
Thanks,
tglx
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 06/11] futex: Allow to re-allocate the private hash bucket.
2024-12-03 16:42 ` [PATCH v4 06/11] futex: Allow to re-allocate the private " Sebastian Andrzej Siewior
@ 2024-12-10 22:27 ` Thomas Gleixner
2024-12-11 14:32 ` Thomas Gleixner
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Thomas Gleixner @ 2024-12-10 22:27 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
On Tue, Dec 03 2024 at 17:42, Sebastian Andrzej Siewior wrote:
> +static void futex_put_old_hb_p(struct futex_hash_bucket_private *hb_p)
> +{
> + unsigned int slots = hb_p->hash_mask + 1;
> + struct futex_hash_bucket *hb;
> + DEFINE_WAKE_Q(wake_q);
> + unsigned int i;
> +
> + for (i = 0; i < slots; i++) {
> + struct futex_q *this;
> +
> + hb = &hb_p->queues[i];
> +
> + spin_lock(&hb->lock);
> + plist_for_each_entry(this, &hb->chain, list)
> + wake_q_add(&wake_q, this->task);
> + spin_unlock(&hb->lock);
> + }
> + futex_hash_priv_put(hb_p);
> +
> + wake_up_q(&wake_q);
So you wake up all queued waiters and let themself requeue on the new
hash.
How is that safe against the following situation:
CPU 0 CPU 1
hb_p_old = mm->futex_hash_bucket; hbp = mm->futex_hash_bucket;
mm->futex_hash_bucket = new;
// Referrence count succeeds!
rcuref_get(&hpb->refcnt);
futex_put_old_hb_p();
// Queues on old hash and
// is lost forever
queue(hbp);
This scheme simply cannot work unless you take the rwsem read in the
wait path, which is a horrible idea. Aside of that taking the rwsem in
various code paths is counterproductive. For a big process where threads
operate on different locks heavily, you introduce a 'global'
serialization for no good reason.
I still think that the basic principle for any of this is to hold the
reference count only accross the actual hash bucket related operations
and not keep it accross schedule.
That obviously means that the hash bucket pointer is invalid after such
an operation block and needs re-evaluation if needed again, but that's
not the end of the world.
Let's look at wait() again:
wait()
{
retry:
hb = hb_get_and_lock(key); // Aqcuires a reference
ret = futex_get_value_locked();
if (ret) {
hb_unlock_and_put(hb); // Drops the reference
...
if (...)
goto retry;
queue(hb, q, ...);
hb_unlock_and_put(hb); // Drops the reference
schedule();
unqueue(q); // Does not require hb
}
Why does unqueue() work w/o a hash bucket reference?
unqueue(q)
{
retry:
lock_ptr = READ_ONCE(q->lock_ptr);
// Wake up ?
if (!lock_ptr)
return 0;
spin_lock(lock_ptr);
// This covers both requeue and rehash operations
if (lock_ptr != q->lock_ptr) {
spin_unlock(lock_ptr);
goto retry;
}
__unqueue(q);
spin_unlock(lock_ptr);
}
Nothing in unqueue() requires a reference on the hash. The lock pointer
logic covers both requeue and rehash operations. They are equivalent,
no?
wake() is not really different. It needs to change the way how the
private retry works:
wake_op()
{
retry:
get_key(key1);
get_ket(key2);
retry_private:
double_get_and_lock(&hb1, &hb2, &key1, &key2);
.....
double_unlock_and_put(&hb1, &hb2);
.....
}
Moving retry private before the point where the hash bucket is retrieved
and locked is required in some other place too. And some places use
q.lock_ptr under the assumption that it can't change, which probably
needs reevaluation of the hash bucket. Other stuff like lock_pi() needs
a seperation of unlocking the hash bucket and dropping the reference.
But that are all minor changes.
All of them can be done on a per function basis before adding the actual
private hash muck, which makes the whole thing reviewable. This patch
definitely does not qualify for reviewable.
All you need are implementations for hb_get_and_lock/unlock_and_put()
plus the double variants and a hash_put() helper. Those implementations
use the global hash until all places are mopped up and then you can add
the private magic in exatly those places
There is not a single place where you need magic state fixups in the
middle of the functions or conditional locking, which turns out to be
not sufficient.
The required helpers are:
hb_get_and_lock(key)
{
if (private(key))
hb = private_hash(key); // Gets a reference
else
hb = hash_bucket(global_hash, key);
hb_lock(hb);
return hb;
}
hb_unlock_and_put(hb)
{
hb_unlock(hb);
if (private(hb))
hb_private_put(hb);
}
The double lock/unlock variants are equivalent.
private_hash(key)
{
scoped_guard(rcu) {
hash = rcu_deref(current->mm->futex.hash);
if (rcuref_get(hash->ref))
return hash_bucket(hash, key);
}
guard(mutex)(current->mm->futex.hash_mutex);
// Did reallocation win the race for the mutex ?
hash = current->mm->futex.hash;
if (!rcuref_get(hash->ref) {
hash = realloc_or_restore_hash();
rcuref_get(hash);
}
return hash_bucket(hash, key);
}
hb_private_put(hb)
{
hash = priv_hash(hb);
hash_putref(hash);
}
hash_putref(hash)
{
// Has fork dropped the initial reference to signal
// that reallocation is required?
//
// If so, the last user hits the dead state
if (!rcuref_put(&hash->ref))
return;
guard(mutex)(current->mm->futex.hash_mutex);
realloc_or_restore_hash();
}
realloc_or_restore_hash()
{
old_hash = current->mm->futex.hash;
new_hash = alloc_hash(current->mm->users);
if (!new_hash) {
// Make the old hash alive again
rcuref_init(old_hash->ref, 1);
return cur_hash;
}
rehash(new_hash, old_hash);
rcu_assign_pointer(current->mm->futex.hash, new_hash);
rcu_free(old_hash);
}
rehash(new_hash, old_hash)
{
// old_hash is marked dead, so new waiters cannot queue
// themself and are stuck on the hash_mutex.
for (i = 0; i < old_hash->size; i++) {
hb1 = &old_hash->queues[i];
// Protect the old bucket against unqueue(), as it does
// not try to get a reference on the hash bucket. It
// solely relies on q->lock_ptr.
spin_lock(&hb1->lock);
plist_for_each_entry_safe(q, tmp, hb1, list) {
hb2 = hash_bucket(new_hash, &q->key);
// Nesting is safe as this is a one time operation
spin_lock_nested(&hb2->lock);
plist_del(&q->list, &hb->chain);
// Redirect the lock pointer to the new hash
// bucket. See unqueue().
q->lock_ptr = &hb2->lock;
plist_add(&q->list, &hb->chain);
}
}
}
fork()
{
if (hash_needs_resize())
hash_putref(mm->futex.hash);
}
That should just work unless I'm missing something important. The charm
of utilizing rcuref for this is that there is close to zero impact on
the hotpaths, unless there is actually a reallocation in progress, which
is a rare event and applications can work around that by allocating the
appropriate hash size upfront.
Thanks,
tglx
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 06/11] futex: Allow to re-allocate the private hash bucket.
2024-12-10 22:27 ` Thomas Gleixner
@ 2024-12-11 14:32 ` Thomas Gleixner
2024-12-11 15:22 ` Thomas Gleixner
2024-12-12 16:16 ` Sebastian Andrzej Siewior
2 siblings, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2024-12-11 14:32 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
On Tue, Dec 10 2024 at 23:27, Thomas Gleixner wrote:
> Why does unqueue() work w/o a hash bucket reference?
>
> unqueue(q)
> {
This actually needs a
guard(rcu);
to protect against a concurrent rehashing.
> retry:
> lock_ptr = READ_ONCE(q->lock_ptr);
> // Wake up ?
> if (!lock_ptr)
> return 0;
>
> spin_lock(lock_ptr);
>
> // This covers both requeue and rehash operations
> if (lock_ptr != q->lock_ptr) {
> spin_unlock(lock_ptr);
> goto retry;
> }
>
> __unqueue(q);
> spin_unlock(lock_ptr);
> }
>
> Nothing in unqueue() requires a reference on the hash. The lock pointer
> logic covers both requeue and rehash operations. They are equivalent,
> no?
>
> wake() is not really different. It needs to change the way how the
> private retry works:
>
> wake_op()
> {
> retry:
> get_key(key1);
> get_ket(key2);
>
> retry_private:
> double_get_and_lock(&hb1, &hb2, &key1, &key2);
> .....
> double_unlock_and_put(&hb1, &hb2);
> .....
> }
>
> Moving retry private before the point where the hash bucket is retrieved
> and locked is required in some other place too. And some places use
> q.lock_ptr under the assumption that it can't change, which probably
> needs reevaluation of the hash bucket. Other stuff like lock_pi() needs
> a seperation of unlocking the hash bucket and dropping the reference.
>
> But that are all minor changes.
>
> All of them can be done on a per function basis before adding the actual
> private hash muck, which makes the whole thing reviewable. This patch
> definitely does not qualify for reviewable.
>
> All you need are implementations for hb_get_and_lock/unlock_and_put()
> plus the double variants and a hash_put() helper. Those implementations
> use the global hash until all places are mopped up and then you can add
> the private magic in exatly those places
>
> There is not a single place where you need magic state fixups in the
> middle of the functions or conditional locking, which turns out to be
> not sufficient.
>
> The required helpers are:
>
> hb_get_and_lock(key)
> {
> if (private(key))
> hb = private_hash(key); // Gets a reference
> else
> hb = hash_bucket(global_hash, key);
> hb_lock(hb);
> return hb;
> }
>
> hb_unlock_and_put(hb)
> {
> hb_unlock(hb);
> if (private(hb))
> hb_private_put(hb);
> }
>
> The double lock/unlock variants are equivalent.
>
> private_hash(key)
> {
> scoped_guard(rcu) {
> hash = rcu_deref(current->mm->futex.hash);
This actually requires:
if (!hash)
return global_hash;
otherwise this results in a NULL pointer dereference, aka. unpriviledged
DoS when a single threaded process invokes sys_futex(...) directly.
That begs the question whether current->mm->futex.hash should be
initialized with &global_hash in the first place and &global_hash having
a reference count too, which never can go to zero. That would simplify
the whole logic there.
Thanks,
tglx
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 06/11] futex: Allow to re-allocate the private hash bucket.
2024-12-10 22:27 ` Thomas Gleixner
2024-12-11 14:32 ` Thomas Gleixner
@ 2024-12-11 15:22 ` Thomas Gleixner
2024-12-12 16:16 ` Sebastian Andrzej Siewior
2 siblings, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2024-12-11 15:22 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
On Tue, Dec 10 2024 at 23:27, Thomas Gleixner wrote:
> On Tue, Dec 03 2024 at 17:42, Sebastian Andrzej Siewior wrote:
> realloc_or_restore_hash()
swap_hash()
{
old_hash = current->mm->futex.hash;
new_hash = current->mm->futex.new_hash;
rehash(new_hash, old_hash);
rcu_assign_pointer(current->mm->futex.hash, new_hash);
rcu_free(old_hash);
}
> rehash(new_hash, old_hash)
> {
> // old_hash is marked dead, so new waiters cannot queue
> // themself and are stuck on the hash_mutex.
>
> for (i = 0; i < old_hash->size; i++) {
> hb1 = &old_hash->queues[i];
>
> // Protect the old bucket against unqueue(), as it does
> // not try to get a reference on the hash bucket. It
> // solely relies on q->lock_ptr.
> spin_lock(&hb1->lock);
>
> plist_for_each_entry_safe(q, tmp, hb1, list) {
> hb2 = hash_bucket(new_hash, &q->key);
> // Nesting is safe as this is a one time operation
> spin_lock_nested(&hb2->lock);
>
> plist_del(&q->list, &hb->chain);
>
> // Redirect the lock pointer to the new hash
> // bucket. See unqueue().
> q->lock_ptr = &hb2->lock;
>
> plist_add(&q->list, &hb->chain);
> }
> }
> }
>
> fork()
> {
> if (hash_needs_resize())
> hash_putref(mm->futex.hash);
futex_validate_hash();
futex_validate_hash()
{
guard(mutex)();
if (!hash_needs_resize(mm->futex.hash))
return;
if (mm->futex.new_hash && !hash_needs_resize(mm->futex.new_hash))
return;
new = alloc();
if (!new)
return;
current->mm.futex_new_hash = new;
if (!rcuput_ref(mm->futex.hash->ref))
return;
swap_hash();
}
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 06/11] futex: Allow to re-allocate the private hash bucket.
2024-12-10 22:27 ` Thomas Gleixner
2024-12-11 14:32 ` Thomas Gleixner
2024-12-11 15:22 ` Thomas Gleixner
@ 2024-12-12 16:16 ` Sebastian Andrzej Siewior
2 siblings, 0 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-12 16:16 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Valentin Schneider,
Waiman Long
On 2024-12-10 23:27:32 [+0100], Thomas Gleixner wrote:
> On Tue, Dec 03 2024 at 17:42, Sebastian Andrzej Siewior wrote:
> > +static void futex_put_old_hb_p(struct futex_hash_bucket_private *hb_p)
> > +{
> > + unsigned int slots = hb_p->hash_mask + 1;
> > + struct futex_hash_bucket *hb;
> > + DEFINE_WAKE_Q(wake_q);
> > + unsigned int i;
> > +
> > + for (i = 0; i < slots; i++) {
> > + struct futex_q *this;
> > +
> > + hb = &hb_p->queues[i];
> > +
> > + spin_lock(&hb->lock);
> > + plist_for_each_entry(this, &hb->chain, list)
> > + wake_q_add(&wake_q, this->task);
> > + spin_unlock(&hb->lock);
> > + }
> > + futex_hash_priv_put(hb_p);
> > +
> > + wake_up_q(&wake_q);
>
> So you wake up all queued waiters and let themself requeue on the new
> hash.
>
> How is that safe against the following situation:
>
> CPU 0 CPU 1
> hb_p_old = mm->futex_hash_bucket; hbp = mm->futex_hash_bucket;
> mm->futex_hash_bucket = new;
> // Referrence count succeeds!
> rcuref_get(&hpb->refcnt);
> futex_put_old_hb_p();
> // Queues on old hash and
> // is lost forever
> queue(hbp);
This does not happen. futex_q_lock() check if the hb is valid after
locking the HB it obtained. So if the HB is still valid then
futex_put_old_hb_p() will see/ wake it. If it is not then it will drop
the lock and try again.
However looking at your proposal below, it has some benefits. Let me
implement this instead.
Sebastian
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v4 05/11] futex: Track the futex hash bucket.
2024-12-10 18:45 ` Thomas Gleixner
@ 2024-12-12 16:41 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-12-12 16:41 UTC (permalink / raw)
To: Thomas Gleixner
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Valentin Schneider,
Waiman Long
On 2024-12-10 19:45:56 [+0100], Thomas Gleixner wrote:
> > +void futex_hash_put(struct futex_hash_bucket *hb)
> > +{
> > + struct futex_hash_bucket_private *hb_p;
> > +
> > + if (hb->hb_slot == 0)
> > + return;
> > + hb_p = container_of(hb, struct futex_hash_bucket_private,
> > + queues[hb->hb_slot - 1]);
>
> Duh. This off by one abuse of hb_slot is really counter intuitive. It
> took me a while to wrap my head around it.
The really cute part is that this -1 gets optimized away and this
container_of becomes just hb_p = hb - "hb_slot << 6" :)
> The structure has a 4 byte hole, so adding a private flag or such is
> feasible without going over a cache line, unless lockdep or rt is
> enabled, but in that case it expands into a second cache line anyway.
I have a whole cacheline to stash things because of futex_hash_bucket
alignment:
| struct futex_hash_bucket_private {
| rcuref_t users; /* 0 4 */
| unsigned int hash_mask; /* 4 4 */
| struct callback_head rcu __attribute__((__aligned__(8))); /* 8 16 */
| bool slots_invariant; /* 24 1 */
|
| /* XXX 39 bytes hole, try to pack */
|
| /* --- cacheline 1 boundary (64 bytes) --- */
| struct futex_hash_bucket queues[] __attribute__((__aligned__(64))); /* 64 0 */
|
| /* size: 64, cachelines: 1, members: 5 */
| /* sum members: 25, holes: 1, sum holes: 39 */
| /* forced alignments: 2, forced holes: 1, sum forced holes: 39 */
| } __attribute__((__aligned__(64)));
The hash bucket itself on RT is:
| struct futex_hash_bucket {
| atomic_t waiters; /* 0 4 */
| unsigned int hb_slot; /* 4 4 */
| spinlock_t lock; /* 8 32 */
| struct plist_head chain; /* 40 16 */
|
| /* size: 64, cachelines: 1, members: 4 */
| /* padding: 8 */
| } __attribute__((__aligned__(64)));
so it still fits. However with lockdep enabled it acquires two
cache lines and these additional 4 bytes don't make a change.
On RT it doesn't make a difference because the spinlock_t is so huge.
But for !RT struct futex_hash_bucket becomes 32 bytes so we could fit 2
buckets into one cache line _if_ memory becomes a concern and cache
bouncing is not an issue because it is only process wide.
> > + futex_hash_priv_put(hb_p);
> > +}
Sebastian
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2024-12-12 16:41 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-03 16:42 [PATCH v4 0/11] futex: Add support task local hash maps Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 01/11] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
2024-12-10 15:13 ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 02/11] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
2024-12-10 15:23 ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 03/11] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
2024-12-10 16:07 ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 04/11] futex: Hash only the address for private futexes Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 05/11] futex: Track the futex hash bucket Sebastian Andrzej Siewior
2024-12-10 18:45 ` Thomas Gleixner
2024-12-12 16:41 ` Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 06/11] futex: Allow to re-allocate the private " Sebastian Andrzej Siewior
2024-12-10 22:27 ` Thomas Gleixner
2024-12-11 14:32 ` Thomas Gleixner
2024-12-11 15:22 ` Thomas Gleixner
2024-12-12 16:16 ` Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 07/11] futex: Allow to make the number of slots invariant Sebastian Andrzej Siewior
2024-12-06 8:19 ` Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 08/11] futex: Resize futex hash table based on number of threads Sebastian Andrzej Siewior
2024-12-06 9:27 ` Thomas Gleixner
2024-12-03 16:42 ` [PATCH v4 09/11] tools/perf: Add the prctl(PR_FUTEX_HASH,…) to futex-hash Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 10/11] tools/perf: The the current affinity for CPU pinning in futex-hash Sebastian Andrzej Siewior
2024-12-03 16:42 ` [PATCH v4 11/11] tools/perf: Allocate futex locks on the local CPU-node Sebastian Andrzej Siewior
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox