* [RFC v2 PATCH 0/4] futex: Add support task local hash maps.
@ 2024-10-28 12:13 Sebastian Andrzej Siewior
2024-10-28 12:13 ` [RFC PATCH v2 1/4] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
` (6 more replies)
0 siblings, 7 replies; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-28 12:13 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long
Hi,
this is a follow up on
https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb
and adds support for task local futex_hash_bucket. It can be created via
prctl(). Last patch in the series enables it one the first thread is
created.
I've been how this auto-create behaves and so far dpkg creates threads
and uses the local-hashmap. systemd-journal on the hand forks a thread
from time to time and I haven't seen it using the hashmap. Need to do
more testing.
v1…v2 https://lore.kernel.org/all/20241026224306.982896-1-bigeasy@linutronix.de/:
- Moved to struct signal_struct and is used process wide.
- Automaticly allocated once the first thread is created.
Sebastian
^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH v2 1/4] futex: Create helper function to initialize a hash slot.
2024-10-28 12:13 [RFC v2 PATCH 0/4] futex: Add support task local hash maps Sebastian Andrzej Siewior
@ 2024-10-28 12:13 ` Sebastian Andrzej Siewior
2024-10-28 12:13 ` [RFC PATCH v2 2/4] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
` (5 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-28 12:13 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
Factor out the futex_hash_bucket initialisation into a helpr function.
The helper function will be used in a follow up patch.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 136768ae26375..de6d7f71961eb 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1146,6 +1146,13 @@ void futex_exit_release(struct task_struct *tsk)
futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
}
+static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
+{
+ atomic_set(&fhb->waiters, 0);
+ plist_head_init(&fhb->chain);
+ spin_lock_init(&fhb->lock);
+}
+
static int __init futex_init(void)
{
unsigned int futex_shift;
@@ -1163,11 +1170,8 @@ static int __init futex_init(void)
futex_hashsize, futex_hashsize);
futex_hashsize = 1UL << futex_shift;
- for (i = 0; i < futex_hashsize; i++) {
- atomic_set(&futex_queues[i].waiters, 0);
- plist_head_init(&futex_queues[i].chain);
- spin_lock_init(&futex_queues[i].lock);
- }
+ for (i = 0; i < futex_hashsize; i++)
+ futex_hash_bucket_init(&futex_queues[i]);
return 0;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH v2 2/4] futex: Add basic infrastructure for local task local hash.
2024-10-28 12:13 [RFC v2 PATCH 0/4] futex: Add support task local hash maps Sebastian Andrzej Siewior
2024-10-28 12:13 ` [RFC PATCH v2 1/4] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
@ 2024-10-28 12:13 ` Sebastian Andrzej Siewior
2024-10-28 12:13 ` [RFC PATCH v2 3/4] futex: Use the task local hashmap Sebastian Andrzej Siewior
` (4 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-28 12:13 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
The futex hashmap is system wide and shared by random tasks. Each slot
is hashed based on its address and VMA. Due to randomized VMAs the same
logical lock (pointer) can end up in a different hash bucket on each
invocation of the application. This in turn means that different
applications may share a hash bucket on each invocation and it is not
always clear which applications will be involved. This can result in
high latency's to acquire the futex_hash_bucket::lock especially if the
lock owner is limited to a CPU and not be effectively PI boosted.
Introduce a task local hash map. The hashmap can be allocated via
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_ALLOCATE, 0)
The `0' argument allocates a default number of 4 slots, a higher number
can be specified if desired. The current uppoer limit is 16.
The allocated hashmap is used by all threads within a process.
A thread can check if the private map has been allocated via
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_IS_SHARED);
Which return the current number of slots.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 7 ++++
include/linux/sched/signal.h | 4 +++
include/uapi/linux/prctl.h | 5 +++
kernel/fork.c | 1 +
kernel/futex/core.c | 65 ++++++++++++++++++++++++++++++++++++
kernel/sys.c | 4 +++
6 files changed, 86 insertions(+)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index b70df27d7e85c..dad50173f70c4 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -77,6 +77,8 @@ void futex_exec_release(struct task_struct *tsk);
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3);
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
#else
static inline void futex_init_task(struct task_struct *tsk) { }
static inline void futex_exit_recursive(struct task_struct *tsk) { }
@@ -88,6 +90,11 @@ static inline long do_futex(u32 __user *uaddr, int op, u32 val,
{
return -EINVAL;
}
+static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+ return -EINVAL;
+}
#endif
#endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index c8ed09ac29ac5..3b8c8975cd493 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -14,6 +14,8 @@
#include <linux/mm_types.h>
#include <asm/ptrace.h>
+struct futex_hash_bucket;
+
/*
* Types defining task->signal and task->sighand and APIs using them:
*/
@@ -246,6 +248,8 @@ struct signal_struct {
* and may have inconsistent
* permissions.
*/
+ unsigned int futex_hash_mask;
+ struct futex_hash_bucket *futex_hash_bucket;
} __randomize_layout;
/*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 35791791a879b..e912ce82de41f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -328,4 +328,9 @@ struct prctl_mm_map {
# define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */
# define PR_PPC_DEXCR_CTRL_MASK 0x1f
+/* FUTEX hash management */
+#define PR_FUTEX_HASH 74
+# define PR_FUTEX_HASH_ALLOCATE 1
+# define PR_FUTEX_HASH_IS_SHARED 2
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 89ceb4a68af25..0d2b0a5299bbc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -949,6 +949,7 @@ static inline void free_signal_struct(struct signal_struct *sig)
{
taskstats_tgid_free(sig);
sched_autogroup_exit(sig);
+ kfree(sig->futex_hash_bucket);
/*
* __mmdrop is not safe to call from softirq context on x86 due to
* pgd_dtor so postpone it to the async context
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index de6d7f71961eb..14e4cb5ccd722 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -39,6 +39,7 @@
#include <linux/memblock.h>
#include <linux/fault-inject.h>
#include <linux/slab.h>
+#include <linux/prctl.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -1153,6 +1154,70 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb)
spin_lock_init(&fhb->lock);
}
+static int futex_hash_allocate(unsigned long arg3, unsigned long arg4,
+ unsigned long arg5)
+{
+ unsigned int hash_slots = arg3;
+ struct futex_hash_bucket *fhb;
+ int i;
+
+ if (!thread_group_leader(current))
+ return -EINVAL;
+
+ if (current->signal->futex_hash_bucket)
+ return -EALREADY;
+
+ if (hash_slots == 0)
+ hash_slots = 4;
+ if (hash_slots < 2)
+ hash_slots = 2;
+ if (hash_slots > 16)
+ hash_slots = 16;
+ if (!is_power_of_2(hash_slots))
+ hash_slots = rounddown_pow_of_two(hash_slots);
+
+ fhb = kmalloc_array(hash_slots, sizeof(struct futex_hash_bucket), GFP_KERNEL);
+ if (!fhb)
+ return -ENOMEM;
+
+ current->signal->futex_hash_mask = hash_slots - 1;
+
+ for (i = 0; i < hash_slots; i++)
+ futex_hash_bucket_init(&fhb[i]);
+
+ current->signal->futex_hash_bucket = fhb;
+ return 0;
+}
+
+static int futex_hash_is_shared(unsigned long arg3, unsigned long arg4,
+ unsigned long arg5)
+{
+ if (current->signal->futex_hash_bucket)
+ return current->signal->futex_hash_mask + 1;
+ return 0;
+}
+
+int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+ int ret;
+
+ switch (arg2) {
+ case PR_FUTEX_HASH_ALLOCATE:
+ ret = futex_hash_allocate(arg3, arg4, arg5);
+ break;
+
+ case PR_FUTEX_HASH_IS_SHARED:
+ ret = futex_hash_is_shared(arg3, arg4, arg5);
+ break;
+
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
static int __init futex_init(void)
{
unsigned int futex_shift;
diff --git a/kernel/sys.c b/kernel/sys.c
index 4da31f28fda81..0dcbb8ce9f19d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -52,6 +52,7 @@
#include <linux/user_namespace.h>
#include <linux/time_namespace.h>
#include <linux/binfmts.h>
+#include <linux/futex.h>
#include <linux/sched.h>
#include <linux/sched/autogroup.h>
@@ -2784,6 +2785,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_RISCV_SET_ICACHE_FLUSH_CTX:
error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
break;
+ case PR_FUTEX_HASH:
+ error = futex_hash_prctl(arg2, arg3, arg4, arg5);
+ break;
default:
error = -EINVAL;
break;
--
2.45.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH v2 3/4] futex: Use the task local hashmap.
2024-10-28 12:13 [RFC v2 PATCH 0/4] futex: Add support task local hash maps Sebastian Andrzej Siewior
2024-10-28 12:13 ` [RFC PATCH v2 1/4] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
2024-10-28 12:13 ` [RFC PATCH v2 2/4] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
@ 2024-10-28 12:13 ` Sebastian Andrzej Siewior
2024-10-28 12:13 ` [RFC PATCH v2 4/4] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
` (3 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-28 12:13 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
Use the hashlocal hashmap if provided.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/futex/core.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 14e4cb5ccd722..3ef4cbd5cfa72 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -108,18 +108,33 @@ late_initcall(fail_futex_debugfs);
#endif /* CONFIG_FAIL_FUTEX */
+static inline bool futex_key_is_private(union futex_key *key)
+{
+ /*
+ * Relies on get_futex_key() to set either bit for shared
+ * futexes -- see comment with union futex_key.
+ */
+ return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED));
+}
+
/**
* futex_hash - Return the hash bucket in the global hash
* @key: Pointer to the futex key for which the hash is calculated
*
* We hash on the keys returned from get_futex_key (see below) and return the
- * corresponding hash bucket in the global hash.
+ * corresponding hash bucket in the global hash. If the FUTEX is private and
+ * a local hash table is privated then this one is used.
*/
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
+ struct futex_hash_bucket *fhb;
u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
key->both.offset);
+ fhb = current->signal->futex_hash_bucket;
+ if (fhb && futex_key_is_private(key))
+ return &fhb[hash & current->signal->futex_hash_mask];
+
return &futex_queues[hash & (futex_hashsize - 1)];
}
--
2.45.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH v2 4/4] futex: Allow automatic allocation of process wide futex hash.
2024-10-28 12:13 [RFC v2 PATCH 0/4] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (2 preceding siblings ...)
2024-10-28 12:13 ` [RFC PATCH v2 3/4] futex: Use the task local hashmap Sebastian Andrzej Siewior
@ 2024-10-28 12:13 ` Sebastian Andrzej Siewior
2024-10-28 17:50 ` [RFC v2 PATCH 0/4] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (2 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-28 12:13 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long,
Sebastian Andrzej Siewior
Allocate a default futex hash if a task forks its first thread.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/futex.h | 6 ++++++
kernel/fork.c | 28 ++++++++++++++++++++++++++++
kernel/futex/core.c | 5 +++++
3 files changed, 39 insertions(+)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index dad50173f70c4..c0f90dda6a295 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -79,6 +79,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3);
int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5);
+int futex_hash_allocate_default(void);
#else
static inline void futex_init_task(struct task_struct *tsk) { }
static inline void futex_exit_recursive(struct task_struct *tsk) { }
@@ -95,6 +96,11 @@ static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
{
return -EINVAL;
}
+static inline int futex_hash_allocate_default(void)
+{
+ return 0;
+}
+
#endif
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 0d2b0a5299bbc..21dccdc8a1f6c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2107,6 +2107,19 @@ static void rv_task_fork(struct task_struct *p)
#define rv_task_fork(p) do {} while (0)
#endif
+static bool need_futex_hash_allocate_default(u64 clone_flags)
+{
+ if ((clone_flags & (CLONE_THREAD | CLONE_VM)) != (CLONE_THREAD | CLONE_VM))
+ return false;
+ if (!thread_group_leader(current))
+ return false;
+ if (current->signal->nr_threads != 1)
+ return false;
+ if (current->signal->futex_hash_bucket)
+ return false;
+ return true;
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -2483,6 +2496,21 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cancel_cgroup;
+ /*
+ * Allocate a default futex hash for the user process once the first
+ * thread spawns.
+ */
+ if (need_futex_hash_allocate_default(clone_flags)) {
+ retval = futex_hash_allocate_default();
+ if (retval)
+ goto bad_fork_core_free;
+ /*
+ * If we fail beyond this point we don't free the allocated
+ * futex hash map. We assume that another thread will created
+ * and makes use of it The hash map will be freed once the main
+ * thread terminates.
+ */
+ }
/*
* From this point on we must avoid any synchronous user-space
* communication until we take the tasklist-lock. In particular, we do
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 3ef4cbd5cfa72..8896ade418b4a 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1204,6 +1204,11 @@ static int futex_hash_allocate(unsigned long arg3, unsigned long arg4,
return 0;
}
+int futex_hash_allocate_default(void)
+{
+ return futex_hash_allocate(0, 0, 0);
+}
+
static int futex_hash_is_shared(unsigned long arg3, unsigned long arg4,
unsigned long arg5)
{
--
2.45.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.
2024-10-28 12:13 [RFC v2 PATCH 0/4] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (3 preceding siblings ...)
2024-10-28 12:13 ` [RFC PATCH v2 4/4] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
@ 2024-10-28 17:50 ` Sebastian Andrzej Siewior
2024-10-29 11:10 ` Juri Lelli
2024-10-31 15:56 ` Sebastian Andrzej Siewior
6 siblings, 0 replies; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-28 17:50 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long
On 2024-10-28 13:13:54 [+0100], To linux-kernel@vger.kernel.org wrote:
> from time to time and I haven't seen it using the hashmap. Need to do
> more testing.
booted gnome and did a few things:
- Total allocations: 632
- tasks which never used their allocated futex hash: 2
gpg-agent and systemd-journal
- Tasks which did not terminate within the measurement: 85
This includes gpg-agent and systemd-journal
- Top5 users of the private hash:
- firefox-esr-3786 used 215985
- gnome-software-2343 used 121296
- chromium-3369 used 65796
- chromium-3209 used 34639
- Isolated used 34211
This looks like we could attach the private futex hashmap directly on
fork instead of delaying it to the first usage.
Side note: If someone is waiting for a thread to exit via pthread_join()
then glibc uses here futex() with op 0x109. I would have expected a
private flag.
Sebastian
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.
2024-10-28 12:13 [RFC v2 PATCH 0/4] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (4 preceding siblings ...)
2024-10-28 17:50 ` [RFC v2 PATCH 0/4] futex: Add support task local hash maps Sebastian Andrzej Siewior
@ 2024-10-29 11:10 ` Juri Lelli
2024-10-29 15:06 ` Sebastian Andrzej Siewior
2024-10-31 15:56 ` Sebastian Andrzej Siewior
6 siblings, 1 reply; 13+ messages in thread
From: Juri Lelli @ 2024-10-29 11:10 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Peter Zijlstra, Valentin Schneider, Waiman Long
Hi Sebastian,
On 28/10/24 13:13, Sebastian Andrzej Siewior wrote:
> Hi,
>
> this is a follow up on
> https://lore.kernel.org/ZwVOMgBMxrw7BU9A@jlelli-thinkpadt14gen4.remote.csb
Thank you so much for working on this!
> and adds support for task local futex_hash_bucket. It can be created via
> prctl(). Last patch in the series enables it one the first thread is
> created.
>
> I've been how this auto-create behaves and so far dpkg creates threads
> and uses the local-hashmap. systemd-journal on the hand forks a thread
> from time to time and I haven't seen it using the hashmap. Need to do
> more testing.
I ported it to one of our kernels with the intent of asking perf folks
to have a go at it (after some manual smoke testing maybe). It will
take a couple of weeks or so to get numbers back.
Do you need specific additional info to possibly be collected while
running? I saw your reply about usage. If you want to agree on what to
collect feel free to send out the debug patch I guess you used for that.
Going of course to also play with it myself and holler if I find any
issue.
Best,
Juri
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.
2024-10-29 11:10 ` Juri Lelli
@ 2024-10-29 15:06 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-29 15:06 UTC (permalink / raw)
To: Juri Lelli
Cc: linux-kernel, André Almeida, Darren Hart, Davidlohr Bueso,
Ingo Molnar, Peter Zijlstra, Valentin Schneider, Waiman Long
On 2024-10-29 12:10:25 [+0100], Juri Lelli wrote:
> Hi Sebastian,
Hi Juri,
> > I've been how this auto-create behaves and so far dpkg creates threads
> > and uses the local-hashmap. systemd-journal on the hand forks a thread
> > from time to time and I haven't seen it using the hashmap. Need to do
> > more testing.
>
> I ported it to one of our kernels with the intent of asking perf folks
> to have a go at it (after some manual smoke testing maybe). It will
> take a couple of weeks or so to get numbers back.
Thanks.
> Do you need specific additional info to possibly be collected while
> running? I saw your reply about usage. If you want to agree on what to
> collect feel free to send out the debug patch I guess you used for that.
If you run a specific locking test cases, you could try set the number of
slots upfront (instead of relying on the default 4) and see how this
affects the performance. Also there is a cap at 16, you might want to
raise this to 1024 and try some higher numbers and see how this effects
performance. The prctl() interface should be easy to set/ get the values.
The default 4 might be too conservative.
That would give an idea what a sane default value and upper limit might be.
The hunk attached (against the to be posted v3) adds counters to see how
many auto-allocated slots were used vs not used. In my tests the number
of unused hash buckets was very small, so I don't think it matters.
> Best,
> Juri
Sebastian
---------------------->8---------------------
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 3b8c8975cd493..aa2a0d059b1a8 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -248,6 +248,7 @@ struct signal_struct {
* and may have inconsistent
* permissions.
*/
+ unsigned int futex_hash_used;
unsigned int futex_hash_mask;
struct futex_hash_bucket *futex_hash_bucket;
} __randomize_layout;
diff --git a/kernel/fork.c b/kernel/fork.c
index e792a43934363..341331778032a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -945,10 +945,19 @@ static void mmdrop_async(struct mm_struct *mm)
}
}
+extern atomic64_t futex_hash_stats_used;
+extern atomic64_t futex_hash_stats_unused;
+
static inline void free_signal_struct(struct signal_struct *sig)
{
taskstats_tgid_free(sig);
sched_autogroup_exit(sig);
+ if (sig->futex_hash_bucket) {
+ if (sig->futex_hash_used)
+ atomic64_inc(&futex_hash_stats_used);
+ else
+ atomic64_inc(&futex_hash_stats_unused);
+ }
kfree(sig->futex_hash_bucket);
/*
* __mmdrop is not safe to call from softirq context on x86 due to
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index b48abf2e97c25..04a597736cb00 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -40,6 +40,7 @@
#include <linux/fault-inject.h>
#include <linux/slab.h>
#include <linux/prctl.h>
+#include <linux/proc_fs.h>
#include "futex.h"
#include "../locking/rtmutex_common.h"
@@ -132,8 +133,10 @@ struct futex_hash_bucket *futex_hash(union futex_key *key)
key->both.offset);
fhb = current->signal->futex_hash_bucket;
- if (fhb && futex_key_is_private(key))
+ if (fhb && futex_key_is_private(key)) {
+ current->signal->futex_hash_used = 1;
return &fhb[hash & current->signal->futex_hash_mask];
+ }
return &futex_queues[hash & (futex_hashsize - 1)];
}
@@ -1202,8 +1205,13 @@ static int futex_hash_allocate(unsigned int hash_slots)
return 0;
}
+atomic64_t futex_hash_stats_used;
+atomic64_t futex_hash_stats_unused;
+atomic64_t futex_hash_stats_auto_create;
+
int futex_hash_allocate_default(void)
{
+ atomic64_inc(&futex_hash_stats_auto_create);
return futex_hash_allocate(0);
}
@@ -1235,6 +1243,19 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3,
return ret;
}
+static int proc_show_futex_stats(struct seq_file *seq, void *offset)
+{
+ long fh_used, fh_unused, fh_auto_create;
+
+ fh_used = atomic64_read(&futex_hash_stats_used);
+ fh_unused = atomic64_read(&futex_hash_stats_unused);
+ fh_auto_create = atomic64_read(&futex_hash_stats_auto_create);
+
+ seq_printf(seq, "used: %ld unsued: %ld auto: %ld\n",
+ fh_used, fh_unused, fh_auto_create);
+ return 0;
+}
+
static int __init futex_init(void)
{
unsigned int futex_shift;
@@ -1255,6 +1276,7 @@ static int __init futex_init(void)
for (i = 0; i < futex_hashsize; i++)
futex_hash_bucket_init(&futex_queues[i]);
+ proc_create_single("futex_stats", 0, NULL, proc_show_futex_stats);
return 0;
}
core_initcall(futex_init);
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.
2024-10-28 12:13 [RFC v2 PATCH 0/4] futex: Add support task local hash maps Sebastian Andrzej Siewior
` (5 preceding siblings ...)
2024-10-29 11:10 ` Juri Lelli
@ 2024-10-31 15:56 ` Sebastian Andrzej Siewior
2024-10-31 17:47 ` Sebastian Andrzej Siewior
` (2 more replies)
6 siblings, 3 replies; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-31 15:56 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long
On 2024-10-28 13:13:54 [+0100], To linux-kernel@vger.kernel.org wrote:
> Need to do
> more testing.
So there is "perf bench futex hash". On a 256 CPU NUMA box:
perf bench futex hash -t 240 -m -s -b $hb
and hb 2 … 131072 (moved the allocation to kvmalloc) I get the following
(averaged over 3 three runs)
buckets op/sec
2 9158.33
4 21665.66 + ~136%
8 44686.66 + ~106
16 84144.33 + ~ 88
32 139998.33 + ~ 66
64 279957.0 + ~ 99
128 509533.0 + ~100
256 1019846.0 + ~100
512 1634940.0 + ~ 60
1024 1834859.33 + ~ 12
1868129.33 (global hash, 65536 hash)
2048 1912071.33 + ~ 4
4096 1918686.66 + ~ 0
8192 1922285.66 + ~ 0
16384 1923017.0 + ~ 0
32768 1923319.0 + ~ 0
65536 1932906.0 + ~ 0
131072 2042571.33 + ~ 5
By doubling the hash size the ops/sec almost double until 256 slots.
After 2048 slots the increase is almost noise (except for the last
entry).
Pinning the bench to individual CPUs belonging to a NUMA node and
running the same test with 110 threads only (avg over 5 runs):
ops/sec global ops/sec local
node 0 2278572.2 2534827.4
node 1 2229838.6 2437498.8
node 0+1 2542602.4 2535749.8
<--->
RAW numbers:
futex hash table entries: 65536 (order: 10, 4194304 bytes, vmalloc hugepage)
Run summary [PID 4541]: 240 threads, each operating on 1024 [private] futexes for 10 secs.
Averaged 1883542 operations/sec (+- 0,28%), total secs = 10
Averaged 1864680 operations/sec (+- 0,31%), total secs = 10
Averaged 1856166 operations/sec (+- 0,32%), total secs = 10
1868129.3333333333
====
Run summary [PID 6247]: 240 threads, hash slots: 2 each operating on 1024 [private] futexes for 10 secs.
Averaged 9219 operations/sec (+- 0,19%), total secs = 10
Averaged 9185 operations/sec (+- 0,18%), total secs = 10
Averaged 9071 operations/sec (+- 0,20%), total secs = 10
9158.333333333334
Run summary [PID 6970]: 240 threads, hash slots: 4 each operating on 1024 [private] futexes for 10 secs.
Averaged 16911 operations/sec (+- 0,29%), total secs = 10
Averaged 24145 operations/sec (+- 0,17%), total secs = 10
Averaged 23941 operations/sec (+- 0,17%), total secs = 10
21665.666666666668
Run summary [PID 7693]: 240 threads, hash slots: 8 each operating on 1024 [private] futexes for 10 secs.
Averaged 45376 operations/sec (+- 0,25%), total secs = 10
Averaged 44587 operations/sec (+- 0,17%), total secs = 10
Averaged 44097 operations/sec (+- 0,26%), total secs = 10
44686.666666666664
Run summary [PID 8416]: 240 threads, hash slots: 16 each operating on 1024 [private] futexes for 10 secs.
Averaged 84547 operations/sec (+- 0,25%), total secs = 10
Averaged 84672 operations/sec (+- 0,18%), total secs = 10
Averaged 83214 operations/sec (+- 0,26%), total secs = 10
84144.33333333333
Run summary [PID 9139]: 240 threads, hash slots: 32 each operating on 1024 [private] futexes for 10 secs.
Averaged 163342 operations/sec (+- 0,55%), total secs = 10
Averaged 127630 operations/sec (+- 0,28%), total secs = 10
Averaged 129023 operations/sec (+- 0,27%), total secs = 10
139998.33333333334
Run summary [PID 9862]: 240 threads, hash slots: 64 each operating on 1024 [private] futexes for 10 secs.
Averaged 279627 operations/sec (+- 0,29%), total secs = 10
Averaged 279572 operations/sec (+- 0,21%), total secs = 10
Averaged 280672 operations/sec (+- 0,26%), total secs = 10
279957.0
Run summary [PID 10585]: 240 threads, hash slots: 128 each operating on 1024 [private] futexes for 10 secs.
Averaged 508759 operations/sec (+- 0,21%), total secs = 10
Averaged 511253 operations/sec (+- 0,22%), total secs = 10
Averaged 508587 operations/sec (+- 0,26%), total secs = 10
509533.0
Run summary [PID 11308]: 240 threads, hash slots: 256 each operating on 1024 [private] futexes for 10 secs.
Averaged 1023552 operations/sec (+- 0,10%), total secs = 10
Averaged 1034426 operations/sec (+- 0,11%), total secs = 10
Averaged 1001560 operations/sec (+- 0,10%), total secs = 10
1019846.0
Run summary [PID 12031]: 240 threads, hash slots: 512 each operating on 1024 [private] futexes for 10 secs.
Averaged 1636187 operations/sec (+- 0,22%), total secs = 10
Averaged 1607427 operations/sec (+- 0,23%), total secs = 10
Averaged 1661206 operations/sec (+- 0,24%), total secs = 10
1634940.0
Run summary [PID 12756]: 240 threads, hash slots: 1024 each operating on 1024 [private] futexes for 10 secs.
Averaged 1833474 operations/sec (+- 0,24%), total secs = 10
Averaged 1835817 operations/sec (+- 0,24%), total secs = 10
Averaged 1835287 operations/sec (+- 0,25%), total secs = 10
1834859.3333333333
Run summary [PID 13479]: 240 threads, hash slots: 2048 each operating on 1024 [private] futexes for 10 secs.
Averaged 1915836 operations/sec (+- 0,29%), total secs = 10
Averaged 1907866 operations/sec (+- 0,28%), total secs = 10
Averaged 1912512 operations/sec (+- 0,29%), total secs = 10
1912071.3333333333
Run summary [PID 14202]: 240 threads, hash slots: 4096 each operating on 1024 [private] futexes for 10 secs.
Averaged 1916947 operations/sec (+- 0,27%), total secs = 10
Averaged 1918102 operations/sec (+- 0,28%), total secs = 10
Averaged 1921011 operations/sec (+- 0,29%), total secs = 10
1918686.6666666667
Run summary [PID 14925]: 240 threads, hash slots: 8192 each operating on 1024 [private] futexes for 10 secs.
Averaged 1916001 operations/sec (+- 0,27%), total secs = 10
Averaged 1923156 operations/sec (+- 0,27%), total secs = 10
Averaged 1927700 operations/sec (+- 0,27%), total secs = 10
1922285.6666666667
Run summary [PID 15648]: 240 threads, hash slots: 16384 each operating on 1024 [private] futexes for 10 secs.
Averaged 1928497 operations/sec (+- 0,28%), total secs = 10
Averaged 1916906 operations/sec (+- 0,27%), total secs = 10
Averaged 1923648 operations/sec (+- 0,26%), total secs = 10
1923017.0
Run summary [PID 16371]: 240 threads, hash slots: 32768 each operating on 1024 [private] futexes for 10 secs.
Averaged 1920425 operations/sec (+- 0,27%), total secs = 10
Averaged 1923449 operations/sec (+- 0,27%), total secs = 10
Averaged 1926083 operations/sec (+- 0,29%), total secs = 10
1923319.0
Run summary [PID 17094]: 240 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 1927007 operations/sec (+- 0,28%), total secs = 10
Averaged 1935182 operations/sec (+- 0,28%), total secs = 10
Averaged 1936529 operations/sec (+- 0,28%), total secs = 10
1932906.0
Run summary [PID 17817]: 240 threads, hash slots: 131072 each operating on 1024 [private] futexes for 10 secs.
Averaged 2033664 operations/sec (+- 0,32%), total secs = 10
Averaged 2060081 operations/sec (+- 0,33%), total secs = 10
Averaged 2033969 operations/sec (+- 0,32%), total secs = 10
2042571.3333333333
----
bigeasy@z3:~$ taskset -pc $$; ./run-numa.sh
pid 7679's current affinity list: 64-127,192-255
====
# Running 'futex/hash' benchmark:
Run summary [PID 23094]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2180419 operations/sec (+- 0,77%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23205]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2258612 operations/sec (+- 0,87%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23317]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2245819 operations/sec (+- 0,80%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23428]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2231469 operations/sec (+- 0,81%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23539]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2232874 operations/sec (+- 0,78%), total secs = 10
====
# Running 'futex/hash' benchmark:
Run summary [PID 23650]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2469636 operations/sec (+- 0,92%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23761]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2432942 operations/sec (+- 0,91%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23872]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2411433 operations/sec (+- 0,90%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 23983]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2438380 operations/sec (+- 0,94%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24094]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2435103 operations/sec (+- 0,94%), total secs = 10
====
bigeasy@z3:~$ taskset -pc $$; ./run-numa.sh
pid 9731's current affinity list: 0-63,128-191
====
# Running 'futex/hash' benchmark:
Run summary [PID 24207]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2206612 operations/sec (+- 0,75%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24318]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2321819 operations/sec (+- 0,85%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24429]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2238386 operations/sec (+- 0,77%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24541]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2325869 operations/sec (+- 0,85%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24652]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2300175 operations/sec (+- 0,82%), total secs = 10
====
# Running 'futex/hash' benchmark:
Run summary [PID 24763]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2530561 operations/sec (+- 0,96%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24874]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2573315 operations/sec (+- 1,03%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 24985]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2517479 operations/sec (+- 0,99%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25096]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2554631 operations/sec (+- 1,01%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25207]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2498151 operations/sec (+- 0,94%), total secs = 10
====
bigeasy@z3:~$ taskset -pc $$; ./run-numa.sh
pid 10975's current affinity list: 0-255
====
# Running 'futex/hash' benchmark:
Run summary [PID 25324]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2561817 operations/sec (+- 0,14%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25435]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2539522 operations/sec (+- 0,11%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25546]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2532349 operations/sec (+- 0,11%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25657]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2539481 operations/sec (+- 0,11%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25768]: 110 threads, hash slots: -65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2539843 operations/sec (+- 0,13%), total secs = 10
====
# Running 'futex/hash' benchmark:
Run summary [PID 25879]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2540858 operations/sec (+- 0,50%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 25990]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2550342 operations/sec (+- 0,48%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 26101]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2522785 operations/sec (+- 0,48%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 26212]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2528686 operations/sec (+- 0,49%), total secs = 10
# Running 'futex/hash' benchmark:
Run summary [PID 26323]: 110 threads, hash slots: 65536 each operating on 1024 [private] futexes for 10 secs.
Averaged 2536078 operations/sec (+- 0,48%), total secs = 10
====
Sebastian
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.
2024-10-31 15:56 ` Sebastian Andrzej Siewior
@ 2024-10-31 17:47 ` Sebastian Andrzej Siewior
2024-11-01 11:08 ` Sebastian Andrzej Siewior
2024-10-31 20:18 ` Waiman Long
2024-10-31 20:28 ` Waiman Long
2 siblings, 1 reply; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-31 17:47 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long
On 2024-10-31 16:56:43 [+0100], To linux-kernel@vger.kernel.org wrote:
> Pinning the bench to individual CPUs belonging to a NUMA node and
> running the same test with 110 threads only (avg over 5 runs):
> ops/sec global ops/sec local
> node 0 2278572.2 2534827.4
> node 1 2229838.6 2437498.8
> node 0+1 2542602.4 2535749.8
Running on node 1, with variable slot size:
hash slots ops/sec
2 43292.2
4 81829.2
8 156903.4
16 297063.6
32 554229.4
64 962158.4
128 1615859.6
256 2106941.4
512 2269494.8
1024 2328782.6
2048 2342981.6
4096 2337705.2
8192 2334141.4
16384 2334237.6
32768 2339262.2
65536 2438800.4
Sebastian
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.
2024-10-31 15:56 ` Sebastian Andrzej Siewior
2024-10-31 17:47 ` Sebastian Andrzej Siewior
@ 2024-10-31 20:18 ` Waiman Long
2024-10-31 20:28 ` Waiman Long
2 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2024-10-31 20:18 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider
On 10/31/24 11:56 AM, Sebastian Andrzej Siewior wrote:
> On 2024-10-28 13:13:54 [+0100], To linux-kernel@vger.kernel.org wrote:
>> Need to do
>> more testing.
> So there is "perf bench futex hash". On a 256 CPU NUMA box:
> perf bench futex hash -t 240 -m -s -b $hb
> and hb 2 … 131072 (moved the allocation to kvmalloc) I get the following
> (averaged over 3 three runs)
>
> buckets op/sec
> 2 9158.33
> 4 21665.66 + ~136%
> 8 44686.66 + ~106
> 16 84144.33 + ~ 88
> 32 139998.33 + ~ 66
> 64 279957.0 + ~ 99
> 128 509533.0 + ~100
> 256 1019846.0 + ~100
> 512 1634940.0 + ~ 60
> 1024 1834859.33 + ~ 12
> 1868129.33 (global hash, 65536 hash)
> 2048 1912071.33 + ~ 4
> 4096 1918686.66 + ~ 0
> 8192 1922285.66 + ~ 0
> 16384 1923017.0 + ~ 0
> 32768 1923319.0 + ~ 0
> 65536 1932906.0 + ~ 0
> 131072 2042571.33 + ~ 5
>
> By doubling the hash size the ops/sec almost double until 256 slots.
> After 2048 slots the increase is almost noise (except for the last
> entry).
>
> Pinning the bench to individual CPUs belonging to a NUMA node and
> running the same test with 110 threads only (avg over 5 runs):
> ops/sec global ops/sec local
> node 0 2278572.2 2534827.4
> node 1 2229838.6 2437498.8
> node 0+1 2542602.4 2535749.8
Looking at the performance data, we should probably use the global hash
table to maximize throughput if latency isn't important.
AFAICT, the reason why patch 4 allocates a local hash whenever the first
thread is created to avoid a race between the same futex hashed on both
the local and global hash tables. Correct me if my understanding is
incorrect. That will enforce all multithreaded processes to use local
hash tables for private futexes even if they don't care about latency.
Maybe we should limit the auto local hash table allocation only to RT
processes. To avoid the race, we could add a flag to indicate if a
private futex has ever been hashed in the kernel and avoid local hash
creation in this case and probably also when the prctl() is being called
to create local hash table.
My 2 cents.
Cheers,
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.
2024-10-31 15:56 ` Sebastian Andrzej Siewior
2024-10-31 17:47 ` Sebastian Andrzej Siewior
2024-10-31 20:18 ` Waiman Long
@ 2024-10-31 20:28 ` Waiman Long
2 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2024-10-31 20:28 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider
On 10/31/24 11:56 AM, Sebastian Andrzej Siewior wrote:
> On 2024-10-28 13:13:54 [+0100], To linux-kernel@vger.kernel.org wrote:
>> Need to do
>> more testing.
> So there is "perf bench futex hash". On a 256 CPU NUMA box:
> perf bench futex hash -t 240 -m -s -b $hb
> and hb 2 … 131072 (moved the allocation to kvmalloc) I get the following
> (averaged over 3 three runs)
>
> buckets op/sec
> 2 9158.33
> 4 21665.66 + ~136%
> 8 44686.66 + ~106
> 16 84144.33 + ~ 88
> 32 139998.33 + ~ 66
> 64 279957.0 + ~ 99
> 128 509533.0 + ~100
> 256 1019846.0 + ~100
> 512 1634940.0 + ~ 60
> 1024 1834859.33 + ~ 12
> 1868129.33 (global hash, 65536 hash)
> 2048 1912071.33 + ~ 4
> 4096 1918686.66 + ~ 0
> 8192 1922285.66 + ~ 0
> 16384 1923017.0 + ~ 0
> 32768 1923319.0 + ~ 0
> 65536 1932906.0 + ~ 0
> 131072 2042571.33 + ~ 5
>
> By doubling the hash size the ops/sec almost double until 256 slots.
> After 2048 slots the increase is almost noise (except for the last
> entry).
Looking at the performance data, we should probably use the global hash
map to maximize throughput if latency isn't important.
AFAICT, the reason why patch 4 creates a local hash map when the first
thread is created is to avoid a race of the same futex being hashed on
both the local and the global hash maps. Correct me if my understanding
is incorrect. So all the multithreaded processes will have to use local
hash maps for their private futexes even if they don't care about latency.
Maybe we should limit the auto local hash map creation to only RT
processes where latency is important. To avoid the race, we could add a
flag to indicate if a private futex hashing operation had ever been done
before and prevent the creation of local hash map after that.
My 2 cents.
Cheers,
Longman
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.
2024-10-31 17:47 ` Sebastian Andrzej Siewior
@ 2024-11-01 11:08 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-01 11:08 UTC (permalink / raw)
To: linux-kernel
Cc: André Almeida, Darren Hart, Davidlohr Bueso, Ingo Molnar,
Juri Lelli, Peter Zijlstra, Valentin Schneider, Waiman Long
On 2024-10-31 18:47:40 [+0100], To linux-kernel@vger.kernel.org wrote:
> On 2024-10-31 16:56:43 [+0100], To linux-kernel@vger.kernel.org wrote:
Since all of this can be scripted and I can have one kernel with …
so I hooked various hash algorithms to see where we get to.
240 threads, same box.
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
| buckets | jhash2 (regular) | jhash2 (addr+offs) | xxhash | hash_long | crc32c | crc32 | siphash | hsiphash |
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
| 2 | 9,172.4 | 9,175.8 | 9,116.4 | 9,497.2 | 9,317.6 | 9,564.0 | 9,091.8 | 9,217.8 |
| 4 | 23,370.8 | 22,611.0 | 20,917.2 | 17,780.6 | 18,185.6 | 17,305.4 | 20,415.0 | 20,885.4 |
| 8 | 44,378.2 | 44,898.4 | 44,713.8 | 42,943.8 | 45,151.8 | 45,149.6 | 44,601.4 | 44,739.4 |
| 16 | 84,567.2 | 84,190.0 | 84,645.2 | 84,737.4 | 86,970.2 | 85,036.8 | 83,142.0 | 85,485.0 |
| 32 | 131,059.2 | 127,895.4 | 127,953.8 | 126,631.2 | 132,293.0 | 125,622.2 | 127,038.4 | 126,322.8 |
| 64 | 285,339.0 | 284,488.8 | 288,109.2 | 268,630.4 | 289,783.8 | 285,281.0 | 285,111.2 | 288,104.4 |
| 128 | 510,550.0 | 515,596.6 | 526,738.0 | 557,349.6 | 508,871.6 | 524,447.0 | 512,482.8 | 513,963.0 |
| 256 | 1,038,348.6 | 1,034,837.4 | 1,042,341.4 | 1,060,650.4 | 1,039,328.6 | 1,098,865.8 | 1,042,759.4 | 1,026,998.6 |
| 512 | 1,626,287.8 | 1,640,112.0 | 1,622,828.8 | 1,637,973.4 | 1,677,108.6 | 1,707,027.2 | 1,588,240.6 | 1,628,800.8 |
| 1024 | 1,827,878.6 | 1,849,074.4 | 1,836,483.8 | 1,776,366.4 | 1,884,670.8 | 1,842,734.2 | 1,765,815.0 | 1,822,137.8 |
| 2048 | 1,905,406.4 | 1,928,399.2 | 1,903,506.0 | 1,822,750.8 | 1,946,141.6 | 1,907,584.6 | 1,830,906.8 | 1,887,678.2 |
| 4096 | 1,912,522.6 | 1,929,667.4 | 1,907,121.6 | 1,847,231.6 | 1,949,908.0 | 1,927,728.6 | 1,834,648.0 | 1,893,792.2 |
| 8192 | 1,912,352.6 | 1,935,078.4 | 1,915,500.4 | 1,853,232.2 | 1,973,339.2 | 1,958,150.4 | 1,840,190.8 | 1,896,981.6 |
| 16384 | 1,917,836.8 | 1,941,917.0 | 1,910,106.0 | 1,863,751.4 | 1,955,101.4 | 1,947,673.2 | 1,836,488.2 | 1,898,002.0 |
| 32768 | 1,919,074.6 | 1,937,200.2 | 1,914,704.8 | 1,872,348.0 | 1,974,182.2 | 1,959,147.2 | 1,837,694.6 | 1,896,566.6 |
| 65536 | 1,930,988.0 | 1,959,076.0 | 1,926,927.6 | 1,873,267.6 | 1,914,420.8 | 1,951,292.4 | 1,849,658.6 | 1,910,334.6 |
| 131072 | 2,023,509.4 | 2,050,380.4 | 2,037,104.6 | 1,990,559.6 | 2,003,758.4 | 1,978,931.2 | 1,946,145.2 | 2,007,205.6 |
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
Intel(R) Xeon(R) CPU E7-8890 v3, 144 CPUs, 4 nodes.
Test using 140 threads, 0 buckets means global hash:
+---------+-------------+
| buckets | ops/sec |
+---------+-------------+
| 0 | 2,644,742.8 |
| 2 | 21,750.2 |
| 4 | 37,537.2 |
| 8 | 69,950.6 |
| 16 | 127,722.0 |
| 32 | 225,479.2 |
| 64 | 401,335.6 |
| 128 | 753,714.8 |
| 256 | 1,376,116.0 |
| 512 | 2,008,764.2 |
| 1024 | 2,386,441.2 |
| 2048 | 2,564,764.0 |
| 4096 | 2,851,801.2 |
| 8192 | 2,862,999.6 |
| 16384 | 2,521,325.0 |
| 32768 | 2,421,839.2 |
| 65536 | 2,483,676.0 |
| 131072 | 2,733,504.2 |
+---------+-------------+
Binding the test to individual NUMA node, 34 threads:
+---------+-------------+-------------+-------------+-------------+
| buckets | node 0 | node 1 | node 2 | node 3 |
+---------+-------------+-------------+-------------+-------------+
| 0 | 4,149,878.4 | 4,149,079.8 | 4,148,085.2 | 4,149,420.6 |
| 2 | 194,714.4 | 197,382.8 | 191,967.0 | 193,510.6 |
| 4 | 363,778.6 | 360,700.2 | 364,293.6 | 361,830.2 |
| 8 | 681,770.4 | 673,973.0 | 658,601.6 | 662,212.0 |
| 16 | 1,201,256.4 | 1,177,681.0 | 1,195,749.4 | 1,181,200.2 |
| 32 | 2,002,673.2 | 1,989,139.0 | 1,988,264.4 | 1,981,004.8 |
| 64 | 2,963,416.0 | 2,962,292.0 | 2,957,491.6 | 2,964,479.6 |
| 128 | 3,499,580.0 | 3,495,971.2 | 3,495,537.6 | 3,499,902.8 |
| 256 | 3,713,251.2 | 3,711,806.4 | 3,716,935.4 | 3,715,458.2 |
| 512 | 3,800,606.4 | 3,801,960.4 | 3,813,903.4 | 3,809,076.6 |
| 1024 | 3,840,679.0 | 3,839,486.4 | 3,841,558.6 | 3,838,641.4 |
| 2048 | 3,867,732.8 | 3,866,216.2 | 3,858,603.4 | 3,848,031.6 |
| 4096 | 3,806,776.8 | 3,819,237.8 | 3,813,381.4 | 3,800,440.2 |
| 8192 | 3,815,358.4 | 3,806,204.2 | 3,804,171.2 | 3,795,476.2 |
| 16384 | 3,865,728.6 | 3,883,038.4 | 3,871,992.0 | 3,857,763.4 |
| 32768 | 4,017,227.0 | 4,025,249.8 | 4,022,779.4 | 4,009,740.8 |
| 65536 | 4,188,410.0 | 4,186,900.8 | 4,195,128.4 | 4,190,580.8 |
| 131072 | 4,334,937.0 | 4,335,978.8 | 4,327,250.2 | 4,332,567.8 |
+---------+-------------+-------------+-------------+-------------+
140 threads, all nodes for the algorithms test:
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
| buckets | jhash2 (regular) | jhash2 (addr+offs) | xxhash | hash_long | crc32c | crc32 | siphash | hsiphash |
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
| 2 | 21,346.0 | 21,321.8 | 20,598.4 | 23,403.0 | 23,336.6 | 21,232.8 | 21,011.4 | 20,661.0 |
| 4 | 38,220.0 | 37,712.0 | 37,421.6 | 39,206.4 | 39,086.2 | 40,098.2 | 37,144.2 | 37,209.8 |
| 8 | 68,470.8 | 68,994.4 | 69,373.6 | 73,973.0 | 70,306.8 | 70,396.0 | 68,950.8 | 69,366.6 |
| 16 | 126,612.2 | 127,433.2 | 128,121.2 | 133,981.8 | 127,268.0 | 130,204.4 | 126,594.4 | 127,812.8 |
| 32 | 224,943.0 | 224,695.2 | 222,879.6 | 227,023.8 | 220,036.4 | 217,311.2 | 224,100.0 | 223,442.8 |
| 64 | 406,235.6 | 399,020.2 | 407,580.6 | 413,988.6 | 404,817.4 | 394,156.0 | 411,282.8 | 389,992.6 |
| 128 | 758,259.0 | 759,423.2 | 755,778.8 | 774,913.8 | 765,497.8 | 763,987.8 | 748,676.8 | 749,303.6 |
| 256 | 1,381,720.6 | 1,380,707.6 | 1,372,685.0 | 1,357,849.0 | 1,331,275.2 | 1,430,867.4 | 1,377,411.6 | 1,374,432.2 |
| 512 | 2,001,912.4 | 2,011,120.8 | 1,993,617.8 | 2,331,041.0 | 2,097,737.0 | 2,079,965.6 | 1,971,513.8 | 1,989,508.6 |
| 1024 | 2,378,279.6 | 2,412,139.6 | 2,371,655.4 | 2,650,416.8 | 2,477,507.8 | 2,456,023.8 | 2,309,010.4 | 2,353,854.2 |
| 2048 | 2,560,923.0 | 2,604,756.2 | 2,544,586.6 | 2,658,535.8 | 2,631,261.0 | 2,628,532.0 | 2,459,461.2 | 2,523,348.0 |
| 4096 | 2,855,199.2 | 2,942,364.8 | 2,822,369.8 | 2,998,159.4 | 2,936,124.2 | 2,919,140.6 | 2,694,488.8 | 2,794,201.4 |
| 8192 | 2,868,792.8 | 2,953,256.8 | 2,834,506.0 | 2,993,257.8 | 2,924,754.2 | 2,941,119.0 | 2,705,526.4 | 2,806,921.2 |
| 16384 | 2,527,784.0 | 2,595,100.2 | 2,498,789.8 | 2,610,646.8 | 2,540,535.4 | 2,550,376.0 | 2,398,098.4 | 2,475,184.4 |
| 32768 | 2,427,199.8 | 2,492,474.2 | 2,408,768.4 | 2,486,733.6 | 2,381,828.0 | 2,425,293.0 | 2,312,774.0 | 2,384,687.6 |
| 65536 | 2,489,441.8 | 2,554,741.4 | 2,465,692.0 | 2,666,031.8 | 2,419,651.8 | 2,515,099.8 | 2,368,451.8 | 2,438,185.6 |
| 131072 | 2,745,458.4 | 2,820,823.0 | 2,720,660.6 | 3,282,233.0 | 2,625,217.6 | 2,466,424.0 | 2,597,005.2 | 2,680,356.4 |
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
And now something smaller, Intel(R) Xeon(R) CPU E5-2650 0, 32CPUs in
total.
28 threads used for the test:
+---------+-------------+
| buckets | ops/sec |
+---------+-------------+
| 0 | 2,344,905.8 |
| 2 | 91,881.2 |
| 4 | 168,243.0 |
| 8 | 310,982.2 |
| 16 | 550,534.4 |
| 32 | 884,066.0 |
| 64 | 1,475,389.4 |
| 128 | 1,949,364.6 |
| 256 | 2,142,025.8 |
| 512 | 2,234,222.2 |
| 1024 | 2,267,931.8 |
| 2048 | 2,287,753.4 |
| 4096 | 2,315,330.4 |
| 8192 | 2,337,878.2 |
| 16384 | 2,444,502.2 |
+---------+-------------+
14 Threads limited to a node:
+---------+-------------+-------------+
| buckets | node 0 | node 1 |
+---------+-------------+-------------+
| 0 | 2,761,709.8 | 2,765,630.0 |
| 2 | 397,527.8 | 397,126.8 |
| 4 | 718,205.0 | 719,615.2 |
| 8 | 1,350,627.4 | 1,305,201.4 |
| 16 | 1,992,643.4 | 1,989,499.2 |
| 32 | 2,365,813.6 | 2,357,618.6 |
| 64 | 2,554,185.8 | 2,555,256.8 |
| 128 | 2,646,479.0 | 2,654,572.6 |
| 256 | 2,679,394.4 | 2,698,002.4 |
| 512 | 2,713,385.6 | 2,723,413.6 |
| 1024 | 2,719,330.6 | 2,733,464.6 |
| 2048 | 2,730,376.6 | 2,738,581.6 |
| 4096 | 2,704,520.6 | 2,720,546.4 |
| 8192 | 2,773,213.4 | 2,782,565.6 |
| 16384 | 2,863,843.2 | 2,858,963.2 |
+---------+-------------+-------------+
And now algorithms, 28 Threads.
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
| buckets | jhash2 (regular) | jhash2 (addr+offs) | xxhash | hash_long | crc32c | crc32 | siphash | hsiphash |
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
| 2 | 92,557.8 | 92,815.2 | 93,172.2 | 103,097.6 | 97,403.2 | 92,629.6 | 94,030.8 | 91,847.2 |
| 4 | 165,385.2 | 167,200.0 | 168,681.2 | 177,600.2 | 172,851.2 | 173,423.6 | 167,814.6 | 168,136.8 |
| 8 | 319,044.0 | 317,291.6 | 318,322.0 | 342,179.4 | 318,252.6 | 323,456.6 | 319,079.6 | 317,106.2 |
| 16 | 555,103.6 | 556,075.0 | 563,529.0 | 595,052.8 | 537,199.2 | 557,180.4 | 554,498.8 | 550,170.4 |
| 32 | 896,751.8 | 908,569.4 | 908,687.4 | 852,593.2 | 892,222.6 | 919,105.0 | 874,487.8 | 920,554.6 |
| 64 | 1,488,013.0 | 1,500,952.6 | 1,467,258.8 | 1,528,428.2 | 1,530,458.6 | 1,526,439.6 | 1,459,185.2 | 1,480,434.0 |
| 128 | 1,944,216.0 | 1,974,618.6 | 1,927,277.6 | 1,748,598.4 | 1,989,212.0 | 1,975,526.2 | 1,839,080.4 | 1,903,844.4 |
| 256 | 2,142,823.0 | 2,185,436.6 | 2,126,787.8 | 2,194,752.2 | 2,189,521.2 | 2,164,454.2 | 1,987,121.0 | 2,081,487.0 |
| 512 | 2,232,887.4 | 2,279,553.4 | 2,215,265.8 | 2,274,402.6 | 2,278,595.6 | 2,262,156.4 | 2,047,572.8 | 2,169,430.8 |
| 1024 | 2,269,308.2 | 2,312,200.0 | 2,250,841.0 | 2,278,423.2 | 2,328,832.6 | 2,288,490.0 | 2,075,494.2 | 2,190,907.8 |
| 2048 | 2,281,539.0 | 2,336,340.6 | 2,255,446.8 | 2,221,195.4 | 2,374,069.2 | 2,330,833.2 | 2,083,151.4 | 2,196,610.0 |
| 4096 | 2,315,628.8 | 2,367,224.6 | 2,284,841.4 | 2,397,385.8 | 2,373,043.2 | 2,394,276.6 | 2,104,235.0 | 2,233,600.0 |
| 8192 | 2,341,296.8 | 2,401,307.8 | 2,320,777.4 | 2,336,329.0 | 2,331,216.4 | 2,391,361.6 | 2,122,129.4 | 2,250,452.6 |
| 16384 | 2,435,181.6 | 2,509,588.2 | 2,422,407.8 | 2,378,702.8 | 2,514,325.8 | 2,552,565.8 | 2,200,619.0 | 2,350,706.0 |
+---------+------------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+
Sebastian
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-11-01 12:41 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-28 12:13 [RFC v2 PATCH 0/4] futex: Add support task local hash maps Sebastian Andrzej Siewior
2024-10-28 12:13 ` [RFC PATCH v2 1/4] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
2024-10-28 12:13 ` [RFC PATCH v2 2/4] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
2024-10-28 12:13 ` [RFC PATCH v2 3/4] futex: Use the task local hashmap Sebastian Andrzej Siewior
2024-10-28 12:13 ` [RFC PATCH v2 4/4] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
2024-10-28 17:50 ` [RFC v2 PATCH 0/4] futex: Add support task local hash maps Sebastian Andrzej Siewior
2024-10-29 11:10 ` Juri Lelli
2024-10-29 15:06 ` Sebastian Andrzej Siewior
2024-10-31 15:56 ` Sebastian Andrzej Siewior
2024-10-31 17:47 ` Sebastian Andrzej Siewior
2024-11-01 11:08 ` Sebastian Andrzej Siewior
2024-10-31 20:18 ` Waiman Long
2024-10-31 20:28 ` Waiman Long
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox