* futex performance regression from "futex: Allow automatic allocation of process wide futex hash" @ 2025-06-03 19:00 Chris Mason 2025-06-04 9:28 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 22+ messages in thread From: Chris Mason @ 2025-06-03 19:00 UTC (permalink / raw) To: Sebastian Andrzej Siewior, Peter Zijlstra, linux-kernel Hi everyone, While testing Peter's latest scheduler patches against current Linus git, I found a pretty big performance regression with schbench: https://github.com/masoncl/schbench The command line I was using: schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0 Bisecting the problem I landed on commit: commit 7c4f75a21f636486d2969d9b6680403ea8483539 (HEAD -> update) Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Wed Apr 16 18:29:13 2025 +0200 futex: Allow automatic allocation of process wide futex hash Allocate a private futex hash with 16 slots if a task forks its first thread. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-14bigeasy@linutronix.de schbench uses one futex per thread, and the command line ends up allocating 1024 threads, so the default bucket size used by this commit is just too small. Using 2048 buckets makes the problem go away. On my big turin system, this commit slows down RPS by 36%. But even a VM on a skylake machine sees a 29% difference. schbench is a microbenchmark, so grain of salt on all of this, but I think our defaults are probably too low. -chris ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-03 19:00 futex performance regression from "futex: Allow automatic allocation of process wide futex hash" Chris Mason @ 2025-06-04 9:28 ` Sebastian Andrzej Siewior 2025-06-04 15:48 ` Chris Mason 0 siblings, 1 reply; 22+ messages in thread From: Sebastian Andrzej Siewior @ 2025-06-04 9:28 UTC (permalink / raw) To: Chris Mason; +Cc: Peter Zijlstra, linux-kernel On 2025-06-03 15:00:43 [-0400], Chris Mason wrote: > Hi everyone, Hi, > While testing Peter's latest scheduler patches against current Linus > git, I found a pretty big performance regression with schbench: > > https://github.com/masoncl/schbench > > The command line I was using: > > schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0 > … > schbench uses one futex per thread, and the command line ends up > allocating 1024 threads, so the default bucket size used by this commit > is just too small. Using 2048 buckets makes the problem go away. There is also this pthread_mutex_t but yeah > On my big turin system, this commit slows down RPS by 36%. But even a > VM on a skylake machine sees a 29% difference. > > schbench is a microbenchmark, so grain of salt on all of this, but I > think our defaults are probably too low. we could diff --git a/kernel/futex/core.c b/kernel/futex/core.c index abbd97c2fcba8..9046f3d9693e7 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -1687,7 +1687,7 @@ int futex_hash_allocate_default(void) scoped_guard(rcu) { threads = min_t(unsigned int, get_nr_threads(current), - num_online_cpus()); + num_online_cpus() * 2); fph = rcu_dereference(current->mm->futex_phash); if (fph) { which would increase it to 2048 as Chris asks for. Then there the possibility of diff --git a/kernel/futex/core.c b/kernel/futex/core.c index abbd97c2fcba8..a19a96cc09c9e 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -1680,6 +1680,8 @@ int futex_hash_allocate_default(void) { unsigned int threads, buckets, current_buckets = 0; struct futex_private_hash *fph; + bool current_immutable = false; + unsigned int flags = 0; if (!current->mm) return 0; @@ -1695,6 +1697,7 @@ int futex_hash_allocate_default(void) return 0; current_buckets = fph->hash_mask + 1; + current_immutable = fph->immutable; } } @@ -1705,10 +1708,13 @@ int futex_hash_allocate_default(void) buckets = roundup_pow_of_two(4 * threads); buckets = clamp(buckets, 16, futex_hashmask + 1); - if (current_buckets >= buckets) - return 0; + if (current_buckets >= buckets) { + if (current_immutable) + return 0; + flags = FH_IMMUTABLE; + } - return futex_hash_allocate(buckets, 0); + return futex_hash_allocate(buckets, flags); } static int futex_hash_get_slots(void) to make hash immutable once the upper limit has been reached. There will be no more auto-resize. One could argue that if the user did not touch it, he might not do it at all. This would avoid the reference accounting. Some testing: 256 cores, 2xNUMA: | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1 | average rps: 785 446.07 Futex HBs: 1024 immutable: 0 | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1 | average rps: 736 769.77 Futex HBs: 2048 immutable: 0 | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1 144 cores, 4xNUMA: | average rps: 2 691 912.55 Futex HBs: 0 immutable: 1 | average rps: 1 306 443.68 Futex HBs: 1024 immutable: 0 | average rps: 2 471 382.28 Futex HBs: 1024 immutable: 1 | average rps: 1 269 503.90 Futex HBs: 2048 immutable: 0 | average rps: 2 656 228.67 Futex HBs: 2048 immutable: 1 tested with this on top: diff --git a/schbench.c b/schbench.c index 1be3e280f5c38..40a5f0091111e 100644 --- a/schbench.c +++ b/schbench.c @@ -19,6 +19,8 @@ #include <unistd.h> #include <errno.h> #include <getopt.h> +#include <linux/prctl.h> +#include <sys/prctl.h> #include <sys/time.h> #include <time.h> #include <string.h> @@ -42,6 +44,9 @@ #define USEC_PER_SEC (1000000) +static int futex_size = -1; +static int futex_flags; + /* -m number of message threads */ static int message_threads = 1; /* -t number of workers per message thread */ @@ -127,7 +132,7 @@ enum { HELP_LONG_OPT = 1, }; -char *option_string = "p:m:M:W:t:Cr:R:w:i:z:A:n:F:Lj:s:J:"; +char *option_string = "p:m:M:W:t:Cr:R:w:i:z:A:n:F:Lj:s:J:H:I"; static struct option long_options[] = { {"pipe", required_argument, 0, 'p'}, {"message-threads", required_argument, 0, 'm'}, @@ -176,6 +181,29 @@ static void print_usage(void) exit(1); } +#ifndef PR_FUTEX_HASH +#define PR_FUTEX_HASH 78 +# define PR_FUTEX_HASH_SET_SLOTS 1 +# define FH_FLAG_IMMUTABLE (1ULL << 0) +# define PR_FUTEX_HASH_GET_SLOTS 2 +# define PR_FUTEX_HASH_GET_IMMUTABLE 3 +#endif + +static int futex_hash_slots_set(unsigned int slots, int flags) +{ + return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, slots, flags); +} + +static int futex_hash_slots_get(void) +{ + return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS); +} + +static int futex_hash_immutable_get(void) +{ + return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE); +} + /* * returns 0 if we fail to parse and 1 if we succeed */ @@ -347,6 +375,13 @@ static void parse_options(int ac, char **av) exit(1); } break; + case 'H': + futex_size = atoi(optarg); + break; + case 'I': + futex_flags = FH_FLAG_IMMUTABLE; + break; + case '?': case HELP_LONG_OPT: print_usage(); @@ -1811,6 +1846,9 @@ int main(int ac, char **av) } } + if (futex_size >= 0) + futex_hash_slots_set(futex_size, futex_flags); + requests_per_sec /= message_threads; loops_per_sec = 0; stopping = 0; @@ -1920,6 +1958,8 @@ int main(int ac, char **av) } free(message_threads_mem); + fprintf(stderr, "Futex HBs: %d immutable: %d\n", futex_hash_slots_get(), + futex_hash_immutable_get()); return 0; } Comments? > -chris Sebastian ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-04 9:28 ` Sebastian Andrzej Siewior @ 2025-06-04 15:48 ` Chris Mason 2025-06-04 20:08 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 22+ messages in thread From: Chris Mason @ 2025-06-04 15:48 UTC (permalink / raw) To: Sebastian Andrzej Siewior; +Cc: Peter Zijlstra, linux-kernel On 6/4/25 5:28 AM, Sebastian Andrzej Siewior wrote: > On 2025-06-03 15:00:43 [-0400], Chris Mason wrote: >> Hi everyone, > Hi, > >> While testing Peter's latest scheduler patches against current Linus >> git, I found a pretty big performance regression with schbench: >> >> https://github.com/masoncl/schbench >> >> The command line I was using: >> >> schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0 >> > … >> schbench uses one futex per thread, and the command line ends up >> allocating 1024 threads, so the default bucket size used by this commit >> is just too small. Using 2048 buckets makes the problem go away. > > There is also this pthread_mutex_t but yeah Yeah, but -L disables the pthread nonsense. > >> On my big turin system, this commit slows down RPS by 36%. But even a >> VM on a skylake machine sees a 29% difference. >> >> schbench is a microbenchmark, so grain of salt on all of this, but I >> think our defaults are probably too low. > > we could > > diff --git a/kernel/futex/core.c b/kernel/futex/core.c > index abbd97c2fcba8..9046f3d9693e7 100644 > --- a/kernel/futex/core.c > +++ b/kernel/futex/core.c > @@ -1687,7 +1687,7 @@ int futex_hash_allocate_default(void) > scoped_guard(rcu) { > threads = min_t(unsigned int, > get_nr_threads(current), > - num_online_cpus()); > + num_online_cpus() * 2); > > fph = rcu_dereference(current->mm->futex_phash); > if (fph) { > > which would increase it to 2048 as Chris asks for. I haven't followed these changes, so asking some extra questions. This would bump to num_online_cpus() * 2, which probably isn't 2048 right? We've got large systems that are basically dedicated to single workloads, and those will probably miss the larger global hash table, regressing like schbench did. Then we have large systems spread over multiple big workloads that will love the private tables. In either case, I think growing the hash table as a multiple of thread count instead of cpu count will probably better reflect the crazy things multi-threaded applications do? At any rate, I don't think we want applications to need prctl to get back to the performance they had on older kernels. For people that want to avoid that memory overhead, I'm assuming they want the CONFIG_FUTEX_PRIVATE_HASH off, so the Kconfig help text should make that more clear. > Then there the possibility of > > diff --git a/kernel/futex/core.c b/kernel/futex/core.c> index abbd97c2fcba8..a19a96cc09c9e 100644 > --- a/kernel/futex/core.c > +++ b/kernel/futex/core.c > @@ -1680,6 +1680,8 @@ int futex_hash_allocate_default(void) > { > unsigned int threads, buckets, current_buckets = 0; > struct futex_private_hash *fph; > + bool current_immutable = false; > + unsigned int flags = 0; > > if (!current->mm) > return 0; > @@ -1695,6 +1697,7 @@ int futex_hash_allocate_default(void) > return 0; > > current_buckets = fph->hash_mask + 1; > + current_immutable = fph->immutable; > } > } > > @@ -1705,10 +1708,13 @@ int futex_hash_allocate_default(void) > buckets = roundup_pow_of_two(4 * threads); > buckets = clamp(buckets, 16, futex_hashmask + 1); > > - if (current_buckets >= buckets) > - return 0; > + if (current_buckets >= buckets) { > + if (current_immutable) > + return 0; > + flags = FH_IMMUTABLE; > + } > > - return futex_hash_allocate(buckets, 0); > + return futex_hash_allocate(buckets, flags); > } > > static int futex_hash_get_slots(void) > > to make hash immutable once the upper limit has been reached. There will > be no more auto-resize. One could argue that if the user did not touch > it, he might not do it at all. > > This would avoid the reference accounting. Some testing: > > 256 cores, 2xNUMA: > | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1 > | average rps: 785 446.07 Futex HBs: 1024 immutable: 0 > | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1> | average rps: 736 769.77 Futex HBs: 2048 immutable: 0 > | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1 How long are these runs? That's a huge benefit from being immutable (1.5M vs 736K?) but the hash table churn should be confined to early in the schbench run right? > > 144 cores, 4xNUMA: > | average rps: 2 691 912.55 Futex HBs: 0 immutable: 1 > | average rps: 1 306 443.68 Futex HBs: 1024 immutable: 0 > | average rps: 2 471 382.28 Futex HBs: 1024 immutable: 1 > | average rps: 1 269 503.90 Futex HBs: 2048 immutable: 0 > | average rps: 2 656 228.67 Futex HBs: 2048 immutable: 1 > > tested with this on top: This schbench hunk is just testing the performance impact of different bucket sizes, but hopefully we don't need it long term unless we want to play with even bigger hash tables? -chris > > diff --git a/schbench.c b/schbench.c > index 1be3e280f5c38..40a5f0091111e 100644 > --- a/schbench.c > +++ b/schbench.c > @@ -19,6 +19,8 @@ > #include <unistd.h> > #include <errno.h> > #include <getopt.h> > +#include <linux/prctl.h> > +#include <sys/prctl.h> > #include <sys/time.h> > #include <time.h> > #include <string.h> > @@ -42,6 +44,9 @@ > > #define USEC_PER_SEC (1000000) > > +static int futex_size = -1; > +static int futex_flags; > + > /* -m number of message threads */ > static int message_threads = 1; > /* -t number of workers per message thread */ > @@ -127,7 +132,7 @@ enum { > HELP_LONG_OPT = 1, > }; > > -char *option_string = "p:m:M:W:t:Cr:R:w:i:z:A:n:F:Lj:s:J:"; > +char *option_string = "p:m:M:W:t:Cr:R:w:i:z:A:n:F:Lj:s:J:H:I"; > static struct option long_options[] = { > {"pipe", required_argument, 0, 'p'}, > {"message-threads", required_argument, 0, 'm'}, > @@ -176,6 +181,29 @@ static void print_usage(void) > exit(1); > } > > +#ifndef PR_FUTEX_HASH > +#define PR_FUTEX_HASH 78 > +# define PR_FUTEX_HASH_SET_SLOTS 1 > +# define FH_FLAG_IMMUTABLE (1ULL << 0) > +# define PR_FUTEX_HASH_GET_SLOTS 2 > +# define PR_FUTEX_HASH_GET_IMMUTABLE 3 > +#endif > + > +static int futex_hash_slots_set(unsigned int slots, int flags) > +{ > + return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, slots, flags); > +} > + > +static int futex_hash_slots_get(void) > +{ > + return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS); > +} > + > +static int futex_hash_immutable_get(void) > +{ > + return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE); > +} > + > /* > * returns 0 if we fail to parse and 1 if we succeed > */ > @@ -347,6 +375,13 @@ static void parse_options(int ac, char **av) > exit(1); > } > break; > + case 'H': > + futex_size = atoi(optarg); > + break; > + case 'I': > + futex_flags = FH_FLAG_IMMUTABLE; > + break; > + > case '?': > case HELP_LONG_OPT: > print_usage(); > @@ -1811,6 +1846,9 @@ int main(int ac, char **av) > } > } > > + if (futex_size >= 0) > + futex_hash_slots_set(futex_size, futex_flags); > + > requests_per_sec /= message_threads; > loops_per_sec = 0; > stopping = 0; > @@ -1920,6 +1958,8 @@ int main(int ac, char **av) > } > free(message_threads_mem); > > + fprintf(stderr, "Futex HBs: %d immutable: %d\n", futex_hash_slots_get(), > + futex_hash_immutable_get()); > > return 0; > } > > > Comments? > >> -chris > > Sebastian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-04 15:48 ` Chris Mason @ 2025-06-04 20:08 ` Sebastian Andrzej Siewior 2025-06-06 0:55 ` Chris Mason 0 siblings, 1 reply; 22+ messages in thread From: Sebastian Andrzej Siewior @ 2025-06-04 20:08 UTC (permalink / raw) To: Chris Mason; +Cc: Peter Zijlstra, linux-kernel On 2025-06-04 11:48:05 [-0400], Chris Mason wrote: > > --- a/kernel/futex/core.c > > +++ b/kernel/futex/core.c > > @@ -1687,7 +1687,7 @@ int futex_hash_allocate_default(void) > > scoped_guard(rcu) { > > threads = min_t(unsigned int, > > get_nr_threads(current), > > - num_online_cpus()); > > + num_online_cpus() * 2); > > > > fph = rcu_dereference(current->mm->futex_phash); > > if (fph) { > > > > which would increase it to 2048 as Chris asks for. > > I haven't followed these changes, so asking some extra questions. This > would bump to num_online_cpus() * 2, which probably isn't 2048 right? Assuming you have 256 CPUs and -m 4 -t 256 creates 4 * 256 = 1024 then threads = 512 gets computed (= min(1024, 256 * 2)). Later we do 512 * 4 (roundup_pow_of_two(4 * threads)). This makes 2048 hash buckets. > We've got large systems that are basically dedicated to single > workloads, and those will probably miss the larger global hash table, > regressing like schbench did. Then we have large systems spread over > multiple big workloads that will love the private tables. > > In either case, I think growing the hash table as a multiple of thread > count instead of cpu count will probably better reflect the crazy things > multi-threaded applications do? At any rate, I don't think we want > applications to need prctl to get back to the performance they had on > older kernels. This is only an issue if all you CPUs spend their time in the kernel using the hash buckets at the same time. This was the case in every benchmark I've seen so far. Your thing might be closer to an actual workload. > For people that want to avoid that memory overhead, I'm assuming they > want the CONFIG_FUTEX_PRIVATE_HASH off, so the Kconfig help text should > make that more clear. > > > Then there the possibility of … > > 256 cores, 2xNUMA: > > | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1 > > | average rps: 785 446.07 Futex HBs: 1024 immutable: 0 > > | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1> | average > rps: 736 769.77 Futex HBs: 2048 immutable: 0 > > | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1 > > > How long are these runs? That's a huge benefit from being immutable > (1.5M vs 736K?) but the hash table churn should be confined to early in > the schbench run right? I think 30 secs or so. I used your command line. The 256 core box showed a higher improvement than the 144 one. I attached a patch against schbench in the previous mail, I did then ./schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0 -H 1024 -I … > This schbench hunk is just testing the performance impact of different > bucket sizes, but hopefully we don't need it long term unless we want to > play with even bigger hash tables? If you do "-H 0" then you should get the "old" behaviour. However the hash bucket are spread now over the NUMA nodes: | dmesg |grep -i futex | [ 0.501736] futex hash table entries: 32768 (2097152 bytes on 2 NUMA nodes, total 4096 KiB, linear). Now there are 32768 hash buckets on both NUMA nodes. Depending on the hash it computes, it uses the data structures on NUMA node 1 or 2. The old code allocated 65536 hash buckets via vmalloc(). The bigger hash table isn't always the answer. Yes, you could play around figure out what works best for you. The problem is that the hash is based on the mm and the (user) memory address. So on each run you will get a different hash and therefore different collisions. If you end up having many hash collisions and then block on the same lock then yes, larger hash table will be the cure. If you have many threads doing the futex syscall simultaneously then making the hash immutable avoids two atomic ops on the same memory address. This would be my favorite. Now that I think about it, it might be possible to move the atomic ops the hash bucket itself. Then it wouldn't bounce the cache line so much. Making the hash immutable is simpler. > -chris Sebastian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-04 20:08 ` Sebastian Andrzej Siewior @ 2025-06-06 0:55 ` Chris Mason 2025-06-06 7:06 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 22+ messages in thread From: Chris Mason @ 2025-06-06 0:55 UTC (permalink / raw) To: Sebastian Andrzej Siewior; +Cc: Peter Zijlstra, linux-kernel On 6/4/25 4:08 PM, Sebastian Andrzej Siewior wrote: > On 2025-06-04 11:48:05 [-0400], Chris Mason wrote: >>> --- a/kernel/futex/core.c >>> +++ b/kernel/futex/core.c >>> @@ -1687,7 +1687,7 @@ int futex_hash_allocate_default(void) >>> scoped_guard(rcu) { >>> threads = min_t(unsigned int, >>> get_nr_threads(current), >>> - num_online_cpus()); >>> + num_online_cpus() * 2); >>> >>> fph = rcu_dereference(current->mm->futex_phash); >>> if (fph) { >>> >>> which would increase it to 2048 as Chris asks for. >> >> I haven't followed these changes, so asking some extra questions. This >> would bump to num_online_cpus() * 2, which probably isn't 2048 right? > > Assuming you have 256 CPUs and -m 4 -t 256 creates 4 * 256 = 1024 then > threads = 512 gets computed (= min(1024, 256 * 2)). Later we do 512 * 4 > (roundup_pow_of_two(4 * threads)). This makes 2048 hash buckets. > Ah right, I missed the x4. >> We've got large systems that are basically dedicated to single >> workloads, and those will probably miss the larger global hash table, >> regressing like schbench did. Then we have large systems spread over >> multiple big workloads that will love the private tables. >> >> In either case, I think growing the hash table as a multiple of thread >> count instead of cpu count will probably better reflect the crazy things >> multi-threaded applications do? At any rate, I don't think we want >> applications to need prctl to get back to the performance they had on >> older kernels. > > This is only an issue if all you CPUs spend their time in the kernel > using the hash buckets at the same time. > This was the case in every benchmark I've seen so far. Your thing might > be closer to an actual workload. > I didn't spend a ton of time looking at the perf profiles of the slower kernel, was the bottleneck in the hash chain length or in contention for the buckets? >> For people that want to avoid that memory overhead, I'm assuming they >> want the CONFIG_FUTEX_PRIVATE_HASH off, so the Kconfig help text should >> make that more clear. >> >>> Then there the possibility of > … >>> 256 cores, 2xNUMA: >>> | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1 >>> | average rps: 785 446.07 Futex HBs: 1024 immutable: 0 >>> | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1> | average >> rps: 736 769.77 Futex HBs: 2048 immutable: 0 >>> | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1 >> >> >> How long are these runs? That's a huge benefit from being immutable >> (1.5M vs 736K?) but the hash table churn should be confined to early in >> the schbench run right? > > I think 30 secs or so. I used your command line. Ah ok, my command line is 60 seconds. It feels like something is strange for the immutable flag to make it that much faster? schbench starts all the threads up front, so it should hit steady state pretty quickly. More on NUMA below, but I'll benchmark with the immutable flag on the turin box in the morning to see if it is the extra atomics. > The 256 core box showed > a higher improvement than the 144 one. I attached a patch against > schbench in the previous mail, I did then > ./schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0 -H 1024 -I > > … >> This schbench hunk is just testing the performance impact of different >> bucket sizes, but hopefully we don't need it long term unless we want to >> play with even bigger hash tables? > > If you do "-H 0" then you should get the "old" behaviour. However the hash > bucket are spread now over the NUMA nodes: > > | dmesg |grep -i futex > | [ 0.501736] futex hash table entries: 32768 (2097152 bytes on 2 NUMA nodes, total 4096 KiB, linear). > > Now there are 32768 hash buckets on both NUMA nodes. Depending on the > hash it computes, it uses the data structures on NUMA node 1 or 2. The > old code allocated 65536 hash buckets via vmalloc(). So I wanted to mention this earlier and forgot, but schbench obviously isn't numa aware at all. This combination of command line options basically just has the 4 message threads waking up the 256 worker threads (per message thread) after scribbling a timestamp into shared memory. From a schbench pov at least, we'll get much more stable numbers by sticking to a single socket. <makes numa placement hand gestures> > > The bigger hash table isn't always the answer. Yes, you could play > around figure out what works best for you. The problem is that the hash > is based on the mm and the (user) memory address. So on each run you > will get a different hash and therefore different collisions. > If you end up having many hash collisions and then block on the same > lock then yes, larger hash table will be the cure. If you have many > threads doing the futex syscall simultaneously then making the hash > immutable avoids two atomic ops on the same memory address. > This would be my favorite. > > Now that I think about it, it might be possible to move the atomic ops > the hash bucket itself. Then it wouldn't bounce the cache line so much. > Making the hash immutable is simpler. Going back to your diff, if we have a process growing the total number of threads, can we set FH_IMMUTABLE too early? As the number of threads increases, eventually we'll pick the 2x num_cpus, but that'll take a while? I do see what you mean about immutable being simpler though, I'll get some numbers on those atomics. -chris ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-06 0:55 ` Chris Mason @ 2025-06-06 7:06 ` Sebastian Andrzej Siewior 2025-06-06 21:06 ` Chris Mason ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Sebastian Andrzej Siewior @ 2025-06-06 7:06 UTC (permalink / raw) To: Chris Mason; +Cc: Peter Zijlstra, linux-kernel On 2025-06-05 20:55:27 [-0400], Chris Mason wrote: > >> We've got large systems that are basically dedicated to single > >> workloads, and those will probably miss the larger global hash table, > >> regressing like schbench did. Then we have large systems spread over > >> multiple big workloads that will love the private tables. > >> > >> In either case, I think growing the hash table as a multiple of thread > >> count instead of cpu count will probably better reflect the crazy things > >> multi-threaded applications do? At any rate, I don't think we want > >> applications to need prctl to get back to the performance they had on > >> older kernels. > > > > This is only an issue if all you CPUs spend their time in the kernel > > using the hash buckets at the same time. > > This was the case in every benchmark I've seen so far. Your thing might > > be closer to an actual workload. > > > > I didn't spend a ton of time looking at the perf profiles of the slower > kernel, was the bottleneck in the hash chain length or in contention for > the buckets? Every futex operation does a rcuref_get() (which is an atomic inc) on the private hash. This is before anything else happens. If you have two threads, on two CPUs, which simultaneously do a futex() operation then both do this rcuref_get(). That atomic inc ensures that the cacheline bounces from one CPU to the other. On the exit of the syscall there is a matching rcuref_put(). > >> For people that want to avoid that memory overhead, I'm assuming they > >> want the CONFIG_FUTEX_PRIVATE_HASH off, so the Kconfig help text should > >> make that more clear. > >> > >>> Then there the possibility of > > … > >>> 256 cores, 2xNUMA: > >>> | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1 > >>> | average rps: 785 446.07 Futex HBs: 1024 immutable: 0 > >>> | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1> | average > >> rps: 736 769.77 Futex HBs: 2048 immutable: 0 > >>> | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1 > >> > >> > >> How long are these runs? That's a huge benefit from being immutable > >> (1.5M vs 736K?) but the hash table churn should be confined to early in > >> the schbench run right? > > > > I think 30 secs or so. I used your command line. > > Ah ok, my command line is 60 seconds. It feels like something is > strange for the immutable flag to make it that much faster? schbench > starts all the threads up front, so it should hit steady state pretty > quickly. More on NUMA below, but I'll benchmark with the immutable flag > on the turin box in the morning to see if it is the extra atomics. That immutable flag makes this rcuref_get()/ put() go away. The price is that you can't change the size of the private hash anymore. So if your workload works best with a hash size of X and you don't intend to change it during the runtime of the program, set the immutable flag. > > The 256 core box showed > > a higher improvement than the 144 one. I attached a patch against > > schbench in the previous mail, I did then > > ./schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0 -H 1024 -I > > > > … > >> This schbench hunk is just testing the performance impact of different > >> bucket sizes, but hopefully we don't need it long term unless we want to > >> play with even bigger hash tables? > > > > If you do "-H 0" then you should get the "old" behaviour. However the hash > > bucket are spread now over the NUMA nodes: > > > > | dmesg |grep -i futex > > | [ 0.501736] futex hash table entries: 32768 (2097152 bytes on 2 NUMA nodes, total 4096 KiB, linear). > > > > Now there are 32768 hash buckets on both NUMA nodes. Depending on the > > hash it computes, it uses the data structures on NUMA node 1 or 2. The > > old code allocated 65536 hash buckets via vmalloc(). > > So I wanted to mention this earlier and forgot, but schbench obviously > isn't numa aware at all. This combination of command line options > basically just has the 4 message threads waking up the 256 worker > threads (per message thread) after scribbling a timestamp into shared > memory. From a schbench pov at least, we'll get much more stable > numbers by sticking to a single socket. <makes numa placement hand > gestures> Earlier the global hash was somehow spread via vmalloc(). Now it is spread slightly different. If you request the private hash, then it is allocated on the current node. Unless you move from one node the other, this should be good. > > The bigger hash table isn't always the answer. Yes, you could play > > around figure out what works best for you. The problem is that the hash > > is based on the mm and the (user) memory address. So on each run you > > will get a different hash and therefore different collisions. > > If you end up having many hash collisions and then block on the same > > lock then yes, larger hash table will be the cure. If you have many > > threads doing the futex syscall simultaneously then making the hash > > immutable avoids two atomic ops on the same memory address. > > This would be my favorite. > > > > Now that I think about it, it might be possible to move the atomic ops > > the hash bucket itself. Then it wouldn't bounce the cache line so much. > > Making the hash immutable is simpler. > > Going back to your diff, if we have a process growing the total number > of threads, can we set FH_IMMUTABLE too early? As the number of threads > increases, eventually we'll pick the 2x num_cpus, but that'll take a while? If you refer to the schbench diff, then set it early. Once the prctl() to set the size of the private hash, there will be no resize by the kernel. If you refer to the kernel diff where set the FH_IMMUTABLE flag, then it is set once the upper limit is reached (that was the plan in case I did the logic wrong). Which means at that point it won't increase any further because of the CPU limit. The only way how you can reach it too early is if you offline CPUs. > I do see what you mean about immutable being simpler though, I'll get > some numbers on those atomics. I have another idea to add a refcount to the hb slot itself. I hope that it won't be that expensive if a atomic inc/dec occurs on different memory locations. But then maybe freeze it if the upper limit is reached. But that might help if we don't reach the upper limit. > -chris Sebastian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-06 7:06 ` Sebastian Andrzej Siewior @ 2025-06-06 21:06 ` Chris Mason 2025-06-06 22:17 ` Chris Mason 2025-06-24 19:01 ` Peter Zijlstra 2 siblings, 0 replies; 22+ messages in thread From: Chris Mason @ 2025-06-06 21:06 UTC (permalink / raw) To: Sebastian Andrzej Siewior; +Cc: Peter Zijlstra, linux-kernel On 6/6/25 3:06 AM, Sebastian Andrzej Siewior wrote: > On 2025-06-05 20:55:27 [-0400], Chris Mason wrote: >>>> We've got large systems that are basically dedicated to single >>>> workloads, and those will probably miss the larger global hash table, >>>> regressing like schbench did. Then we have large systems spread over >>>> multiple big workloads that will love the private tables. >>>> >>>> In either case, I think growing the hash table as a multiple of thread >>>> count instead of cpu count will probably better reflect the crazy things >>>> multi-threaded applications do? At any rate, I don't think we want >>>> applications to need prctl to get back to the performance they had on >>>> older kernels. >>> >>> This is only an issue if all you CPUs spend their time in the kernel >>> using the hash buckets at the same time. >>> This was the case in every benchmark I've seen so far. Your thing might >>> be closer to an actual workload. >>> >> >> I didn't spend a ton of time looking at the perf profiles of the slower >> kernel, was the bottleneck in the hash chain length or in contention for >> the buckets? > > Every futex operation does a rcuref_get() (which is an atomic inc) on > the private hash. This is before anything else happens. If you have two > threads, on two CPUs, which simultaneously do a futex() operation then > both do this rcuref_get(). That atomic inc ensures that the cacheline > bounces from one CPU to the other. On the exit of the syscall there is a > matching rcuref_put(). > >>>> For people that want to avoid that memory overhead, I'm assuming they >>>> want the CONFIG_FUTEX_PRIVATE_HASH off, so the Kconfig help text should >>>> make that more clear. >>>> >>>>> Then there the possibility of >>> … >>>>> 256 cores, 2xNUMA: >>>>> | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1 >>>>> | average rps: 785 446.07 Futex HBs: 1024 immutable: 0 >>>>> | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1> | average >>>> rps: 736 769.77 Futex HBs: 2048 immutable: 0 >>>>> | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1 >>>> >>>> >>>> How long are these runs? That's a huge benefit from being immutable >>>> (1.5M vs 736K?) but the hash table churn should be confined to early in >>>> the schbench run right? >>> >>> I think 30 secs or so. I used your command line. >> >> Ah ok, my command line is 60 seconds. It feels like something is >> strange for the immutable flag to make it that much faster? schbench >> starts all the threads up front, so it should hit steady state pretty >> quickly. More on NUMA below, but I'll benchmark with the immutable flag >> on the turin box in the morning to see if it is the extra atomics. > > That immutable flag makes this rcuref_get()/ put() go away. The price is > that you can't change the size of the private hash anymore. So if your > workload works best with a hash size of X and you don't intend to change > it during the runtime of the program, set the immutable flag. ok, for the benchmarks, I hard coded the number of buckets at 1024 and the only thing I flipped was the immutable flag. I was running on top of c0c9379f235df33a12ceae94370ad80c5278324d (just today's Linus). schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0 1024 bucket without immutable: RPS 2 297 856 1024 with immutable: RPS 3 665 920 This is pretty similar to what 6.15 obtains, so I'm hoping we can find a way to get the 6.15 performance levels without the applications needing to call prctl manually. I took some profiles on the 1024 + non-immutable 16.92% schbench [kernel.kallsyms] [k] futex_hash_put | |--8.93%--fpost | |--6.99%--xlist_wake_all | --0.58%--entry_SYSCALL_64 syscall 13.27% schbench libc.so.6 [.] syscall | |--8.90%--xlist_wake_all | --3.98%--entry_SYSCALL_64 syscall 12.36% schbench [kernel.kallsyms] [k] entry_SYSCALL_64 | |--11.04%--__futex_hash | futex_hash | futex_wake | do_futex | __x64_sys_futex | do_syscall_64 | entry_SYSCALL_64_after_hwframe | syscall | --1.23%--futex_hash futex_wake do_futex __x64_sys_futex do_syscall_64 entry_SYSCALL_64_after_hwframe syscall Percent │ 0xffffffff813f2f80 <futex_hash_put>: 0.05 │ → callq __fentry__ 0.06 │ pushq %rbx 0.08 │ movq 0x18(%rdi),%rbx 0.05 │ testq %rbx,%rbx 0.03 │ ↓ je 15 12.49 │ cmpb $0x0,0x21(%rbx) 12.58 │ ↓ je 17 │15: popq %rbx │ ← retq 0.24 │17: movl $0xffffffff,%esi 12.76 │ lock │ xaddl %esi,(%rbx) 7.80 │ ┌──subl $0x1,%esi 22.96 │ ├──js 27 23.05 │ │ popq %rbx 7.85 │ │← retq │27:└─→movq %rbx,%rdi │ → callq rcuref_put_slowpath │ testb %al,%al │ ↑ je 15 │ movq 0x18(%rbx),%rdi │ popq %rbx │ → jmp wake_up_var If you like profiles with line numbers: 12830 samples (10.24%) Comms: schbench futex_hash_put @ /root/linux-6.15/kernel/futex/core.c:179:2 futex_private_hash_put @ /root/linux-6.15/kernel/futex/core.c:148:6 [inlined] futex_private_hash_put @ /root/linux-6.15/kernel/futex/core.c:153:6 [inlined] rcuref_put @ /root/linux-6.15/./include/linux/rcuref.h:173:13 [inlined] __rcuref_put @ /root/linux-6.15/./include/linux/rcuref.h:110:5 [inlined] futex_wake @ /root/linux-6.15/kernel/futex/waitwake.c:200:1 do_futex @ /root/linux-6.15/kernel/futex/syscalls.c:107:10 __x64_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1 __se_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1 [inlined] __do_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:179:9 [inlined] do_syscall_64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:94:7 do_syscall_x64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:63:12 [inlined] entry_SYSCALL_64_after_hwframe 13795 samples (11.01%) Comms: schbench futex_hash @ /root/linux-6.15/kernel/futex/core.c:311:15 futex_private_hash_get @ /root/linux-6.15/kernel/futex/core.c:143:5 [inlined] futex_wake @ /root/linux-6.15/kernel/futex/waitwake.c:172:2 class_hb_constructor @ /root/linux-6.15/kernel/futex/futex.h:242:1 [inlined] do_futex @ /root/linux-6.15/kernel/futex/syscalls.c:107:10 __x64_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1 __se_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1 [inlined] __do_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:179:9 [inlined] do_syscall_64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:94:7 do_syscall_x64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:63:12 [inlined] entry_SYSCALL_64_after_hwframe 14364 samples (11.46%) Comms: schbench entry_SYSCALL_64 17612 samples (14.05%) Comms: schbench futex_hash @ /root/linux-6.15/kernel/futex/core.c:311:15 futex_private_hash_get @ /root/linux-6.15/kernel/futex/core.c:145:9 [inlined] rcuref_get @ /root/linux-6.15/./include/linux/rcuref.h:87:5 [inlined] futex_wake @ /root/linux-6.15/kernel/futex/waitwake.c:172:2 class_hb_constructor @ /root/linux-6.15/kernel/futex/futex.h:242:1 [inlined] do_futex @ /root/linux-6.15/kernel/futex/syscalls.c:107:10 __x64_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1 __se_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1 [inlined] __do_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:179:9 [inlined] do_syscall_64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:94:7 do_syscall_x64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:63:12 [inlined] entry_SYSCALL_64_after_hwframe With immutable, futexes are not in the top 10. -chris ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-06 7:06 ` Sebastian Andrzej Siewior 2025-06-06 21:06 ` Chris Mason @ 2025-06-06 22:17 ` Chris Mason 2025-06-24 19:01 ` Peter Zijlstra 2 siblings, 0 replies; 22+ messages in thread From: Chris Mason @ 2025-06-06 22:17 UTC (permalink / raw) To: Sebastian Andrzej Siewior; +Cc: Peter Zijlstra, linux-kernel On 6/6/25 3:06 AM, Sebastian Andrzej Siewior wrote: > On 2025-06-05 20:55:27 [-0400], Chris Mason wrote: [ ... ] >> Going back to your diff, if we have a process growing the total number >> of threads, can we set FH_IMMUTABLE too early? As the number of threads >> increases, eventually we'll pick the 2x num_cpus, but that'll take a while? > > If you refer to the schbench diff, then set it early. Once the prctl() > to set the size of the private hash, there will be no resize by the > kernel. > > If you refer to the kernel diff where set the FH_IMMUTABLE flag, then it > is set once the upper limit is reached (that was the plan in case I did > the logic wrong). Which means at that point it won't increase any > further because of the CPU limit. The only way how you can reach it too > early is if you offline CPUs. I pulled your immutable diff on top of c0c9379f235d. Just the immutable diff, not the first suggestion that bumped num_online_cpus() * 2. The RPS did not improve (~2.5M RPS). Looking at the futex_private_hash of the process: >>> task.mm.futex_phash *(struct futex_private_hash *)0xff110003bbd15000 = { .users = (rcuref_t){ .refcnt = (atomic_t){ .counter = (int)0, }, }, .hash_mask = (unsigned int)15, .rcu = (struct callback_head){ .next = (struct callback_head *)0x0, .func = (void (*)(struct callback_head *))0x0, }, .mm = (void *)0xff110003321a2400, .custom = (bool)0, .immutable = (bool)1, .queues = (struct futex_hash_bucket []){}, } It's because our calculation on the max is based on min(get_nr_threads(current), num_online_cpus()) get_nr_threads starts smaller than num_online_cpus(), so we immediately decide we've maxed out the buckets? -chris ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-06 7:06 ` Sebastian Andrzej Siewior 2025-06-06 21:06 ` Chris Mason 2025-06-06 22:17 ` Chris Mason @ 2025-06-24 19:01 ` Peter Zijlstra 2025-06-26 11:01 ` Chris Mason ` (2 more replies) 2 siblings, 3 replies; 22+ messages in thread From: Peter Zijlstra @ 2025-06-24 19:01 UTC (permalink / raw) To: Sebastian Andrzej Siewior; +Cc: Chris Mason, linux-kernel, Thomas Gleixner On Fri, Jun 06, 2025 at 09:06:38AM +0200, Sebastian Andrzej Siewior wrote: > On 2025-06-05 20:55:27 [-0400], Chris Mason wrote: > > >> We've got large systems that are basically dedicated to single > > >> workloads, and those will probably miss the larger global hash table, > > >> regressing like schbench did. Then we have large systems spread over > > >> multiple big workloads that will love the private tables. > > >> > > >> In either case, I think growing the hash table as a multiple of thread > > >> count instead of cpu count will probably better reflect the crazy things > > >> multi-threaded applications do? At any rate, I don't think we want > > >> applications to need prctl to get back to the performance they had on > > >> older kernels. > > > > > > This is only an issue if all you CPUs spend their time in the kernel > > > using the hash buckets at the same time. > > > This was the case in every benchmark I've seen so far. Your thing might > > > be closer to an actual workload. > > > > > > > I didn't spend a ton of time looking at the perf profiles of the slower > > kernel, was the bottleneck in the hash chain length or in contention for > > the buckets? > > Every futex operation does a rcuref_get() (which is an atomic inc) on > the private hash. This is before anything else happens. If you have two > threads, on two CPUs, which simultaneously do a futex() operation then > both do this rcuref_get(). That atomic inc ensures that the cacheline > bounces from one CPU to the other. On the exit of the syscall there is a > matching rcuref_put(). How about something like this (very lightly tested)... the TL;DR is that it turns all those refcounts into per-cpu ops when there is no hash replacement pending (eg. the normal case), and only folds the lot into an atomic when we really care about it. There's some sharp corners still.. but it boots and survives the (slightly modified) selftest. --- include/linux/futex.h | 12 +- include/linux/mm_types.h | 5 + kernel/futex/core.c | 238 +++++++++++++++++++-- .../selftests/futex/functional/futex_priv_hash.c | 11 +- 4 files changed, 239 insertions(+), 27 deletions(-) diff --git a/include/linux/futex.h b/include/linux/futex.h index b37193653e6b..5f92c7a8ba57 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -85,17 +85,11 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4) #ifdef CONFIG_FUTEX_PRIVATE_HASH int futex_hash_allocate_default(void); void futex_hash_free(struct mm_struct *mm); - -static inline void futex_mm_init(struct mm_struct *mm) -{ - RCU_INIT_POINTER(mm->futex_phash, NULL); - mm->futex_phash_new = NULL; - mutex_init(&mm->futex_hash_lock); -} +int futex_mm_init(struct mm_struct *mm); #else /* !CONFIG_FUTEX_PRIVATE_HASH */ static inline int futex_hash_allocate_default(void) { return 0; } -static inline void futex_hash_free(struct mm_struct *mm) { } +static inline int futex_hash_free(struct mm_struct *mm) { return 0; } static inline void futex_mm_init(struct mm_struct *mm) { } #endif /* CONFIG_FUTEX_PRIVATE_HASH */ @@ -118,7 +112,7 @@ static inline int futex_hash_allocate_default(void) { return 0; } -static inline void futex_hash_free(struct mm_struct *mm) { } +static inline int futex_hash_free(struct mm_struct *mm) { return 0; } static inline void futex_mm_init(struct mm_struct *mm) { } #endif diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d6b91e8a66d6..0f0662157066 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1070,6 +1070,11 @@ struct mm_struct { struct mutex futex_hash_lock; struct futex_private_hash __rcu *futex_phash; struct futex_private_hash *futex_phash_new; + /* futex-ref */ + unsigned long futex_batches; + struct rcu_head futex_rcu; + atomic_long_t futex_atomic; + unsigned int __percpu *futex_ref; #endif unsigned long hiwater_rss; /* High-watermark of RSS usage */ diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 90d53fb0ee9e..cfa7c105a7dc 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -42,7 +42,6 @@ #include <linux/fault-inject.h> #include <linux/slab.h> #include <linux/prctl.h> -#include <linux/rcuref.h> #include <linux/mempolicy.h> #include <linux/mmap_lock.h> @@ -65,7 +64,7 @@ static struct { #define futex_queues (__futex_data.queues) struct futex_private_hash { - rcuref_t users; + int state; unsigned int hash_mask; struct rcu_head rcu; void *mm; @@ -129,6 +128,12 @@ static struct futex_hash_bucket * __futex_hash(union futex_key *key, struct futex_private_hash *fph); #ifdef CONFIG_FUTEX_PRIVATE_HASH +static bool futex_ref_get(struct futex_private_hash *fph); +static bool futex_ref_put(struct futex_private_hash *fph); +static bool futex_ref_is_dead(struct futex_private_hash *fph); + +enum { FR_PERCPU = 0, FR_ATOMIC }; + static inline bool futex_key_is_private(union futex_key *key) { /* @@ -142,15 +147,14 @@ bool futex_private_hash_get(struct futex_private_hash *fph) { if (fph->immutable) return true; - return rcuref_get(&fph->users); + return futex_ref_get(fph); } void futex_private_hash_put(struct futex_private_hash *fph) { - /* Ignore return value, last put is verified via rcuref_is_dead() */ if (fph->immutable) return; - if (rcuref_put(&fph->users)) + if (futex_ref_put(fph)) wake_up_var(fph->mm); } @@ -243,14 +247,18 @@ static bool __futex_pivot_hash(struct mm_struct *mm, fph = rcu_dereference_protected(mm->futex_phash, lockdep_is_held(&mm->futex_hash_lock)); if (fph) { - if (!rcuref_is_dead(&fph->users)) { + if (!futex_ref_is_dead(fph)) { mm->futex_phash_new = new; return false; } futex_rehash_private(fph, new); } - rcu_assign_pointer(mm->futex_phash, new); + new->state = FR_PERCPU; + scoped_guard (rcu) { + mm->futex_batches = get_state_synchronize_rcu(); + rcu_assign_pointer(mm->futex_phash, new); + } kvfree_rcu(fph, rcu); return true; } @@ -289,9 +297,7 @@ struct futex_private_hash *futex_private_hash(void) if (!fph) return NULL; - if (fph->immutable) - return fph; - if (rcuref_get(&fph->users)) + if (futex_private_hash_get(fph)) return fph; } futex_pivot_hash(mm); @@ -1527,16 +1533,214 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb, #define FH_IMMUTABLE 0x02 #ifdef CONFIG_FUTEX_PRIVATE_HASH + +/* + * futex-ref + * + * Heavily inspired by percpu-rwsem/percpu-refcount; not reusing any of that + * code because it just doesn't fit right. + * + * Dual counter, per-cpu / atomic approach like percpu-refcount, except it + * re-initializes the state automatically, such that the fph swizzle is also a + * transition back to per-cpu. + */ + +static void futex_ref_rcu(struct rcu_head *head); + +static void __futex_ref_atomic_begin(struct futex_private_hash *fph) +{ + struct mm_struct *mm = fph->mm; + + /* + * The counter we're about to switch to must have fully switched; + * otherwise it would be impossible for it to have reported success + * from futex_ref_is_dead(). + */ + WARN_ON_ONCE(atomic_long_read(&mm->futex_atomic) != 0); + + /* + * Set the atomic to the bias value such that futex_ref_{get,put}() + * will never observe 0. Will be fixed up in __futex_ref_atomic_end() + * when folding in the percpu count. + */ + atomic_long_set(&mm->futex_atomic, LONG_MAX); + smp_store_release(&fph->state, FR_ATOMIC); + + call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu); +} + +static void __futex_ref_atomic_end(struct futex_private_hash *fph) +{ + struct mm_struct *mm = fph->mm; + unsigned int count = 0; + long ret; + int cpu; + + /* + * Per __futex_ref_atomic_begin() the state of the fph must be ATOMIC + * and per this RCU callback, everybody must now observe this state and + * use the atomic variable. + */ + WARN_ON_ONCE(fph->state != FR_ATOMIC); + + /* + * Therefore the per-cpu counter is now stable, sum and reset. + */ + for_each_possible_cpu(cpu) { + unsigned int *ptr = per_cpu_ptr(mm->futex_ref, cpu); + count += *ptr; + *ptr = 0; + } + + /* + * Re-init for the next cycle. + */ + this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */ + + /* + * Add actual count, subtract bias and initial refcount. + * + * The moment this atomic operation happens, futex_ref_is_dead() can + * become true. + */ + ret = atomic_long_add_return(count - LONG_MAX - 1, &mm->futex_atomic); + if (!ret) + wake_up_var(mm); + + WARN_ON_ONCE(ret < 0); +} + +static void futex_ref_rcu(struct rcu_head *head) +{ + struct mm_struct *mm = container_of(head, struct mm_struct, futex_rcu); + struct futex_private_hash *fph = rcu_dereference_raw(mm->futex_phash); + + if (fph->state == FR_PERCPU) { + /* + * Per this extra grace-period, everybody must now observe + * fph as the current fph and no previously observed fph's + * are in-flight. + * + * Notably, nobody will now rely on the atomic + * futex_ref_is_dead() state anymore so we can begin the + * migration of the per-cpu counter into the atomic. + */ + __futex_ref_atomic_begin(fph); + return; + } + + __futex_ref_atomic_end(fph); +} + +/* + * Drop the initial refcount and transition to atomics. + */ +static void futex_ref_drop(struct futex_private_hash *fph) +{ + struct mm_struct *mm = fph->mm; + + /* + * Can only transition the current fph; + */ + WARN_ON_ONCE(rcu_dereference_raw(mm->futex_phash) != fph); + + /* + * In order to avoid the following scenario: + * + * futex_hash() __futex_pivot_hash() + * guard(rcu); guard(mm->futex_hash_lock); + * fph = mm->futex_phash; + * rcu_assign_pointer(&mm->futex_phash, new); + * futex_hash_allocate() + * futex_ref_drop() + * fph->state = FR_ATOMIC; + * atomic_set(, BIAS); + * + * futex_private_hash_get(fph); // OOPS + * + * Where an old fph (which is FR_ATOMIC) and should fail on + * inc_not_zero, will succeed because a new transition is started and + * the atomic is bias'ed away from 0. + * + * There must be at least one full grace-period between publishing a + * new fph and trying to replace it. + */ + if (poll_state_synchronize_rcu(mm->futex_batches)) { + /* + * There was a grace-period, we can begin now. + */ + __futex_ref_atomic_begin(fph); + return; + } + + call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu); +} + +static bool futex_ref_get(struct futex_private_hash *fph) +{ + struct mm_struct *mm = fph->mm; + + guard(preempt)(); + + if (smp_load_acquire(&fph->state) == FR_PERCPU) { + this_cpu_inc(*mm->futex_ref); + return true; + } + + return atomic_long_inc_not_zero(&mm->futex_atomic); +} + +static bool futex_ref_put(struct futex_private_hash *fph) +{ + struct mm_struct *mm = fph->mm; + + guard(preempt)(); + + if (smp_load_acquire(&fph->state) == FR_PERCPU) { + this_cpu_dec(*mm->futex_ref); + return false; + } + + return atomic_long_dec_and_test(&mm->futex_atomic); +} + +static bool futex_ref_is_dead(struct futex_private_hash *fph) +{ + struct mm_struct *mm = fph->mm; + + guard(preempt)(); + + if (smp_load_acquire(&fph->state) == FR_PERCPU) + return false; + + return atomic_long_read(&mm->futex_atomic) == 0; +} + +// TODO propagate error +int futex_mm_init(struct mm_struct *mm) +{ + mutex_init(&mm->futex_hash_lock); + RCU_INIT_POINTER(mm->futex_phash, NULL); + mm->futex_phash_new = NULL; + /* futex-ref */ + atomic_long_set(&mm->futex_atomic, 0); + mm->futex_batches = get_state_synchronize_rcu(); + mm->futex_ref = alloc_percpu(unsigned int); + if (!mm->futex_ref) + return -ENOMEM; + this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */ + return 0; +} + void futex_hash_free(struct mm_struct *mm) { struct futex_private_hash *fph; + free_percpu(mm->futex_ref); kvfree(mm->futex_phash_new); fph = rcu_dereference_raw(mm->futex_phash); - if (fph) { - WARN_ON_ONCE(rcuref_read(&fph->users) > 1); + if (fph) kvfree(fph); - } } static bool futex_pivot_pending(struct mm_struct *mm) @@ -1549,7 +1753,7 @@ static bool futex_pivot_pending(struct mm_struct *mm) return true; fph = rcu_dereference(mm->futex_phash); - return rcuref_is_dead(&fph->users); + return futex_ref_is_dead(fph); } static bool futex_hash_less(struct futex_private_hash *a, @@ -1598,11 +1802,11 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags) } } - fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT | __GFP_NOWARN); + fph = kvzalloc(struct_size(fph, queues, hash_slots), + GFP_KERNEL_ACCOUNT | __GFP_NOWARN); if (!fph) return -ENOMEM; - rcuref_init(&fph->users, 1); fph->hash_mask = hash_slots ? hash_slots - 1 : 0; fph->custom = custom; fph->immutable = !!(flags & FH_IMMUTABLE); @@ -1645,7 +1849,7 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags) * allocated a replacement hash, drop the initial * reference on the existing hash. */ - futex_private_hash_put(cur); + futex_ref_drop(cur); } if (new) { diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c b/tools/testing/selftests/futex/functional/futex_priv_hash.c index 24a92dc94eb8..1238a7178715 100644 --- a/tools/testing/selftests/futex/functional/futex_priv_hash.c +++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c @@ -97,6 +97,7 @@ static void create_max_threads(void *(*thread_fn)(void *)) ret = pthread_create(&threads[i], NULL, thread_fn, NULL); if (ret) ksft_exit_fail_msg("pthread_create failed: %m\n"); + usleep(5000); } } @@ -131,6 +132,7 @@ int main(int argc, char *argv[]) int use_global_hash = 0; int ret; int c; + int retry = 2; while ((c = getopt(argc, argv, "cghv:")) != -1) { switch (c) { @@ -208,11 +210,18 @@ int main(int argc, char *argv[]) */ ksft_print_msg("Online CPUs: %d\n", online_cpus); if (online_cpus > 16) { +again: futex_slotsn = futex_hash_slots_get(); if (futex_slotsn < 0 || futex_slots1 == futex_slotsn) { ksft_print_msg("Expected increase of hash buckets but got: %d -> %d\n", futex_slots1, futex_slotsn); - ksft_exit_fail_msg(test_msg_auto_inc); + if (--retry) { + usleep(500000); + goto again; + } else { + ksft_test_result_fail(test_msg_auto_inc); +// ksft_exit_fail_msg(test_msg_auto_inc); + } } ksft_test_result_pass(test_msg_auto_inc); } else { ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-24 19:01 ` Peter Zijlstra @ 2025-06-26 11:01 ` Chris Mason 2025-06-26 13:17 ` Peter Zijlstra 2025-06-26 13:48 ` futex performance regression from "futex: Allow automatic allocation of process wide futex hash" Sebastian Andrzej Siewior 2025-06-27 22:48 ` Tim Chen 2 siblings, 1 reply; 22+ messages in thread From: Chris Mason @ 2025-06-26 11:01 UTC (permalink / raw) To: Peter Zijlstra, Sebastian Andrzej Siewior; +Cc: linux-kernel, Thomas Gleixner On 6/24/25 3:01 PM, Peter Zijlstra wrote: > On Fri, Jun 06, 2025 at 09:06:38AM +0200, Sebastian Andrzej Siewior wrote: >> On 2025-06-05 20:55:27 [-0400], Chris Mason wrote: >>>>> We've got large systems that are basically dedicated to single >>>>> workloads, and those will probably miss the larger global hash table, >>>>> regressing like schbench did. Then we have large systems spread over >>>>> multiple big workloads that will love the private tables. >>>>> >>>>> In either case, I think growing the hash table as a multiple of thread >>>>> count instead of cpu count will probably better reflect the crazy things >>>>> multi-threaded applications do? At any rate, I don't think we want >>>>> applications to need prctl to get back to the performance they had on >>>>> older kernels. >>>> >>>> This is only an issue if all you CPUs spend their time in the kernel >>>> using the hash buckets at the same time. >>>> This was the case in every benchmark I've seen so far. Your thing might >>>> be closer to an actual workload. >>>> >>> >>> I didn't spend a ton of time looking at the perf profiles of the slower >>> kernel, was the bottleneck in the hash chain length or in contention for >>> the buckets? >> >> Every futex operation does a rcuref_get() (which is an atomic inc) on >> the private hash. This is before anything else happens. If you have two >> threads, on two CPUs, which simultaneously do a futex() operation then >> both do this rcuref_get(). That atomic inc ensures that the cacheline >> bounces from one CPU to the other. On the exit of the syscall there is a >> matching rcuref_put(). > > How about something like this (very lightly tested)... > > the TL;DR is that it turns all those refcounts into per-cpu ops when > there is no hash replacement pending (eg. the normal case), and only > folds the lot into an atomic when we really care about it. > > There's some sharp corners still.. but it boots and survives the > (slightly modified) selftest. I can get some benchmarks going of this, thanks. For 6.16, is the goal to put something like this in, or default to the global hash table until we've nailed it down? I'd vote for defaulting to global for one more release. -chris ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-26 11:01 ` Chris Mason @ 2025-06-26 13:17 ` Peter Zijlstra 2025-06-26 13:50 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 22+ messages in thread From: Peter Zijlstra @ 2025-06-26 13:17 UTC (permalink / raw) To: Chris Mason; +Cc: Sebastian Andrzej Siewior, linux-kernel, Thomas Gleixner On Thu, Jun 26, 2025 at 07:01:23AM -0400, Chris Mason wrote: > On 6/24/25 3:01 PM, Peter Zijlstra wrote: > > On Fri, Jun 06, 2025 at 09:06:38AM +0200, Sebastian Andrzej Siewior wrote: > >> On 2025-06-05 20:55:27 [-0400], Chris Mason wrote: > >>>>> We've got large systems that are basically dedicated to single > >>>>> workloads, and those will probably miss the larger global hash table, > >>>>> regressing like schbench did. Then we have large systems spread over > >>>>> multiple big workloads that will love the private tables. > >>>>> > >>>>> In either case, I think growing the hash table as a multiple of thread > >>>>> count instead of cpu count will probably better reflect the crazy things > >>>>> multi-threaded applications do? At any rate, I don't think we want > >>>>> applications to need prctl to get back to the performance they had on > >>>>> older kernels. > >>>> > >>>> This is only an issue if all you CPUs spend their time in the kernel > >>>> using the hash buckets at the same time. > >>>> This was the case in every benchmark I've seen so far. Your thing might > >>>> be closer to an actual workload. > >>>> > >>> > >>> I didn't spend a ton of time looking at the perf profiles of the slower > >>> kernel, was the bottleneck in the hash chain length or in contention for > >>> the buckets? > >> > >> Every futex operation does a rcuref_get() (which is an atomic inc) on > >> the private hash. This is before anything else happens. If you have two > >> threads, on two CPUs, which simultaneously do a futex() operation then > >> both do this rcuref_get(). That atomic inc ensures that the cacheline > >> bounces from one CPU to the other. On the exit of the syscall there is a > >> matching rcuref_put(). > > > > How about something like this (very lightly tested)... > > > > the TL;DR is that it turns all those refcounts into per-cpu ops when > > there is no hash replacement pending (eg. the normal case), and only > > folds the lot into an atomic when we really care about it. > > > > There's some sharp corners still.. but it boots and survives the > > (slightly modified) selftest. > > I can get some benchmarks going of this, thanks. For 6.16, is the goal > to put something like this in, or default to the global hash table until > we've nailed it down? > > I'd vote for defaulting to global for one more release. Probably best to do that; means we don't have to rush crazy code :-) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-26 13:17 ` Peter Zijlstra @ 2025-06-26 13:50 ` Sebastian Andrzej Siewior 2025-06-27 11:04 ` Peter Zijlstra 0 siblings, 1 reply; 22+ messages in thread From: Sebastian Andrzej Siewior @ 2025-06-26 13:50 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Chris Mason, linux-kernel, Thomas Gleixner On 2025-06-26 15:17:15 [+0200], Peter Zijlstra wrote: > > I'd vote for defaulting to global for one more release. > > Probably best to do that; means we don't have to rush crazy code :-) If we do that, that means we need to ensure that the private hash is not deployed while a thread is created and the prctl() function can only work create the private hash if the application is single threaded. Sebastian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-26 13:50 ` Sebastian Andrzej Siewior @ 2025-06-27 11:04 ` Peter Zijlstra 2025-06-27 12:14 ` Sebastian Andrzej Siewior 2025-06-30 14:50 ` [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH Sebastian Andrzej Siewior 0 siblings, 2 replies; 22+ messages in thread From: Peter Zijlstra @ 2025-06-27 11:04 UTC (permalink / raw) To: Sebastian Andrzej Siewior; +Cc: Chris Mason, linux-kernel, Thomas Gleixner On Thu, Jun 26, 2025 at 03:50:34PM +0200, Sebastian Andrzej Siewior wrote: > On 2025-06-26 15:17:15 [+0200], Peter Zijlstra wrote: > > > I'd vote for defaulting to global for one more release. > > > > Probably best to do that; means we don't have to rush crazy code :-) > > If we do that, that means we need to ensure that the private hash is not > deployed while a thread is created and the prctl() function can only > work create the private hash if the application is single threaded. I was thinking something like so.. --- diff --git a/init/Kconfig b/init/Kconfig index af4c2f085455..666783eb50ab 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1716,9 +1716,13 @@ config FUTEX_PI depends on FUTEX && RT_MUTEXES default y +# +# marked broken for performance reasons; gives us one more cycle to sort things out. +# config FUTEX_PRIVATE_HASH bool depends on FUTEX && !BASE_SMALL && MMU + depends on BROKEN default y config FUTEX_MPOL ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-27 11:04 ` Peter Zijlstra @ 2025-06-27 12:14 ` Sebastian Andrzej Siewior 2025-06-30 14:50 ` [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH Sebastian Andrzej Siewior 1 sibling, 0 replies; 22+ messages in thread From: Sebastian Andrzej Siewior @ 2025-06-27 12:14 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Chris Mason, linux-kernel, Thomas Gleixner On 2025-06-27 13:04:57 [+0200], Peter Zijlstra wrote: > On Thu, Jun 26, 2025 at 03:50:34PM +0200, Sebastian Andrzej Siewior wrote: > > On 2025-06-26 15:17:15 [+0200], Peter Zijlstra wrote: > > > > I'd vote for defaulting to global for one more release. > > > > > > Probably best to do that; means we don't have to rush crazy code :-) > > > > If we do that, that means we need to ensure that the private hash is not > > deployed while a thread is created and the prctl() function can only > > work create the private hash if the application is single threaded. > > I was thinking something like so.. Okay. That increases the motivation to fix it in one cycle. Anything you want me to do here? Sebastian ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH 2025-06-27 11:04 ` Peter Zijlstra 2025-06-27 12:14 ` Sebastian Andrzej Siewior @ 2025-06-30 14:50 ` Sebastian Andrzej Siewior 2025-07-01 13:13 ` [tip: locking/urgent] " tip-bot2 for Sebastian Andrzej Siewior 2025-07-16 1:34 ` [PATCH] " kernel test robot 1 sibling, 2 replies; 22+ messages in thread From: Sebastian Andrzej Siewior @ 2025-06-30 14:50 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Chris Mason, linux-kernel, Thomas Gleixner Chris Mason reported a performance regression on big iron. Reports of this kind were usually reported as part of a micro benchmark but Chris' test did mimic his real workload. This makes it a real regression. The root cause is rcuref_get() which is invoked during each futex operation. If all threads of an application do this simultaneously then it leads to cache line bouncing and the performance drops. Disable FUTEX_PRIVATE_HASH entirely for this cycle. The performance regression will be addressed in the following cycle enabling the option again. Reported-by: Chris Mason <clm@meta.com> Closes: https://lore.kernel.org/all/3ad05298-351e-4d61-9972-ca45a0a50e33@meta.com/ Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- init/Kconfig | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/init/Kconfig b/init/Kconfig index af4c2f0854554..666783eb50abd 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1716,9 +1716,13 @@ config FUTEX_PI depends on FUTEX && RT_MUTEXES default y +# +# marked broken for performance reasons; gives us one more cycle to sort things out. +# config FUTEX_PRIVATE_HASH bool depends on FUTEX && !BASE_SMALL && MMU + depends on BROKEN default y config FUTEX_MPOL -- 2.50.0 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [tip: locking/urgent] futex: Temporary disable FUTEX_PRIVATE_HASH 2025-06-30 14:50 ` [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH Sebastian Andrzej Siewior @ 2025-07-01 13:13 ` tip-bot2 for Sebastian Andrzej Siewior 2025-07-16 1:34 ` [PATCH] " kernel test robot 1 sibling, 0 replies; 22+ messages in thread From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-07-01 13:13 UTC (permalink / raw) To: linux-tip-commits Cc: Chris Mason, Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86, linux-kernel The following commit has been merged into the locking/urgent branch of tip: Commit-ID: 9a57c3773152a3ff2c35cc8325e088d011c9f83b Gitweb: https://git.kernel.org/tip/9a57c3773152a3ff2c35cc8325e088d011c9f83b Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> AuthorDate: Mon, 30 Jun 2025 16:50:34 +02:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Tue, 01 Jul 2025 15:02:05 +02:00 futex: Temporary disable FUTEX_PRIVATE_HASH Chris Mason reported a performance regression on big iron. Reports of this kind were usually reported as part of a micro benchmark but Chris' test did mimic his real workload. This makes it a real regression. The root cause is rcuref_get() which is invoked during each futex operation. If all threads of an application do this simultaneously then it leads to cache line bouncing and the performance drops. Disable FUTEX_PRIVATE_HASH entirely for this cycle. The performance regression will be addressed in the following cycle enabling the option again. Closes: https://lore.kernel.org/all/3ad05298-351e-4d61-9972-ca45a0a50e33@meta.com/ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250630145034.8JnINEaS@linutronix.de --- init/Kconfig | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/init/Kconfig b/init/Kconfig index af4c2f0..666783e 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1716,9 +1716,13 @@ config FUTEX_PI depends on FUTEX && RT_MUTEXES default y +# +# marked broken for performance reasons; gives us one more cycle to sort things out. +# config FUTEX_PRIVATE_HASH bool depends on FUTEX && !BASE_SMALL && MMU + depends on BROKEN default y config FUTEX_MPOL ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH 2025-06-30 14:50 ` [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH Sebastian Andrzej Siewior 2025-07-01 13:13 ` [tip: locking/urgent] " tip-bot2 for Sebastian Andrzej Siewior @ 2025-07-16 1:34 ` kernel test robot 1 sibling, 0 replies; 22+ messages in thread From: kernel test robot @ 2025-07-16 1:34 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: oe-lkp, lkp, Chris Mason, linux-kernel, Peter Zijlstra, Thomas Gleixner, oliver.sang Hello, from the commit message, we know this 'temporary disable' is to address a performance regression. so we still send out this report FYI what's the possible performance impact. however, our team focus on micro benchmark, so, anyway, just FYI. kernel test robot noticed a 1.9% improvement of perf-bench-futex.ops/s on: commit: bc1aa469e545fe16a62d501e095630cccc3fe1c4 ("[PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH") url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/futex-Temporary-disable-FUTEX_PRIVATE_HASH/20250630-225317 base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git d0b3b7b22dfa1f4b515fd3a295b3fd958f9e81af patch link: https://lore.kernel.org/all/20250630145034.8JnINEaS@linutronix.de/ patch subject: [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH testcase: perf-bench-futex config: x86_64-rhel-9.4 compiler: gcc-12 test machine: 192 threads 2 sockets Intel(R) Xeon(R) 6740E CPU @ 2.4GHz (Sierra Forest) with 256G memory parameters: runtime: 300s nr_task: 100% test: hash shared: shared cpufreq_governor: performance Details are as below: --------------------------------------------------------------------------------------------------> The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20250716/202507160223.2a483e8b-lkp@intel.com ========================================================================================= compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/shared/tbox_group/test/testcase: gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/300s/shared/lkp-srf-2sp2/hash/perf-bench-futex commit: v6.16-rc4 bc1aa469e5 ("futex: Temporary disable FUTEX_PRIVATE_HASH") v6.16-rc4 bc1aa469e545fe16a62d501e095 ---------------- --------------------------- %stddev %change %stddev \ | \ 2249734 +1.9% 2291792 perf-bench-futex.ops/s 6622 +3.7% 6868 perf-bench-futex.time.user_time 115.83 ± 12% +40.1% 162.33 ± 18% perf-sched.wait_and_delay.count.irqentry_exit_to_user_mode.asm_sysvec_call_function_single.[unknown].[unknown] 4.40 ± 19% +343.0% 19.51 ± 64% perf-sched.wait_and_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 4.15 ± 17% +347.3% 18.55 ± 69% perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 21.00 ± 63% +121.4% 46.50 ± 11% perf-c2c.DRAM.local 4129 ± 57% +225.5% 13444 perf-c2c.DRAM.remote 67558 ± 56% +66.3% 112351 perf-c2c.HITM.local 4045 ± 57% +227.2% 13237 perf-c2c.HITM.remote 71603 ± 57% +75.4% 125588 perf-c2c.HITM.total 1.674e+08 ± 20% +19.0% 1.991e+08 perf-stat.i.cache-misses 4.096e+08 ± 20% +19.4% 4.891e+08 perf-stat.i.cache-references 0.36 +2.9% 0.37 perf-stat.overall.MPKI 3127 -1.8% 3070 perf-stat.overall.cycles-between-cache-misses 1.669e+08 ± 20% +18.9% 1.985e+08 perf-stat.ps.cache-misses 4.085e+08 ± 20% +19.3% 4.874e+08 perf-stat.ps.cache-references 1.622e+14 -1.0% 1.606e+14 perf-stat.total.instructions Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-24 19:01 ` Peter Zijlstra 2025-06-26 11:01 ` Chris Mason @ 2025-06-26 13:48 ` Sebastian Andrzej Siewior 2025-06-26 14:36 ` Peter Zijlstra 2025-06-27 22:48 ` Tim Chen 2 siblings, 1 reply; 22+ messages in thread From: Sebastian Andrzej Siewior @ 2025-06-26 13:48 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Chris Mason, linux-kernel, Thomas Gleixner On 2025-06-24 21:01:18 [+0200], Peter Zijlstra wrote: > How about something like this (very lightly tested)... > > the TL;DR is that it turns all those refcounts into per-cpu ops when > there is no hash replacement pending (eg. the normal case), and only > folds the lot into an atomic when we really care about it. so we have per-CPU counter and on resize we wait one RCU grace period to ensure everybody observed current fph, switch to atomics and wait one grace period to ensure everyone is using atomics. Last step is to align the atomic counter with the per-CPU counters and once the counter reaches 0 perform the swap. This looks fine :) Due to the RCU grace period, the swap takes longer than before. I guess that is why you said earlier with to use srcu. For things like "hackbench -T" you end up creating a new hash on every thread creation which is not applied because RCU takes a while. This could be optimized later by checking if the hash in futex_phash_new matches the requested size. > There's some sharp corners still.. but it boots and survives the > (slightly modified) selftest. The refcount does not pop up in perf so that is good. > --- a/kernel/futex/core.c > +++ b/kernel/futex/core.c > @@ -243,14 +247,18 @@ static bool __futex_pivot_hash(struct mm_struct *mm, > fph = rcu_dereference_protected(mm->futex_phash, > lockdep_is_held(&mm->futex_hash_lock)); > if (fph) { > - if (!rcuref_is_dead(&fph->users)) { > + if (!futex_ref_is_dead(fph)) { > mm->futex_phash_new = new; > return false; > } > > futex_rehash_private(fph, new); > } > - rcu_assign_pointer(mm->futex_phash, new); > + new->state = FR_PERCPU; > + scoped_guard (rcu) { We do space or we don't? It looks like sched/ does while the remaining bits of the kernel mostly don't. I don't care but we could (later) adjust it for futex towards one direction. > + mm->futex_batches = get_state_synchronize_rcu(); > + rcu_assign_pointer(mm->futex_phash, new); > + } > kvfree_rcu(fph, rcu); > return true; > } … > +static void futex_ref_drop(struct futex_private_hash *fph) … > + call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu); Do you think it would improve with srcu or it is not worth it? > +} > + > +static bool futex_ref_get(struct futex_private_hash *fph) > +{ > + struct mm_struct *mm = fph->mm; > + > + guard(preempt)(); > + > + if (smp_load_acquire(&fph->state) == FR_PERCPU) { > + this_cpu_inc(*mm->futex_ref); > + return true; > + } > + > + return atomic_long_inc_not_zero(&mm->futex_atomic); > +} > + > +static bool futex_ref_put(struct futex_private_hash *fph) > +{ > + struct mm_struct *mm = fph->mm; > + > + guard(preempt)(); > + > + if (smp_load_acquire(&fph->state) == FR_PERCPU) { > + this_cpu_dec(*mm->futex_ref); > + return false; > + } > + > + return atomic_long_dec_and_test(&mm->futex_atomic); > +} > + > +static bool futex_ref_is_dead(struct futex_private_hash *fph) > +{ > + struct mm_struct *mm = fph->mm; > + > + guard(preempt)(); > + > + if (smp_load_acquire(&fph->state) == FR_PERCPU) > + return false; > + > + return atomic_long_read(&mm->futex_atomic) == 0; > +} Why preempt_disable()? Is it just an optimized version of rcu_read_lock()? I don't understand why. You don't even go for __this_cpu_inc() so I a bit puzzled. Sebastian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-26 13:48 ` futex performance regression from "futex: Allow automatic allocation of process wide futex hash" Sebastian Andrzej Siewior @ 2025-06-26 14:36 ` Peter Zijlstra 2025-06-27 12:24 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 22+ messages in thread From: Peter Zijlstra @ 2025-06-26 14:36 UTC (permalink / raw) To: Sebastian Andrzej Siewior; +Cc: Chris Mason, linux-kernel, Thomas Gleixner On Thu, Jun 26, 2025 at 03:48:20PM +0200, Sebastian Andrzej Siewior wrote: > On 2025-06-24 21:01:18 [+0200], Peter Zijlstra wrote: > > How about something like this (very lightly tested)... > > > > the TL;DR is that it turns all those refcounts into per-cpu ops when > > there is no hash replacement pending (eg. the normal case), and only > > folds the lot into an atomic when we really care about it. > > so we have per-CPU counter and on resize we wait one RCU grace period to > ensure everybody observed current fph, switch to atomics and wait one > grace period to ensure everyone is using atomics. Last step is to align > the atomic counter with the per-CPU counters and once the counter > reaches 0 perform the swap. > > This looks fine :) Due to the RCU grace period, the swap takes longer > than before. Yes, but given this is supposed to be 'rare', I didn't think that was a problem. > I guess that is why you said earlier with to use srcu. For > things like "hackbench -T" you end up creating a new hash on every > thread creation which is not applied because RCU takes a while. > This could be optimized later by checking if the hash in futex_phash_new > matches the requested size. So the main benefit of using SRCU (I have half a patch and a head-ache to show for it) is that you don't need the per-mm-per-cpu memory storage. The down-side is that you share the 'refcount' between all mm's so it gets to be even slower. The 'problem' is that reasoning about stuff comes even harder. But the main idea is to replace mm->futex_phash with a 'tombstone' value to force __futex_hash() into taking the slow-path and hitting mm->futex_hash_lock. Then you do call_srcu() to wait one SRCU period, this has everybody stalled, and guarantees the fph you had is now unused so we can rehash. Then replace the tombstone with the new hash and start over. It's just that stuff like futex_hash_get() gets somewhat tricky, you have to ensure the SRCU periods overlap. Anyway, that approach should be feasible as well, just not sure of the trade-offs. > > There's some sharp corners still.. but it boots and survives the > > (slightly modified) selftest. > > The refcount does not pop up in perf so that is good. \o/ > > --- a/kernel/futex/core.c > > +++ b/kernel/futex/core.c > > @@ -243,14 +247,18 @@ static bool __futex_pivot_hash(struct mm_struct *mm, > > fph = rcu_dereference_protected(mm->futex_phash, > > lockdep_is_held(&mm->futex_hash_lock)); > > if (fph) { > > - if (!rcuref_is_dead(&fph->users)) { > > + if (!futex_ref_is_dead(fph)) { > > mm->futex_phash_new = new; > > return false; > > } > > > > futex_rehash_private(fph, new); > > } > > - rcu_assign_pointer(mm->futex_phash, new); > > + new->state = FR_PERCPU; > > + scoped_guard (rcu) { > > We do space or we don't? It looks like sched/ does while the remaining > bits of the kernel mostly don't. I don't care but we could (later) > adjust it for futex towards one direction. Oh, scoped_guard is for and we write for loops with a space on. Perhaps its because I wrote it all and know this. But yeah, meh :-) > > + mm->futex_batches = get_state_synchronize_rcu(); > > + rcu_assign_pointer(mm->futex_phash, new); > > + } > > kvfree_rcu(fph, rcu); > > return true; > > } > … > > +static void futex_ref_drop(struct futex_private_hash *fph) > … > > + call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu); > > Do you think it would improve with srcu or it is not worth it? Per the above, I'm not sure about the trade-offs. I think the SRCU approach can be made to work -- I just got me head in a twist. > > +} > > + > > +static bool futex_ref_get(struct futex_private_hash *fph) > > +{ > > + struct mm_struct *mm = fph->mm; > > + > > + guard(preempt)(); > > + > > + if (smp_load_acquire(&fph->state) == FR_PERCPU) { > > + this_cpu_inc(*mm->futex_ref); > > + return true; > > + } > > + > > + return atomic_long_inc_not_zero(&mm->futex_atomic); > > +} > > + > > +static bool futex_ref_put(struct futex_private_hash *fph) > > +{ > > + struct mm_struct *mm = fph->mm; > > + > > + guard(preempt)(); > > + > > + if (smp_load_acquire(&fph->state) == FR_PERCPU) { > > + this_cpu_dec(*mm->futex_ref); > > + return false; > > + } > > + > > + return atomic_long_dec_and_test(&mm->futex_atomic); > > +} > > + > > +static bool futex_ref_is_dead(struct futex_private_hash *fph) > > +{ > > + struct mm_struct *mm = fph->mm; > > + > > + guard(preempt)(); > > + > > + if (smp_load_acquire(&fph->state) == FR_PERCPU) > > + return false; > > + > > + return atomic_long_read(&mm->futex_atomic) == 0; > > +} > > Why preempt_disable()? Is it just an optimized version of > rcu_read_lock()? I don't understand why. You don't even go for > __this_cpu_inc() so I a bit puzzled. The code morphed a lot over the two days it took to write this. But yeah, preempt or rcu doesn't really matter here. I didn't yet think about __this_cpu, its not really relevant on x86. If you want to make that relaxation you have to consider IRQ and SoftIRQ handling though. Given we have this_cpu_inc() in the RCU callback, which is ran from SoftIRQ context, this might not be safe. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-26 14:36 ` Peter Zijlstra @ 2025-06-27 12:24 ` Sebastian Andrzej Siewior 0 siblings, 0 replies; 22+ messages in thread From: Sebastian Andrzej Siewior @ 2025-06-27 12:24 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Chris Mason, linux-kernel, Thomas Gleixner On 2025-06-26 16:36:38 [+0200], Peter Zijlstra wrote: > > I guess that is why you said earlier with to use srcu. For > > things like "hackbench -T" you end up creating a new hash on every > > thread creation which is not applied because RCU takes a while. > > This could be optimized later by checking if the hash in futex_phash_new > > matches the requested size. > > So the main benefit of using SRCU (I have half a patch and a head-ache > to show for it) is that you don't need the per-mm-per-cpu memory > storage. The down-side is that you share the 'refcount' between all mm's > so it gets to be even slower. I see. So it is not just s/rcu/srcu/ for the hopefully quicker grace period since it does not affect whole system. It a different algorithm. > The 'problem' is that reasoning about stuff comes even harder. But the > main idea is to replace mm->futex_phash with a 'tombstone' value to > force __futex_hash() into taking the slow-path and hitting > mm->futex_hash_lock. > > Then you do call_srcu() to wait one SRCU period, this has everybody > stalled, and guarantees the fph you had is now unused so we can rehash. > Then replace the tombstone with the new hash and start over. > > It's just that stuff like futex_hash_get() gets somewhat tricky, you > have to ensure the SRCU periods overlap. > > Anyway, that approach should be feasible as well, just not sure of the > trade-offs. The current looks reasonable enough and it survived yesterday's testing. > > Why preempt_disable()? Is it just an optimized version of > > rcu_read_lock()? I don't understand why. You don't even go for > > __this_cpu_inc() so I a bit puzzled. > > The code morphed a lot over the two days it took to write this. But > yeah, preempt or rcu doesn't really matter here. > > I didn't yet think about __this_cpu, its not really relevant on x86. If > you want to make that relaxation you have to consider IRQ and SoftIRQ > handling though. Given we have this_cpu_inc() in the RCU callback, which > is ran from SoftIRQ context, this might not be safe. Nah, I am all for a simple cguard(rcu) here. Sebastian ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-24 19:01 ` Peter Zijlstra 2025-06-26 11:01 ` Chris Mason 2025-06-26 13:48 ` futex performance regression from "futex: Allow automatic allocation of process wide futex hash" Sebastian Andrzej Siewior @ 2025-06-27 22:48 ` Tim Chen 2025-06-28 8:23 ` Peter Zijlstra 2 siblings, 1 reply; 22+ messages in thread From: Tim Chen @ 2025-06-27 22:48 UTC (permalink / raw) To: Peter Zijlstra, Sebastian Andrzej Siewior Cc: Chris Mason, linux-kernel, Thomas Gleixner On Tue, 2025-06-24 at 21:01 +0200, Peter Zijlstra wrote: > > How about something like this (very lightly tested)... > > the TL;DR is that it turns all those refcounts into per-cpu ops when > there is no hash replacement pending (eg. the normal case), and only > folds the lot into an atomic when we really care about it. > > There's some sharp corners still.. but it boots and survives the > (slightly modified) selftest. > > --- > include/linux/futex.h | 12 +- > include/linux/mm_types.h | 5 + > kernel/futex/core.c | 238 +++++++++++++++++++-- > .../selftests/futex/functional/futex_priv_hash.c | 11 +- > 4 files changed, 239 insertions(+), 27 deletions(-) > > diff --git a/include/linux/futex.h b/include/linux/futex.h > index b37193653e6b..5f92c7a8ba57 100644 > --- a/include/linux/futex.h > +++ b/include/linux/futex.h > @@ -85,17 +85,11 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4) > #ifdef CONFIG_FUTEX_PRIVATE_HASH > int futex_hash_allocate_default(void); > void futex_hash_free(struct mm_struct *mm); > - > -static inline void futex_mm_init(struct mm_struct *mm) > -{ > - RCU_INIT_POINTER(mm->futex_phash, NULL); > - mm->futex_phash_new = NULL; > - mutex_init(&mm->futex_hash_lock); > -} > +int futex_mm_init(struct mm_struct *mm); > > #else /* !CONFIG_FUTEX_PRIVATE_HASH */ > static inline int futex_hash_allocate_default(void) { return 0; } > -static inline void futex_hash_free(struct mm_struct *mm) { } > +static inline int futex_hash_free(struct mm_struct *mm) { return 0; } Minor nit. futex_hash_free() is defined to return void for CONFIG_FUTEX_PRIVATE_HASH config. But is now defined to return int for !CONFIG_FUTEX_PRIVATE_HASH and !CONFIG_FUTEX configs. But it seems like nobody is actually checking the return value. It seems to make more sense to keep the return value as void here so the return value is consistent? Tim > static inline void futex_mm_init(struct mm_struct *mm) { } > #endif /* CONFIG_FUTEX_PRIVATE_HASH */ > > @@ -118,7 +112,7 @@ static inline int futex_hash_allocate_default(void) > { > return 0; > } > -static inline void futex_hash_free(struct mm_struct *mm) { } > +static inline int futex_hash_free(struct mm_struct *mm) { return 0; } > static inline void futex_mm_init(struct mm_struct *mm) { } > > #endif ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash" 2025-06-27 22:48 ` Tim Chen @ 2025-06-28 8:23 ` Peter Zijlstra 0 siblings, 0 replies; 22+ messages in thread From: Peter Zijlstra @ 2025-06-28 8:23 UTC (permalink / raw) To: Tim Chen Cc: Sebastian Andrzej Siewior, Chris Mason, linux-kernel, Thomas Gleixner On Fri, Jun 27, 2025 at 03:48:12PM -0700, Tim Chen wrote: > On Tue, 2025-06-24 at 21:01 +0200, Peter Zijlstra wrote: > > > > How about something like this (very lightly tested)... > > > > the TL;DR is that it turns all those refcounts into per-cpu ops when > > there is no hash replacement pending (eg. the normal case), and only > > folds the lot into an atomic when we really care about it. > > > > There's some sharp corners still.. but it boots and survives the > > (slightly modified) selftest. > > > > --- > > include/linux/futex.h | 12 +- > > include/linux/mm_types.h | 5 + > > kernel/futex/core.c | 238 +++++++++++++++++++-- > > .../selftests/futex/functional/futex_priv_hash.c | 11 +- > > 4 files changed, 239 insertions(+), 27 deletions(-) > > > > diff --git a/include/linux/futex.h b/include/linux/futex.h > > index b37193653e6b..5f92c7a8ba57 100644 > > --- a/include/linux/futex.h > > +++ b/include/linux/futex.h > > @@ -85,17 +85,11 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4) > > #ifdef CONFIG_FUTEX_PRIVATE_HASH > > int futex_hash_allocate_default(void); > > void futex_hash_free(struct mm_struct *mm); > > - > > -static inline void futex_mm_init(struct mm_struct *mm) > > -{ > > - RCU_INIT_POINTER(mm->futex_phash, NULL); > > - mm->futex_phash_new = NULL; > > - mutex_init(&mm->futex_hash_lock); > > -} > > +int futex_mm_init(struct mm_struct *mm); > > > > #else /* !CONFIG_FUTEX_PRIVATE_HASH */ > > static inline int futex_hash_allocate_default(void) { return 0; } > > -static inline void futex_hash_free(struct mm_struct *mm) { } > > +static inline int futex_hash_free(struct mm_struct *mm) { return 0; } > > Minor nit. > > futex_hash_free() is defined to return void for CONFIG_FUTEX_PRIVATE_HASH > config. But is now defined to return int for !CONFIG_FUTEX_PRIVATE_HASH and !CONFIG_FUTEX configs. > But it seems like nobody is actually checking the return value. It > seems to make more sense to keep the return value as void here so > the return value is consistent? YEah, not sure what happened there. There's a TODO item to actually make sure someone looks at the futex_mm_init() return value, I guess that's a good moment to clean this up too :-) ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-07-16 1:35 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-03 19:00 futex performance regression from "futex: Allow automatic allocation of process wide futex hash" Chris Mason 2025-06-04 9:28 ` Sebastian Andrzej Siewior 2025-06-04 15:48 ` Chris Mason 2025-06-04 20:08 ` Sebastian Andrzej Siewior 2025-06-06 0:55 ` Chris Mason 2025-06-06 7:06 ` Sebastian Andrzej Siewior 2025-06-06 21:06 ` Chris Mason 2025-06-06 22:17 ` Chris Mason 2025-06-24 19:01 ` Peter Zijlstra 2025-06-26 11:01 ` Chris Mason 2025-06-26 13:17 ` Peter Zijlstra 2025-06-26 13:50 ` Sebastian Andrzej Siewior 2025-06-27 11:04 ` Peter Zijlstra 2025-06-27 12:14 ` Sebastian Andrzej Siewior 2025-06-30 14:50 ` [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH Sebastian Andrzej Siewior 2025-07-01 13:13 ` [tip: locking/urgent] " tip-bot2 for Sebastian Andrzej Siewior 2025-07-16 1:34 ` [PATCH] " kernel test robot 2025-06-26 13:48 ` futex performance regression from "futex: Allow automatic allocation of process wide futex hash" Sebastian Andrzej Siewior 2025-06-26 14:36 ` Peter Zijlstra 2025-06-27 12:24 ` Sebastian Andrzej Siewior 2025-06-27 22:48 ` Tim Chen 2025-06-28 8:23 ` Peter Zijlstra
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.