futex performance regression from "futex: Allow automatic allocation of process wide futex hash"

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
@ 2025-06-03 19:00 Chris Mason
  2025-06-04  9:28 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Mason @ 2025-06-03 19:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Peter Zijlstra, linux-kernel

Hi everyone,

While testing Peter's latest scheduler patches against current Linus
git, I found a pretty big performance regression with schbench:

https://github.com/masoncl/schbench

The command line I was using:

schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0

Bisecting the problem I landed on commit:

commit 7c4f75a21f636486d2969d9b6680403ea8483539 (HEAD -> update)
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Wed Apr 16 18:29:13 2025 +0200

futex: Allow automatic allocation of process wide futex hash

Allocate a private futex hash with 16 slots if a task forks its first
thread.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link:
https://lore.kernel.org/r/20250416162921.513656-14bigeasy@linutronix.de

schbench uses one futex per thread, and the command line ends up
allocating 1024 threads, so the default bucket size used by this commit
is just too small.  Using 2048 buckets makes the problem go away.

On my big turin system, this commit slows down RPS by 36%.  But even a
VM on a skylake machine sees a 29% difference.

schbench is a microbenchmark, so grain of salt on all of this, but I
think our defaults are probably too low.

-chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-03 19:00 futex performance regression from "futex: Allow automatic allocation of process wide futex hash" Chris Mason
@ 2025-06-04  9:28 ` Sebastian Andrzej Siewior
  2025-06-04 15:48   ` Chris Mason
  0 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-04  9:28 UTC (permalink / raw)
  To: Chris Mason; +Cc: Peter Zijlstra, linux-kernel

On 2025-06-03 15:00:43 [-0400], Chris Mason wrote:
> Hi everyone,
Hi,

> While testing Peter's latest scheduler patches against current Linus
> git, I found a pretty big performance regression with schbench:
> 
> https://github.com/masoncl/schbench
> 
> The command line I was using:
> 
> schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0
> 
…
> schbench uses one futex per thread, and the command line ends up
> allocating 1024 threads, so the default bucket size used by this commit
> is just too small.  Using 2048 buckets makes the problem go away.

There is also this pthread_mutex_t but yeah

> On my big turin system, this commit slows down RPS by 36%.  But even a
> VM on a skylake machine sees a 29% difference.
> 
> schbench is a microbenchmark, so grain of salt on all of this, but I
> think our defaults are probably too low.

we could 

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index abbd97c2fcba8..9046f3d9693e7 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1687,7 +1687,7 @@ int futex_hash_allocate_default(void)
 	scoped_guard(rcu) {
 		threads = min_t(unsigned int,
 				get_nr_threads(current),
-				num_online_cpus());
+				num_online_cpus() * 2);
 
 		fph = rcu_dereference(current->mm->futex_phash);
 		if (fph) {

which would increase it to 2048 as Chris asks for.
Then there the possibility of 

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index abbd97c2fcba8..a19a96cc09c9e 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1680,6 +1680,8 @@ int futex_hash_allocate_default(void)
 {
 	unsigned int threads, buckets, current_buckets = 0;
 	struct futex_private_hash *fph;
+	bool current_immutable = false;
+	unsigned int flags = 0;
 
 	if (!current->mm)
 		return 0;
@@ -1695,6 +1697,7 @@ int futex_hash_allocate_default(void)
 				return 0;
 
 			current_buckets = fph->hash_mask + 1;
+			current_immutable = fph->immutable;
 		}
 	}
 
@@ -1705,10 +1708,13 @@ int futex_hash_allocate_default(void)
 	buckets = roundup_pow_of_two(4 * threads);
 	buckets = clamp(buckets, 16, futex_hashmask + 1);
 
-	if (current_buckets >= buckets)
-		return 0;
+	if (current_buckets >= buckets) {
+		if (current_immutable)
+			return 0;
+		flags = FH_IMMUTABLE;
+	}
 
-	return futex_hash_allocate(buckets, 0);
+	return futex_hash_allocate(buckets, flags);
 }
 
 static int futex_hash_get_slots(void)

to make hash immutable once the upper limit has been reached. There will
be no more auto-resize. One could argue that if the user did not touch
it, he might not do it at all.

This would avoid the reference accounting. Some testing:

256 cores, 2xNUMA:
| average rps: 1 701 947.02 Futex HBs: 0 immutable: 1
| average rps:   785 446.07 Futex HBs: 1024 immutable: 0
| average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1
| average rps:   736 769.77 Futex HBs: 2048 immutable: 0
| average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1

144 cores, 4xNUMA:
| average rps: 2 691 912.55 Futex HBs: 0 immutable: 1
| average rps: 1 306 443.68 Futex HBs: 1024 immutable: 0
| average rps: 2 471 382.28 Futex HBs: 1024 immutable: 1
| average rps: 1 269 503.90 Futex HBs: 2048 immutable: 0
| average rps: 2 656 228.67 Futex HBs: 2048 immutable: 1

tested with this on top:

diff --git a/schbench.c b/schbench.c
index 1be3e280f5c38..40a5f0091111e 100644
--- a/schbench.c
+++ b/schbench.c
@@ -19,6 +19,8 @@
 #include <unistd.h>
 #include <errno.h>
 #include <getopt.h>
+#include <linux/prctl.h>
+#include <sys/prctl.h>
 #include <sys/time.h>
 #include <time.h>
 #include <string.h>
@@ -42,6 +44,9 @@
 
 #define USEC_PER_SEC (1000000)
 
+static int futex_size = -1;
+static int futex_flags;
+
 /* -m number of message threads */
 static int message_threads = 1;
 /* -t  number of workers per message thread */
@@ -127,7 +132,7 @@ enum {
 	HELP_LONG_OPT = 1,
 };
 
-char *option_string = "p:m:M:W:t:Cr:R:w:i:z:A:n:F:Lj:s:J:";
+char *option_string = "p:m:M:W:t:Cr:R:w:i:z:A:n:F:Lj:s:J:H:I";
 static struct option long_options[] = {
 	{"pipe", required_argument, 0, 'p'},
 	{"message-threads", required_argument, 0, 'm'},
@@ -176,6 +181,29 @@ static void print_usage(void)
 	exit(1);
 }
 
+#ifndef PR_FUTEX_HASH
+#define PR_FUTEX_HASH                   78
+# define PR_FUTEX_HASH_SET_SLOTS        1
+# define FH_FLAG_IMMUTABLE              (1ULL << 0)
+# define PR_FUTEX_HASH_GET_SLOTS        2
+# define PR_FUTEX_HASH_GET_IMMUTABLE    3
+#endif
+
+static int futex_hash_slots_set(unsigned int slots, int flags)
+{
+	return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, slots, flags);
+}
+
+static int futex_hash_slots_get(void)
+{
+	return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
+}
+
+static int futex_hash_immutable_get(void)
+{
+	return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE);
+}
+
 /*
  * returns 0 if we fail to parse and 1 if we succeed
  */
@@ -347,6 +375,13 @@ static void parse_options(int ac, char **av)
 				exit(1);
 			}
 			break;
+		case 'H':
+			futex_size = atoi(optarg);
+			break;
+		case 'I':
+			futex_flags = FH_FLAG_IMMUTABLE;
+			break;
+
 		case '?':
 		case HELP_LONG_OPT:
 			print_usage();
@@ -1811,6 +1846,9 @@ int main(int ac, char **av)
 		}
 	}
 
+	if (futex_size >= 0)
+		futex_hash_slots_set(futex_size, futex_flags);
+
 	requests_per_sec /= message_threads;
 	loops_per_sec = 0;
 	stopping = 0;
@@ -1920,6 +1958,8 @@ int main(int ac, char **av)
 	}
 	free(message_threads_mem);
 
+	fprintf(stderr, "Futex HBs: %d immutable: %d\n", futex_hash_slots_get(),
+		futex_hash_immutable_get());
 
 	return 0;
 }


Comments?

> -chris

Sebastian

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-04  9:28 ` Sebastian Andrzej Siewior
@ 2025-06-04 15:48   ` Chris Mason
  2025-06-04 20:08     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Mason @ 2025-06-04 15:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: Peter Zijlstra, linux-kernel

On 6/4/25 5:28 AM, Sebastian Andrzej Siewior wrote:
> On 2025-06-03 15:00:43 [-0400], Chris Mason wrote:
>> Hi everyone,
> Hi,
> 
>> While testing Peter's latest scheduler patches against current Linus
>> git, I found a pretty big performance regression with schbench:
>>
>> https://github.com/masoncl/schbench 
>>
>> The command line I was using:
>>
>> schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0
>>
> …
>> schbench uses one futex per thread, and the command line ends up
>> allocating 1024 threads, so the default bucket size used by this commit
>> is just too small.  Using 2048 buckets makes the problem go away.
> 
> There is also this pthread_mutex_t but yeah

Yeah, but -L disables the pthread nonsense.

> 
>> On my big turin system, this commit slows down RPS by 36%.  But even a
>> VM on a skylake machine sees a 29% difference.
>>
>> schbench is a microbenchmark, so grain of salt on all of this, but I
>> think our defaults are probably too low.
> 
> we could 
> 
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c
> index abbd97c2fcba8..9046f3d9693e7 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -1687,7 +1687,7 @@ int futex_hash_allocate_default(void)
>  	scoped_guard(rcu) {
>  		threads = min_t(unsigned int,
>  				get_nr_threads(current),
> -				num_online_cpus());
> +				num_online_cpus() * 2);
>  
>  		fph = rcu_dereference(current->mm->futex_phash);
>  		if (fph) {
> 
> which would increase it to 2048 as Chris asks for.

I haven't followed these changes, so asking some extra questions.  This
would bump to num_online_cpus() * 2, which probably isn't 2048 right?

We've got large systems that are basically dedicated to single
workloads, and those will probably miss the larger global hash table,
regressing like schbench did.  Then we have large systems spread over
multiple big workloads that will love the private tables.

In either case, I think growing the hash table as a multiple of thread
count instead of cpu count will probably better reflect the crazy things
multi-threaded applications do?  At any rate, I don't think we want
applications to need prctl to get back to the performance they had on
older kernels.

For people that want to avoid that memory overhead, I'm assuming they
want the CONFIG_FUTEX_PRIVATE_HASH off, so the Kconfig help text should
make that more clear.

> Then there the possibility of 
>
> diff --git a/kernel/futex/core.c b/kernel/futex/core.c> index
abbd97c2fcba8..a19a96cc09c9e 100644
> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -1680,6 +1680,8 @@ int futex_hash_allocate_default(void)
>  {
>  	unsigned int threads, buckets, current_buckets = 0;
>  	struct futex_private_hash *fph;
> +	bool current_immutable = false;
> +	unsigned int flags = 0;
>  
>  	if (!current->mm)
>  		return 0;
> @@ -1695,6 +1697,7 @@ int futex_hash_allocate_default(void)
>  				return 0;
>  
>  			current_buckets = fph->hash_mask + 1;
> +			current_immutable = fph->immutable;
>  		}
>  	}
>  
> @@ -1705,10 +1708,13 @@ int futex_hash_allocate_default(void)
>  	buckets = roundup_pow_of_two(4 * threads);
>  	buckets = clamp(buckets, 16, futex_hashmask + 1);
>  
> -	if (current_buckets >= buckets)
> -		return 0;
> +	if (current_buckets >= buckets) {
> +		if (current_immutable)
> +			return 0;
> +		flags = FH_IMMUTABLE;
> +	}
>  
> -	return futex_hash_allocate(buckets, 0);
> +	return futex_hash_allocate(buckets, flags);
>  }
>  >  static int futex_hash_get_slots(void)
> 
> to make hash immutable once the upper limit has been reached. There will
> be no more auto-resize. One could argue that if the user did not touch
> it, he might not do it at all.
> 
> This would avoid the reference accounting. Some testing:
> 
> 256 cores, 2xNUMA:
> | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1
> | average rps:   785 446.07 Futex HBs: 1024 immutable: 0
> | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1> | average
rps:   736 769.77 Futex HBs: 2048 immutable: 0
> | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1


How long are these runs?  That's a huge benefit from being immutable
(1.5M vs 736K?) but the hash table churn should be confined to early in
the schbench run right?

> 
> 144 cores, 4xNUMA:
> | average rps: 2 691 912.55 Futex HBs: 0 immutable: 1
> | average rps: 1 306 443.68 Futex HBs: 1024 immutable: 0
> | average rps: 2 471 382.28 Futex HBs: 1024 immutable: 1
> | average rps: 1 269 503.90 Futex HBs: 2048 immutable: 0
> | average rps: 2 656 228.67 Futex HBs: 2048 immutable: 1
> 
> tested with this on top:

This schbench hunk is just testing the performance impact of different
bucket sizes, but hopefully we don't need it long term unless we want to
play with even bigger hash tables?

-chris

> 
> diff --git a/schbench.c b/schbench.c
> index 1be3e280f5c38..40a5f0091111e 100644
> --- a/schbench.c
> +++ b/schbench.c
> @@ -19,6 +19,8 @@
>  #include <unistd.h>
>  #include <errno.h>
>  #include <getopt.h>
> +#include <linux/prctl.h>
> +#include <sys/prctl.h>
>  #include <sys/time.h>
>  #include <time.h>
>  #include <string.h>
> @@ -42,6 +44,9 @@
>  
>  #define USEC_PER_SEC (1000000)
>  
> +static int futex_size = -1;
> +static int futex_flags;
> +
>  /* -m number of message threads */
>  static int message_threads = 1;
>  /* -t  number of workers per message thread */
> @@ -127,7 +132,7 @@ enum {
>  	HELP_LONG_OPT = 1,
>  };
>  
> -char *option_string = "p:m:M:W:t:Cr:R:w:i:z:A:n:F:Lj:s:J:";
> +char *option_string = "p:m:M:W:t:Cr:R:w:i:z:A:n:F:Lj:s:J:H:I";
>  static struct option long_options[] = {
>  	{"pipe", required_argument, 0, 'p'},
>  	{"message-threads", required_argument, 0, 'm'},
> @@ -176,6 +181,29 @@ static void print_usage(void)
>  	exit(1);
>  }
>  
> +#ifndef PR_FUTEX_HASH
> +#define PR_FUTEX_HASH                   78
> +# define PR_FUTEX_HASH_SET_SLOTS        1
> +# define FH_FLAG_IMMUTABLE              (1ULL << 0)
> +# define PR_FUTEX_HASH_GET_SLOTS        2
> +# define PR_FUTEX_HASH_GET_IMMUTABLE    3
> +#endif
> +
> +static int futex_hash_slots_set(unsigned int slots, int flags)
> +{
> +	return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, slots, flags);
> +}
> +
> +static int futex_hash_slots_get(void)
> +{
> +	return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
> +}
> +
> +static int futex_hash_immutable_get(void)
> +{
> +	return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE);
> +}
> +
>  /*
>   * returns 0 if we fail to parse and 1 if we succeed
>   */
> @@ -347,6 +375,13 @@ static void parse_options(int ac, char **av)
>  				exit(1);
>  			}
>  			break;
> +		case 'H':
> +			futex_size = atoi(optarg);
> +			break;
> +		case 'I':
> +			futex_flags = FH_FLAG_IMMUTABLE;
> +			break;
> +
>  		case '?':
>  		case HELP_LONG_OPT:
>  			print_usage();
> @@ -1811,6 +1846,9 @@ int main(int ac, char **av)
>  		}
>  	}
>  
> +	if (futex_size >= 0)
> +		futex_hash_slots_set(futex_size, futex_flags);
> +
>  	requests_per_sec /= message_threads;
>  	loops_per_sec = 0;
>  	stopping = 0;
> @@ -1920,6 +1958,8 @@ int main(int ac, char **av)
>  	}
>  	free(message_threads_mem);
>  
> +	fprintf(stderr, "Futex HBs: %d immutable: %d\n", futex_hash_slots_get(),
> +		futex_hash_immutable_get());
>  
>  	return 0;
>  }
> 
> 
> Comments?
> 
>> -chris
> 
> Sebastian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-04 15:48   ` Chris Mason
@ 2025-06-04 20:08     ` Sebastian Andrzej Siewior
  2025-06-06  0:55       ` Chris Mason
  0 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-04 20:08 UTC (permalink / raw)
  To: Chris Mason; +Cc: Peter Zijlstra, linux-kernel

On 2025-06-04 11:48:05 [-0400], Chris Mason wrote:
> > --- a/kernel/futex/core.c
> > +++ b/kernel/futex/core.c
> > @@ -1687,7 +1687,7 @@ int futex_hash_allocate_default(void)
> >  	scoped_guard(rcu) {
> >  		threads = min_t(unsigned int,
> >  				get_nr_threads(current),
> > -				num_online_cpus());
> > +				num_online_cpus() * 2);
> >  
> >  		fph = rcu_dereference(current->mm->futex_phash);
> >  		if (fph) {
> > 
> > which would increase it to 2048 as Chris asks for.
> 
> I haven't followed these changes, so asking some extra questions.  This
> would bump to num_online_cpus() * 2, which probably isn't 2048 right?

Assuming you have 256 CPUs and -m 4 -t 256 creates 4 * 256 = 1024 then
threads = 512 gets computed (= min(1024, 256 * 2)). Later we do 512 * 4
(roundup_pow_of_two(4 * threads)). This makes 2048 hash buckets.

> We've got large systems that are basically dedicated to single
> workloads, and those will probably miss the larger global hash table,
> regressing like schbench did.  Then we have large systems spread over
> multiple big workloads that will love the private tables.
> 
> In either case, I think growing the hash table as a multiple of thread
> count instead of cpu count will probably better reflect the crazy things
> multi-threaded applications do?  At any rate, I don't think we want
> applications to need prctl to get back to the performance they had on
> older kernels.

This is only an issue if all you CPUs spend their time in the kernel
using the hash buckets at the same time.
This was the case in every benchmark I've seen so far. Your thing might
be closer to an actual workload.

> For people that want to avoid that memory overhead, I'm assuming they
> want the CONFIG_FUTEX_PRIVATE_HASH off, so the Kconfig help text should
> make that more clear.
> 
> > Then there the possibility of 
…
> > 256 cores, 2xNUMA:
> > | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1
> > | average rps:   785 446.07 Futex HBs: 1024 immutable: 0
> > | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1> | average
> rps:   736 769.77 Futex HBs: 2048 immutable: 0
> > | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1
> 
> 
> How long are these runs?  That's a huge benefit from being immutable
> (1.5M vs 736K?) but the hash table churn should be confined to early in
> the schbench run right?

I think 30 secs or so. I used your command line. The 256 core box showed
a higher improvement than the 144 one. I attached a patch against
schbench in the previous mail, I did then
	./schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0 -H 1024 -I

…
> This schbench hunk is just testing the performance impact of different
> bucket sizes, but hopefully we don't need it long term unless we want to
> play with even bigger hash tables?

If you do "-H 0" then you should get the "old" behaviour. However the hash
bucket are spread now over the NUMA nodes:

| dmesg |grep -i futex
| [    0.501736] futex hash table entries: 32768 (2097152 bytes on 2 NUMA nodes, total 4096 KiB, linear).

Now there are 32768 hash buckets on both NUMA nodes. Depending on the
hash it computes, it uses the data structures on NUMA node 1 or 2. The
old code allocated 65536 hash buckets via vmalloc().

The bigger hash table isn't always the answer. Yes, you could play
around figure out what works best for you. The problem is that the hash
is based on the mm and the (user) memory address. So on each run you
will get a different hash and therefore different collisions.
If you end up having many hash collisions and then block on the same
lock then yes, larger hash table will be the cure. If you have many
threads doing the futex syscall simultaneously then making the hash
immutable avoids two atomic ops on the same memory address.
This would be my favorite.

Now that I think about it, it might be possible to move the atomic ops
the hash bucket itself. Then it wouldn't bounce the cache line so much.
Making the hash immutable is simpler.
 
> -chris

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-04 20:08     ` Sebastian Andrzej Siewior
@ 2025-06-06  0:55       ` Chris Mason
  2025-06-06  7:06         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Mason @ 2025-06-06  0:55 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: Peter Zijlstra, linux-kernel

On 6/4/25 4:08 PM, Sebastian Andrzej Siewior wrote:
> On 2025-06-04 11:48:05 [-0400], Chris Mason wrote:
>>> --- a/kernel/futex/core.c
>>> +++ b/kernel/futex/core.c
>>> @@ -1687,7 +1687,7 @@ int futex_hash_allocate_default(void)
>>>  	scoped_guard(rcu) {
>>>  		threads = min_t(unsigned int,
>>>  				get_nr_threads(current),
>>> -				num_online_cpus());
>>> +				num_online_cpus() * 2);
>>>  
>>>  		fph = rcu_dereference(current->mm->futex_phash);
>>>  		if (fph) {
>>>
>>> which would increase it to 2048 as Chris asks for.
>>
>> I haven't followed these changes, so asking some extra questions.  This
>> would bump to num_online_cpus() * 2, which probably isn't 2048 right?
> 
> Assuming you have 256 CPUs and -m 4 -t 256 creates 4 * 256 = 1024 then
> threads = 512 gets computed (= min(1024, 256 * 2)). Later we do 512 * 4
> (roundup_pow_of_two(4 * threads)). This makes 2048 hash buckets.
> 

Ah right, I missed the x4.

>> We've got large systems that are basically dedicated to single
>> workloads, and those will probably miss the larger global hash table,
>> regressing like schbench did.  Then we have large systems spread over
>> multiple big workloads that will love the private tables.
>>
>> In either case, I think growing the hash table as a multiple of thread
>> count instead of cpu count will probably better reflect the crazy things
>> multi-threaded applications do?  At any rate, I don't think we want
>> applications to need prctl to get back to the performance they had on
>> older kernels.
> 
> This is only an issue if all you CPUs spend their time in the kernel
> using the hash buckets at the same time.
> This was the case in every benchmark I've seen so far. Your thing might
> be closer to an actual workload.
> 

I didn't spend a ton of time looking at the perf profiles of the slower
kernel, was the bottleneck in the hash chain length or in contention for
the buckets?

>> For people that want to avoid that memory overhead, I'm assuming they
>> want the CONFIG_FUTEX_PRIVATE_HASH off, so the Kconfig help text should
>> make that more clear.
>>
>>> Then there the possibility of 
> …
>>> 256 cores, 2xNUMA:
>>> | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1
>>> | average rps:   785 446.07 Futex HBs: 1024 immutable: 0
>>> | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1> | average
>> rps:   736 769.77 Futex HBs: 2048 immutable: 0
>>> | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1
>>
>>
>> How long are these runs?  That's a huge benefit from being immutable
>> (1.5M vs 736K?) but the hash table churn should be confined to early in
>> the schbench run right?
> 
> I think 30 secs or so. I used your command line. 

Ah ok, my command line is 60 seconds.  It feels like something is
strange for the immutable flag to make it that much faster?  schbench
starts all the threads up front, so it should hit steady state pretty
quickly.  More on NUMA below, but I'll benchmark with the immutable flag
on the turin box in the morning to see if it is the extra atomics.

> The 256 core box showed
> a higher improvement than the 144 one. I attached a patch against
> schbench in the previous mail, I did then
> 	./schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0 -H 1024 -I
> 
> …
>> This schbench hunk is just testing the performance impact of different
>> bucket sizes, but hopefully we don't need it long term unless we want to
>> play with even bigger hash tables?
> 
> If you do "-H 0" then you should get the "old" behaviour. However the hash
> bucket are spread now over the NUMA nodes:
> 
> | dmesg |grep -i futex
> | [    0.501736] futex hash table entries: 32768 (2097152 bytes on 2 NUMA nodes, total 4096 KiB, linear).
> 
> Now there are 32768 hash buckets on both NUMA nodes. Depending on the
> hash it computes, it uses the data structures on NUMA node 1 or 2. The
> old code allocated 65536 hash buckets via vmalloc().

So I wanted to mention this earlier and forgot, but schbench obviously
isn't numa aware at all.  This combination of command line options
basically just has the 4 message threads waking up the 256 worker
threads (per message thread) after scribbling a timestamp into shared
memory.  From a schbench pov at least, we'll get much more stable
numbers by sticking to a single socket.  <makes numa placement hand
gestures>

> 
> The bigger hash table isn't always the answer. Yes, you could play
> around figure out what works best for you. The problem is that the hash
> is based on the mm and the (user) memory address. So on each run you
> will get a different hash and therefore different collisions.
> If you end up having many hash collisions and then block on the same
> lock then yes, larger hash table will be the cure. If you have many
> threads doing the futex syscall simultaneously then making the hash
> immutable avoids two atomic ops on the same memory address.
> This would be my favorite.
> 
> Now that I think about it, it might be possible to move the atomic ops
> the hash bucket itself. Then it wouldn't bounce the cache line so much.
> Making the hash immutable is simpler.

Going back to your diff, if we have a process growing the total number
of threads, can we set FH_IMMUTABLE too early?  As the number of threads
increases, eventually we'll pick the 2x num_cpus, but that'll take a while?

I do see what you mean about immutable being simpler though, I'll get
some numbers on those atomics.

-chris


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-06  0:55       ` Chris Mason
@ 2025-06-06  7:06         ` Sebastian Andrzej Siewior
  2025-06-06 21:06           ` Chris Mason
                             ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-06  7:06 UTC (permalink / raw)
  To: Chris Mason; +Cc: Peter Zijlstra, linux-kernel

On 2025-06-05 20:55:27 [-0400], Chris Mason wrote:
> >> We've got large systems that are basically dedicated to single
> >> workloads, and those will probably miss the larger global hash table,
> >> regressing like schbench did.  Then we have large systems spread over
> >> multiple big workloads that will love the private tables.
> >>
> >> In either case, I think growing the hash table as a multiple of thread
> >> count instead of cpu count will probably better reflect the crazy things
> >> multi-threaded applications do?  At any rate, I don't think we want
> >> applications to need prctl to get back to the performance they had on
> >> older kernels.
> > 
> > This is only an issue if all you CPUs spend their time in the kernel
> > using the hash buckets at the same time.
> > This was the case in every benchmark I've seen so far. Your thing might
> > be closer to an actual workload.
> > 
> 
> I didn't spend a ton of time looking at the perf profiles of the slower
> kernel, was the bottleneck in the hash chain length or in contention for
> the buckets?

Every futex operation does a rcuref_get() (which is an atomic inc) on
the private hash. This is before anything else happens. If you have two
threads, on two CPUs, which simultaneously do a futex() operation then
both do this rcuref_get(). That atomic inc ensures that the cacheline
bounces from one CPU to the other. On the exit of the syscall there is a
matching rcuref_put().

> >> For people that want to avoid that memory overhead, I'm assuming they
> >> want the CONFIG_FUTEX_PRIVATE_HASH off, so the Kconfig help text should
> >> make that more clear.
> >>
> >>> Then there the possibility of 
> > …
> >>> 256 cores, 2xNUMA:
> >>> | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1
> >>> | average rps:   785 446.07 Futex HBs: 1024 immutable: 0
> >>> | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1> | average
> >> rps:   736 769.77 Futex HBs: 2048 immutable: 0
> >>> | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1
> >>
> >>
> >> How long are these runs?  That's a huge benefit from being immutable
> >> (1.5M vs 736K?) but the hash table churn should be confined to early in
> >> the schbench run right?
> > 
> > I think 30 secs or so. I used your command line. 
> 
> Ah ok, my command line is 60 seconds.  It feels like something is
> strange for the immutable flag to make it that much faster?  schbench
> starts all the threads up front, so it should hit steady state pretty
> quickly.  More on NUMA below, but I'll benchmark with the immutable flag
> on the turin box in the morning to see if it is the extra atomics.

That immutable flag makes this rcuref_get()/ put() go away. The price is
that you can't change the size of the private hash anymore. So if your
workload works best with a hash size of X and you don't intend to change
it during the runtime of the program, set the immutable flag.

> > The 256 core box showed
> > a higher improvement than the 144 one. I attached a patch against
> > schbench in the previous mail, I did then
> > 	./schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0 -H 1024 -I
> > 
> > …
> >> This schbench hunk is just testing the performance impact of different
> >> bucket sizes, but hopefully we don't need it long term unless we want to
> >> play with even bigger hash tables?
> > 
> > If you do "-H 0" then you should get the "old" behaviour. However the hash
> > bucket are spread now over the NUMA nodes:
> > 
> > | dmesg |grep -i futex
> > | [    0.501736] futex hash table entries: 32768 (2097152 bytes on 2 NUMA nodes, total 4096 KiB, linear).
> > 
> > Now there are 32768 hash buckets on both NUMA nodes. Depending on the
> > hash it computes, it uses the data structures on NUMA node 1 or 2. The
> > old code allocated 65536 hash buckets via vmalloc().
> 
> So I wanted to mention this earlier and forgot, but schbench obviously
> isn't numa aware at all.  This combination of command line options
> basically just has the 4 message threads waking up the 256 worker
> threads (per message thread) after scribbling a timestamp into shared
> memory.  From a schbench pov at least, we'll get much more stable
> numbers by sticking to a single socket.  <makes numa placement hand
> gestures>

Earlier the global hash was somehow spread via vmalloc(). Now it is
spread slightly different. If you request the private hash, then it is
allocated on the current node. Unless you move from one node the other,
this should be good.

> > The bigger hash table isn't always the answer. Yes, you could play
> > around figure out what works best for you. The problem is that the hash
> > is based on the mm and the (user) memory address. So on each run you
> > will get a different hash and therefore different collisions.
> > If you end up having many hash collisions and then block on the same
> > lock then yes, larger hash table will be the cure. If you have many
> > threads doing the futex syscall simultaneously then making the hash
> > immutable avoids two atomic ops on the same memory address.
> > This would be my favorite.
> > 
> > Now that I think about it, it might be possible to move the atomic ops
> > the hash bucket itself. Then it wouldn't bounce the cache line so much.
> > Making the hash immutable is simpler.
> 
> Going back to your diff, if we have a process growing the total number
> of threads, can we set FH_IMMUTABLE too early?  As the number of threads
> increases, eventually we'll pick the 2x num_cpus, but that'll take a while?

If you refer to the schbench diff, then set it early. Once the prctl()
to set the size of the private hash, there will be no resize by the
kernel.

If you refer to the kernel diff where set the FH_IMMUTABLE flag, then it
is set once the upper limit is reached (that was the plan in case I did
the logic wrong). Which means at that point it won't increase any
further because of the CPU limit. The only way how you can reach it too
early is if you offline CPUs.

> I do see what you mean about immutable being simpler though, I'll get
> some numbers on those atomics.

I have another idea to add a refcount to the hb slot itself. I hope that
it won't be that expensive if a atomic inc/dec occurs on different
memory locations. But then maybe freeze it if the upper limit is
reached. But that might help if we don't reach the upper limit.

> -chris

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-06  7:06         ` Sebastian Andrzej Siewior
@ 2025-06-06 21:06           ` Chris Mason
  2025-06-06 22:17           ` Chris Mason
  2025-06-24 19:01           ` Peter Zijlstra
  2 siblings, 0 replies; 22+ messages in thread
From: Chris Mason @ 2025-06-06 21:06 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: Peter Zijlstra, linux-kernel

On 6/6/25 3:06 AM, Sebastian Andrzej Siewior wrote:
> On 2025-06-05 20:55:27 [-0400], Chris Mason wrote:
>>>> We've got large systems that are basically dedicated to single
>>>> workloads, and those will probably miss the larger global hash table,
>>>> regressing like schbench did.  Then we have large systems spread over
>>>> multiple big workloads that will love the private tables.
>>>>
>>>> In either case, I think growing the hash table as a multiple of thread
>>>> count instead of cpu count will probably better reflect the crazy things
>>>> multi-threaded applications do?  At any rate, I don't think we want
>>>> applications to need prctl to get back to the performance they had on
>>>> older kernels.
>>>
>>> This is only an issue if all you CPUs spend their time in the kernel
>>> using the hash buckets at the same time.
>>> This was the case in every benchmark I've seen so far. Your thing might
>>> be closer to an actual workload.
>>>
>>
>> I didn't spend a ton of time looking at the perf profiles of the slower
>> kernel, was the bottleneck in the hash chain length or in contention for
>> the buckets?
> 
> Every futex operation does a rcuref_get() (which is an atomic inc) on
> the private hash. This is before anything else happens. If you have two
> threads, on two CPUs, which simultaneously do a futex() operation then
> both do this rcuref_get(). That atomic inc ensures that the cacheline
> bounces from one CPU to the other. On the exit of the syscall there is a
> matching rcuref_put().
> 
>>>> For people that want to avoid that memory overhead, I'm assuming they
>>>> want the CONFIG_FUTEX_PRIVATE_HASH off, so the Kconfig help text should
>>>> make that more clear.
>>>>
>>>>> Then there the possibility of 
>>> …
>>>>> 256 cores, 2xNUMA:
>>>>> | average rps: 1 701 947.02 Futex HBs: 0 immutable: 1
>>>>> | average rps:   785 446.07 Futex HBs: 1024 immutable: 0
>>>>> | average rps: 1 586 755.62 Futex HBs: 1024 immutable: 1> | average
>>>> rps:   736 769.77 Futex HBs: 2048 immutable: 0
>>>>> | average rps: 1 555 182.52 Futex HBs: 2048 immutable: 1
>>>>
>>>>
>>>> How long are these runs?  That's a huge benefit from being immutable
>>>> (1.5M vs 736K?) but the hash table churn should be confined to early in
>>>> the schbench run right?
>>>
>>> I think 30 secs or so. I used your command line. 
>>
>> Ah ok, my command line is 60 seconds.  It feels like something is
>> strange for the immutable flag to make it that much faster?  schbench
>> starts all the threads up front, so it should hit steady state pretty
>> quickly.  More on NUMA below, but I'll benchmark with the immutable flag
>> on the turin box in the morning to see if it is the extra atomics.
> 
> That immutable flag makes this rcuref_get()/ put() go away. The price is
> that you can't change the size of the private hash anymore. So if your
> workload works best with a hash size of X and you don't intend to change
> it during the runtime of the program, set the immutable flag.

ok, for the benchmarks, I hard coded the number of buckets at 1024 and
the only thing I flipped was the immutable flag.  I was running on top
of c0c9379f235df33a12ceae94370ad80c5278324d (just today's Linus).

schbench -L -m 4 -M auto -t 256 -n 0 -r 60 -s 0

1024 bucket without immutable:

RPS 2 297 856

1024 with immutable:

RPS 3 665 920

This is pretty similar to what 6.15 obtains, so I'm hoping we can find a
way to get the 6.15 performance levels without the applications needing
to call prctl manually.

I took some profiles on the 1024 + non-immutable

    16.92%  schbench         [kernel.kallsyms]  [k] futex_hash_put
            |
            |--8.93%--fpost
            |
            |--6.99%--xlist_wake_all
            |
             --0.58%--entry_SYSCALL_64
                       syscall

    13.27%  schbench         libc.so.6          [.] syscall
            |
            |--8.90%--xlist_wake_all
            |
             --3.98%--entry_SYSCALL_64
                       syscall

    12.36%  schbench         [kernel.kallsyms]  [k] entry_SYSCALL_64
            |
            |--11.04%--__futex_hash
            |          futex_hash
            |          futex_wake
            |          do_futex
            |          __x64_sys_futex
            |          do_syscall_64
            |          entry_SYSCALL_64_after_hwframe
            |          syscall
            |
             --1.23%--futex_hash
                       futex_wake
                       do_futex
                       __x64_sys_futex
                       do_syscall_64
                       entry_SYSCALL_64_after_hwframe
                       syscall

Percent │    0xffffffff813f2f80 <futex_hash_put>:
   0.05 │    → callq __fentry__
   0.06 │      pushq %rbx
   0.08 │      movq  0x18(%rdi),%rbx
   0.05 │      testq %rbx,%rbx
   0.03 │    ↓ je    15
  12.49 │      cmpb  $0x0,0x21(%rbx)
  12.58 │    ↓ je    17
        │15:   popq  %rbx
        │    ← retq
   0.24 │17:   movl  $0xffffffff,%esi
  12.76 │      lock
        │      xaddl %esi,(%rbx)
   7.80 │   ┌──subl  $0x1,%esi
  22.96 │   ├──js    27
  23.05 │   │  popq  %rbx
   7.85 │   │← retq
        │27:└─→movq  %rbx,%rdi
        │    → callq rcuref_put_slowpath
        │      testb %al,%al
        │    ↑ je    15
        │      movq  0x18(%rbx),%rdi
        │      popq  %rbx
        │    → jmp   wake_up_var

If you like profiles with line numbers:
12830 samples (10.24%) Comms: schbench
                 futex_hash_put @ /root/linux-6.15/kernel/futex/core.c:179:2
                  futex_private_hash_put @ /root/linux-6.15/kernel/futex/core.c:148:6 [inlined]
                  futex_private_hash_put @ /root/linux-6.15/kernel/futex/core.c:153:6 [inlined]
                  rcuref_put @ /root/linux-6.15/./include/linux/rcuref.h:173:13 [inlined]
                  __rcuref_put @ /root/linux-6.15/./include/linux/rcuref.h:110:5 [inlined]
                 futex_wake @ /root/linux-6.15/kernel/futex/waitwake.c:200:1
                 do_futex @ /root/linux-6.15/kernel/futex/syscalls.c:107:10
                 __x64_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1
                  __se_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1 [inlined]
                  __do_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:179:9 [inlined]
                 do_syscall_64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:94:7
                  do_syscall_x64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:63:12 [inlined]
                 entry_SYSCALL_64_after_hwframe
13795 samples (11.01%) Comms: schbench
                 futex_hash @ /root/linux-6.15/kernel/futex/core.c:311:15
                  futex_private_hash_get @ /root/linux-6.15/kernel/futex/core.c:143:5 [inlined]
                 futex_wake @ /root/linux-6.15/kernel/futex/waitwake.c:172:2
                  class_hb_constructor @ /root/linux-6.15/kernel/futex/futex.h:242:1 [inlined]
                 do_futex @ /root/linux-6.15/kernel/futex/syscalls.c:107:10
                 __x64_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1
                  __se_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1 [inlined]
                  __do_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:179:9 [inlined]
                 do_syscall_64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:94:7
                  do_syscall_x64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:63:12 [inlined]
                 entry_SYSCALL_64_after_hwframe
14364 samples (11.46%) Comms: schbench
                 entry_SYSCALL_64
17612 samples (14.05%) Comms: schbench
                 futex_hash @ /root/linux-6.15/kernel/futex/core.c:311:15
                  futex_private_hash_get @ /root/linux-6.15/kernel/futex/core.c:145:9 [inlined]
                  rcuref_get @ /root/linux-6.15/./include/linux/rcuref.h:87:5 [inlined]
                 futex_wake @ /root/linux-6.15/kernel/futex/waitwake.c:172:2
                  class_hb_constructor @ /root/linux-6.15/kernel/futex/futex.h:242:1 [inlined]
                 do_futex @ /root/linux-6.15/kernel/futex/syscalls.c:107:10
                 __x64_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1
                  __se_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:160:1 [inlined]
                  __do_sys_futex @ /root/linux-6.15/kernel/futex/syscalls.c:179:9 [inlined]
                 do_syscall_64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:94:7
                  do_syscall_x64 @ /root/linux-6.15/arch/x86/entry/syscall_64.c:63:12 [inlined]
                 entry_SYSCALL_64_after_hwframe

With immutable, futexes are not in the top 10.

-chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-06  7:06         ` Sebastian Andrzej Siewior
  2025-06-06 21:06           ` Chris Mason
@ 2025-06-06 22:17           ` Chris Mason
  2025-06-24 19:01           ` Peter Zijlstra
  2 siblings, 0 replies; 22+ messages in thread
From: Chris Mason @ 2025-06-06 22:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: Peter Zijlstra, linux-kernel

On 6/6/25 3:06 AM, Sebastian Andrzej Siewior wrote:
> On 2025-06-05 20:55:27 [-0400], Chris Mason wrote:

[ ... ]

>> Going back to your diff, if we have a process growing the total number
>> of threads, can we set FH_IMMUTABLE too early?  As the number of threads
>> increases, eventually we'll pick the 2x num_cpus, but that'll take a while?
> 
> If you refer to the schbench diff, then set it early. Once the prctl()
> to set the size of the private hash, there will be no resize by the
> kernel.
> 
> If you refer to the kernel diff where set the FH_IMMUTABLE flag, then it
> is set once the upper limit is reached (that was the plan in case I did
> the logic wrong). Which means at that point it won't increase any
> further because of the CPU limit. The only way how you can reach it too
> early is if you offline CPUs.

I pulled your immutable diff on top of c0c9379f235d.  Just the immutable
diff, not the first suggestion that bumped num_online_cpus() * 2.

The RPS did not improve (~2.5M RPS).  Looking at the futex_private_hash
of the process:

>>> task.mm.futex_phash
*(struct futex_private_hash *)0xff110003bbd15000 = {
	.users = (rcuref_t){
		.refcnt = (atomic_t){
			.counter = (int)0,
		},
	},
	.hash_mask = (unsigned int)15,
	.rcu = (struct callback_head){
		.next = (struct callback_head *)0x0,
		.func = (void (*)(struct callback_head *))0x0,
	},
	.mm = (void *)0xff110003321a2400,
	.custom = (bool)0,
	.immutable = (bool)1,
	.queues = (struct futex_hash_bucket []){},
}

It's because our calculation on the max is based on
min(get_nr_threads(current), num_online_cpus())

get_nr_threads starts smaller than num_online_cpus(), so we immediately
decide we've maxed out the buckets?

-chris


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-06  7:06         ` Sebastian Andrzej Siewior
  2025-06-06 21:06           ` Chris Mason
  2025-06-06 22:17           ` Chris Mason
@ 2025-06-24 19:01           ` Peter Zijlstra
  2025-06-26 11:01             ` Chris Mason
                               ` (2 more replies)
  2 siblings, 3 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-06-24 19:01 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: Chris Mason, linux-kernel, Thomas Gleixner

On Fri, Jun 06, 2025 at 09:06:38AM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-05 20:55:27 [-0400], Chris Mason wrote:
> > >> We've got large systems that are basically dedicated to single
> > >> workloads, and those will probably miss the larger global hash table,
> > >> regressing like schbench did.  Then we have large systems spread over
> > >> multiple big workloads that will love the private tables.
> > >>
> > >> In either case, I think growing the hash table as a multiple of thread
> > >> count instead of cpu count will probably better reflect the crazy things
> > >> multi-threaded applications do?  At any rate, I don't think we want
> > >> applications to need prctl to get back to the performance they had on
> > >> older kernels.
> > > 
> > > This is only an issue if all you CPUs spend their time in the kernel
> > > using the hash buckets at the same time.
> > > This was the case in every benchmark I've seen so far. Your thing might
> > > be closer to an actual workload.
> > > 
> > 
> > I didn't spend a ton of time looking at the perf profiles of the slower
> > kernel, was the bottleneck in the hash chain length or in contention for
> > the buckets?
> 
> Every futex operation does a rcuref_get() (which is an atomic inc) on
> the private hash. This is before anything else happens. If you have two
> threads, on two CPUs, which simultaneously do a futex() operation then
> both do this rcuref_get(). That atomic inc ensures that the cacheline
> bounces from one CPU to the other. On the exit of the syscall there is a
> matching rcuref_put().

How about something like this (very lightly tested)...

the TL;DR is that it turns all those refcounts into per-cpu ops when
there is no hash replacement pending (eg. the normal case), and only
folds the lot into an atomic when we really care about it.

There's some sharp corners still.. but it boots and survives the
(slightly modified) selftest.

---
 include/linux/futex.h                              |  12 +-
 include/linux/mm_types.h                           |   5 +
 kernel/futex/core.c                                | 238 +++++++++++++++++++--
 .../selftests/futex/functional/futex_priv_hash.c   |  11 +-
 4 files changed, 239 insertions(+), 27 deletions(-)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index b37193653e6b..5f92c7a8ba57 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -85,17 +85,11 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 int futex_hash_allocate_default(void);
 void futex_hash_free(struct mm_struct *mm);
-
-static inline void futex_mm_init(struct mm_struct *mm)
-{
-	RCU_INIT_POINTER(mm->futex_phash, NULL);
-	mm->futex_phash_new = NULL;
-	mutex_init(&mm->futex_hash_lock);
-}
+int futex_mm_init(struct mm_struct *mm);
 
 #else /* !CONFIG_FUTEX_PRIVATE_HASH */
 static inline int futex_hash_allocate_default(void) { return 0; }
-static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
 static inline void futex_mm_init(struct mm_struct *mm) { }
 #endif /* CONFIG_FUTEX_PRIVATE_HASH */
 
@@ -118,7 +112,7 @@ static inline int futex_hash_allocate_default(void)
 {
 	return 0;
 }
-static inline void futex_hash_free(struct mm_struct *mm) { }
+static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
 static inline void futex_mm_init(struct mm_struct *mm) { }
 
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d6b91e8a66d6..0f0662157066 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1070,6 +1070,11 @@ struct mm_struct {
 		struct mutex			futex_hash_lock;
 		struct futex_private_hash	__rcu *futex_phash;
 		struct futex_private_hash	*futex_phash_new;
+		/* futex-ref */
+		unsigned long			futex_batches;
+		struct rcu_head			futex_rcu;
+		atomic_long_t			futex_atomic;
+		unsigned int			__percpu *futex_ref;
 #endif
 
 		unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 90d53fb0ee9e..cfa7c105a7dc 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -42,7 +42,6 @@
 #include <linux/fault-inject.h>
 #include <linux/slab.h>
 #include <linux/prctl.h>
-#include <linux/rcuref.h>
 #include <linux/mempolicy.h>
 #include <linux/mmap_lock.h>
 
@@ -65,7 +64,7 @@ static struct {
 #define futex_queues	(__futex_data.queues)
 
 struct futex_private_hash {
-	rcuref_t	users;
+	int		state;
 	unsigned int	hash_mask;
 	struct rcu_head	rcu;
 	void		*mm;
@@ -129,6 +128,12 @@ static struct futex_hash_bucket *
 __futex_hash(union futex_key *key, struct futex_private_hash *fph);
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
+static bool futex_ref_get(struct futex_private_hash *fph);
+static bool futex_ref_put(struct futex_private_hash *fph);
+static bool futex_ref_is_dead(struct futex_private_hash *fph);
+
+enum { FR_PERCPU = 0, FR_ATOMIC };
+
 static inline bool futex_key_is_private(union futex_key *key)
 {
 	/*
@@ -142,15 +147,14 @@ bool futex_private_hash_get(struct futex_private_hash *fph)
 {
 	if (fph->immutable)
 		return true;
-	return rcuref_get(&fph->users);
+	return futex_ref_get(fph);
 }
 
 void futex_private_hash_put(struct futex_private_hash *fph)
 {
-	/* Ignore return value, last put is verified via rcuref_is_dead() */
 	if (fph->immutable)
 		return;
-	if (rcuref_put(&fph->users))
+	if (futex_ref_put(fph))
 		wake_up_var(fph->mm);
 }
 
@@ -243,14 +247,18 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
 	fph = rcu_dereference_protected(mm->futex_phash,
 					lockdep_is_held(&mm->futex_hash_lock));
 	if (fph) {
-		if (!rcuref_is_dead(&fph->users)) {
+		if (!futex_ref_is_dead(fph)) {
 			mm->futex_phash_new = new;
 			return false;
 		}
 
 		futex_rehash_private(fph, new);
 	}
-	rcu_assign_pointer(mm->futex_phash, new);
+	new->state = FR_PERCPU;
+	scoped_guard (rcu) {
+		mm->futex_batches = get_state_synchronize_rcu();
+		rcu_assign_pointer(mm->futex_phash, new);
+	}
 	kvfree_rcu(fph, rcu);
 	return true;
 }
@@ -289,9 +297,7 @@ struct futex_private_hash *futex_private_hash(void)
 		if (!fph)
 			return NULL;
 
-		if (fph->immutable)
-			return fph;
-		if (rcuref_get(&fph->users))
+		if (futex_private_hash_get(fph))
 			return fph;
 	}
 	futex_pivot_hash(mm);
@@ -1527,16 +1533,214 @@ static void futex_hash_bucket_init(struct futex_hash_bucket *fhb,
 #define FH_IMMUTABLE	0x02
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
+
+/*
+ * futex-ref
+ *
+ * Heavily inspired by percpu-rwsem/percpu-refcount; not reusing any of that
+ * code because it just doesn't fit right.
+ *
+ * Dual counter, per-cpu / atomic approach like percpu-refcount, except it
+ * re-initializes the state automatically, such that the fph swizzle is also a
+ * transition back to per-cpu.
+ */
+
+static void futex_ref_rcu(struct rcu_head *head);
+
+static void __futex_ref_atomic_begin(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	/*
+	 * The counter we're about to switch to must have fully switched;
+	 * otherwise it would be impossible for it to have reported success
+	 * from futex_ref_is_dead().
+	 */
+	WARN_ON_ONCE(atomic_long_read(&mm->futex_atomic) != 0);
+
+	/*
+	 * Set the atomic to the bias value such that futex_ref_{get,put}()
+	 * will never observe 0. Will be fixed up in __futex_ref_atomic_end()
+	 * when folding in the percpu count.
+	 */
+	atomic_long_set(&mm->futex_atomic, LONG_MAX);
+	smp_store_release(&fph->state, FR_ATOMIC);
+
+	call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
+}
+
+static void __futex_ref_atomic_end(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+	unsigned int count = 0;
+	long ret;
+	int cpu;
+
+	/*
+	 * Per __futex_ref_atomic_begin() the state of the fph must be ATOMIC
+	 * and per this RCU callback, everybody must now observe this state and
+	 * use the atomic variable.
+	 */
+	WARN_ON_ONCE(fph->state != FR_ATOMIC);
+
+	/*
+	 * Therefore the per-cpu counter is now stable, sum and reset.
+	 */
+	for_each_possible_cpu(cpu) {
+		unsigned int *ptr = per_cpu_ptr(mm->futex_ref, cpu);
+		count += *ptr;
+		*ptr = 0;
+	}
+
+	/*
+	 * Re-init for the next cycle.
+	 */
+	this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
+
+	/*
+	 * Add actual count, subtract bias and initial refcount.
+	 *
+	 * The moment this atomic operation happens, futex_ref_is_dead() can
+	 * become true.
+	 */
+	ret = atomic_long_add_return(count - LONG_MAX - 1, &mm->futex_atomic);
+	if (!ret)
+		wake_up_var(mm);
+
+	WARN_ON_ONCE(ret < 0);
+}
+
+static void futex_ref_rcu(struct rcu_head *head)
+{
+	struct mm_struct *mm = container_of(head, struct mm_struct, futex_rcu);
+	struct futex_private_hash *fph = rcu_dereference_raw(mm->futex_phash);
+
+	if (fph->state == FR_PERCPU) {
+		/*
+		 * Per this extra grace-period, everybody must now observe
+		 * fph as the current fph and no previously observed fph's
+		 * are in-flight.
+		 *
+		 * Notably, nobody will now rely on the atomic
+		 * futex_ref_is_dead() state anymore so we can begin the
+		 * migration of the per-cpu counter into the atomic.
+		 */
+		__futex_ref_atomic_begin(fph);
+		return;
+	}
+
+	__futex_ref_atomic_end(fph);
+}
+
+/*
+ * Drop the initial refcount and transition to atomics.
+ */
+static void futex_ref_drop(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	/*
+	 * Can only transition the current fph;
+	 */
+	WARN_ON_ONCE(rcu_dereference_raw(mm->futex_phash) != fph);
+
+	/*
+	 * In order to avoid the following scenario:
+	 *
+	 * futex_hash()			__futex_pivot_hash()
+	 *   guard(rcu);		  guard(mm->futex_hash_lock);
+	 *   fph = mm->futex_phash;
+	 *				  rcu_assign_pointer(&mm->futex_phash, new);
+	 *				futex_hash_allocate()
+	 *				  futex_ref_drop()
+	 *				    fph->state = FR_ATOMIC;
+	 *				    atomic_set(, BIAS);
+	 *
+	 *   futex_private_hash_get(fph); // OOPS
+	 *
+	 * Where an old fph (which is FR_ATOMIC) and should fail on
+	 * inc_not_zero, will succeed because a new transition is started and
+	 * the atomic is bias'ed away from 0.
+	 *
+	 * There must be at least one full grace-period between publishing a
+	 * new fph and trying to replace it.
+	 */
+	if (poll_state_synchronize_rcu(mm->futex_batches)) {
+		/*
+		 * There was a grace-period, we can begin now.
+		 */
+		__futex_ref_atomic_begin(fph);
+		return;
+	}
+
+	call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
+}
+
+static bool futex_ref_get(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	guard(preempt)();
+
+	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
+		this_cpu_inc(*mm->futex_ref);
+		return true;
+	}
+
+	return atomic_long_inc_not_zero(&mm->futex_atomic);
+}
+
+static bool futex_ref_put(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	guard(preempt)();
+
+	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
+		this_cpu_dec(*mm->futex_ref);
+		return false;
+	}
+
+	return atomic_long_dec_and_test(&mm->futex_atomic);
+}
+
+static bool futex_ref_is_dead(struct futex_private_hash *fph)
+{
+	struct mm_struct *mm = fph->mm;
+
+	guard(preempt)();
+
+	if (smp_load_acquire(&fph->state) == FR_PERCPU)
+		return false;
+
+	return atomic_long_read(&mm->futex_atomic) == 0;
+}
+
+// TODO propagate error
+int futex_mm_init(struct mm_struct *mm)
+{
+	mutex_init(&mm->futex_hash_lock);
+	RCU_INIT_POINTER(mm->futex_phash, NULL);
+	mm->futex_phash_new = NULL;
+	/* futex-ref */
+	atomic_long_set(&mm->futex_atomic, 0);
+	mm->futex_batches = get_state_synchronize_rcu();
+	mm->futex_ref = alloc_percpu(unsigned int);
+	if (!mm->futex_ref)
+		return -ENOMEM;
+	this_cpu_inc(*mm->futex_ref); /* 0 -> 1 */
+	return 0;
+}
+
 void futex_hash_free(struct mm_struct *mm)
 {
 	struct futex_private_hash *fph;
 
+	free_percpu(mm->futex_ref);
 	kvfree(mm->futex_phash_new);
 	fph = rcu_dereference_raw(mm->futex_phash);
-	if (fph) {
-		WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+	if (fph)
 		kvfree(fph);
-	}
 }
 
 static bool futex_pivot_pending(struct mm_struct *mm)
@@ -1549,7 +1753,7 @@ static bool futex_pivot_pending(struct mm_struct *mm)
 		return true;
 
 	fph = rcu_dereference(mm->futex_phash);
-	return rcuref_is_dead(&fph->users);
+	return futex_ref_is_dead(fph);
 }
 
 static bool futex_hash_less(struct futex_private_hash *a,
@@ -1598,11 +1802,11 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
 		}
 	}
 
-	fph = kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
+	fph = kvzalloc(struct_size(fph, queues, hash_slots),
+		       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (!fph)
 		return -ENOMEM;
 
-	rcuref_init(&fph->users, 1);
 	fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
 	fph->custom = custom;
 	fph->immutable = !!(flags & FH_IMMUTABLE);
@@ -1645,7 +1849,7 @@ static int futex_hash_allocate(unsigned int hash_slots, unsigned int flags)
 				 * allocated a replacement hash, drop the initial
 				 * reference on the existing hash.
 				 */
-				futex_private_hash_put(cur);
+				futex_ref_drop(cur);
 			}
 
 			if (new) {
diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c b/tools/testing/selftests/futex/functional/futex_priv_hash.c
index 24a92dc94eb8..1238a7178715 100644
--- a/tools/testing/selftests/futex/functional/futex_priv_hash.c
+++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c
@@ -97,6 +97,7 @@ static void create_max_threads(void *(*thread_fn)(void *))
 		ret = pthread_create(&threads[i], NULL, thread_fn, NULL);
 		if (ret)
 			ksft_exit_fail_msg("pthread_create failed: %m\n");
+		usleep(5000);
 	}
 }
 
@@ -131,6 +132,7 @@ int main(int argc, char *argv[])
 	int use_global_hash = 0;
 	int ret;
 	int c;
+	int retry = 2;
 
 	while ((c = getopt(argc, argv, "cghv:")) != -1) {
 		switch (c) {
@@ -208,11 +210,18 @@ int main(int argc, char *argv[])
 	 */
 	ksft_print_msg("Online CPUs: %d\n", online_cpus);
 	if (online_cpus > 16) {
+again:
 		futex_slotsn = futex_hash_slots_get();
 		if (futex_slotsn < 0 || futex_slots1 == futex_slotsn) {
 			ksft_print_msg("Expected increase of hash buckets but got: %d -> %d\n",
 				       futex_slots1, futex_slotsn);
-			ksft_exit_fail_msg(test_msg_auto_inc);
+			if (--retry) {
+				usleep(500000);
+				goto again;
+			} else {
+				ksft_test_result_fail(test_msg_auto_inc);
+//				ksft_exit_fail_msg(test_msg_auto_inc);
+			}
 		}
 		ksft_test_result_pass(test_msg_auto_inc);
 	} else {

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-24 19:01           ` Peter Zijlstra
@ 2025-06-26 11:01             ` Chris Mason
  2025-06-26 13:17               ` Peter Zijlstra
  2025-06-26 13:48             ` futex performance regression from "futex: Allow automatic allocation of process wide futex hash" Sebastian Andrzej Siewior
  2025-06-27 22:48             ` Tim Chen
  2 siblings, 1 reply; 22+ messages in thread
From: Chris Mason @ 2025-06-26 11:01 UTC (permalink / raw)
  To: Peter Zijlstra, Sebastian Andrzej Siewior; +Cc: linux-kernel, Thomas Gleixner

On 6/24/25 3:01 PM, Peter Zijlstra wrote:
> On Fri, Jun 06, 2025 at 09:06:38AM +0200, Sebastian Andrzej Siewior wrote:
>> On 2025-06-05 20:55:27 [-0400], Chris Mason wrote:
>>>>> We've got large systems that are basically dedicated to single
>>>>> workloads, and those will probably miss the larger global hash table,
>>>>> regressing like schbench did.  Then we have large systems spread over
>>>>> multiple big workloads that will love the private tables.
>>>>>
>>>>> In either case, I think growing the hash table as a multiple of thread
>>>>> count instead of cpu count will probably better reflect the crazy things
>>>>> multi-threaded applications do?  At any rate, I don't think we want
>>>>> applications to need prctl to get back to the performance they had on
>>>>> older kernels.
>>>>
>>>> This is only an issue if all you CPUs spend their time in the kernel
>>>> using the hash buckets at the same time.
>>>> This was the case in every benchmark I've seen so far. Your thing might
>>>> be closer to an actual workload.
>>>>
>>>
>>> I didn't spend a ton of time looking at the perf profiles of the slower
>>> kernel, was the bottleneck in the hash chain length or in contention for
>>> the buckets?
>>
>> Every futex operation does a rcuref_get() (which is an atomic inc) on
>> the private hash. This is before anything else happens. If you have two
>> threads, on two CPUs, which simultaneously do a futex() operation then
>> both do this rcuref_get(). That atomic inc ensures that the cacheline
>> bounces from one CPU to the other. On the exit of the syscall there is a
>> matching rcuref_put().
> 
> How about something like this (very lightly tested)...
> 
> the TL;DR is that it turns all those refcounts into per-cpu ops when
> there is no hash replacement pending (eg. the normal case), and only
> folds the lot into an atomic when we really care about it.
> 
> There's some sharp corners still.. but it boots and survives the
> (slightly modified) selftest.

I can get some benchmarks going of this, thanks.  For 6.16, is the goal
to put something like this in, or default to the global hash table until
we've nailed it down?

I'd vote for defaulting to global for one more release.

-chris


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-26 11:01             ` Chris Mason
@ 2025-06-26 13:17               ` Peter Zijlstra
  2025-06-26 13:50                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-06-26 13:17 UTC (permalink / raw)
  To: Chris Mason; +Cc: Sebastian Andrzej Siewior, linux-kernel, Thomas Gleixner

On Thu, Jun 26, 2025 at 07:01:23AM -0400, Chris Mason wrote:
> On 6/24/25 3:01 PM, Peter Zijlstra wrote:
> > On Fri, Jun 06, 2025 at 09:06:38AM +0200, Sebastian Andrzej Siewior wrote:
> >> On 2025-06-05 20:55:27 [-0400], Chris Mason wrote:
> >>>>> We've got large systems that are basically dedicated to single
> >>>>> workloads, and those will probably miss the larger global hash table,
> >>>>> regressing like schbench did.  Then we have large systems spread over
> >>>>> multiple big workloads that will love the private tables.
> >>>>>
> >>>>> In either case, I think growing the hash table as a multiple of thread
> >>>>> count instead of cpu count will probably better reflect the crazy things
> >>>>> multi-threaded applications do?  At any rate, I don't think we want
> >>>>> applications to need prctl to get back to the performance they had on
> >>>>> older kernels.
> >>>>
> >>>> This is only an issue if all you CPUs spend their time in the kernel
> >>>> using the hash buckets at the same time.
> >>>> This was the case in every benchmark I've seen so far. Your thing might
> >>>> be closer to an actual workload.
> >>>>
> >>>
> >>> I didn't spend a ton of time looking at the perf profiles of the slower
> >>> kernel, was the bottleneck in the hash chain length or in contention for
> >>> the buckets?
> >>
> >> Every futex operation does a rcuref_get() (which is an atomic inc) on
> >> the private hash. This is before anything else happens. If you have two
> >> threads, on two CPUs, which simultaneously do a futex() operation then
> >> both do this rcuref_get(). That atomic inc ensures that the cacheline
> >> bounces from one CPU to the other. On the exit of the syscall there is a
> >> matching rcuref_put().
> > 
> > How about something like this (very lightly tested)...
> > 
> > the TL;DR is that it turns all those refcounts into per-cpu ops when
> > there is no hash replacement pending (eg. the normal case), and only
> > folds the lot into an atomic when we really care about it.
> > 
> > There's some sharp corners still.. but it boots and survives the
> > (slightly modified) selftest.
> 
> I can get some benchmarks going of this, thanks.  For 6.16, is the goal
> to put something like this in, or default to the global hash table until
> we've nailed it down?
> 
> I'd vote for defaulting to global for one more release.

Probably best to do that; means we don't have to rush crazy code :-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-24 19:01           ` Peter Zijlstra
  2025-06-26 11:01             ` Chris Mason
@ 2025-06-26 13:48             ` Sebastian Andrzej Siewior
  2025-06-26 14:36               ` Peter Zijlstra
  2025-06-27 22:48             ` Tim Chen
  2 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-26 13:48 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Chris Mason, linux-kernel, Thomas Gleixner

On 2025-06-24 21:01:18 [+0200], Peter Zijlstra wrote:
> How about something like this (very lightly tested)...
> 
> the TL;DR is that it turns all those refcounts into per-cpu ops when
> there is no hash replacement pending (eg. the normal case), and only
> folds the lot into an atomic when we really care about it.

so we have per-CPU counter and on resize we wait one RCU grace period to
ensure everybody observed current fph, switch to atomics and wait one
grace period to ensure everyone is using atomics. Last step is to align
the atomic counter with the per-CPU counters and once the counter
reaches 0 perform the swap.

This looks fine :) Due to the RCU grace period, the swap takes longer
than before. I guess that is why you said earlier with to use srcu. For
things like "hackbench -T" you end up creating a new hash on every
thread creation which is not applied because RCU takes a while.
This could be optimized later by checking if the hash in futex_phash_new
matches the requested size.

> There's some sharp corners still.. but it boots and survives the
> (slightly modified) selftest.

The refcount does not pop up in perf so that is good.

> --- a/kernel/futex/core.c
> +++ b/kernel/futex/core.c
> @@ -243,14 +247,18 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
>  	fph = rcu_dereference_protected(mm->futex_phash,
>  					lockdep_is_held(&mm->futex_hash_lock));
>  	if (fph) {
> -		if (!rcuref_is_dead(&fph->users)) {
> +		if (!futex_ref_is_dead(fph)) {
>  			mm->futex_phash_new = new;
>  			return false;
>  		}
>  
>  		futex_rehash_private(fph, new);
>  	}
> -	rcu_assign_pointer(mm->futex_phash, new);
> +	new->state = FR_PERCPU;
> +	scoped_guard (rcu) {

We do space or we don't? It looks like sched/ does while the remaining
bits of the kernel mostly don't. I don't care but we could (later)
adjust it for futex towards one direction.

> +		mm->futex_batches = get_state_synchronize_rcu();
> +		rcu_assign_pointer(mm->futex_phash, new);
> +	}
>  	kvfree_rcu(fph, rcu);
>  	return true;
>  }
…
> +static void futex_ref_drop(struct futex_private_hash *fph)
…
> +	call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);

Do you think it would improve with srcu or it is not worth it?

> +}
> +
> +static bool futex_ref_get(struct futex_private_hash *fph)
> +{
> +	struct mm_struct *mm = fph->mm;
> +
> +	guard(preempt)();
> +
> +	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> +		this_cpu_inc(*mm->futex_ref);
> +		return true;
> +	}
> +
> +	return atomic_long_inc_not_zero(&mm->futex_atomic);
> +}
> +
> +static bool futex_ref_put(struct futex_private_hash *fph)
> +{
> +	struct mm_struct *mm = fph->mm;
> +
> +	guard(preempt)();
> +
> +	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> +		this_cpu_dec(*mm->futex_ref);
> +		return false;
> +	}
> +
> +	return atomic_long_dec_and_test(&mm->futex_atomic);
> +}
> +
> +static bool futex_ref_is_dead(struct futex_private_hash *fph)
> +{
> +	struct mm_struct *mm = fph->mm;
> +
> +	guard(preempt)();
> +
> +	if (smp_load_acquire(&fph->state) == FR_PERCPU)
> +		return false;
> +
> +	return atomic_long_read(&mm->futex_atomic) == 0;
> +}

Why preempt_disable()? Is it just an optimized version of
rcu_read_lock()? I don't understand why. You don't even go for
__this_cpu_inc() so I a bit puzzled.

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-26 13:17               ` Peter Zijlstra
@ 2025-06-26 13:50                 ` Sebastian Andrzej Siewior
  2025-06-27 11:04                   ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-26 13:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Chris Mason, linux-kernel, Thomas Gleixner

On 2025-06-26 15:17:15 [+0200], Peter Zijlstra wrote:
> > I'd vote for defaulting to global for one more release.
> 
> Probably best to do that; means we don't have to rush crazy code :-)

If we do that, that means we need to ensure that the private hash is not
deployed while a thread is created and the prctl() function can only
work create the private hash if the application is single threaded.

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-26 13:48             ` futex performance regression from "futex: Allow automatic allocation of process wide futex hash" Sebastian Andrzej Siewior
@ 2025-06-26 14:36               ` Peter Zijlstra
  2025-06-27 12:24                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-06-26 14:36 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: Chris Mason, linux-kernel, Thomas Gleixner

On Thu, Jun 26, 2025 at 03:48:20PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-24 21:01:18 [+0200], Peter Zijlstra wrote:
> > How about something like this (very lightly tested)...
> > 
> > the TL;DR is that it turns all those refcounts into per-cpu ops when
> > there is no hash replacement pending (eg. the normal case), and only
> > folds the lot into an atomic when we really care about it.
> 
> so we have per-CPU counter and on resize we wait one RCU grace period to
> ensure everybody observed current fph, switch to atomics and wait one
> grace period to ensure everyone is using atomics. Last step is to align
> the atomic counter with the per-CPU counters and once the counter
> reaches 0 perform the swap.
> 
> This looks fine :) Due to the RCU grace period, the swap takes longer
> than before. 

Yes, but given this is supposed to be 'rare', I didn't think that was a
problem.

> I guess that is why you said earlier with to use srcu. For
> things like "hackbench -T" you end up creating a new hash on every
> thread creation which is not applied because RCU takes a while.
> This could be optimized later by checking if the hash in futex_phash_new
> matches the requested size.

So the main benefit of using SRCU (I have half a patch and a head-ache
to show for it) is that you don't need the per-mm-per-cpu memory
storage. The down-side is that you share the 'refcount' between all mm's
so it gets to be even slower.

The 'problem' is that reasoning about stuff comes even harder. But the
main idea is to replace mm->futex_phash with a 'tombstone' value to
force __futex_hash() into taking the slow-path and hitting
mm->futex_hash_lock.

Then you do call_srcu() to wait one SRCU period, this has everybody
stalled, and guarantees the fph you had is now unused so we can rehash.
Then replace the tombstone with the new hash and start over.

It's just that stuff like futex_hash_get() gets somewhat tricky, you
have to ensure the SRCU periods overlap.

Anyway, that approach should be feasible as well, just not sure of the
trade-offs.

> > There's some sharp corners still.. but it boots and survives the
> > (slightly modified) selftest.
> 
> The refcount does not pop up in perf so that is good.

\o/

> > --- a/kernel/futex/core.c
> > +++ b/kernel/futex/core.c
> > @@ -243,14 +247,18 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
> >  	fph = rcu_dereference_protected(mm->futex_phash,
> >  					lockdep_is_held(&mm->futex_hash_lock));
> >  	if (fph) {
> > -		if (!rcuref_is_dead(&fph->users)) {
> > +		if (!futex_ref_is_dead(fph)) {
> >  			mm->futex_phash_new = new;
> >  			return false;
> >  		}
> >  
> >  		futex_rehash_private(fph, new);
> >  	}
> > -	rcu_assign_pointer(mm->futex_phash, new);
> > +	new->state = FR_PERCPU;
> > +	scoped_guard (rcu) {
> 
> We do space or we don't? It looks like sched/ does while the remaining
> bits of the kernel mostly don't. I don't care but we could (later)
> adjust it for futex towards one direction.

Oh, scoped_guard is for and we write for loops with a space on. Perhaps
its because I wrote it all and know this. But yeah, meh :-)

> > +		mm->futex_batches = get_state_synchronize_rcu();
> > +		rcu_assign_pointer(mm->futex_phash, new);
> > +	}
> >  	kvfree_rcu(fph, rcu);
> >  	return true;
> >  }
> …
> > +static void futex_ref_drop(struct futex_private_hash *fph)
> …
> > +	call_rcu_hurry(&mm->futex_rcu, futex_ref_rcu);
> 
> Do you think it would improve with srcu or it is not worth it?

Per the above, I'm not sure about the trade-offs. I think the SRCU
approach can be made to work -- I just got me head in a twist.

> > +}
> > +
> > +static bool futex_ref_get(struct futex_private_hash *fph)
> > +{
> > +	struct mm_struct *mm = fph->mm;
> > +
> > +	guard(preempt)();
> > +
> > +	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> > +		this_cpu_inc(*mm->futex_ref);
> > +		return true;
> > +	}
> > +
> > +	return atomic_long_inc_not_zero(&mm->futex_atomic);
> > +}
> > +
> > +static bool futex_ref_put(struct futex_private_hash *fph)
> > +{
> > +	struct mm_struct *mm = fph->mm;
> > +
> > +	guard(preempt)();
> > +
> > +	if (smp_load_acquire(&fph->state) == FR_PERCPU) {
> > +		this_cpu_dec(*mm->futex_ref);
> > +		return false;
> > +	}
> > +
> > +	return atomic_long_dec_and_test(&mm->futex_atomic);
> > +}
> > +
> > +static bool futex_ref_is_dead(struct futex_private_hash *fph)
> > +{
> > +	struct mm_struct *mm = fph->mm;
> > +
> > +	guard(preempt)();
> > +
> > +	if (smp_load_acquire(&fph->state) == FR_PERCPU)
> > +		return false;
> > +
> > +	return atomic_long_read(&mm->futex_atomic) == 0;
> > +}
> 
> Why preempt_disable()? Is it just an optimized version of
> rcu_read_lock()? I don't understand why. You don't even go for
> __this_cpu_inc() so I a bit puzzled.

The code morphed a lot over the two days it took to write this. But
yeah, preempt or rcu doesn't really matter here.

I didn't yet think about __this_cpu, its not really relevant on x86. If
you want to make that relaxation you have to consider IRQ and SoftIRQ
handling though. Given we have this_cpu_inc() in the RCU callback, which
is ran from SoftIRQ context, this might not be safe.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-26 13:50                 ` Sebastian Andrzej Siewior
@ 2025-06-27 11:04                   ` Peter Zijlstra
  2025-06-27 12:14                     ` Sebastian Andrzej Siewior
  2025-06-30 14:50                     ` [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH Sebastian Andrzej Siewior
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-06-27 11:04 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: Chris Mason, linux-kernel, Thomas Gleixner

On Thu, Jun 26, 2025 at 03:50:34PM +0200, Sebastian Andrzej Siewior wrote:
> On 2025-06-26 15:17:15 [+0200], Peter Zijlstra wrote:
> > > I'd vote for defaulting to global for one more release.
> > 
> > Probably best to do that; means we don't have to rush crazy code :-)
> 
> If we do that, that means we need to ensure that the private hash is not
> deployed while a thread is created and the prctl() function can only
> work create the private hash if the application is single threaded.

I was thinking something like so..

---
diff --git a/init/Kconfig b/init/Kconfig
index af4c2f085455..666783eb50ab 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1716,9 +1716,13 @@ config FUTEX_PI
 	depends on FUTEX && RT_MUTEXES
 	default y
 
+#
+# marked broken for performance reasons; gives us one more cycle to sort things out.
+#
 config FUTEX_PRIVATE_HASH
 	bool
 	depends on FUTEX && !BASE_SMALL && MMU
+	depends on BROKEN
 	default y
 
 config FUTEX_MPOL

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-27 11:04                   ` Peter Zijlstra
@ 2025-06-27 12:14                     ` Sebastian Andrzej Siewior
  2025-06-30 14:50                     ` [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH Sebastian Andrzej Siewior
  1 sibling, 0 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-27 12:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Chris Mason, linux-kernel, Thomas Gleixner

On 2025-06-27 13:04:57 [+0200], Peter Zijlstra wrote:
> On Thu, Jun 26, 2025 at 03:50:34PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2025-06-26 15:17:15 [+0200], Peter Zijlstra wrote:
> > > > I'd vote for defaulting to global for one more release.
> > > 
> > > Probably best to do that; means we don't have to rush crazy code :-)
> > 
> > If we do that, that means we need to ensure that the private hash is not
> > deployed while a thread is created and the prctl() function can only
> > work create the private hash if the application is single threaded.
> 
> I was thinking something like so..

Okay. That increases the motivation to fix it in one cycle. 
Anything you want me to do here?

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-26 14:36               ` Peter Zijlstra
@ 2025-06-27 12:24                 ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-27 12:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Chris Mason, linux-kernel, Thomas Gleixner

On 2025-06-26 16:36:38 [+0200], Peter Zijlstra wrote:
> > I guess that is why you said earlier with to use srcu. For
> > things like "hackbench -T" you end up creating a new hash on every
> > thread creation which is not applied because RCU takes a while.
> > This could be optimized later by checking if the hash in futex_phash_new
> > matches the requested size.
> 
> So the main benefit of using SRCU (I have half a patch and a head-ache
> to show for it) is that you don't need the per-mm-per-cpu memory
> storage. The down-side is that you share the 'refcount' between all mm's
> so it gets to be even slower.

I see. So it is not just s/rcu/srcu/ for the hopefully quicker grace
period since it does not affect whole system. It a different algorithm.

> The 'problem' is that reasoning about stuff comes even harder. But the
> main idea is to replace mm->futex_phash with a 'tombstone' value to
> force __futex_hash() into taking the slow-path and hitting
> mm->futex_hash_lock.
> 
> Then you do call_srcu() to wait one SRCU period, this has everybody
> stalled, and guarantees the fph you had is now unused so we can rehash.
> Then replace the tombstone with the new hash and start over.
> 
> It's just that stuff like futex_hash_get() gets somewhat tricky, you
> have to ensure the SRCU periods overlap.
> 
> Anyway, that approach should be feasible as well, just not sure of the
> trade-offs.

The current looks reasonable enough and it survived yesterday's testing.

> > Why preempt_disable()? Is it just an optimized version of
> > rcu_read_lock()? I don't understand why. You don't even go for
> > __this_cpu_inc() so I a bit puzzled.
> 
> The code morphed a lot over the two days it took to write this. But
> yeah, preempt or rcu doesn't really matter here.
> 
> I didn't yet think about __this_cpu, its not really relevant on x86. If
> you want to make that relaxation you have to consider IRQ and SoftIRQ
> handling though. Given we have this_cpu_inc() in the RCU callback, which
> is ran from SoftIRQ context, this might not be safe.

Nah, I am all for a simple cguard(rcu) here.

Sebastian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-24 19:01           ` Peter Zijlstra
  2025-06-26 11:01             ` Chris Mason
  2025-06-26 13:48             ` futex performance regression from "futex: Allow automatic allocation of process wide futex hash" Sebastian Andrzej Siewior
@ 2025-06-27 22:48             ` Tim Chen
  2025-06-28  8:23               ` Peter Zijlstra
  2 siblings, 1 reply; 22+ messages in thread
From: Tim Chen @ 2025-06-27 22:48 UTC (permalink / raw)
  To: Peter Zijlstra, Sebastian Andrzej Siewior
  Cc: Chris Mason, linux-kernel, Thomas Gleixner

On Tue, 2025-06-24 at 21:01 +0200, Peter Zijlstra wrote:
> 
> How about something like this (very lightly tested)...
> 
> the TL;DR is that it turns all those refcounts into per-cpu ops when
> there is no hash replacement pending (eg. the normal case), and only
> folds the lot into an atomic when we really care about it.
> 
> There's some sharp corners still.. but it boots and survives the
> (slightly modified) selftest.
> 
> ---
>  include/linux/futex.h                              |  12 +-
>  include/linux/mm_types.h                           |   5 +
>  kernel/futex/core.c                                | 238 +++++++++++++++++++--
>  .../selftests/futex/functional/futex_priv_hash.c   |  11 +-
>  4 files changed, 239 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/futex.h b/include/linux/futex.h
> index b37193653e6b..5f92c7a8ba57 100644
> --- a/include/linux/futex.h
> +++ b/include/linux/futex.h
> @@ -85,17 +85,11 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
>  #ifdef CONFIG_FUTEX_PRIVATE_HASH
>  int futex_hash_allocate_default(void);
>  void futex_hash_free(struct mm_struct *mm);
> -
> -static inline void futex_mm_init(struct mm_struct *mm)
> -{
> -	RCU_INIT_POINTER(mm->futex_phash, NULL);
> -	mm->futex_phash_new = NULL;
> -	mutex_init(&mm->futex_hash_lock);
> -}
> +int futex_mm_init(struct mm_struct *mm);
>  
>  #else /* !CONFIG_FUTEX_PRIVATE_HASH */
>  static inline int futex_hash_allocate_default(void) { return 0; }
> -static inline void futex_hash_free(struct mm_struct *mm) { }
> +static inline int futex_hash_free(struct mm_struct *mm) { return 0; }

Minor nit.

futex_hash_free() is defined to return void for CONFIG_FUTEX_PRIVATE_HASH
config. But is now defined to return int for !CONFIG_FUTEX_PRIVATE_HASH and !CONFIG_FUTEX configs.
But it seems like nobody is actually checking the return value.  It
seems to make more sense to keep the return value as void here so
the return value is consistent?

Tim

>  static inline void futex_mm_init(struct mm_struct *mm) { }
>  #endif /* CONFIG_FUTEX_PRIVATE_HASH */
>  
> @@ -118,7 +112,7 @@ static inline int futex_hash_allocate_default(void)
>  {
>  	return 0;
>  }
> -static inline void futex_hash_free(struct mm_struct *mm) { }
> +static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
>  static inline void futex_mm_init(struct mm_struct *mm) { }
>  
>  #endif


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: futex performance regression from "futex: Allow automatic allocation of process wide futex hash"
  2025-06-27 22:48             ` Tim Chen
@ 2025-06-28  8:23               ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-06-28  8:23 UTC (permalink / raw)
  To: Tim Chen
  Cc: Sebastian Andrzej Siewior, Chris Mason, linux-kernel,
	Thomas Gleixner

On Fri, Jun 27, 2025 at 03:48:12PM -0700, Tim Chen wrote:
> On Tue, 2025-06-24 at 21:01 +0200, Peter Zijlstra wrote:
> > 
> > How about something like this (very lightly tested)...
> > 
> > the TL;DR is that it turns all those refcounts into per-cpu ops when
> > there is no hash replacement pending (eg. the normal case), and only
> > folds the lot into an atomic when we really care about it.
> > 
> > There's some sharp corners still.. but it boots and survives the
> > (slightly modified) selftest.
> > 
> > ---
> >  include/linux/futex.h                              |  12 +-
> >  include/linux/mm_types.h                           |   5 +
> >  kernel/futex/core.c                                | 238 +++++++++++++++++++--
> >  .../selftests/futex/functional/futex_priv_hash.c   |  11 +-
> >  4 files changed, 239 insertions(+), 27 deletions(-)
> > 
> > diff --git a/include/linux/futex.h b/include/linux/futex.h
> > index b37193653e6b..5f92c7a8ba57 100644
> > --- a/include/linux/futex.h
> > +++ b/include/linux/futex.h
> > @@ -85,17 +85,11 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
> >  #ifdef CONFIG_FUTEX_PRIVATE_HASH
> >  int futex_hash_allocate_default(void);
> >  void futex_hash_free(struct mm_struct *mm);
> > -
> > -static inline void futex_mm_init(struct mm_struct *mm)
> > -{
> > -	RCU_INIT_POINTER(mm->futex_phash, NULL);
> > -	mm->futex_phash_new = NULL;
> > -	mutex_init(&mm->futex_hash_lock);
> > -}
> > +int futex_mm_init(struct mm_struct *mm);
> >  
> >  #else /* !CONFIG_FUTEX_PRIVATE_HASH */
> >  static inline int futex_hash_allocate_default(void) { return 0; }
> > -static inline void futex_hash_free(struct mm_struct *mm) { }
> > +static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
> 
> Minor nit.
> 
> futex_hash_free() is defined to return void for CONFIG_FUTEX_PRIVATE_HASH
> config. But is now defined to return int for !CONFIG_FUTEX_PRIVATE_HASH and !CONFIG_FUTEX configs.
> But it seems like nobody is actually checking the return value.  It
> seems to make more sense to keep the return value as void here so
> the return value is consistent?

YEah, not sure what happened there. There's a TODO item to actually make
sure someone looks at the futex_mm_init() return value, I guess that's a
good moment to clean this up too :-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH
  2025-06-27 11:04                   ` Peter Zijlstra
  2025-06-27 12:14                     ` Sebastian Andrzej Siewior
@ 2025-06-30 14:50                     ` Sebastian Andrzej Siewior
  2025-07-01 13:13                       ` [tip: locking/urgent] " tip-bot2 for Sebastian Andrzej Siewior
  2025-07-16  1:34                       ` [PATCH] " kernel test robot
  1 sibling, 2 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-06-30 14:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Chris Mason, linux-kernel, Thomas Gleixner

Chris Mason reported a performance regression on big iron. Reports of
this kind were usually reported as part of a micro benchmark but Chris'
test did mimic his real workload. This makes it a real regression.

The root cause is rcuref_get() which is invoked during each futex
operation. If all threads of an application do this simultaneously then
it leads to cache line bouncing and the performance drops.

Disable FUTEX_PRIVATE_HASH entirely for this cycle. The performance
regression will be addressed in the following cycle enabling the option
again.

Reported-by: Chris Mason <clm@meta.com>
Closes: https://lore.kernel.org/all/3ad05298-351e-4d61-9972-ca45a0a50e33@meta.com/
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 init/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index af4c2f0854554..666783eb50abd 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1716,9 +1716,13 @@ config FUTEX_PI
 	depends on FUTEX && RT_MUTEXES
 	default y
 
+#
+# marked broken for performance reasons; gives us one more cycle to sort things out.
+#
 config FUTEX_PRIVATE_HASH
 	bool
 	depends on FUTEX && !BASE_SMALL && MMU
+	depends on BROKEN
 	default y
 
 config FUTEX_MPOL
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: locking/urgent] futex: Temporary disable FUTEX_PRIVATE_HASH
  2025-06-30 14:50                     ` [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH Sebastian Andrzej Siewior
@ 2025-07-01 13:13                       ` tip-bot2 for Sebastian Andrzej Siewior
  2025-07-16  1:34                       ` [PATCH] " kernel test robot
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2025-07-01 13:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Chris Mason, Sebastian Andrzej Siewior, Peter Zijlstra (Intel),
	x86, linux-kernel

The following commit has been merged into the locking/urgent branch of tip:

Commit-ID:     9a57c3773152a3ff2c35cc8325e088d011c9f83b
Gitweb:        https://git.kernel.org/tip/9a57c3773152a3ff2c35cc8325e088d011c9f83b
Author:        Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate:    Mon, 30 Jun 2025 16:50:34 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 01 Jul 2025 15:02:05 +02:00

futex: Temporary disable FUTEX_PRIVATE_HASH

Chris Mason reported a performance regression on big iron. Reports of
this kind were usually reported as part of a micro benchmark but Chris'
test did mimic his real workload. This makes it a real regression.

The root cause is rcuref_get() which is invoked during each futex
operation. If all threads of an application do this simultaneously then
it leads to cache line bouncing and the performance drops.

Disable FUTEX_PRIVATE_HASH entirely for this cycle. The performance
regression will be addressed in the following cycle enabling the option
again.

Closes: https://lore.kernel.org/all/3ad05298-351e-4d61-9972-ca45a0a50e33@meta.com/
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250630145034.8JnINEaS@linutronix.de
---
 init/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index af4c2f0..666783e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1716,9 +1716,13 @@ config FUTEX_PI
 	depends on FUTEX && RT_MUTEXES
 	default y
 
+#
+# marked broken for performance reasons; gives us one more cycle to sort things out.
+#
 config FUTEX_PRIVATE_HASH
 	bool
 	depends on FUTEX && !BASE_SMALL && MMU
+	depends on BROKEN
 	default y
 
 config FUTEX_MPOL

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH
  2025-06-30 14:50                     ` [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH Sebastian Andrzej Siewior
  2025-07-01 13:13                       ` [tip: locking/urgent] " tip-bot2 for Sebastian Andrzej Siewior
@ 2025-07-16  1:34                       ` kernel test robot
  1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-07-16  1:34 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: oe-lkp, lkp, Chris Mason, linux-kernel, Peter Zijlstra,
	Thomas Gleixner, oliver.sang



Hello,


from the commit message, we know this 'temporary disable' is to address a
performance regression. so we still send out this report FYI what's the
possible performance impact. however, our team focus on micro benchmark,
so, anyway, just FYI.


kernel test robot noticed a 1.9% improvement of perf-bench-futex.ops/s on:


commit: bc1aa469e545fe16a62d501e095630cccc3fe1c4 ("[PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH")
url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/futex-Temporary-disable-FUTEX_PRIVATE_HASH/20250630-225317
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git d0b3b7b22dfa1f4b515fd3a295b3fd958f9e81af
patch link: https://lore.kernel.org/all/20250630145034.8JnINEaS@linutronix.de/
patch subject: [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH

testcase: perf-bench-futex
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 192 threads 2 sockets Intel(R) Xeon(R) 6740E  CPU @ 2.4GHz (Sierra Forest) with 256G memory
parameters:

	runtime: 300s
	nr_task: 100%
	test: hash
	shared: shared
	cpufreq_governor: performance



Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250716/202507160223.2a483e8b-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/shared/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/300s/shared/lkp-srf-2sp2/hash/perf-bench-futex

commit: 
  v6.16-rc4
  bc1aa469e5 ("futex: Temporary disable FUTEX_PRIVATE_HASH")

       v6.16-rc4 bc1aa469e545fe16a62d501e095 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
   2249734            +1.9%    2291792        perf-bench-futex.ops/s
      6622            +3.7%       6868        perf-bench-futex.time.user_time
    115.83 ± 12%     +40.1%     162.33 ± 18%  perf-sched.wait_and_delay.count.irqentry_exit_to_user_mode.asm_sysvec_call_function_single.[unknown].[unknown]
      4.40 ± 19%    +343.0%      19.51 ± 64%  perf-sched.wait_and_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
      4.15 ± 17%    +347.3%      18.55 ± 69%  perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
     21.00 ± 63%    +121.4%      46.50 ± 11%  perf-c2c.DRAM.local
      4129 ± 57%    +225.5%      13444        perf-c2c.DRAM.remote
     67558 ± 56%     +66.3%     112351        perf-c2c.HITM.local
      4045 ± 57%    +227.2%      13237        perf-c2c.HITM.remote
     71603 ± 57%     +75.4%     125588        perf-c2c.HITM.total
 1.674e+08 ± 20%     +19.0%  1.991e+08        perf-stat.i.cache-misses
 4.096e+08 ± 20%     +19.4%  4.891e+08        perf-stat.i.cache-references
      0.36            +2.9%       0.37        perf-stat.overall.MPKI
      3127            -1.8%       3070        perf-stat.overall.cycles-between-cache-misses
 1.669e+08 ± 20%     +18.9%  1.985e+08        perf-stat.ps.cache-misses
 4.085e+08 ± 20%     +19.3%  4.874e+08        perf-stat.ps.cache-references
 1.622e+14            -1.0%  1.606e+14        perf-stat.total.instructions




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-07-16  1:35 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-03 19:00 futex performance regression from "futex: Allow automatic allocation of process wide futex hash" Chris Mason
2025-06-04  9:28 ` Sebastian Andrzej Siewior
2025-06-04 15:48   ` Chris Mason
2025-06-04 20:08     ` Sebastian Andrzej Siewior
2025-06-06  0:55       ` Chris Mason
2025-06-06  7:06         ` Sebastian Andrzej Siewior
2025-06-06 21:06           ` Chris Mason
2025-06-06 22:17           ` Chris Mason
2025-06-24 19:01           ` Peter Zijlstra
2025-06-26 11:01             ` Chris Mason
2025-06-26 13:17               ` Peter Zijlstra
2025-06-26 13:50                 ` Sebastian Andrzej Siewior
2025-06-27 11:04                   ` Peter Zijlstra
2025-06-27 12:14                     ` Sebastian Andrzej Siewior
2025-06-30 14:50                     ` [PATCH] futex: Temporary disable FUTEX_PRIVATE_HASH Sebastian Andrzej Siewior
2025-07-01 13:13                       ` [tip: locking/urgent] " tip-bot2 for Sebastian Andrzej Siewior
2025-07-16  1:34                       ` [PATCH] " kernel test robot
2025-06-26 13:48             ` futex performance regression from "futex: Allow automatic allocation of process wide futex hash" Sebastian Andrzej Siewior
2025-06-26 14:36               ` Peter Zijlstra
2025-06-27 12:24                 ` Sebastian Andrzej Siewior
2025-06-27 22:48             ` Tim Chen
2025-06-28  8:23               ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).