[PATCH 0/2] execve scalability issues, part 1

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] execve scalability issues, part 1
@ 2023-08-21 20:28 Mateusz Guzik
  2023-08-21 20:28 ` [PATCH 1/2] pcpcntr: add group allocation/free Mateusz Guzik
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Mateusz Guzik @ 2023-08-21 20:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: dennis, tj, cl, akpm, shakeelb, linux-mm, Mateusz Guzik

To start I figured I'm going to bench about as friendly case as it gets
-- statically linked *separate* binaries all doing execve in a loop.

I borrowed the bench from found here:
http://apollo.backplane.com/DFlyMisc/doexec.c

$ cc -static -O2 -o static-doexec doexec.c
$ ./static-doexec $(nproc)

It prints a result every second (warning: first line is garbage).

My test box is temporarily only 26 cores and even at this scale I run
into massive lock contention stemming from back-to-back calls to
percpu_counter_init (and _destroy later).

While not a panacea, one simple thing to do here is to batch these ops.
Since the term "batching" is already used in the file, I decided to
refer to it as "grouping" instead.

Even if this code could be patched to dodge these counters,  I would
argue a high-traffic alloc/free consumer is only a matter of time so it
makes sense to facilitate it.

With the fix I get an ok win, to quote from the commit:
> Even at a very modest scale of 26 cores (ops/s):
> before: 133543.63
> after:  186061.81 (+39%)

> While with the patch these allocations remain a significant problem,
> the primary bottleneck shifts to:
> 
>     __pv_queued_spin_lock_slowpath+1
>     _raw_spin_lock_irqsave+57
>     folio_lruvec_lock_irqsave+91
>     release_pages+590
>     tlb_batch_pages_flush+61
>     tlb_finish_mmu+101
>     exit_mmap+327
>     __mmput+61
>     begin_new_exec+1245
>     load_elf_binary+712
>     bprm_execve+644
>     do_execveat_common.isra.0+429
>     __x64_sys_execve+50
>     do_syscall_64+46
>     entry_SYSCALL_64_after_hwframe+110

I intend to do more work on the area to mostly sort it out, but I would
not mind if someone else took the hammer to folio. :)

With this out of the way I'll be looking at some form of caching to
eliminate these allocs as a problem.

Thoughts?

Mateusz Guzik (2):
  pcpcntr: add group allocation/free
  fork: group allocation of per-cpu counters for mm struct

 include/linux/percpu_counter.h | 19 ++++++++---
 kernel/fork.c                  | 13 ++------
 lib/percpu_counter.c           | 61 ++++++++++++++++++++++++----------
 3 files changed, 60 insertions(+), 33 deletions(-)

-- 
2.39.2

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/2] pcpcntr: add group allocation/free
  2023-08-21 20:28 [PATCH 0/2] execve scalability issues, part 1 Mateusz Guzik
@ 2023-08-21 20:28 ` Mateusz Guzik
  2023-08-22 13:37   ` Vegard Nossum
  2023-08-22 17:02   ` Dennis Zhou
  2023-08-21 20:28 ` [PATCH 2/2] fork: group allocation of per-cpu counters for mm struct Mateusz Guzik
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 31+ messages in thread
From: Mateusz Guzik @ 2023-08-21 20:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: dennis, tj, cl, akpm, shakeelb, linux-mm, Mateusz Guzik

Allocations and frees are globally serialized on the pcpu lock (and the
CPU hotplug lock if enabled, which is the case on Debian).

At least one frequent consumer allocates 4 back-to-back counters (and
frees them in the same manner), exacerbating the problem.

While this does not fully remedy scalability issues, it is a step
towards that goal and provides immediate relief.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 include/linux/percpu_counter.h | 19 ++++++++---
 lib/percpu_counter.c           | 61 ++++++++++++++++++++++++----------
 2 files changed, 57 insertions(+), 23 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 75b73c83bc9d..ff5850b07124 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -30,17 +30,26 @@ struct percpu_counter {
 
 extern int percpu_counter_batch;
 
-int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
-			  struct lock_class_key *key);
+int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
+			  struct lock_class_key *key, u32 count);
 
-#define percpu_counter_init(fbc, value, gfp)				\
+#define percpu_counter_init_many(fbc, value, gfp, count)		\
 	({								\
 		static struct lock_class_key __key;			\
 									\
-		__percpu_counter_init(fbc, value, gfp, &__key);		\
+		__percpu_counter_init_many(fbc, value, gfp, &__key, count);\
 	})
 
-void percpu_counter_destroy(struct percpu_counter *fbc);
+
+#define percpu_counter_init(fbc, value, gfp)				\
+	percpu_counter_init_many(fbc, value, gfp, 1)
+
+void percpu_counter_destroy_many(struct percpu_counter *fbc, u32 count);
+static inline void percpu_counter_destroy(struct percpu_counter *fbc)
+{
+	percpu_counter_destroy_many(fbc, 1);
+}
+
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
 			      s32 batch);
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 5004463c4f9f..2a33cf23df55 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -151,48 +151,73 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
-int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
-			  struct lock_class_key *key)
+int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
+			  struct lock_class_key *key, u32 count)
 {
 	unsigned long flags __maybe_unused;
+	s32 __percpu *counters;
+	u32 i;
 
-	raw_spin_lock_init(&fbc->lock);
-	lockdep_set_class(&fbc->lock, key);
-	fbc->count = amount;
-	fbc->counters = alloc_percpu_gfp(s32, gfp);
-	if (!fbc->counters)
+	counters = __alloc_percpu_gfp(sizeof(*counters) * count,
+				      sizeof(*counters), gfp);
+	if (!counters) {
+		fbc[0].counters = NULL;
 		return -ENOMEM;
+	}
 
-	debug_percpu_counter_activate(fbc);
+	for (i = 0; i < count; i++) {
+		raw_spin_lock_init(&fbc[i].lock);
+		lockdep_set_class(&fbc[i].lock, key);
+#ifdef CONFIG_HOTPLUG_CPU
+		INIT_LIST_HEAD(&fbc[i].list);
+#endif
+		fbc[i].count = amount;
+		fbc[i].counters = &counters[i];
+
+		debug_percpu_counter_activate(&fbc[i]);
+	}
 
 #ifdef CONFIG_HOTPLUG_CPU
-	INIT_LIST_HEAD(&fbc->list);
 	spin_lock_irqsave(&percpu_counters_lock, flags);
-	list_add(&fbc->list, &percpu_counters);
+	for (i = 0; i < count; i++) {
+		list_add(&fbc[i].list, &percpu_counters);
+	}
 	spin_unlock_irqrestore(&percpu_counters_lock, flags);
 #endif
 	return 0;
 }
-EXPORT_SYMBOL(__percpu_counter_init);
+EXPORT_SYMBOL(__percpu_counter_init_many);
 
-void percpu_counter_destroy(struct percpu_counter *fbc)
+void percpu_counter_destroy_many(struct percpu_counter *fbc, u32 count)
 {
 	unsigned long flags __maybe_unused;
+	u32 i;
 
-	if (!fbc->counters)
+	if (WARN_ON_ONCE(!fbc))
 		return;
 
-	debug_percpu_counter_deactivate(fbc);
+	if (!fbc[0].counters)
+		return;
+
+	for (i = 0; i < count; i++) {
+		debug_percpu_counter_deactivate(&fbc[i]);
+	}
 
 #ifdef CONFIG_HOTPLUG_CPU
 	spin_lock_irqsave(&percpu_counters_lock, flags);
-	list_del(&fbc->list);
+	for (i = 0; i < count; i++) {
+		list_del(&fbc[i].list);
+	}
 	spin_unlock_irqrestore(&percpu_counters_lock, flags);
 #endif
-	free_percpu(fbc->counters);
-	fbc->counters = NULL;
+
+	free_percpu(fbc[0].counters);
+
+	for (i = 0; i < count; i++) {
+		fbc[i].counters = NULL;
+	}
 }
-EXPORT_SYMBOL(percpu_counter_destroy);
+EXPORT_SYMBOL(percpu_counter_destroy_many);
 
 int percpu_counter_batch __read_mostly = 32;
 EXPORT_SYMBOL(percpu_counter_batch);
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] pcpcntr: add group allocation/free
  2023-08-21 20:28 ` [PATCH 1/2] pcpcntr: add group allocation/free Mateusz Guzik
@ 2023-08-22 13:37   ` Vegard Nossum
  2023-08-22 14:06     ` Mateusz Guzik
  2023-08-22 17:02   ` Dennis Zhou
  1 sibling, 1 reply; 31+ messages in thread
From: Vegard Nossum @ 2023-08-22 13:37 UTC (permalink / raw)
  To: Mateusz Guzik, linux-kernel; +Cc: dennis, tj, cl, akpm, shakeelb, linux-mm


Testing out a review style with very detailed comments. Let me know if
you hate it. Review notes:

On 8/21/23 22:28, Mateusz Guzik wrote:
> Allocations and frees are globally serialized on the pcpu lock (and the
> CPU hotplug lock if enabled, which is the case on Debian).
> 
> At least one frequent consumer allocates 4 back-to-back counters (and
> frees them in the same manner), exacerbating the problem.
> 
> While this does not fully remedy scalability issues, it is a step
> towards that goal and provides immediate relief.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
>   include/linux/percpu_counter.h | 19 ++++++++---
>   lib/percpu_counter.c           | 61 ++++++++++++++++++++++++----------
>   2 files changed, 57 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> index 75b73c83bc9d..ff5850b07124 100644
> --- a/include/linux/percpu_counter.h
> +++ b/include/linux/percpu_counter.h
> @@ -30,17 +30,26 @@ struct percpu_counter {
>   
>   extern int percpu_counter_batch;
>   
> -int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
> -			  struct lock_class_key *key);
> +int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
> +			  struct lock_class_key *key, u32 count);

renaming and adding a u32 count argument

>   
> -#define percpu_counter_init(fbc, value, gfp)				\
> +#define percpu_counter_init_many(fbc, value, gfp, count)		\

adding a count argument

>   	({								\
>   		static struct lock_class_key __key;			\
>   									\
> -		__percpu_counter_init(fbc, value, gfp, &__key);		\
> +		__percpu_counter_init_many(fbc, value, gfp, &__key, count);\

renaming and passing count along

>   	})
>   
> -void percpu_counter_destroy(struct percpu_counter *fbc);
> +
> +#define percpu_counter_init(fbc, value, gfp)				\
> +	percpu_counter_init_many(fbc, value, gfp, 1)
> +
> +void percpu_counter_destroy_many(struct percpu_counter *fbc, u32 count);
> +static inline void percpu_counter_destroy(struct percpu_counter *fbc)
> +{
> +	percpu_counter_destroy_many(fbc, 1);
> +}
> +

wrappers for the count == 1 case

>   void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
>   void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
>   			      s32 batch);
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index 5004463c4f9f..2a33cf23df55 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -151,48 +151,73 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
>   }
>   EXPORT_SYMBOL(__percpu_counter_sum);
>   
> -int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
> -			  struct lock_class_key *key)
> +int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
> +			  struct lock_class_key *key, u32 count)
>   {
>   	unsigned long flags __maybe_unused;
> +	s32 __percpu *counters;
> +	u32 i;
>   
> -	raw_spin_lock_init(&fbc->lock);
> -	lockdep_set_class(&fbc->lock, key);
> -	fbc->count = amount;
> -	fbc->counters = alloc_percpu_gfp(s32, gfp);
> -	if (!fbc->counters)
> +	counters = __alloc_percpu_gfp(sizeof(*counters) * count,
> +				      sizeof(*counters), gfp);

The second argument here is the alignment. I see other callers using
__alignof__(type), which is what alloc_percpu_gfp() does as well. In
practice I think it shouldn't matter, but for clarity/consistency maybe
this should be __alignof__ as well?

Presumably multiplication overflow is not an issue here as it is with
kmalloc and friends since the count can't be controlled by userspace.

> +	if (!counters) {
> +		fbc[0].counters = NULL;
>   		return -ENOMEM;
> +	}

Checked that __alloc_percpu_gfp() returns NULL on failure.

Checked that nothing else before this in the function needs cleanup.

In the old code, fbc->count would have gotten initialized but it
shouldn't matter here, I think, as long as the counter is never activated.

I'm not sure why only fbc[0].counters is set to NULL, should this happen
for all the "count" members? [PS: percpu_counter_destroy_many() below
has a check for fbc[0].counters]

>   
> -	debug_percpu_counter_activate(fbc);
> +	for (i = 0; i < count; i++) {
> +		raw_spin_lock_init(&fbc[i].lock);
> +		lockdep_set_class(&fbc[i].lock, key);
> +#ifdef CONFIG_HOTPLUG_CPU
> +		INIT_LIST_HEAD(&fbc[i].list);
> +#endif
> +		fbc[i].count = amount;
> +		fbc[i].counters = &counters[i];
> +
> +		debug_percpu_counter_activate(&fbc[i]);

Checked that this can't return an error.

> +	}
>   
>   #ifdef CONFIG_HOTPLUG_CPU
> -	INIT_LIST_HEAD(&fbc->list);
>   	spin_lock_irqsave(&percpu_counters_lock, flags);
> -	list_add(&fbc->list, &percpu_counters);
> +	for (i = 0; i < count; i++) {
> +		list_add(&fbc[i].list, &percpu_counters);
> +	}
>   	spin_unlock_irqrestore(&percpu_counters_lock, flags);
>   #endif

each counter is added to the list while the spinlock is held

>   	return 0;

Nothing here can fail after the initial allocation so no cleanup/error
handling is needed before returning.

>   }
> -EXPORT_SYMBOL(__percpu_counter_init);
> +EXPORT_SYMBOL(__percpu_counter_init_many);
>   
> -void percpu_counter_destroy(struct percpu_counter *fbc)
> +void percpu_counter_destroy_many(struct percpu_counter *fbc, u32 count)
>   {
>   	unsigned long flags __maybe_unused;
> +	u32 i;
>   
> -	if (!fbc->counters)
> +	if (WARN_ON_ONCE(!fbc))
>   		return;

This change is misleading, but correct; the WARN_ON_ONCE() is newly
added and the old check is modified below:

>   
> -	debug_percpu_counter_deactivate(fbc);
> +	if (!fbc[0].counters)
> +		return;

(this explains why only fbc[0] was NULL-ed out above in the allocation
function...)

> +
> +	for (i = 0; i < count; i++) {
> +		debug_percpu_counter_deactivate(&fbc[i]);
> +	}

Double checked that _activate() was not called in the cases where we
would return early from this function.

>   
>   #ifdef CONFIG_HOTPLUG_CPU
>   	spin_lock_irqsave(&percpu_counters_lock, flags);
> -	list_del(&fbc->list);
> +	for (i = 0; i < count; i++) {
> +		list_del(&fbc[i].list);
> +	}
>   	spin_unlock_irqrestore(&percpu_counters_lock, flags);
>   #endif
> -	free_percpu(fbc->counters);
> -	fbc->counters = NULL;
> +
> +	free_percpu(fbc[0].counters);
> +
> +	for (i = 0; i < count; i++) {
> +		fbc[i].counters = NULL;
> +	}
>   }

Looks correct to me; fbc[0].counters is the actual allocation so only
that gets passed to free_percpu().

> -EXPORT_SYMBOL(percpu_counter_destroy);
> +EXPORT_SYMBOL(percpu_counter_destroy_many);
>   
>   int percpu_counter_batch __read_mostly = 32;
>   EXPORT_SYMBOL(percpu_counter_batch);

Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>

In summary, my only slight concern is sizeof(*counters) being passed as
the alignment to __alloc_percpu_gfp() when maybe it would be more
appropriate to pass __alignof__() -- not that it makes a difference at
runtime since both are 4 for s32.

One other thing: I find it a bit odd that the "amount" parameter
(initial value) is s64 when the counters themselves are s32. Maybe just
a leftover from an old version?


Vegard


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] pcpcntr: add group allocation/free
  2023-08-22 13:37   ` Vegard Nossum
@ 2023-08-22 14:06     ` Mateusz Guzik
  0 siblings, 0 replies; 31+ messages in thread
From: Mateusz Guzik @ 2023-08-22 14:06 UTC (permalink / raw)
  To: Vegard Nossum; +Cc: linux-kernel, dennis, tj, cl, akpm, shakeelb, linux-mm

On 8/22/23, Vegard Nossum <vegard.nossum@oracle.com> wrote:
>
> Testing out a review style with very detailed comments. Let me know if
> you hate it. Review notes:
>

I do, very noisy and I don't think it adds any value. ;)

If something like this becomes the norm I'm confident people are going
to start skimming responses to their mail, as opposed to reading them.

But then again, I had serious disagreement with review folk in the past, so...

> On 8/21/23 22:28, Mateusz Guzik wrote:
>> +	counters = __alloc_percpu_gfp(sizeof(*counters) * count,
>> +				      sizeof(*counters), gfp);
>
> The second argument here is the alignment. I see other callers using
> __alignof__(type), which is what alloc_percpu_gfp() does as well. In
> practice I think it shouldn't matter, but for clarity/consistency maybe
> this should be __alignof__ as well?
>

Ye, I neglected to patch it up after a whipping out a PoC.

> Presumably multiplication overflow is not an issue here as it is with
> kmalloc and friends since the count can't be controlled by userspace.
>

I wanted to assert on the count being sensible to begin with, but
there is no general "only assert with debug enabled" macro. Perhaps a
warn_once + bail out would be preferred?

>> +	if (!counters) {
>> +		fbc[0].counters = NULL;
>>   		return -ENOMEM;
>> +	}
>
> Checked that __alloc_percpu_gfp() returns NULL on failure.
>
> Checked that nothing else before this in the function needs cleanup.
>
> In the old code, fbc->count would have gotten initialized but it
> shouldn't matter here, I think, as long as the counter is never activated.
>
> I'm not sure why only fbc[0].counters is set to NULL, should this happen
> for all the "count" members? [PS: percpu_counter_destroy_many() below
> has a check for fbc[0].counters]
>

Consumers looked fishy to me with zeroing the counter prior to calling
the init func.

I added the NULL assignment so that a call to destroy does nothing.

> In summary, my only slight concern is sizeof(*counters) being passed as
> the alignment to __alloc_percpu_gfp() when maybe it would be more
> appropriate to pass __alignof__() -- not that it makes a difference at
> runtime since both are 4 for s32.
>

Agreed, will patch later.

> One other thing: I find it a bit odd that the "amount" parameter
> (initial value) is s64 when the counters themselves are s32. Maybe just
> a leftover from an old version?
>

I don't know the reasoning by the authors, but seems a clear case to
me that the per-CPU counts were left int-sized to reduce memory usage
and reduce deviation between the central counter and the real state.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] pcpcntr: add group allocation/free
  2023-08-21 20:28 ` [PATCH 1/2] pcpcntr: add group allocation/free Mateusz Guzik
  2023-08-22 13:37   ` Vegard Nossum
@ 2023-08-22 17:02   ` Dennis Zhou
  1 sibling, 0 replies; 31+ messages in thread
From: Dennis Zhou @ 2023-08-22 17:02 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: linux-kernel, tj, cl, akpm, shakeelb, linux-mm

Hi Mateusz,

On Mon, Aug 21, 2023 at 10:28:28PM +0200, Mateusz Guzik wrote:
> Allocations and frees are globally serialized on the pcpu lock (and the
> CPU hotplug lock if enabled, which is the case on Debian).
> 
> At least one frequent consumer allocates 4 back-to-back counters (and
> frees them in the same manner), exacerbating the problem.
> 
> While this does not fully remedy scalability issues, it is a step
> towards that goal and provides immediate relief.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>

As I mentioned yesterday, I like this approach because instead of making
percpu smarter, we're batching against the higher level inits which I'm
assuming will have additional synchronization requirements.

Below is mostly style changes to conform and a few nits wrt naming.

> ---
>  include/linux/percpu_counter.h | 19 ++++++++---
>  lib/percpu_counter.c           | 61 ++++++++++++++++++++++++----------
>  2 files changed, 57 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> index 75b73c83bc9d..ff5850b07124 100644
> --- a/include/linux/percpu_counter.h
> +++ b/include/linux/percpu_counter.h
> @@ -30,17 +30,26 @@ struct percpu_counter {
>  
>  extern int percpu_counter_batch;
>  
> -int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
> -			  struct lock_class_key *key);
> +int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
> +			  struct lock_class_key *key, u32 count);

1. Can we move count to before the lock_class_key?
2. count is an overloaded term between percpu_counters and
percpu_refcount. Maybe `nr` or `nr_counters` is better?

>  
> -#define percpu_counter_init(fbc, value, gfp)				\
> +#define percpu_counter_init_many(fbc, value, gfp, count)		\
>  	({								\
>  		static struct lock_class_key __key;			\
>  									\
> -		__percpu_counter_init(fbc, value, gfp, &__key);		\
> +		__percpu_counter_init_many(fbc, value, gfp, &__key, count);\
>  	})
>  
> -void percpu_counter_destroy(struct percpu_counter *fbc);
> +
> +#define percpu_counter_init(fbc, value, gfp)				\
> +	percpu_counter_init_many(fbc, value, gfp, 1)
> +
> +void percpu_counter_destroy_many(struct percpu_counter *fbc, u32 count);
> +static inline void percpu_counter_destroy(struct percpu_counter *fbc)
> +{
> +	percpu_counter_destroy_many(fbc, 1);
> +}
> +
>  void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
>  void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
>  			      s32 batch);
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index 5004463c4f9f..2a33cf23df55 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -151,48 +151,73 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
>  }
>  EXPORT_SYMBOL(__percpu_counter_sum);
>  
> -int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
> -			  struct lock_class_key *key)
> +int __percpu_counter_init_many(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
> +			  struct lock_class_key *key, u32 count)
>  {
>  	unsigned long flags __maybe_unused;
> +	s32 __percpu *counters;
> +	u32 i;
>  
> -	raw_spin_lock_init(&fbc->lock);
> -	lockdep_set_class(&fbc->lock, key);
> -	fbc->count = amount;
> -	fbc->counters = alloc_percpu_gfp(s32, gfp);
> -	if (!fbc->counters)
> +	counters = __alloc_percpu_gfp(sizeof(*counters) * count,
> +				      sizeof(*counters), gfp);


So while a bit moot, I think it'd be nice to clarify what we should do
here wrt alignment and batch allocation. There has been a case in the
past where alignment > size.

This is from my batch api implementation:
+       /*
+        * When allocating a batch we need to align the size so that each
+        * element in the batch will have the appropriate alignment.
+        */
+       size = ALIGN(size, align);

So I think some variation of:
    element_size = ALIGN(sizeof(*counters, __alignof__(*counters)));
    counters = __alloc_percpu_gfp(nr * element_size, __alignof__(*counters), gfp);

While this isn't necessary here for s32's, I think it would be nice to
just set the precedent so we preserve alignment asks for future users if
they use this as a template.

> +	if (!counters) {
> +		fbc[0].counters = NULL;
>  		return -ENOMEM;
> +	}
>  
> -	debug_percpu_counter_activate(fbc);
> +	for (i = 0; i < count; i++) {
> +		raw_spin_lock_init(&fbc[i].lock);
> +		lockdep_set_class(&fbc[i].lock, key);
> +#ifdef CONFIG_HOTPLUG_CPU
> +		INIT_LIST_HEAD(&fbc[i].list);
> +#endif
> +		fbc[i].count = amount;
> +		fbc[i].counters = &counters[i];

This would have to become some (void *) math tho with element_size.

> +
> +		debug_percpu_counter_activate(&fbc[i]);
> +	}
>  
>  #ifdef CONFIG_HOTPLUG_CPU
> -	INIT_LIST_HEAD(&fbc->list);
>  	spin_lock_irqsave(&percpu_counters_lock, flags);
> -	list_add(&fbc->list, &percpu_counters);
> +	for (i = 0; i < count; i++) {
> +		list_add(&fbc[i].list, &percpu_counters);
> +	}

nit: we don't add {} for single line loops.

>  	spin_unlock_irqrestore(&percpu_counters_lock, flags);
>  #endif
>  	return 0;
>  }
> -EXPORT_SYMBOL(__percpu_counter_init);
> +EXPORT_SYMBOL(__percpu_counter_init_many);
>  
> -void percpu_counter_destroy(struct percpu_counter *fbc)
> +void percpu_counter_destroy_many(struct percpu_counter *fbc, u32 count)
>  {
>  	unsigned long flags __maybe_unused;
> +	u32 i;
>  
> -	if (!fbc->counters)
> +	if (WARN_ON_ONCE(!fbc))
>  		return;
>  
> -	debug_percpu_counter_deactivate(fbc);
> +	if (!fbc[0].counters)
> +		return;
> +
> +	for (i = 0; i < count; i++) {
> +		debug_percpu_counter_deactivate(&fbc[i]);
> +	}
>  

nit: no {}.

>  #ifdef CONFIG_HOTPLUG_CPU
>  	spin_lock_irqsave(&percpu_counters_lock, flags);
> -	list_del(&fbc->list);
> +	for (i = 0; i < count; i++) {
> +		list_del(&fbc[i].list);
> +	}

nit: no {}.

>  	spin_unlock_irqrestore(&percpu_counters_lock, flags);
>  #endif
> -	free_percpu(fbc->counters);
> -	fbc->counters = NULL;
> +
> +	free_percpu(fbc[0].counters);
> +
> +	for (i = 0; i < count; i++) {
> +		fbc[i].counters = NULL;
> +	}

nit: no {}.

>  }
> -EXPORT_SYMBOL(percpu_counter_destroy);
> +EXPORT_SYMBOL(percpu_counter_destroy_many);
>  
>  int percpu_counter_batch __read_mostly = 32;
>  EXPORT_SYMBOL(percpu_counter_batch);
> -- 
> 2.39.2
> 

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/2] fork: group allocation of per-cpu counters for mm struct
  2023-08-21 20:28 [PATCH 0/2] execve scalability issues, part 1 Mateusz Guzik
  2023-08-21 20:28 ` [PATCH 1/2] pcpcntr: add group allocation/free Mateusz Guzik
@ 2023-08-21 20:28 ` Mateusz Guzik
  2023-08-21 21:20   ` Matthew Wilcox
  2023-08-21 20:42 ` [PATCH 0/2] execve scalability issues, part 1 Matthew Wilcox
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: Mateusz Guzik @ 2023-08-21 20:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: dennis, tj, cl, akpm, shakeelb, linux-mm, Mateusz Guzik

A trivial execve scalability test which tries to be very friendly
(statically linked binaries, all separate) is predominantly bottlenecked
by back-to-back per-cpu counter allocations which serialize on global
locks.

Ease the pain by allocating and freeing them in one go.

Bench can be found here:
http://apollo.backplane.com/DFlyMisc/doexec.c

$ cc -static -O2 -o static-doexec doexec.c
$ ./static-doexec $(nproc)

Even at a very modest scale of 26 cores (ops/s):
before:	133543.63
after:	186061.81 (+39%)

While with the patch these allocations remain a significant problem,
the primary bottleneck shifts to:

    __pv_queued_spin_lock_slowpath+1
    _raw_spin_lock_irqsave+57
    folio_lruvec_lock_irqsave+91
    release_pages+590
    tlb_batch_pages_flush+61
    tlb_finish_mmu+101
    exit_mmap+327
    __mmput+61
    begin_new_exec+1245
    load_elf_binary+712
    bprm_execve+644
    do_execveat_common.isra.0+429
    __x64_sys_execve+50
    do_syscall_64+46
    entry_SYSCALL_64_after_hwframe+110

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 kernel/fork.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d2e12b6d2b18..86ff78e001c1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -909,8 +909,6 @@ static void cleanup_lazy_tlbs(struct mm_struct *mm)
  */
 void __mmdrop(struct mm_struct *mm)
 {
-	int i;
-
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
 
@@ -925,9 +923,8 @@ void __mmdrop(struct mm_struct *mm)
 	put_user_ns(mm->user_ns);
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
+	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
 
-	for (i = 0; i < NR_MM_COUNTERS; i++)
-		percpu_counter_destroy(&mm->rss_stat[i]);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -1252,7 +1249,6 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
-	int i;
 
 	mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
 	mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock);
@@ -1301,17 +1297,14 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_cid(mm))
 		goto fail_cid;
 
-	for (i = 0; i < NR_MM_COUNTERS; i++)
-		if (percpu_counter_init(&mm->rss_stat[i], 0, GFP_KERNEL_ACCOUNT))
-			goto fail_pcpu;
+	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, NR_MM_COUNTERS))
+		goto fail_pcpu;
 
 	mm->user_ns = get_user_ns(user_ns);
 	lru_gen_init_mm(mm);
 	return mm;
 
 fail_pcpu:
-	while (i > 0)
-		percpu_counter_destroy(&mm->rss_stat[--i]);
 	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] fork: group allocation of per-cpu counters for mm struct
  2023-08-21 20:28 ` [PATCH 2/2] fork: group allocation of per-cpu counters for mm struct Mateusz Guzik
@ 2023-08-21 21:20   ` Matthew Wilcox
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Wilcox @ 2023-08-21 21:20 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: linux-kernel, dennis, tj, cl, akpm, shakeelb, linux-mm

On Mon, Aug 21, 2023 at 10:28:29PM +0200, Mateusz Guzik wrote:
> While with the patch these allocations remain a significant problem,
> the primary bottleneck shifts to:
> 
>     __pv_queued_spin_lock_slowpath+1
>     _raw_spin_lock_irqsave+57
>     folio_lruvec_lock_irqsave+91
>     release_pages+590
>     tlb_batch_pages_flush+61
>     tlb_finish_mmu+101
>     exit_mmap+327
>     __mmput+61
>     begin_new_exec+1245
>     load_elf_binary+712
>     bprm_execve+644
>     do_execveat_common.isra.0+429
>     __x64_sys_execve+50

Looking at this more closely, I don't think the patches I sent are going
to help much.  I'd say the primary problem you have is that you're trying
to free _a lot_ of pages at once on all CPUs.  Since it's the exit_mmap()
path, these are going to be the anonymous pages allocated to this task
(not the file pages it has mmaped).  The large anonymous folios work may
help you out here by decreasing the number of folios we have to manage,
and thus the length of time the LRU lock has to be held for.  It's not
an immediate solution, but I think it'll do the job once it lands.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-21 20:28 [PATCH 0/2] execve scalability issues, part 1 Mateusz Guzik
  2023-08-21 20:28 ` [PATCH 1/2] pcpcntr: add group allocation/free Mateusz Guzik
  2023-08-21 20:28 ` [PATCH 2/2] fork: group allocation of per-cpu counters for mm struct Mateusz Guzik
@ 2023-08-21 20:42 ` Matthew Wilcox
  2023-08-21 20:44   ` [PATCH 1/7] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
  2023-08-21 21:07 ` [PATCH 0/2] execve scalability issues, part 1 Dennis Zhou
  2023-08-26 18:33 ` Mateusz Guzik
  4 siblings, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2023-08-21 20:42 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: linux-kernel, dennis, tj, cl, akpm, shakeelb, linux-mm

On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote:
> > While with the patch these allocations remain a significant problem,
> > the primary bottleneck shifts to:
> > 
> >     __pv_queued_spin_lock_slowpath+1
> >     _raw_spin_lock_irqsave+57
> >     folio_lruvec_lock_irqsave+91
> >     release_pages+590
> >     tlb_batch_pages_flush+61
> >     tlb_finish_mmu+101
> >     exit_mmap+327
> >     __mmput+61
> >     begin_new_exec+1245
> >     load_elf_binary+712
> >     bprm_execve+644
> >     do_execveat_common.isra.0+429
> >     __x64_sys_execve+50
> >     do_syscall_64+46
> >     entry_SYSCALL_64_after_hwframe+110
> 
> I intend to do more work on the area to mostly sort it out, but I would
> not mind if someone else took the hammer to folio. :)

Funny you should ask ... these patches are from ~3 weeks ago.  They may
or may not apply to a current tree, but I'm working on this area.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/7] mm: Make folios_put() the basis of release_pages()
  2023-08-21 20:42 ` [PATCH 0/2] execve scalability issues, part 1 Matthew Wilcox
@ 2023-08-21 20:44   ` Matthew Wilcox (Oracle)
  2023-08-21 20:44     ` [PATCH 2/7] mm: Convert free_unref_page_list() to use folios Matthew Wilcox (Oracle)
                       ` (5 more replies)
  0 siblings, 6 replies; 31+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-21 20:44 UTC (permalink / raw)
  To: Mateusz Guzik, linux-kernel, dennis, tj, cl, akpm, shakeelb,
	linux-mm
  Cc: Matthew Wilcox (Oracle)

By making release_pages() call folios_put(), we can get rid of the calls
to compound_head() for the callers that already know they have folios.
We can also get rid of the lock_batch tracking as we know the size of
the batch is limited by folio_batch.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h | 19 ++----------
 mm/mlock.c         |  2 +-
 mm/swap.c          | 77 +++++++++++++++++++++++++++-------------------
 3 files changed, 48 insertions(+), 50 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 85568e2b2556..c1fc81da8dff 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -36,6 +36,7 @@ struct anon_vma;
 struct anon_vma_chain;
 struct user_struct;
 struct pt_regs;
+struct folio_batch;
 
 extern int sysctl_page_lock_unfairness;
 
@@ -1497,23 +1498,7 @@ typedef union {
 } release_pages_arg __attribute__ ((__transparent_union__));
 
 void release_pages(release_pages_arg, int nr);
-
-/**
- * folios_put - Decrement the reference count on an array of folios.
- * @folios: The folios.
- * @nr: How many folios there are.
- *
- * Like folio_put(), but for an array of folios.  This is more efficient
- * than writing the loop yourself as it will optimise the locks which
- * need to be taken if the folios are freed.
- *
- * Context: May be called in process or interrupt context, but not in NMI
- * context.  May be called while holding a spinlock.
- */
-static inline void folios_put(struct folio **folios, unsigned int nr)
-{
-	release_pages(folios, nr);
-}
+void folios_put(struct folio_batch *folios);
 
 static inline void put_page(struct page *page)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 0a0c996c5c21..67ec840cf5f1 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -206,7 +206,7 @@ static void mlock_folio_batch(struct folio_batch *fbatch)
 
 	if (lruvec)
 		unlock_page_lruvec_irq(lruvec);
-	folios_put(fbatch->folios, folio_batch_count(fbatch));
+	folios_put(fbatch);
 	folio_batch_reinit(fbatch);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..11ca25d4843f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -221,7 +221,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 
 	if (lruvec)
 		unlock_page_lruvec_irqrestore(lruvec, flags);
-	folios_put(fbatch->folios, folio_batch_count(fbatch));
+	folios_put(fbatch);
 	folio_batch_reinit(fbatch);
 }
 
@@ -946,41 +946,25 @@ void lru_cache_disable(void)
 }
 
 /**
- * release_pages - batched put_page()
- * @arg: array of pages to release
- * @nr: number of pages
+ * folios_put - Decrement the reference count on a batch of folios.
+ * @folios: The folios.
  *
- * Decrement the reference count on all the pages in @arg.  If it
- * fell to zero, remove the page from the LRU and free it.
+ * Like folio_put(), but for a batch of folios.  This is more efficient
+ * than writing the loop yourself as it will optimise the locks which
+ * need to be taken if the folios are freed.
  *
- * Note that the argument can be an array of pages, encoded pages,
- * or folio pointers. We ignore any encoded bits, and turn any of
- * them into just a folio that gets free'd.
+ * Context: May be called in process or interrupt context, but not in NMI
+ * context.  May be called while holding a spinlock.
  */
-void release_pages(release_pages_arg arg, int nr)
+void folios_put(struct folio_batch *folios)
 {
 	int i;
-	struct encoded_page **encoded = arg.encoded_pages;
 	LIST_HEAD(pages_to_free);
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
-	unsigned int lock_batch;
 
-	for (i = 0; i < nr; i++) {
-		struct folio *folio;
-
-		/* Turn any of the argument types into a folio */
-		folio = page_folio(encoded_page_ptr(encoded[i]));
-
-		/*
-		 * Make sure the IRQ-safe lock-holding time does not get
-		 * excessive with a continuous string of pages from the
-		 * same lruvec. The lock is held only if lruvec != NULL.
-		 */
-		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
-			unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = NULL;
-		}
+	for (i = 0; i < folios->nr; i++) {
+		struct folio *folio = folios->folios[i];
 
 		if (is_huge_zero_page(&folio->page))
 			continue;
@@ -1010,13 +994,8 @@ void release_pages(release_pages_arg arg, int nr)
 		}
 
 		if (folio_test_lru(folio)) {
-			struct lruvec *prev_lruvec = lruvec;
-
 			lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
 									&flags);
-			if (prev_lruvec != lruvec)
-				lock_batch = 0;
-
 			lruvec_del_folio(lruvec, folio);
 			__folio_clear_lru_flags(folio);
 		}
@@ -1041,6 +1020,40 @@ void release_pages(release_pages_arg arg, int nr)
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
 }
+EXPORT_SYMBOL(folios_put);
+
+/**
+ * release_pages - batched put_page()
+ * @arg: array of pages to release
+ * @nr: number of pages
+ *
+ * Decrement the reference count on all the pages in @arg.  If it
+ * fell to zero, remove the page from the LRU and free it.
+ *
+ * Note that the argument can be an array of pages, encoded pages,
+ * or folio pointers. We ignore any encoded bits, and turn any of
+ * them into just a folio that gets free'd.
+ */
+void release_pages(release_pages_arg arg, int nr)
+{
+	struct folio_batch fbatch;
+	struct encoded_page **encoded = arg.encoded_pages;
+	int i;
+
+	folio_batch_init(&fbatch);
+	for (i = 0; i < nr; i++) {
+		/* Turn any of the argument types into a folio */
+		struct folio *folio = page_folio(encoded_page_ptr(encoded[i]));
+
+		if (folio_batch_add(&fbatch, folio) > 0)
+			continue;
+		folios_put(&fbatch);
+		fbatch.nr = 0;
+	}
+
+	if (fbatch.nr)
+		folios_put(&fbatch);
+}
 EXPORT_SYMBOL(release_pages);
 
 /*
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/7] mm: Convert free_unref_page_list() to use folios
  2023-08-21 20:44   ` [PATCH 1/7] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
@ 2023-08-21 20:44     ` Matthew Wilcox (Oracle)
  2023-08-21 20:44     ` [PATCH 3/7] mm: Add free_unref_folios() Matthew Wilcox (Oracle)
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-21 20:44 UTC (permalink / raw)
  To: Mateusz Guzik, linux-kernel, dennis, tj, cl, akpm, shakeelb,
	linux-mm
  Cc: Matthew Wilcox (Oracle)

Most of its callees are not yet ready to accept a folio, but we know
all of the pages passed in are actually folios because they're linked
through ->lru.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/page_alloc.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81b1c7e3a28b..2f2185929fcb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2498,17 +2498,17 @@ void free_unref_page(struct page *page, unsigned int order)
 void free_unref_page_list(struct list_head *list)
 {
 	unsigned long __maybe_unused UP_flags;
-	struct page *page, *next;
+	struct folio *folio, *next;
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
 	int batch_count = 0;
 	int migratetype;
 
 	/* Prepare pages for freeing */
-	list_for_each_entry_safe(page, next, list, lru) {
-		unsigned long pfn = page_to_pfn(page);
-		if (!free_unref_page_prepare(page, pfn, 0)) {
-			list_del(&page->lru);
+	list_for_each_entry_safe(folio, next, list, lru) {
+		unsigned long pfn = folio_pfn(folio);
+		if (!free_unref_page_prepare(&folio->page, pfn, 0)) {
+			list_del(&folio->lru);
 			continue;
 		}
 
@@ -2516,24 +2516,25 @@ void free_unref_page_list(struct list_head *list)
 		 * Free isolated pages directly to the allocator, see
 		 * comment in free_unref_page.
 		 */
-		migratetype = get_pcppage_migratetype(page);
+		migratetype = get_pcppage_migratetype(&folio->page);
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			list_del(&page->lru);
-			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
+			list_del(&folio->lru);
+			free_one_page(folio_zone(folio), &folio->page, pfn,
+					0, migratetype, FPI_NONE);
 			continue;
 		}
 	}
 
-	list_for_each_entry_safe(page, next, list, lru) {
-		struct zone *zone = page_zone(page);
+	list_for_each_entry_safe(folio, next, list, lru) {
+		struct zone *zone = folio_zone(folio);
 
-		list_del(&page->lru);
-		migratetype = get_pcppage_migratetype(page);
+		list_del(&folio->lru);
+		migratetype = get_pcppage_migratetype(&folio->page);
 
 		/*
 		 * Either different zone requiring a different pcp lock or
 		 * excessive lock hold times when freeing a large list of
-		 * pages.
+		 * folios.
 		 */
 		if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
 			if (pcp) {
@@ -2544,15 +2545,16 @@ void free_unref_page_list(struct list_head *list)
 			batch_count = 0;
 
 			/*
-			 * trylock is necessary as pages may be getting freed
+			 * trylock is necessary as folios may be getting freed
 			 * from IRQ or SoftIRQ context after an IO completion.
 			 */
 			pcp_trylock_prepare(UP_flags);
 			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
-				free_one_page(zone, page, page_to_pfn(page),
-					      0, migratetype, FPI_NONE);
+				free_one_page(zone, &folio->page,
+						folio_pfn(folio), 0,
+						migratetype, FPI_NONE);
 				locked_zone = NULL;
 				continue;
 			}
@@ -2566,8 +2568,8 @@ void free_unref_page_list(struct list_head *list)
 		if (unlikely(migratetype >= MIGRATE_PCPTYPES))
 			migratetype = MIGRATE_MOVABLE;
 
-		trace_mm_page_free_batched(page);
-		free_unref_page_commit(zone, pcp, page, migratetype, 0);
+		trace_mm_page_free_batched(&folio->page);
+		free_unref_page_commit(zone, pcp, &folio->page, migratetype, 0);
 		batch_count++;
 	}
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/7] mm: Add free_unref_folios()
  2023-08-21 20:44   ` [PATCH 1/7] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
  2023-08-21 20:44     ` [PATCH 2/7] mm: Convert free_unref_page_list() to use folios Matthew Wilcox (Oracle)
@ 2023-08-21 20:44     ` Matthew Wilcox (Oracle)
  2023-08-21 20:44     ` [PATCH 4/7] mm: Use folios_put() in __folio_batch_release() Matthew Wilcox (Oracle)
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-21 20:44 UTC (permalink / raw)
  To: Mateusz Guzik, linux-kernel, dennis, tj, cl, akpm, shakeelb,
	linux-mm
  Cc: Matthew Wilcox (Oracle)

Iterate over a folio_batch rather than a linked list.  This is
easier for the CPU to prefetch and has a batch count naturally
built in so we don't need to track it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/internal.h   |  5 +++--
 mm/page_alloc.c | 59 ++++++++++++++++++++++++++++++-------------------
 2 files changed, 39 insertions(+), 25 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 5c777b6779fa..3e6b448e7d63 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -440,8 +440,9 @@ extern void post_alloc_hook(struct page *page, unsigned int order,
 					gfp_t gfp_flags);
 extern int user_min_free_kbytes;
 
-extern void free_unref_page(struct page *page, unsigned int order);
-extern void free_unref_page_list(struct list_head *list);
+void free_unref_page(struct page *page, unsigned int order);
+void free_unref_folios(struct folio_batch *fbatch);
+void free_unref_page_list(struct list_head *list);
 
 extern void zone_pcp_reset(struct zone *zone);
 extern void zone_pcp_disable(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2f2185929fcb..4354938ca3b0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -32,6 +32,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
+#include <linux/pagevec.h>
 #include <linux/memory_hotplug.h>
 #include <linux/nodemask.h>
 #include <linux/vmstat.h>
@@ -2493,24 +2494,21 @@ void free_unref_page(struct page *page, unsigned int order)
 }
 
 /*
- * Free a list of 0-order pages
+ * Free a batch of 0-order pages
  */
-void free_unref_page_list(struct list_head *list)
+void free_unref_folios(struct folio_batch *folios)
 {
 	unsigned long __maybe_unused UP_flags;
-	struct folio *folio, *next;
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
-	int batch_count = 0;
-	int migratetype;
+	int i, j, migratetype;
 
-	/* Prepare pages for freeing */
-	list_for_each_entry_safe(folio, next, list, lru) {
+	/* Prepare folios for freeing */
+	for (i = 0, j = 0; i < folios->nr; i++) {
+		struct folio *folio = folios->folios[i];
 		unsigned long pfn = folio_pfn(folio);
-		if (!free_unref_page_prepare(&folio->page, pfn, 0)) {
-			list_del(&folio->lru);
+		if (!free_unref_page_prepare(&folio->page, pfn, 0))
 			continue;
-		}
 
 		/*
 		 * Free isolated pages directly to the allocator, see
@@ -2518,34 +2516,31 @@ void free_unref_page_list(struct list_head *list)
 		 */
 		migratetype = get_pcppage_migratetype(&folio->page);
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			list_del(&folio->lru);
 			free_one_page(folio_zone(folio), &folio->page, pfn,
 					0, migratetype, FPI_NONE);
 			continue;
 		}
+		if (j != i)
+			folios->folios[j] = folio;
+		j++;
 	}
+	folios->nr = j;
 
-	list_for_each_entry_safe(folio, next, list, lru) {
+	for (i = 0; i < folios->nr; i++) {
+		struct folio *folio = folios->folios[i];
 		struct zone *zone = folio_zone(folio);
 
-		list_del(&folio->lru);
 		migratetype = get_pcppage_migratetype(&folio->page);
 
-		/*
-		 * Either different zone requiring a different pcp lock or
-		 * excessive lock hold times when freeing a large list of
-		 * folios.
-		 */
-		if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
+		/* Different zone requires a different pcp lock */
+		if (zone != locked_zone) {
 			if (pcp) {
 				pcp_spin_unlock(pcp);
 				pcp_trylock_finish(UP_flags);
 			}
 
-			batch_count = 0;
-
 			/*
-			 * trylock is necessary as folios may be getting freed
+			 * trylock is necessary as pages may be getting freed
 			 * from IRQ or SoftIRQ context after an IO completion.
 			 */
 			pcp_trylock_prepare(UP_flags);
@@ -2570,7 +2565,6 @@ void free_unref_page_list(struct list_head *list)
 
 		trace_mm_page_free_batched(&folio->page);
 		free_unref_page_commit(zone, pcp, &folio->page, migratetype, 0);
-		batch_count++;
 	}
 
 	if (pcp) {
@@ -2579,6 +2573,25 @@ void free_unref_page_list(struct list_head *list)
 	}
 }
 
+void free_unref_page_list(struct list_head *list)
+{
+	struct folio_batch fbatch;
+
+	folio_batch_init(&fbatch);
+	while (!list_empty(list)) {
+		struct folio *folio = list_first_entry(list, struct folio, lru);
+
+		list_del(&folio->lru);
+		if (folio_batch_add(&fbatch, folio) > 0)
+			continue;
+		free_unref_folios(&fbatch);
+		fbatch.nr = 0;
+	}
+
+	if (fbatch.nr)
+		free_unref_folios(&fbatch);
+}
+
 /*
  * split_page takes a non-compound higher-order page, and splits it into
  * n (1<<order) sub-pages: page[0..n]
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/7] mm: Use folios_put() in __folio_batch_release()
  2023-08-21 20:44   ` [PATCH 1/7] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
  2023-08-21 20:44     ` [PATCH 2/7] mm: Convert free_unref_page_list() to use folios Matthew Wilcox (Oracle)
  2023-08-21 20:44     ` [PATCH 3/7] mm: Add free_unref_folios() Matthew Wilcox (Oracle)
@ 2023-08-21 20:44     ` Matthew Wilcox (Oracle)
  2023-08-21 20:44     ` [PATCH 5/7] memcg: Add mem_cgroup_uncharge_batch() Matthew Wilcox (Oracle)
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-21 20:44 UTC (permalink / raw)
  To: Mateusz Guzik, linux-kernel, dennis, tj, cl, akpm, shakeelb,
	linux-mm
  Cc: Matthew Wilcox (Oracle)

There's no need to indirect through release_pages() and iterate
over this batch of folios an extra time; we can just use the batch
that we have.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/swap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 11ca25d4843f..9d31185dc27b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1072,7 +1072,7 @@ void __folio_batch_release(struct folio_batch *fbatch)
 		lru_add_drain();
 		fbatch->percpu_pvec_drained = true;
 	}
-	release_pages(fbatch->folios, folio_batch_count(fbatch));
+	folios_put(fbatch);
 	folio_batch_reinit(fbatch);
 }
 EXPORT_SYMBOL(__folio_batch_release);
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/7] memcg: Add mem_cgroup_uncharge_batch()
  2023-08-21 20:44   ` [PATCH 1/7] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
                       ` (2 preceding siblings ...)
  2023-08-21 20:44     ` [PATCH 4/7] mm: Use folios_put() in __folio_batch_release() Matthew Wilcox (Oracle)
@ 2023-08-21 20:44     ` Matthew Wilcox (Oracle)
  2023-08-21 20:44     ` [PATCH 6/7] mm: Remove use of folio list from folios_put() Matthew Wilcox (Oracle)
  2023-08-21 20:44     ` [PATCH 7/7] mm: Use free_unref_folios() in put_pages_list() Matthew Wilcox (Oracle)
  5 siblings, 0 replies; 31+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-21 20:44 UTC (permalink / raw)
  To: Mateusz Guzik, linux-kernel, dennis, tj, cl, akpm, shakeelb,
	linux-mm
  Cc: Matthew Wilcox (Oracle)

Almost identical to mem_cgroup_uncharge_list(), except it takes a
folio_batch instead of a list_head.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/memcontrol.h | 12 ++++++++++++
 mm/memcontrol.c            | 13 +++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3bd00f224224..4ea6f8399b05 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -708,6 +708,14 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
 	__mem_cgroup_uncharge_list(page_list);
 }
 
+void __mem_cgroup_uncharge_batch(struct folio_batch *fbatch);
+static inline void mem_cgroup_uncharge_batch(struct folio_batch *fbatch)
+{
+	if (mem_cgroup_disabled())
+		return;
+	__mem_cgroup_uncharge_batch(fbatch);
+}
+
 void mem_cgroup_migrate(struct folio *old, struct folio *new);
 
 /**
@@ -1269,6 +1277,10 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
 {
 }
 
+static inline void mem_cgroup_uncharge_batch(struct folio_batch *fbatch)
+{
+}
+
 static inline void mem_cgroup_migrate(struct folio *old, struct folio *new)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 67bda1ceedbe..205aa28c2672 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -33,6 +33,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/hugetlb.h>
 #include <linux/pagemap.h>
+#include <linux/pagevec.h>
 #include <linux/vm_event_item.h>
 #include <linux/smp.h>
 #include <linux/page-flags.h>
@@ -7194,6 +7195,18 @@ void __mem_cgroup_uncharge_list(struct list_head *page_list)
 		uncharge_batch(&ug);
 }
 
+void __mem_cgroup_uncharge_batch(struct folio_batch *fbatch)
+{
+	struct uncharge_gather ug;
+	unsigned int i;
+
+	uncharge_gather_clear(&ug);
+	for (i = 0; i < fbatch->nr; i++)
+		uncharge_folio(fbatch->folios[i], &ug);
+	if (ug.memcg)
+		uncharge_batch(&ug);
+}
+
 /**
  * mem_cgroup_migrate - Charge a folio's replacement.
  * @old: Currently circulating folio.
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/7] mm: Remove use of folio list from folios_put()
  2023-08-21 20:44   ` [PATCH 1/7] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
                       ` (3 preceding siblings ...)
  2023-08-21 20:44     ` [PATCH 5/7] memcg: Add mem_cgroup_uncharge_batch() Matthew Wilcox (Oracle)
@ 2023-08-21 20:44     ` Matthew Wilcox (Oracle)
  2023-08-21 20:44     ` [PATCH 7/7] mm: Use free_unref_folios() in put_pages_list() Matthew Wilcox (Oracle)
  5 siblings, 0 replies; 31+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-21 20:44 UTC (permalink / raw)
  To: Mateusz Guzik, linux-kernel, dennis, tj, cl, akpm, shakeelb,
	linux-mm
  Cc: Matthew Wilcox (Oracle)

Instead of putting the interesting folios on a list, delete the
uninteresting one from the folio_batch.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/swap.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 9d31185dc27b..8ebb10d44de7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -958,12 +958,11 @@ void lru_cache_disable(void)
  */
 void folios_put(struct folio_batch *folios)
 {
-	int i;
-	LIST_HEAD(pages_to_free);
+	int i, j;
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
-	for (i = 0; i < folios->nr; i++) {
+	for (i = 0, j = 0; i < folios->nr; i++) {
 		struct folio *folio = folios->folios[i];
 
 		if (is_huge_zero_page(&folio->page))
@@ -1012,13 +1011,18 @@ void folios_put(struct folio_batch *folios)
 			count_vm_event(UNEVICTABLE_PGCLEARED);
 		}
 
-		list_add(&folio->lru, &pages_to_free);
+		if (j != i)
+			folios->folios[j] = folio;
+		j++;
 	}
 	if (lruvec)
 		unlock_page_lruvec_irqrestore(lruvec, flags);
+	folios->nr = j;
+	if (!j)
+		return;
 
-	mem_cgroup_uncharge_list(&pages_to_free);
-	free_unref_page_list(&pages_to_free);
+	mem_cgroup_uncharge_batch(folios);
+	free_unref_folios(folios);
 }
 EXPORT_SYMBOL(folios_put);
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 7/7] mm: Use free_unref_folios() in put_pages_list()
  2023-08-21 20:44   ` [PATCH 1/7] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
                       ` (4 preceding siblings ...)
  2023-08-21 20:44     ` [PATCH 6/7] mm: Remove use of folio list from folios_put() Matthew Wilcox (Oracle)
@ 2023-08-21 20:44     ` Matthew Wilcox (Oracle)
  5 siblings, 0 replies; 31+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-08-21 20:44 UTC (permalink / raw)
  To: Mateusz Guzik, linux-kernel, dennis, tj, cl, akpm, shakeelb,
	linux-mm
  Cc: Matthew Wilcox (Oracle)

Break up the list of folios into batches here so that the
folios are more likely to be cache hot when doing the
rest of the processing.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/swap.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 8ebb10d44de7..51a613f477a9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -138,22 +138,26 @@ EXPORT_SYMBOL(__folio_put);
  */
 void put_pages_list(struct list_head *pages)
 {
-	struct folio *folio, *next;
+	struct folio_batch fbatch;
+	struct folio *folio;
 
-	list_for_each_entry_safe(folio, next, pages, lru) {
-		if (!folio_put_testzero(folio)) {
-			list_del(&folio->lru);
+	folio_batch_init(&fbatch);
+	list_for_each_entry(folio, pages, lru) {
+		if (!folio_put_testzero(folio))
 			continue;
-		}
 		if (folio_test_large(folio)) {
-			list_del(&folio->lru);
 			__folio_put_large(folio);
 			continue;
 		}
 		/* LRU flag must be clear because it's passed using the lru */
+		if (folio_batch_add(&fbatch, folio) > 0)
+			continue;
+		free_unref_folios(&fbatch);
+		fbatch.nr = 0;
 	}
 
-	free_unref_page_list(pages);
+	if (fbatch.nr)
+		free_unref_folios(&fbatch);
 	INIT_LIST_HEAD(pages);
 }
 EXPORT_SYMBOL(put_pages_list);
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-21 20:28 [PATCH 0/2] execve scalability issues, part 1 Mateusz Guzik
                   ` (2 preceding siblings ...)
  2023-08-21 20:42 ` [PATCH 0/2] execve scalability issues, part 1 Matthew Wilcox
@ 2023-08-21 21:07 ` Dennis Zhou
  2023-08-21 21:39   ` Mateusz Guzik
  2023-08-26 18:33 ` Mateusz Guzik
  4 siblings, 1 reply; 31+ messages in thread
From: Dennis Zhou @ 2023-08-21 21:07 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: linux-kernel, tj, cl, akpm, shakeelb, linux-mm

Hello,

On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote:
> To start I figured I'm going to bench about as friendly case as it gets
> -- statically linked *separate* binaries all doing execve in a loop.
> 
> I borrowed the bench from found here:
> http://apollo.backplane.com/DFlyMisc/doexec.c
> 
> $ cc -static -O2 -o static-doexec doexec.c
> $ ./static-doexec $(nproc)
> 
> It prints a result every second (warning: first line is garbage).
> 
> My test box is temporarily only 26 cores and even at this scale I run
> into massive lock contention stemming from back-to-back calls to
> percpu_counter_init (and _destroy later).
> 
> While not a panacea, one simple thing to do here is to batch these ops.
> Since the term "batching" is already used in the file, I decided to
> refer to it as "grouping" instead.
> 

Unfortunately it's taken me longer to get back to and I'm actually not
super happy with the results, I wrote a batch percpu allocation api.
It's better than the starting place, but I'm falling short on the free
path. I am/was also wrestling with the lifetime ideas (should these
lifetimes be percpus problem or call site bound like you've done).

What I like about this approach is you group the percpu_counter lock
which batching percpu allocations wouldn't be able to solve no matter
how well I do.

I'll review this more closely today.

> Even if this code could be patched to dodge these counters,  I would
> argue a high-traffic alloc/free consumer is only a matter of time so it
> makes sense to facilitate it.
> 
> With the fix I get an ok win, to quote from the commit:
> > Even at a very modest scale of 26 cores (ops/s):
> > before: 133543.63
> > after:  186061.81 (+39%)
> 
> > While with the patch these allocations remain a significant problem,
> > the primary bottleneck shifts to:
> > 
> >     __pv_queued_spin_lock_slowpath+1
> >     _raw_spin_lock_irqsave+57
> >     folio_lruvec_lock_irqsave+91
> >     release_pages+590
> >     tlb_batch_pages_flush+61
> >     tlb_finish_mmu+101
> >     exit_mmap+327
> >     __mmput+61
> >     begin_new_exec+1245
> >     load_elf_binary+712
> >     bprm_execve+644
> >     do_execveat_common.isra.0+429
> >     __x64_sys_execve+50
> >     do_syscall_64+46
> >     entry_SYSCALL_64_after_hwframe+110
> 
> I intend to do more work on the area to mostly sort it out, but I would
> not mind if someone else took the hammer to folio. :)
> 
> With this out of the way I'll be looking at some form of caching to
> eliminate these allocs as a problem.
> 

I'm not against caching, this is just my first thought. Caching will
have an impact on the backing pages of percpu. All it takes is 1
allocation on a page for the current allocator to pin n pages of memory.
A few years ago percpu depopulation was implemented so that limits the
amount of resident backing pages.

Maybe the right thing to do is preallocate pools of common sized
allocations so that way they can be recycled such that we don't have to
think too hard about fragmentation that can occur if we populate these
pools over time?

Also as you've pointed out, it wasn't just the percpu allocation being
the bottleneck, but percpu_counter's global lock too for hotplug
support. I'm hazarding a guess most use cases of percpu might have
additional locking requirements too such as percpu_counter.

Thanks,
Dennis

> Thoughts?
> 
> Mateusz Guzik (2):
>   pcpcntr: add group allocation/free
>   fork: group allocation of per-cpu counters for mm struct
> 
>  include/linux/percpu_counter.h | 19 ++++++++---
>  kernel/fork.c                  | 13 ++------
>  lib/percpu_counter.c           | 61 ++++++++++++++++++++++++----------
>  3 files changed, 60 insertions(+), 33 deletions(-)
> 
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-21 21:07 ` [PATCH 0/2] execve scalability issues, part 1 Dennis Zhou
@ 2023-08-21 21:39   ` Mateusz Guzik
  2023-08-21 22:29     ` Mateusz Guzik
  0 siblings, 1 reply; 31+ messages in thread
From: Mateusz Guzik @ 2023-08-21 21:39 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: linux-kernel, tj, cl, akpm, shakeelb, linux-mm

On Mon, Aug 21, 2023 at 02:07:28PM -0700, Dennis Zhou wrote:
> On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote:
> > With this out of the way I'll be looking at some form of caching to
> > eliminate these allocs as a problem.
> > 
> 
> I'm not against caching, this is just my first thought. Caching will
> have an impact on the backing pages of percpu. All it takes is 1
> allocation on a page for the current allocator to pin n pages of memory.
> A few years ago percpu depopulation was implemented so that limits the
> amount of resident backing pages.
> 

I'm painfully aware.

> Maybe the right thing to do is preallocate pools of common sized
> allocations so that way they can be recycled such that we don't have to
> think too hard about fragmentation that can occur if we populate these
> pools over time?
> 

This is what I was going to suggest :)

FreeBSD has a per-cpu allocator which pretends to be the same as the
slab allocator, except handing out per-cpu bufs. So far it has sizes 4,
8, 16, 32 and 64 and you can act as if you are mallocing in that size.

Scales perfectly fine of course since it caches objs per-CPU, but there
is some waste and I have 0 idea how it compares to what Linux is doing
on that front.

I stress though that even if you were to carve out certain sizes, a
global lock to handle ops will still kill scalability.

Perhaps granularity better than global, but less than per-CPU would be a
sweet spot for scalabability vs memory waste.

That said...

> Also as you've pointed out, it wasn't just the percpu allocation being
> the bottleneck, but percpu_counter's global lock too for hotplug
> support. I'm hazarding a guess most use cases of percpu might have
> additional locking requirements too such as percpu_counter.
> 

True Fix(tm) is a longer story.

Maybe let's sort out this patchset first, whichever way. :)

> Thanks,
> Dennis
> 
> > Thoughts?
> > 
> > Mateusz Guzik (2):
> >   pcpcntr: add group allocation/free
> >   fork: group allocation of per-cpu counters for mm struct
> > 
> >  include/linux/percpu_counter.h | 19 ++++++++---
> >  kernel/fork.c                  | 13 ++------
> >  lib/percpu_counter.c           | 61 ++++++++++++++++++++++++----------
> >  3 files changed, 60 insertions(+), 33 deletions(-)
> > 
> > -- 
> > 2.39.2
> > 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-21 21:39   ` Mateusz Guzik
@ 2023-08-21 22:29     ` Mateusz Guzik
  2023-08-22  9:51       ` Jan Kara
  0 siblings, 1 reply; 31+ messages in thread
From: Mateusz Guzik @ 2023-08-21 22:29 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: linux-kernel, tj, cl, akpm, shakeelb, linux-mm, jack

On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On Mon, Aug 21, 2023 at 02:07:28PM -0700, Dennis Zhou wrote:
>> On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote:
>> > With this out of the way I'll be looking at some form of caching to
>> > eliminate these allocs as a problem.
>> >
>>
>> I'm not against caching, this is just my first thought. Caching will
>> have an impact on the backing pages of percpu. All it takes is 1
>> allocation on a page for the current allocator to pin n pages of memory.
>> A few years ago percpu depopulation was implemented so that limits the
>> amount of resident backing pages.
>>
>
> I'm painfully aware.
>
>> Maybe the right thing to do is preallocate pools of common sized
>> allocations so that way they can be recycled such that we don't have to
>> think too hard about fragmentation that can occur if we populate these
>> pools over time?
>>
>
> This is what I was going to suggest :)
>
> FreeBSD has a per-cpu allocator which pretends to be the same as the
> slab allocator, except handing out per-cpu bufs. So far it has sizes 4,
> 8, 16, 32 and 64 and you can act as if you are mallocing in that size.
>
> Scales perfectly fine of course since it caches objs per-CPU, but there
> is some waste and I have 0 idea how it compares to what Linux is doing
> on that front.
>
> I stress though that even if you were to carve out certain sizes, a
> global lock to handle ops will still kill scalability.
>
> Perhaps granularity better than global, but less than per-CPU would be a
> sweet spot for scalabability vs memory waste.
>
> That said...
>
>> Also as you've pointed out, it wasn't just the percpu allocation being
>> the bottleneck, but percpu_counter's global lock too for hotplug
>> support. I'm hazarding a guess most use cases of percpu might have
>> additional locking requirements too such as percpu_counter.
>>
>
> True Fix(tm) is a longer story.
>
> Maybe let's sort out this patchset first, whichever way. :)
>

So I found the discussion around the original patch with a perf
regression report.

https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/

The reporter suggests dodging the problem by only allocating per-cpu
counters when the process is going multithreaded. Given that there is
still plenty of forever single-threaded procs out there I think that's
does sound like a great plan regardless of what happens with this
patchset.

Almost all access is already done using dedicated routines, so this
should be an afternoon churn to sort out, unless I missed a
showstopper. (maybe there is no good place to stuff a flag/whatever
other indicator about the state of counters?)

That said I'll look into it some time this or next week.

>> Thanks,
>> Dennis
>>
>> > Thoughts?
>> >
>> > Mateusz Guzik (2):
>> >   pcpcntr: add group allocation/free
>> >   fork: group allocation of per-cpu counters for mm struct
>> >
>> >  include/linux/percpu_counter.h | 19 ++++++++---
>> >  kernel/fork.c                  | 13 ++------
>> >  lib/percpu_counter.c           | 61 ++++++++++++++++++++++++----------
>> >  3 files changed, 60 insertions(+), 33 deletions(-)
>> >
>> > --
>> > 2.39.2
>> >
>


-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-21 22:29     ` Mateusz Guzik
@ 2023-08-22  9:51       ` Jan Kara
  2023-08-22 14:24         ` Mateusz Guzik
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2023-08-22  9:51 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Dennis Zhou, linux-kernel, tj, cl, akpm, shakeelb, linux-mm, jack

On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > True Fix(tm) is a longer story.
> >
> > Maybe let's sort out this patchset first, whichever way. :)
> >
> 
> So I found the discussion around the original patch with a perf
> regression report.
> 
> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
> 
> The reporter suggests dodging the problem by only allocating per-cpu
> counters when the process is going multithreaded. Given that there is
> still plenty of forever single-threaded procs out there I think that's
> does sound like a great plan regardless of what happens with this
> patchset.
> 
> Almost all access is already done using dedicated routines, so this
> should be an afternoon churn to sort out, unless I missed a
> showstopper. (maybe there is no good place to stuff a flag/whatever
> other indicator about the state of counters?)
> 
> That said I'll look into it some time this or next week.

Good, just let me know how it went, I also wanted to start looking into
this to come up with some concrete patches :). What I had in mind was that
we could use 'counters == NULL' as an indication that the counter is still
in 'single counter mode'.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-22  9:51       ` Jan Kara
@ 2023-08-22 14:24         ` Mateusz Guzik
  2023-08-23  9:49           ` Jan Kara
  0 siblings, 1 reply; 31+ messages in thread
From: Mateusz Guzik @ 2023-08-22 14:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dennis Zhou, linux-kernel, tj, cl, akpm, shakeelb, linux-mm

On 8/22/23, Jan Kara <jack@suse.cz> wrote:
> On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
>> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> > True Fix(tm) is a longer story.
>> >
>> > Maybe let's sort out this patchset first, whichever way. :)
>> >
>>
>> So I found the discussion around the original patch with a perf
>> regression report.
>>
>> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
>>
>> The reporter suggests dodging the problem by only allocating per-cpu
>> counters when the process is going multithreaded. Given that there is
>> still plenty of forever single-threaded procs out there I think that's
>> does sound like a great plan regardless of what happens with this
>> patchset.
>>
>> Almost all access is already done using dedicated routines, so this
>> should be an afternoon churn to sort out, unless I missed a
>> showstopper. (maybe there is no good place to stuff a flag/whatever
>> other indicator about the state of counters?)
>>
>> That said I'll look into it some time this or next week.
>
> Good, just let me know how it went, I also wanted to start looking into
> this to come up with some concrete patches :). What I had in mind was that
> we could use 'counters == NULL' as an indication that the counter is still
> in 'single counter mode'.
>

In the current state there are only pointers to counters in mm_struct
and there is no storage for them in task_struct. So I don't think
merely null-checking the per-cpu stuff is going to cut it -- where
should the single-threaded counters land?

Bonus problem, non-current can modify these counters and this needs to
be safe against current playing with them at the same time. (and it
would be a shame to require current to use atomic on them)

That said, my initial proposal adds a union:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5e74ce4a28cd..ea70f0c08286 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -737,7 +737,11 @@ struct mm_struct {

                unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
/proc/PID/auxv */

-               struct percpu_counter rss_stat[NR_MM_COUNTERS];
+               union {
+                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
+                       u64 *rss_stat_single;
+               };
+               bool    magic_flag_stuffed_elsewhere;

                struct linux_binfmt *binfmt;

Then for single-threaded case an area is allocated for NR_MM_COUNTERS
countes * 2 -- first set updated without any synchro by current
thread. Second set only to be modified by others and protected with
mm->arg_lock. The lock protects remote access to the union to begin
with.

Transition to per-CPU operation sets the magic flag (there is plenty
of spare space in mm_struct, I'll find a good home for it without
growing the struct). It would be a one-way street -- a process which
gets a bunch of threads and goes back to one stays with per-CPU.

Then you get the true value of something by adding both counters.

arg_lock is sparingly used, so remote ops are not expected to contend
with anything. In fact their cost is going to go down since percpu
summation takes a spinlock which also disables interrupts.

Local ops should be about the same in cost as they are right now.

I might have missed some detail in the above description, but I think
the approach is decent.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-22 14:24         ` Mateusz Guzik
@ 2023-08-23  9:49           ` Jan Kara
  2023-08-23 10:49             ` David Laight
                               ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Jan Kara @ 2023-08-23  9:49 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Jan Kara, Dennis Zhou, linux-kernel, tj, cl, akpm, shakeelb,
	linux-mm

On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
> On 8/22/23, Jan Kara <jack@suse.cz> wrote:
> > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
> >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >> > True Fix(tm) is a longer story.
> >> >
> >> > Maybe let's sort out this patchset first, whichever way. :)
> >> >
> >>
> >> So I found the discussion around the original patch with a perf
> >> regression report.
> >>
> >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
> >>
> >> The reporter suggests dodging the problem by only allocating per-cpu
> >> counters when the process is going multithreaded. Given that there is
> >> still plenty of forever single-threaded procs out there I think that's
> >> does sound like a great plan regardless of what happens with this
> >> patchset.
> >>
> >> Almost all access is already done using dedicated routines, so this
> >> should be an afternoon churn to sort out, unless I missed a
> >> showstopper. (maybe there is no good place to stuff a flag/whatever
> >> other indicator about the state of counters?)
> >>
> >> That said I'll look into it some time this or next week.
> >
> > Good, just let me know how it went, I also wanted to start looking into
> > this to come up with some concrete patches :). What I had in mind was that
> > we could use 'counters == NULL' as an indication that the counter is still
> > in 'single counter mode'.
> >
> 
> In the current state there are only pointers to counters in mm_struct
> and there is no storage for them in task_struct. So I don't think
> merely null-checking the per-cpu stuff is going to cut it -- where
> should the single-threaded counters land?

I think you misunderstood. What I wanted to do it to provide a new flavor
of percpu_counter (sharing most of code and definitions) which would have
an option to start as simple counter (indicated by pcc->counters == NULL
and using pcc->count for counting) and then be upgraded by a call to real
percpu thing. Because I think such counters would be useful also on other
occasions than as rss counters.

> Bonus problem, non-current can modify these counters and this needs to
> be safe against current playing with them at the same time. (and it
> would be a shame to require current to use atomic on them)

Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be
modifying the counters for other processes. Thanks for pointing this out.

> That said, my initial proposal adds a union:
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5e74ce4a28cd..ea70f0c08286 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -737,7 +737,11 @@ struct mm_struct {
> 
>                 unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
> /proc/PID/auxv */
> 
> -               struct percpu_counter rss_stat[NR_MM_COUNTERS];
> +               union {
> +                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
> +                       u64 *rss_stat_single;
> +               };
> +               bool    magic_flag_stuffed_elsewhere;
> 
>                 struct linux_binfmt *binfmt;
> 
> 
> Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> countes * 2 -- first set updated without any synchro by current
> thread. Second set only to be modified by others and protected with
> mm->arg_lock. The lock protects remote access to the union to begin
> with.

arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
with two counters is clever but I'm not 100% convinced the complexity is
really worth it. I'm not sure the overhead of always using an atomic
counter would really be measurable as atomic counter ops in local CPU cache
tend to be cheap. Did you try to measure the difference?

If the second counter proves to be worth it, we could make just that one
atomic to avoid the need for abusing some spinlock.

> Transition to per-CPU operation sets the magic flag (there is plenty
> of spare space in mm_struct, I'll find a good home for it without
> growing the struct). It would be a one-way street -- a process which
> gets a bunch of threads and goes back to one stays with per-CPU.

Agreed with switching to be a one-way street.

> Then you get the true value of something by adding both counters.
> 
> arg_lock is sparingly used, so remote ops are not expected to contend
> with anything. In fact their cost is going to go down since percpu
> summation takes a spinlock which also disables interrupts.
> 
> Local ops should be about the same in cost as they are right now.
> 
> I might have missed some detail in the above description, but I think
> the approach is decent.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 0/2] execve scalability issues, part 1
  2023-08-23  9:49           ` Jan Kara
@ 2023-08-23 10:49             ` David Laight
  2023-08-23 12:01               ` Mateusz Guzik
  2023-08-23 12:13             ` Mateusz Guzik
  2023-08-23 20:27             ` Dennis Zhou
  2 siblings, 1 reply; 31+ messages in thread
From: David Laight @ 2023-08-23 10:49 UTC (permalink / raw)
  To: 'Jan Kara', Mateusz Guzik
  Cc: Dennis Zhou, linux-kernel@vger.kernel.org, tj@kernel.org,
	cl@linux.com, akpm@linux-foundation.org, shakeelb@google.com,
	linux-mm@kvack.org

From: Jan Kara
> Sent: Wednesday, August 23, 2023 10:49 AM
....
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -737,7 +737,11 @@ struct mm_struct {
> >
> >                 unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
> > /proc/PID/auxv */
> >
> > -               struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > +               union {
> > +                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > +                       u64 *rss_stat_single;
> > +               };
> > +               bool    magic_flag_stuffed_elsewhere;

I wouldn't use a union to save a pointer - it is asking for trouble.

> >
> >                 struct linux_binfmt *binfmt;
> >
> >
> > Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> > countes * 2 -- first set updated without any synchro by current
> > thread. Second set only to be modified by others and protected with
> > mm->arg_lock. The lock protects remote access to the union to begin
> > with.
> 
> arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> with two counters is clever but I'm not 100% convinced the complexity is
> really worth it. I'm not sure the overhead of always using an atomic
> counter would really be measurable as atomic counter ops in local CPU cache
> tend to be cheap. Did you try to measure the difference?

A separate lock is worse than atomics.
(Although some 32bit arch may have issues with 64bit atomics.)

I think you'll be surprised just how slow atomic ops are.
Even when present in the local cache.
(Probably because any other copies have to be invalidated.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-23 10:49             ` David Laight
@ 2023-08-23 12:01               ` Mateusz Guzik
  0 siblings, 0 replies; 31+ messages in thread
From: Mateusz Guzik @ 2023-08-23 12:01 UTC (permalink / raw)
  To: David Laight
  Cc: Jan Kara, Dennis Zhou, linux-kernel@vger.kernel.org,
	tj@kernel.org, cl@linux.com, akpm@linux-foundation.org,
	shakeelb@google.com, linux-mm@kvack.org

On 8/23/23, David Laight <David.Laight@aculab.com> wrote:
> From: Jan Kara
>> Sent: Wednesday, August 23, 2023 10:49 AM
> ....
>> > --- a/include/linux/mm_types.h
>> > +++ b/include/linux/mm_types.h
>> > @@ -737,7 +737,11 @@ struct mm_struct {
>> >
>> >                 unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
>> > /proc/PID/auxv */
>> >
>> > -               struct percpu_counter rss_stat[NR_MM_COUNTERS];
>> > +               union {
>> > +                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
>> > +                       u64 *rss_stat_single;
>> > +               };
>> > +               bool    magic_flag_stuffed_elsewhere;
>
> I wouldn't use a union to save a pointer - it is asking for trouble.
>

I may need to abandon this bit anyway -- counter init adds counters to
a global list and I can't call easily call it like that.

>> >
>> >                 struct linux_binfmt *binfmt;
>> >
>> >
>> > Then for single-threaded case an area is allocated for NR_MM_COUNTERS
>> > countes * 2 -- first set updated without any synchro by current
>> > thread. Second set only to be modified by others and protected with
>> > mm->arg_lock. The lock protects remote access to the union to begin
>> > with.
>>
>> arg_lock seems a bit like a hack. How is it related to rss_stat? The
>> scheme
>> with two counters is clever but I'm not 100% convinced the complexity is
>> really worth it. I'm not sure the overhead of always using an atomic
>> counter would really be measurable as atomic counter ops in local CPU
>> cache
>> tend to be cheap. Did you try to measure the difference?
>
> A separate lock is worse than atomics.
> (Although some 32bit arch may have issues with 64bit atomics.)
>

But in my proposal the separate lock is used to facilitate *NOT* using
atomics by the most common consumer -- the only thread.

The lock is only used for the transition to multithreaded state for
updated by remote parties (both rare compared to updated by current).

> I think you'll be surprised just how slow atomic ops are.
> Even when present in the local cache.
> (Probably because any other copies have to be invalidated.)
>

Agreed. They have always been super expensive on x86-64 (and continue
to be). I keep running to claims they are not, I don't know where
that's coming from.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-23  9:49           ` Jan Kara
  2023-08-23 10:49             ` David Laight
@ 2023-08-23 12:13             ` Mateusz Guzik
  2023-08-23 15:47               ` Jan Kara
  2023-08-23 20:27             ` Dennis Zhou
  2 siblings, 1 reply; 31+ messages in thread
From: Mateusz Guzik @ 2023-08-23 12:13 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dennis Zhou, linux-kernel, tj, cl, akpm, shakeelb, linux-mm

On 8/23/23, Jan Kara <jack@suse.cz> wrote:
> On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
>> On 8/22/23, Jan Kara <jack@suse.cz> wrote:
>> > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
>> >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >> > True Fix(tm) is a longer story.
>> >> >
>> >> > Maybe let's sort out this patchset first, whichever way. :)
>> >> >
>> >>
>> >> So I found the discussion around the original patch with a perf
>> >> regression report.
>> >>
>> >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
>> >>
>> >> The reporter suggests dodging the problem by only allocating per-cpu
>> >> counters when the process is going multithreaded. Given that there is
>> >> still plenty of forever single-threaded procs out there I think that's
>> >> does sound like a great plan regardless of what happens with this
>> >> patchset.
>> >>
>> >> Almost all access is already done using dedicated routines, so this
>> >> should be an afternoon churn to sort out, unless I missed a
>> >> showstopper. (maybe there is no good place to stuff a flag/whatever
>> >> other indicator about the state of counters?)
>> >>
>> >> That said I'll look into it some time this or next week.
>> >
>> > Good, just let me know how it went, I also wanted to start looking into
>> > this to come up with some concrete patches :). What I had in mind was
>> > that
>> > we could use 'counters == NULL' as an indication that the counter is
>> > still
>> > in 'single counter mode'.
>> >
>>
>> In the current state there are only pointers to counters in mm_struct
>> and there is no storage for them in task_struct. So I don't think
>> merely null-checking the per-cpu stuff is going to cut it -- where
>> should the single-threaded counters land?
>
> I think you misunderstood. What I wanted to do it to provide a new flavor
> of percpu_counter (sharing most of code and definitions) which would have
> an option to start as simple counter (indicated by pcc->counters == NULL
> and using pcc->count for counting) and then be upgraded by a call to real
> percpu thing. Because I think such counters would be useful also on other
> occasions than as rss counters.
>

Indeed I did -- I had tunnel vision on dodging atomics for current
given remote modifications, which wont be done in your proposal.

I concede your idea solves the problem at hand, I question whether it
is the right to do though. Not my call to make.

>> Then for single-threaded case an area is allocated for NR_MM_COUNTERS
>> countes * 2 -- first set updated without any synchro by current
>> thread. Second set only to be modified by others and protected with
>> mm->arg_lock. The lock protects remote access to the union to begin
>> with.
>
> arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> with two counters is clever but I'm not 100% convinced the complexity is
> really worth it. I'm not sure the overhead of always using an atomic
> counter would really be measurable as atomic counter ops in local CPU cache
> tend to be cheap. Did you try to measure the difference?
>

arg_lock is not as is, it would have to be renamed to something more generic.

Atomics on x86-64 are very expensive to this very day. Here is a
sample measurement of 2 atomics showing up done by someone else:
https://lore.kernel.org/oe-lkp/202308141149.d38fdf91-oliver.sang@intel.com/T/#u

tl;dr it is *really* bad.

> If the second counter proves to be worth it, we could make just that one
> atomic to avoid the need for abusing some spinlock.
>

The spinlock would be there to synchronize against the transition to
per-cpu -- any trickery is avoided and we trivially know for a fact
the remote party either sees the per-cpu state if transitioned, or
local if not. Then one easily knows no updates have been lost and the
buf for 2 sets of counters can be safely freed.

While writing down the idea previously I did not realize the per-cpu
counter ops disable interrupts around the op. That's already very slow
and the trip should be comparable to paying for an atomic (as in the
patch which introduced percpu counters here slowed things down for
single-threaded processes).

With your proposal the atomic would be there, but interrupt trip could
be avoided. This would roughly maintain the current cost of doing the
op (as in it would not get /worse/). My patch would make it lower.

All that said, I'm going to refrain from writing a patch for the time
being. If powers to be decide on your approach, I'm not going to argue
-- I don't think either is a clear winner over the other.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-23 12:13             ` Mateusz Guzik
@ 2023-08-23 15:47               ` Jan Kara
  2023-08-23 16:10                 ` Mateusz Guzik
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2023-08-23 15:47 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Jan Kara, Dennis Zhou, linux-kernel, tj, cl, akpm, shakeelb,
	linux-mm

On Wed 23-08-23 14:13:20, Mateusz Guzik wrote:
> On 8/23/23, Jan Kara <jack@suse.cz> wrote:
> > On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
> >> On 8/22/23, Jan Kara <jack@suse.cz> wrote:
> >> Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> >> countes * 2 -- first set updated without any synchro by current
> >> thread. Second set only to be modified by others and protected with
> >> mm->arg_lock. The lock protects remote access to the union to begin
> >> with.
> >
> > arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> > with two counters is clever but I'm not 100% convinced the complexity is
> > really worth it. I'm not sure the overhead of always using an atomic
> > counter would really be measurable as atomic counter ops in local CPU cache
> > tend to be cheap. Did you try to measure the difference?
> >
> 
> arg_lock is not as is, it would have to be renamed to something more generic.

Ah, OK.

> Atomics on x86-64 are very expensive to this very day. Here is a
> sample measurement of 2 atomics showing up done by someone else:
> https://lore.kernel.org/oe-lkp/202308141149.d38fdf91-oliver.sang@intel.com/T/#u
> 
> tl;dr it is *really* bad.

I didn't express myself well. Sure atomics are expensive compared to plain
arithmetic operations. But I wanted to say - we had atomics for RSS
counters before commit f1a7941243 ("mm: convert mm's rss stats into
percpu_counter") and people seemed happy with it until there were many CPUs
contending on the updates. So maybe RSS counters aren't used heavily enough
for the difference to practically matter? Probably operation like faulting
in (or unmapping) tmpfs file has the highest chance of showing the cost of
rss accounting compared to the cost of the remainder of the operation...

> > If the second counter proves to be worth it, we could make just that one
> > atomic to avoid the need for abusing some spinlock.
> 
> The spinlock would be there to synchronize against the transition to
> per-cpu -- any trickery is avoided and we trivially know for a fact
> the remote party either sees the per-cpu state if transitioned, or
> local if not. Then one easily knows no updates have been lost and the
> buf for 2 sets of counters can be safely freed.

Yeah, the spinlock makes the transition simpler, I agree.

> While writing down the idea previously I did not realize the per-cpu
> counter ops disable interrupts around the op. That's already very slow
> and the trip should be comparable to paying for an atomic (as in the
> patch which introduced percpu counters here slowed things down for
> single-threaded processes).
> 
> With your proposal the atomic would be there, but interrupt trip could
> be avoided. This would roughly maintain the current cost of doing the
> op (as in it would not get /worse/). My patch would make it lower.
> 
> All that said, I'm going to refrain from writing a patch for the time
> being. If powers to be decide on your approach, I'm not going to argue
> -- I don't think either is a clear winner over the other.

I guess we'll need to code it and compare the results :)

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-23 15:47               ` Jan Kara
@ 2023-08-23 16:10                 ` Mateusz Guzik
  2023-08-23 16:41                   ` Jan Kara
  0 siblings, 1 reply; 31+ messages in thread
From: Mateusz Guzik @ 2023-08-23 16:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dennis Zhou, linux-kernel, tj, cl, akpm, shakeelb, linux-mm

On 8/23/23, Jan Kara <jack@suse.cz> wrote:
> I didn't express myself well. Sure atomics are expensive compared to plain
> arithmetic operations. But I wanted to say - we had atomics for RSS
> counters before commit f1a7941243 ("mm: convert mm's rss stats into
> percpu_counter") and people seemed happy with it until there were many CPUs
> contending on the updates. So maybe RSS counters aren't used heavily enough
> for the difference to practically matter? Probably operation like faulting
> in (or unmapping) tmpfs file has the highest chance of showing the cost of
> rss accounting compared to the cost of the remainder of the operation...
>

These stats used to be decentralized by storing them in task_struct,
the commit complains about values deviating too much.

The value would get synced every 64 uses, from the diff:
-/* sync counter once per 64 page faults */
-#define TASK_RSS_EVENTS_THRESH (64)
-static void check_sync_rss_stat(struct task_struct *task)
-{
-       if (unlikely(task != current))
-               return;
-       if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
-               sync_mm_rss(task->mm);
-}

other than that it was a non-atomic update in struct thread.

-static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
-{
-       struct task_struct *task = current;
-
-       if (likely(task->mm == mm))
-               task->rss_stat.count[member] += val;
-       else
-               add_mm_counter(mm, member, val);
-}

So the question is how much does this matter. My personal approach is
that avoidable slowdowns (like atomics here) only facilitate further
avoidable slowdowns as people can claim there is a minuscule change in
% to baseline. But if the baseline is already slow....

Anyhow, I just found that patch failed to completely remove
SPLIT_RSS_COUNTING. I'm going to submit something about that later.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-23 16:10                 ` Mateusz Guzik
@ 2023-08-23 16:41                   ` Jan Kara
  2023-08-23 17:12                     ` Mateusz Guzik
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2023-08-23 16:41 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Jan Kara, Dennis Zhou, linux-kernel, tj, cl, akpm, shakeelb,
	linux-mm

On Wed 23-08-23 18:10:29, Mateusz Guzik wrote:
> On 8/23/23, Jan Kara <jack@suse.cz> wrote:
> > I didn't express myself well. Sure atomics are expensive compared to plain
> > arithmetic operations. But I wanted to say - we had atomics for RSS
> > counters before commit f1a7941243 ("mm: convert mm's rss stats into
> > percpu_counter") and people seemed happy with it until there were many CPUs
> > contending on the updates. So maybe RSS counters aren't used heavily enough
> > for the difference to practically matter? Probably operation like faulting
> > in (or unmapping) tmpfs file has the highest chance of showing the cost of
> > rss accounting compared to the cost of the remainder of the operation...
> >
> 
> These stats used to be decentralized by storing them in task_struct,
> the commit complains about values deviating too much.
> 
> The value would get synced every 64 uses, from the diff:
> -/* sync counter once per 64 page faults */
> -#define TASK_RSS_EVENTS_THRESH (64)
> -static void check_sync_rss_stat(struct task_struct *task)
> -{
> -       if (unlikely(task != current))
> -               return;
> -       if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
> -               sync_mm_rss(task->mm);
> -}
> 
> other than that it was a non-atomic update in struct thread.
> 
> -static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
> -{
> -       struct task_struct *task = current;
> -
> -       if (likely(task->mm == mm))
> -               task->rss_stat.count[member] += val;
> -       else
> -               add_mm_counter(mm, member, val);
> -}

Ah, I see. I already forgot these details since I was checking the
regression back in spring. Now I've just seen the atomic_long_t counters in
task_struct and forgot there used to be also these per-thread ones. Thanks
for refreshing my memory!

> So the question is how much does this matter. My personal approach is
> that avoidable slowdowns (like atomics here) only facilitate further
> avoidable slowdowns as people can claim there is a minuscule change in
> % to baseline. But if the baseline is already slow....

I get your point but over the years I've also learned that premature
optimization isn't good either as we will be dragging the maintenance
burden for a long time ;) It's a balance.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-23 16:41                   ` Jan Kara
@ 2023-08-23 17:12                     ` Mateusz Guzik
  0 siblings, 0 replies; 31+ messages in thread
From: Mateusz Guzik @ 2023-08-23 17:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dennis Zhou, linux-kernel, tj, cl, akpm, shakeelb, linux-mm

On 8/23/23, Jan Kara <jack@suse.cz> wrote:
> On Wed 23-08-23 18:10:29, Mateusz Guzik wrote:
>> So the question is how much does this matter. My personal approach is
>> that avoidable slowdowns (like atomics here) only facilitate further
>> avoidable slowdowns as people can claim there is a minuscule change in
>> % to baseline. But if the baseline is already slow....
>
> I get your point but over the years I've also learned that premature
> optimization isn't good either as we will be dragging the maintenance
> burden for a long time ;) It's a balance.
>

Mate, your proposal is not maintenance burden-free either. ;)

I claim mine is simpler and faster for single threaded case, but is
not generic so should another consumer show up with the single vs
multithreaded need, there may or may not be more work to accommodate
it.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-23  9:49           ` Jan Kara
  2023-08-23 10:49             ` David Laight
  2023-08-23 12:13             ` Mateusz Guzik
@ 2023-08-23 20:27             ` Dennis Zhou
  2023-08-24  9:19               ` Jan Kara
  2 siblings, 1 reply; 31+ messages in thread
From: Dennis Zhou @ 2023-08-23 20:27 UTC (permalink / raw)
  To: Jan Kara, Mateusz Guzik; +Cc: linux-kernel, tj, cl, akpm, shakeelb, linux-mm

On Wed, Aug 23, 2023 at 11:49:15AM +0200, Jan Kara wrote:
> On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
> > On 8/22/23, Jan Kara <jack@suse.cz> wrote:
> > > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
> > >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >> > True Fix(tm) is a longer story.
> > >> >
> > >> > Maybe let's sort out this patchset first, whichever way. :)
> > >> >
> > >>
> > >> So I found the discussion around the original patch with a perf
> > >> regression report.
> > >>
> > >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
> > >>
> > >> The reporter suggests dodging the problem by only allocating per-cpu
> > >> counters when the process is going multithreaded. Given that there is
> > >> still plenty of forever single-threaded procs out there I think that's
> > >> does sound like a great plan regardless of what happens with this
> > >> patchset.
> > >>
> > >> Almost all access is already done using dedicated routines, so this
> > >> should be an afternoon churn to sort out, unless I missed a
> > >> showstopper. (maybe there is no good place to stuff a flag/whatever
> > >> other indicator about the state of counters?)
> > >>
> > >> That said I'll look into it some time this or next week.
> > >
> > > Good, just let me know how it went, I also wanted to start looking into
> > > this to come up with some concrete patches :). What I had in mind was that
> > > we could use 'counters == NULL' as an indication that the counter is still
> > > in 'single counter mode'.
> > >
> > 
> > In the current state there are only pointers to counters in mm_struct
> > and there is no storage for them in task_struct. So I don't think
> > merely null-checking the per-cpu stuff is going to cut it -- where
> > should the single-threaded counters land?
> 
> I think you misunderstood. What I wanted to do it to provide a new flavor
> of percpu_counter (sharing most of code and definitions) which would have
> an option to start as simple counter (indicated by pcc->counters == NULL
> and using pcc->count for counting) and then be upgraded by a call to real
> percpu thing. Because I think such counters would be useful also on other
> occasions than as rss counters.
> 

Kent wrote something similar and sent it out last year [1]. However, the
case slightly differs from what we'd want here, 1 -> 2 threads becomes
percpu vs update rate which a single thread might be able to trigger?

[1] https://lore.kernel.org/lkml/20230501165450.15352-8-surenb@google.com/

Thanks,
Dennis

> > Bonus problem, non-current can modify these counters and this needs to
> > be safe against current playing with them at the same time. (and it
> > would be a shame to require current to use atomic on them)
> 
> Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be
> modifying the counters for other processes. Thanks for pointing this out.
> 
> > That said, my initial proposal adds a union:
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 5e74ce4a28cd..ea70f0c08286 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -737,7 +737,11 @@ struct mm_struct {
> > 
> >                 unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
> > /proc/PID/auxv */
> > 
> > -               struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > +               union {
> > +                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > +                       u64 *rss_stat_single;
> > +               };
> > +               bool    magic_flag_stuffed_elsewhere;
> > 
> >                 struct linux_binfmt *binfmt;
> > 
> > 
> > Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> > countes * 2 -- first set updated without any synchro by current
> > thread. Second set only to be modified by others and protected with
> > mm->arg_lock. The lock protects remote access to the union to begin
> > with.
> 
> arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> with two counters is clever but I'm not 100% convinced the complexity is
> really worth it. I'm not sure the overhead of always using an atomic
> counter would really be measurable as atomic counter ops in local CPU cache
> tend to be cheap. Did you try to measure the difference?
> 
> If the second counter proves to be worth it, we could make just that one
> atomic to avoid the need for abusing some spinlock.
> 
> > Transition to per-CPU operation sets the magic flag (there is plenty
> > of spare space in mm_struct, I'll find a good home for it without
> > growing the struct). It would be a one-way street -- a process which
> > gets a bunch of threads and goes back to one stays with per-CPU.
> 
> Agreed with switching to be a one-way street.
> 
> > Then you get the true value of something by adding both counters.
> > 
> > arg_lock is sparingly used, so remote ops are not expected to contend
> > with anything. In fact their cost is going to go down since percpu
> > summation takes a spinlock which also disables interrupts.
> > 
> > Local ops should be about the same in cost as they are right now.
> > 
> > I might have missed some detail in the above description, but I think
> > the approach is decent.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-23 20:27             ` Dennis Zhou
@ 2023-08-24  9:19               ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2023-08-24  9:19 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Jan Kara, Mateusz Guzik, linux-kernel, tj, cl, akpm, shakeelb,
	linux-mm

On Wed 23-08-23 13:27:56, Dennis Zhou wrote:
> On Wed, Aug 23, 2023 at 11:49:15AM +0200, Jan Kara wrote:
> > On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
> > > On 8/22/23, Jan Kara <jack@suse.cz> wrote:
> > > > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
> > > >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > >> > True Fix(tm) is a longer story.
> > > >> >
> > > >> > Maybe let's sort out this patchset first, whichever way. :)
> > > >> >
> > > >>
> > > >> So I found the discussion around the original patch with a perf
> > > >> regression report.
> > > >>
> > > >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
> > > >>
> > > >> The reporter suggests dodging the problem by only allocating per-cpu
> > > >> counters when the process is going multithreaded. Given that there is
> > > >> still plenty of forever single-threaded procs out there I think that's
> > > >> does sound like a great plan regardless of what happens with this
> > > >> patchset.
> > > >>
> > > >> Almost all access is already done using dedicated routines, so this
> > > >> should be an afternoon churn to sort out, unless I missed a
> > > >> showstopper. (maybe there is no good place to stuff a flag/whatever
> > > >> other indicator about the state of counters?)
> > > >>
> > > >> That said I'll look into it some time this or next week.
> > > >
> > > > Good, just let me know how it went, I also wanted to start looking into
> > > > this to come up with some concrete patches :). What I had in mind was that
> > > > we could use 'counters == NULL' as an indication that the counter is still
> > > > in 'single counter mode'.
> > > >
> > > 
> > > In the current state there are only pointers to counters in mm_struct
> > > and there is no storage for them in task_struct. So I don't think
> > > merely null-checking the per-cpu stuff is going to cut it -- where
> > > should the single-threaded counters land?
> > 
> > I think you misunderstood. What I wanted to do it to provide a new flavor
> > of percpu_counter (sharing most of code and definitions) which would have
> > an option to start as simple counter (indicated by pcc->counters == NULL
> > and using pcc->count for counting) and then be upgraded by a call to real
> > percpu thing. Because I think such counters would be useful also on other
> > occasions than as rss counters.
> > 
> 
> Kent wrote something similar and sent it out last year [1]. However, the
> case slightly differs from what we'd want here, 1 -> 2 threads becomes
> percpu vs update rate which a single thread might be able to trigger?

Thanks for the pointer but that version of counters is not really suitable
here as is (but we could factor out some common bits if that work is
happening). 1 thread can easily do 10000 RSS updates per second.

> [1] https://lore.kernel.org/lkml/20230501165450.15352-8-surenb@google.com/

								Honza

> > > Bonus problem, non-current can modify these counters and this needs to
> > > be safe against current playing with them at the same time. (and it
> > > would be a shame to require current to use atomic on them)
> > 
> > Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be
> > modifying the counters for other processes. Thanks for pointing this out.
> > 
> > > That said, my initial proposal adds a union:
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 5e74ce4a28cd..ea70f0c08286 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -737,7 +737,11 @@ struct mm_struct {
> > > 
> > >                 unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
> > > /proc/PID/auxv */
> > > 
> > > -               struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > > +               union {
> > > +                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > > +                       u64 *rss_stat_single;
> > > +               };
> > > +               bool    magic_flag_stuffed_elsewhere;
> > > 
> > >                 struct linux_binfmt *binfmt;
> > > 
> > > 
> > > Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> > > countes * 2 -- first set updated without any synchro by current
> > > thread. Second set only to be modified by others and protected with
> > > mm->arg_lock. The lock protects remote access to the union to begin
> > > with.
> > 
> > arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> > with two counters is clever but I'm not 100% convinced the complexity is
> > really worth it. I'm not sure the overhead of always using an atomic
> > counter would really be measurable as atomic counter ops in local CPU cache
> > tend to be cheap. Did you try to measure the difference?
> > 
> > If the second counter proves to be worth it, we could make just that one
> > atomic to avoid the need for abusing some spinlock.
> > 
> > > Transition to per-CPU operation sets the magic flag (there is plenty
> > > of spare space in mm_struct, I'll find a good home for it without
> > > growing the struct). It would be a one-way street -- a process which
> > > gets a bunch of threads and goes back to one stays with per-CPU.
> > 
> > Agreed with switching to be a one-way street.
> > 
> > > Then you get the true value of something by adding both counters.
> > > 
> > > arg_lock is sparingly used, so remote ops are not expected to contend
> > > with anything. In fact their cost is going to go down since percpu
> > > summation takes a spinlock which also disables interrupts.
> > > 
> > > Local ops should be about the same in cost as they are right now.
> > > 
> > > I might have missed some detail in the above description, but I think
> > > the approach is decent.
> > 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] execve scalability issues, part 1
  2023-08-21 20:28 [PATCH 0/2] execve scalability issues, part 1 Mateusz Guzik
                   ` (3 preceding siblings ...)
  2023-08-21 21:07 ` [PATCH 0/2] execve scalability issues, part 1 Dennis Zhou
@ 2023-08-26 18:33 ` Mateusz Guzik
  4 siblings, 0 replies; 31+ messages in thread
From: Mateusz Guzik @ 2023-08-26 18:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: dennis, tj, cl, akpm, shakeelb, linux-mm, jack

On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> To start I figured I'm going to bench about as friendly case as it gets
> -- statically linked *separate* binaries all doing execve in a loop.
>
> I borrowed the bench from found here:
> http://apollo.backplane.com/DFlyMisc/doexec.c
>
> $ cc -static -O2 -o static-doexec doexec.c
> $ ./static-doexec $(nproc)
>
> It prints a result every second (warning: first line is garbage).
>
> My test box is temporarily only 26 cores and even at this scale I run
> into massive lock contention stemming from back-to-back calls to
> percpu_counter_init (and _destroy later).
>
> While not a panacea, one simple thing to do here is to batch these ops.
> Since the term "batching" is already used in the file, I decided to
> refer to it as "grouping" instead.
>
> Even if this code could be patched to dodge these counters,  I would
> argue a high-traffic alloc/free consumer is only a matter of time so it
> makes sense to facilitate it.
>
> With the fix I get an ok win, to quote from the commit:
>> Even at a very modest scale of 26 cores (ops/s):
>> before: 133543.63
>> after:  186061.81 (+39%)
>

So to sum up, a v3 of the patchset is queued up here:
https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-next

For interested I temporarily got my hands on something exceeding the
hand watch scale benched above -- a 192-way AMD EPYC 7R13 box (2
sockets x 48 cores x 2 threads).

A 6.5 kernel + the patchset only gets south of 140k execs/s when
running ./static-doexec 192

According to perf top:
  51.04%  [kernel]           [k] osq_lock
   6.82%  [kernel]           [k] __raw_callee_save___kvm_vcpu_is_preempted
   2.98%  [kernel]           [k] _atomic_dec_and_lock_irqsave
   1.62%  [kernel]           [k] rcu_cblist_dequeue
   1.54%  [kernel]           [k] refcount_dec_not_one
   1.51%  [kernel]           [k] __mod_lruvec_page_state
   1.46%  [kernel]           [k] put_cred_rcu
   1.34%  [kernel]           [k] native_queued_spin_lock_slowpath
   0.94%  [kernel]           [k] srso_alias_safe_ret
   0.81%  [kernel]           [k] memset_orig
   0.77%  [kernel]           [k] unmap_page_range
   0.73%  [kernel]           [k] _compound_head
   0.72%  [kernel]           [k] kmem_cache_free

Then bpftrace -e 'kprobe:osq_lock { @[kstack()] = count(); }' shows:

@[
    osq_lock+1
    __mutex_lock_killable_slowpath+19
    mutex_lock_killable+62
    pcpu_alloc+1219
    __alloc_percpu_gfp+18
    __percpu_counter_init_many+43
    mm_init+727
    mm_alloc+78
    alloc_bprm+138
    do_execveat_common.isra.0+103
    __x64_sys_execve+55
    do_syscall_64+54
    entry_SYSCALL_64_after_hwframe+110
]: 637370

@[
    osq_lock+1
    __mutex_lock_killable_slowpath+19
    mutex_lock_killable+62
    pcpu_alloc+1219
    __alloc_percpu+21
    mm_init+577
    mm_alloc+78
    alloc_bprm+138
    do_execveat_common.isra.0+103
    __x64_sys_execve+55
    do_syscall_64+54
    entry_SYSCALL_64_after_hwframe+110
]: 638036

That is per-cpu allocation is still on top at this scale.

But more importantly there are *TWO* unrelated back-to-back per-cpu
allocs -- one by rss counters and one by mm_alloc_cid.

That is to say per-cpu alloc scalability definitely needs to get
fixed, I'll ponder about it.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2023-08-26 18:33 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-21 20:28 [PATCH 0/2] execve scalability issues, part 1 Mateusz Guzik
2023-08-21 20:28 ` [PATCH 1/2] pcpcntr: add group allocation/free Mateusz Guzik
2023-08-22 13:37   ` Vegard Nossum
2023-08-22 14:06     ` Mateusz Guzik
2023-08-22 17:02   ` Dennis Zhou
2023-08-21 20:28 ` [PATCH 2/2] fork: group allocation of per-cpu counters for mm struct Mateusz Guzik
2023-08-21 21:20   ` Matthew Wilcox
2023-08-21 20:42 ` [PATCH 0/2] execve scalability issues, part 1 Matthew Wilcox
2023-08-21 20:44   ` [PATCH 1/7] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
2023-08-21 20:44     ` [PATCH 2/7] mm: Convert free_unref_page_list() to use folios Matthew Wilcox (Oracle)
2023-08-21 20:44     ` [PATCH 3/7] mm: Add free_unref_folios() Matthew Wilcox (Oracle)
2023-08-21 20:44     ` [PATCH 4/7] mm: Use folios_put() in __folio_batch_release() Matthew Wilcox (Oracle)
2023-08-21 20:44     ` [PATCH 5/7] memcg: Add mem_cgroup_uncharge_batch() Matthew Wilcox (Oracle)
2023-08-21 20:44     ` [PATCH 6/7] mm: Remove use of folio list from folios_put() Matthew Wilcox (Oracle)
2023-08-21 20:44     ` [PATCH 7/7] mm: Use free_unref_folios() in put_pages_list() Matthew Wilcox (Oracle)
2023-08-21 21:07 ` [PATCH 0/2] execve scalability issues, part 1 Dennis Zhou
2023-08-21 21:39   ` Mateusz Guzik
2023-08-21 22:29     ` Mateusz Guzik
2023-08-22  9:51       ` Jan Kara
2023-08-22 14:24         ` Mateusz Guzik
2023-08-23  9:49           ` Jan Kara
2023-08-23 10:49             ` David Laight
2023-08-23 12:01               ` Mateusz Guzik
2023-08-23 12:13             ` Mateusz Guzik
2023-08-23 15:47               ` Jan Kara
2023-08-23 16:10                 ` Mateusz Guzik
2023-08-23 16:41                   ` Jan Kara
2023-08-23 17:12                     ` Mateusz Guzik
2023-08-23 20:27             ` Dennis Zhou
2023-08-24  9:19               ` Jan Kara
2023-08-26 18:33 ` Mateusz Guzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).