linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup
@ 2024-07-05  8:48 Saurabh Sengar
  2024-07-05 20:59 ` Andrew Morton
  2024-08-09  5:20 ` Andrew Morton
  0 siblings, 2 replies; 12+ messages in thread
From: Saurabh Sengar @ 2024-07-05  8:48 UTC (permalink / raw)
  To: akpm, linux-mm, linux-kernel; +Cc: ssengar, wei.liu

refresh_zone_stat_thresholds function has two loops which is expensive for
higher number of CPUs and NUMA nodes.

Below is the rough estimation of total iterations done by these loops
based on number of NUMA and CPUs.

Total number of iterations: nCPU * 2 * Numa * mCPU
Where:
 nCPU = total number of CPUs
 Numa = total number of NUMA nodes
 mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)

For the system under test with 16 NUMA nodes and 1024 CPUs, this
results in a substantial increase in the number of loop iterations
during boot-up when NUMA is enabled:

No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
takes around 224 ms total for all the CPUs in the system under test.
16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
takes around 4.5 seconds total for all the CPUs in the system under test.

Calling this for each CPU is expensive when there are large number
of CPUs along with multiple NUMAs. Fix this by deferring
refresh_zone_stat_thresholds to be called later at once when all the
secondary CPUs are up. Also, register the DYN hooks to keep the
existing hotplug functionality intact.

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
---
 mm/vmstat.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6e3347789eb2..5c0df62238d9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -31,6 +31,7 @@
 
 #include "internal.h"
 
+static int vmstat_late_init_done;
 #ifdef CONFIG_NUMA
 int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
 
@@ -2107,7 +2108,8 @@ static void __init init_cpu_node_state(void)
 
 static int vmstat_cpu_online(unsigned int cpu)
 {
-	refresh_zone_stat_thresholds();
+	if (vmstat_late_init_done)
+		refresh_zone_stat_thresholds();
 
 	if (!node_state(cpu_to_node(cpu), N_CPU)) {
 		node_set_state(cpu_to_node(cpu), N_CPU);
@@ -2139,6 +2141,14 @@ static int vmstat_cpu_dead(unsigned int cpu)
 	return 0;
 }
 
+static int __init vmstat_late_init(void)
+{
+	refresh_zone_stat_thresholds();
+	vmstat_late_init_done = 1;
+
+	return 0;
+}
+late_initcall(vmstat_late_init);
 #endif
 
 struct workqueue_struct *mm_percpu_wq;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup
  2024-07-05  8:48 [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup Saurabh Sengar
@ 2024-07-05 20:59 ` Andrew Morton
  2024-07-09  4:57   ` Saurabh Singh Sengar
  2024-08-09  5:20 ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2024-07-05 20:59 UTC (permalink / raw)
  To: Saurabh Sengar; +Cc: linux-mm, linux-kernel, ssengar, wei.liu

On Fri,  5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@linux.microsoft.com> wrote:

> refresh_zone_stat_thresholds function has two loops which is expensive for
> higher number of CPUs and NUMA nodes.
> 
> Below is the rough estimation of total iterations done by these loops
> based on number of NUMA and CPUs.
> 
> Total number of iterations: nCPU * 2 * Numa * mCPU
> Where:
>  nCPU = total number of CPUs
>  Numa = total number of NUMA nodes
>  mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
> 
> For the system under test with 16 NUMA nodes and 1024 CPUs, this
> results in a substantial increase in the number of loop iterations
> during boot-up when NUMA is enabled:
> 
> No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> takes around 224 ms total for all the CPUs in the system under test.
> 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> takes around 4.5 seconds total for all the CPUs in the system under test.

Did you measure the overall before-and-after times?  IOW, how much of
that 4.5s do we reclaim?

> Calling this for each CPU is expensive when there are large number
> of CPUs along with multiple NUMAs. Fix this by deferring
> refresh_zone_stat_thresholds to be called later at once when all the
> secondary CPUs are up. Also, register the DYN hooks to keep the
> existing hotplug functionality intact.
> 

Seems risky - we'll now have online CPUs which have unintialized data,
yes?  What assurance do we have that this data won't be accessed?

Another approach might be to make the code a bit smarter - instead of
calculating thresholds for the whole world, we make incremental changes
to the existing thresholds on behalf of the new resource which just
became available?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup
  2024-07-05 20:59 ` Andrew Morton
@ 2024-07-09  4:57   ` Saurabh Singh Sengar
  2024-07-09 21:45     ` Andrew Morton
                       ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Saurabh Singh Sengar @ 2024-07-09  4:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, ssengar, wei.liu

On Fri, Jul 05, 2024 at 01:59:11PM -0700, Andrew Morton wrote:
> On Fri,  5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@linux.microsoft.com> wrote:
> 
> > refresh_zone_stat_thresholds function has two loops which is expensive for
> > higher number of CPUs and NUMA nodes.
> > 
> > Below is the rough estimation of total iterations done by these loops
> > based on number of NUMA and CPUs.
> > 
> > Total number of iterations: nCPU * 2 * Numa * mCPU
> > Where:
> >  nCPU = total number of CPUs
> >  Numa = total number of NUMA nodes
> >  mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
> > 
> > For the system under test with 16 NUMA nodes and 1024 CPUs, this
> > results in a substantial increase in the number of loop iterations
> > during boot-up when NUMA is enabled:
> > 
> > No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> > takes around 224 ms total for all the CPUs in the system under test.
> > 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> > takes around 4.5 seconds total for all the CPUs in the system under test.
> 
> Did you measure the overall before-and-after times?  IOW, how much of
> that 4.5s do we reclaim?

This entire gain is accounted in over all boot processi time. Most of the Linux
kernel boot process is sequential and doesn't take advantage of SMP.

> 
> > Calling this for each CPU is expensive when there are large number
> > of CPUs along with multiple NUMAs. Fix this by deferring
> > refresh_zone_stat_thresholds to be called later at once when all the
> > secondary CPUs are up. Also, register the DYN hooks to keep the
> > existing hotplug functionality intact.
> > 
> 
> Seems risky - we'll now have online CPUs which have unintialized data,
> yes?  What assurance do we have that this data won't be accessed?

I understand that this data is only accessed by userspace tools, and they can
only access it post late_initcall. Please let me know if there are any other
cases, I will look to address them.

> 
> Another approach might be to make the code a bit smarter - instead of
> calculating thresholds for the whole world, we make incremental changes
> to the existing thresholds on behalf of the new resource which just
> became available?

I agree, and I have spent good amount of time undertanding the calculation,
but couldn't find any obvious way to code everything it does in incremental way.

I would be happy to assist if you have any suggestions how to do this.

- Saurabh


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup
  2024-07-09  4:57   ` Saurabh Singh Sengar
@ 2024-07-09 21:45     ` Andrew Morton
  2024-08-10  7:04     ` Andrew Morton
  2024-08-23 15:32     ` Christoph Lameter (Ampere)
  2 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2024-07-09 21:45 UTC (permalink / raw)
  To: Saurabh Singh Sengar; +Cc: linux-mm, linux-kernel, ssengar, wei.liu

On Mon, 8 Jul 2024 21:57:50 -0700 Saurabh Singh Sengar <ssengar@linux.microsoft.com> wrote:

> this data is only accessed by userspace tools, and they can
> only access it post late_initcall

OK, thanks.  We're at -rc7 now, so I'll queue this for the
post-6.11-rc1 flood.  That way it should get plenty of testing.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup
  2024-07-05  8:48 [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup Saurabh Sengar
  2024-07-05 20:59 ` Andrew Morton
@ 2024-08-09  5:20 ` Andrew Morton
  2024-08-09  5:49   ` Saurabh Singh Sengar
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2024-08-09  5:20 UTC (permalink / raw)
  To: Saurabh Sengar; +Cc: linux-mm, linux-kernel, ssengar, wei.liu

On Fri,  5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@linux.microsoft.com> wrote:

> refresh_zone_stat_thresholds function has two loops which is expensive for
> higher number of CPUs and NUMA nodes.
> 
> Below is the rough estimation of total iterations done by these loops
> based on number of NUMA and CPUs.
> 
> Total number of iterations: nCPU * 2 * Numa * mCPU
> Where:
>  nCPU = total number of CPUs
>  Numa = total number of NUMA nodes
>  mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
> 
> For the system under test with 16 NUMA nodes and 1024 CPUs, this
> results in a substantial increase in the number of loop iterations
> during boot-up when NUMA is enabled:
> 
> No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> takes around 224 ms total for all the CPUs in the system under test.
> 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> takes around 4.5 seconds total for all the CPUs in the system under test.
> 
> Calling this for each CPU is expensive when there are large number
> of CPUs along with multiple NUMAs. Fix this by deferring
> refresh_zone_stat_thresholds to be called later at once when all the
> secondary CPUs are up. Also, register the DYN hooks to keep the
> existing hotplug functionality intact.
>
> ...
>
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -31,6 +31,7 @@
>  
>  #include "internal.h"
>  
> +static int vmstat_late_init_done;
>  #ifdef CONFIG_NUMA
>  int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>  
> @@ -2107,7 +2108,8 @@ static void __init init_cpu_node_state(void)
>  
>  static int vmstat_cpu_online(unsigned int cpu)
>  {
> -	refresh_zone_stat_thresholds();
> +	if (vmstat_late_init_done)
> +		refresh_zone_stat_thresholds();
>  
>  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
>  		node_set_state(cpu_to_node(cpu), N_CPU);
> @@ -2139,6 +2141,14 @@ static int vmstat_cpu_dead(unsigned int cpu)
>  	return 0;
>  }
>  
> +static int __init vmstat_late_init(void)
> +{
> +	refresh_zone_stat_thresholds();
> +	vmstat_late_init_done = 1;
> +
> +	return 0;
> +}
> +late_initcall(vmstat_late_init);

OK, so what's happening here.  Once all CPUs are online and running
around doing heaven knows what, one of the CPUs sets up everyone's
thresholds.  So for a period, all the other CPUs are running with
inappropriate threshold values.

So what are all the other CPUs doing at this point in time, and why is
it safe to leave their thresholds in an inappropriate state while they
are doing it?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup
  2024-08-09  5:20 ` Andrew Morton
@ 2024-08-09  5:49   ` Saurabh Singh Sengar
  0 siblings, 0 replies; 12+ messages in thread
From: Saurabh Singh Sengar @ 2024-08-09  5:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, ssengar, wei.liu

On Thu, Aug 08, 2024 at 10:20:06PM -0700, Andrew Morton wrote:
> On Fri,  5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@linux.microsoft.com> wrote:
> 
> > refresh_zone_stat_thresholds function has two loops which is expensive for
> > higher number of CPUs and NUMA nodes.
> > 
> > Below is the rough estimation of total iterations done by these loops
> > based on number of NUMA and CPUs.
> > 
> > Total number of iterations: nCPU * 2 * Numa * mCPU
> > Where:
> >  nCPU = total number of CPUs
> >  Numa = total number of NUMA nodes
> >  mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
> > 
> > For the system under test with 16 NUMA nodes and 1024 CPUs, this
> > results in a substantial increase in the number of loop iterations
> > during boot-up when NUMA is enabled:
> > 
> > No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> > takes around 224 ms total for all the CPUs in the system under test.
> > 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> > takes around 4.5 seconds total for all the CPUs in the system under test.
> > 
> > Calling this for each CPU is expensive when there are large number
> > of CPUs along with multiple NUMAs. Fix this by deferring
> > refresh_zone_stat_thresholds to be called later at once when all the
> > secondary CPUs are up. Also, register the DYN hooks to keep the
> > existing hotplug functionality intact.
> >
> > ...
> >
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -31,6 +31,7 @@
> >  
> >  #include "internal.h"
> >  
> > +static int vmstat_late_init_done;
> >  #ifdef CONFIG_NUMA
> >  int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
> >  
> > @@ -2107,7 +2108,8 @@ static void __init init_cpu_node_state(void)
> >  
> >  static int vmstat_cpu_online(unsigned int cpu)
> >  {
> > -	refresh_zone_stat_thresholds();
> > +	if (vmstat_late_init_done)
> > +		refresh_zone_stat_thresholds();
> >  
> >  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
> >  		node_set_state(cpu_to_node(cpu), N_CPU);
> > @@ -2139,6 +2141,14 @@ static int vmstat_cpu_dead(unsigned int cpu)
> >  	return 0;
> >  }
> >  
> > +static int __init vmstat_late_init(void)
> > +{
> > +	refresh_zone_stat_thresholds();
> > +	vmstat_late_init_done = 1;
> > +
> > +	return 0;
> > +}
> > +late_initcall(vmstat_late_init);
> 
> OK, so what's happening here.  Once all CPUs are online and running
> around doing heaven knows what, one of the CPUs sets up everyone's
> thresholds.  So for a period, all the other CPUs are running with
> inappropriate threshold values.
> 
> So what are all the other CPUs doing at this point in time, and why is
> it safe to leave their thresholds in an inappropriate state while they
> are doing it?

From what I undersatnd these threshold values are primarily used by
userspace tools, and this data will be useful post late_initcall only.

If there’s a more effective approach to handle this, please let me know,
and I can investigate further.

- Saurabh


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup
  2024-07-09  4:57   ` Saurabh Singh Sengar
  2024-07-09 21:45     ` Andrew Morton
@ 2024-08-10  7:04     ` Andrew Morton
  2024-08-12  4:37       ` Saurabh Singh Sengar
  2024-08-23 15:32     ` Christoph Lameter (Ampere)
  2 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2024-08-10  7:04 UTC (permalink / raw)
  To: Saurabh Singh Sengar; +Cc: linux-mm, linux-kernel, ssengar, wei.liu

On Mon, 8 Jul 2024 21:57:50 -0700 Saurabh Singh Sengar <ssengar@linux.microsoft.com> wrote:

> > > No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> > > takes around 224 ms total for all the CPUs in the system under test.
> > > 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> > > takes around 4.5 seconds total for all the CPUs in the system under test.
> > 
> > Did you measure the overall before-and-after times?  IOW, how much of
> > that 4.5s do we reclaim?
> 
> This entire gain is accounted in over all boot processi time. Most of the Linux
> kernel boot process is sequential and doesn't take advantage of SMP.

Again, if you were able to measure 4.5s without the patch then you are
able to measure how long this delay is with the patch.  Please share
that number.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup
  2024-08-10  7:04     ` Andrew Morton
@ 2024-08-12  4:37       ` Saurabh Singh Sengar
  2024-08-13 23:37         ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Saurabh Singh Sengar @ 2024-08-12  4:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, ssengar, wei.liu

On Sat, Aug 10, 2024 at 12:04:04AM -0700, Andrew Morton wrote:
> On Mon, 8 Jul 2024 21:57:50 -0700 Saurabh Singh Sengar <ssengar@linux.microsoft.com> wrote:
> 
> > > > No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> > > > takes around 224 ms total for all the CPUs in the system under test.
> > > > 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> > > > takes around 4.5 seconds total for all the CPUs in the system under test.
> > > 
> > > Did you measure the overall before-and-after times?  IOW, how much of
> > > that 4.5s do we reclaim?
> > 
> > This entire gain is accounted in over all boot processi time. Most of the Linux
> > kernel boot process is sequential and doesn't take advantage of SMP.
> 
> Again, if you were able to measure 4.5s without the patch then you are
> able to measure how long this delay is with the patch.  Please share
> that number.

If I understand your question correctly, you're asking about the total time taken by
refresh_zone_stat_threshold for all its executions before and after the fix.

Without this patch, refresh_zone_stat_threshold was being called 1024 times.
After applying the patch, it is called only once, which is same as the last
iteration of the earlier 1024 calls. Further testing with this patch, I observed
a 4.5-second improvement in the overall boot timing due to this fix, which is
same as the total time taken by refresh_zone_stat_thresholds without thie patch,
leading me to reasonably conclude that refresh_zone_stat_threshold now takes a
negligible amount of time (likely just a few milliseconds). If you would like
precise timing details on single refresh_zone_stat_threshold execution, please
let me know, and I can conduct the tests again and provide the results in few days.

- Saurabh


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup
  2024-08-12  4:37       ` Saurabh Singh Sengar
@ 2024-08-13 23:37         ` Andrew Morton
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2024-08-13 23:37 UTC (permalink / raw)
  To: Saurabh Singh Sengar; +Cc: linux-mm, linux-kernel, ssengar, wei.liu

On Sun, 11 Aug 2024 21:37:54 -0700 Saurabh Singh Sengar <ssengar@linux.microsoft.com> wrote:

> Without this patch, refresh_zone_stat_threshold was being called 1024 times.
> After applying the patch, it is called only once, which is same as the last
> iteration of the earlier 1024 calls. Further testing with this patch, I observed
> a 4.5-second improvement in the overall boot timing due to this fix, which is
> same as the total time taken by refresh_zone_stat_thresholds without thie patch,
> leading me to reasonably conclude that refresh_zone_stat_threshold now takes a
> negligible amount of time (likely just a few milliseconds).

Thanks, nice.  I pasted this into the changelog.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup
  2024-07-09  4:57   ` Saurabh Singh Sengar
  2024-07-09 21:45     ` Andrew Morton
  2024-08-10  7:04     ` Andrew Morton
@ 2024-08-23 15:32     ` Christoph Lameter (Ampere)
  2024-08-28  5:37       ` Saurabh Singh Sengar
  2 siblings, 1 reply; 12+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-08-23 15:32 UTC (permalink / raw)
  To: Saurabh Singh Sengar
  Cc: Andrew Morton, linux-mm, linux-kernel, ssengar, wei.liu

On Mon, 8 Jul 2024, Saurabh Singh Sengar wrote:

> > > Calling this for each CPU is expensive when there are large number
> > > of CPUs along with multiple NUMAs. Fix this by deferring
> > > refresh_zone_stat_thresholds to be called later at once when all the
> > > secondary CPUs are up. Also, register the DYN hooks to keep the
> > > existing hotplug functionality intact.
> > >
> >
> > Seems risky - we'll now have online CPUs which have unintialized data,
> > yes?  What assurance do we have that this data won't be accessed?
>
> I understand that this data is only accessed by userspace tools, and they can
> only access it post late_initcall. Please let me know if there are any other
> cases, I will look to address them.

stat_threshold is used in all statistics functions that modify VM
counters. It is core to the functioning of the VM statistics.

However, if the threshold is zero (not initialized) then the VM counter
handling will simply be less effective because it will not do the per cpu
counter diffs anymore. This may increase contention and eat up the benefit
you are getting from deferring the calculation of the threshholds.

What may be more promising is to make it possible to calculate the
threshholds per cpu instead of recalculating the thresholds for every zone
again and again.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup
  2024-08-23 15:32     ` Christoph Lameter (Ampere)
@ 2024-08-28  5:37       ` Saurabh Singh Sengar
  2024-08-28 15:43         ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 12+ messages in thread
From: Saurabh Singh Sengar @ 2024-08-28  5:37 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Andrew Morton, linux-mm, linux-kernel, ssengar, wei.liu

On Fri, Aug 23, 2024 at 08:32:43AM -0700, Christoph Lameter (Ampere) wrote:
> On Mon, 8 Jul 2024, Saurabh Singh Sengar wrote:
> 
> > > > Calling this for each CPU is expensive when there are large number
> > > > of CPUs along with multiple NUMAs. Fix this by deferring
> > > > refresh_zone_stat_thresholds to be called later at once when all the
> > > > secondary CPUs are up. Also, register the DYN hooks to keep the
> > > > existing hotplug functionality intact.
> > > >
> > >
> > > Seems risky - we'll now have online CPUs which have unintialized data,
> > > yes?  What assurance do we have that this data won't be accessed?
> >
> > I understand that this data is only accessed by userspace tools, and they can
> > only access it post late_initcall. Please let me know if there are any other
> > cases, I will look to address them.
> 
> stat_threshold is used in all statistics functions that modify VM
> counters. It is core to the functioning of the VM statistics.
> 
> However, if the threshold is zero (not initialized) then the VM counter
> handling will simply be less effective because it will not do the per cpu
> counter diffs anymore. This may increase contention and eat up the benefit
> you are getting from deferring the calculation of the threshholds.

Christoph,

Thank you for your review. I would like to gain a better understanding of how
to measure contention. Could you please inform me if there is any recommended
method for doing so?

In my testing, this patch has resulted in a significant improvement in boot time.

> 
> What may be more promising is to make it possible to calculate the
> threshholds per cpu instead of recalculating the thresholds for every zone
> again and again.

I am happy to explore alternatives, can you please share more details around
this approach. Are you referring to avoiding the repetition of the calculation
below?

mem = zone_managed_pages(zone) >> (27 - PAGE_SHIFT);

- Saurabh


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup
  2024-08-28  5:37       ` Saurabh Singh Sengar
@ 2024-08-28 15:43         ` Christoph Lameter (Ampere)
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-08-28 15:43 UTC (permalink / raw)
  To: Saurabh Singh Sengar
  Cc: Andrew Morton, linux-mm, linux-kernel, ssengar, wei.liu

On Tue, 27 Aug 2024, Saurabh Singh Sengar wrote:

> Thank you for your review. I would like to gain a better understanding of how
> to measure contention. Could you please inform me if there is any recommended
> method for doing so?
>
> In my testing, this patch has resulted in a significant improvement in boot time.

Good.

There was the question by Andrew as to what happens if the
threshold is not initialized. I have no specific benchmarks to measure the
effect.

> > What may be more promising is to make it possible to calculate the
> > threshholds per cpu instead of recalculating the thresholds for every zone
> > again and again.
>
> I am happy to explore alternatives, can you please share more details around
> this approach. Are you referring to avoiding the repetition of the calculation
> below?
>
> mem = zone_managed_pages(zone) >> (27 - PAGE_SHIFT);

The thresholds vary per the zone setup in a NUMA node. So one approach
would be to calculate  these values for each new NODE once and only
update them when memory is brought online or offlines.

Then the parameters for each per cpu pcp could be set from the per NODE /
zone information when a cpu is brought up. The overhead would be minimal
and there would not be a need for the loops.

My company has similar amounts of cpus and it would be great to have a
clean solution here.



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-08-28 15:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-05  8:48 [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup Saurabh Sengar
2024-07-05 20:59 ` Andrew Morton
2024-07-09  4:57   ` Saurabh Singh Sengar
2024-07-09 21:45     ` Andrew Morton
2024-08-10  7:04     ` Andrew Morton
2024-08-12  4:37       ` Saurabh Singh Sengar
2024-08-13 23:37         ` Andrew Morton
2024-08-23 15:32     ` Christoph Lameter (Ampere)
2024-08-28  5:37       ` Saurabh Singh Sengar
2024-08-28 15:43         ` Christoph Lameter (Ampere)
2024-08-09  5:20 ` Andrew Morton
2024-08-09  5:49   ` Saurabh Singh Sengar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).