[PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
@ 2026-04-01 13:57 Breno Leitao
  2026-04-01 14:25 ` Johannes Weiner
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Breno Leitao @ 2026-04-01 13:57 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko
  Cc: linux-mm, linux-kernel, kas, shakeel.butt, usama.arif,
	kernel-team, Breno Leitao

vmstat_update uses round_jiffies_relative() when re-queuing itself,
which aligns all CPUs' timers to the same second boundary.  When many
CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
free_pcppages_bulk() simultaneously, serializing on zone->lock and
hitting contention.

Introduce vmstat_spread_delay() which distributes each CPU's
vmstat_update evenly across the stat interval instead of aligning them.

This does not increase the number of timer interrupts — each CPU still
fires once per interval. The timers are simply staggered rather than
aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
wake idle CPUs regardless of scheduling; the spread only affects CPUs
that are already active

`perf lock contention` shows 7.5x reduction in zone->lock contention
(872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
system under memory pressure.

Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
memory allocation bursts.  Lock contention was measured with:

  perf lock contention -a -b -S free_pcppages_bulk

Results with KASAN enabled:

  free_pcppages_bulk contention (KASAN):
  +--------------+----------+----------+
  | Metric       | No fix   | With fix |
  +--------------+----------+----------+
  | Contentions  |      872 |      117 |
  | Total wait   | 199.43ms | 80.76ms  |
  | Max wait     |   4.19ms | 35.76ms  |
  +--------------+----------+----------+

Results without KASAN:

  free_pcppages_bulk contention (no KASAN):
  +--------------+----------+----------+
  | Metric       | No fix   | With fix |
  +--------------+----------+----------+
  | Contentions  |      240 |      133 |
  | Total wait   |  34.01ms | 24.61ms  |
  | Max wait     |   965us  |  1.35ms  |
  +--------------+----------+----------+

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/vmstat.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2370c6fb1fcd..2e94bd765606 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
 }
 #endif /* CONFIG_PROC_FS */
 
+/*
+ * Return a per-cpu delay that spreads vmstat_update work across the stat
+ * interval.  Without this, round_jiffies_relative() aligns every CPU's
+ * timer to the same second boundary, causing a thundering-herd on
+ * zone->lock when multiple CPUs drain PCP pages simultaneously via
+ * decay_pcp_high() -> free_pcppages_bulk().
+ */
+static unsigned long vmstat_spread_delay(void)
+{
+	unsigned long interval = sysctl_stat_interval;
+	unsigned int nr_cpus = num_online_cpus();
+
+	if (nr_cpus <= 1)
+		return round_jiffies_relative(interval);
+
+	/*
+	 * Spread per-cpu vmstat work evenly across the interval.  Don't
+	 * use round_jiffies_relative() here -- it would snap every CPU
+	 * back to the same second boundary, defeating the spread.
+	 */
+	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
+}
+
 static void vmstat_update(struct work_struct *w)
 {
 	if (refresh_cpu_vm_stats(true)) {
@@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
 		 */
 		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
 				this_cpu_ptr(&vmstat_work),
-				round_jiffies_relative(sysctl_stat_interval));
+				vmstat_spread_delay());
 	}
 }
 

---
base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb
change-id: 20260401-vmstat-048e0feaf344

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 13:57 [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Breno Leitao
@ 2026-04-01 14:25 ` Johannes Weiner
  2026-04-01 14:39   ` Breno Leitao
  2026-04-01 14:47 ` Breno Leitao
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2026-04-01 14:25 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kas,
	shakeel.butt, usama.arif, kernel-team

On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary.  When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
> 
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.
> 
> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
> 
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.
> 
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts.  Lock contention was measured with:
> 
>   perf lock contention -a -b -S free_pcppages_bulk
> 
> Results with KASAN enabled:
> 
>   free_pcppages_bulk contention (KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      872 |      117 |
>   | Total wait   | 199.43ms | 80.76ms  |
>   | Max wait     |   4.19ms | 35.76ms  |
>   +--------------+----------+----------+
> 
> Results without KASAN:
> 
>   free_pcppages_bulk contention (no KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      240 |      133 |
>   | Total wait   |  34.01ms | 24.61ms  |
>   | Max wait     |   965us  |  1.35ms  |
>   +--------------+----------+----------+
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Nice!

> ---
>  mm/vmstat.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2370c6fb1fcd..2e94bd765606 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
>  }
>  #endif /* CONFIG_PROC_FS */
>  
> +/*
> + * Return a per-cpu delay that spreads vmstat_update work across the stat
> + * interval.  Without this, round_jiffies_relative() aligns every CPU's
> + * timer to the same second boundary, causing a thundering-herd on
> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> + * decay_pcp_high() -> free_pcppages_bulk().
> + */
> +static unsigned long vmstat_spread_delay(void)
> +{
> +	unsigned long interval = sysctl_stat_interval;
> +	unsigned int nr_cpus = num_online_cpus();
> +
> +	if (nr_cpus <= 1)
> +		return round_jiffies_relative(interval);
> +
> +	/*
> +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> +	 * use round_jiffies_relative() here -- it would snap every CPU
> +	 * back to the same second boundary, defeating the spread.
> +	 */
> +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;

smp_processor_id() <= nr_cpus, so

	return interval + interval*cpu/nr_cpus

should be equivalent, no?

Other than that,

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 14:25 ` Johannes Weiner
@ 2026-04-01 14:39   ` Breno Leitao
  2026-04-01 14:57     ` Johannes Weiner
  0 siblings, 1 reply; 17+ messages in thread
From: Breno Leitao @ 2026-04-01 14:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kas,
	shakeel.butt, usama.arif, kernel-team

Hello Johannes,

On Wed, Apr 01, 2026 at 10:25:35AM -0400, Johannes Weiner wrote:
> On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
> > +static unsigned long vmstat_spread_delay(void)
> > +{
> > +	unsigned long interval = sysctl_stat_interval;
> > +	unsigned int nr_cpus = num_online_cpus();
> > +
> > +	if (nr_cpus <= 1)
> > +		return round_jiffies_relative(interval);
> > +
> > +	/*
> > +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> > +	 * use round_jiffies_relative() here -- it would snap every CPU
> > +	 * back to the same second boundary, defeating the spread.
> > +	 */
> > +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> 
> smp_processor_id() <= nr_cpus, so
> 
> 	return interval + interval*cpu/nr_cpus
> 
> should be equivalent, no?

nr_cpus is the number of online CPUs, while smp_processor_id() is the
CPU id.

If you offline a CPU, then smp_processor_id() might be bigger than
num_online_cpus()

My goal was to linearly shift the timer and avoid creating gaps when
removing certain CPUs.

Thanks for the review,
--breno


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 13:57 [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Breno Leitao
  2026-04-01 14:25 ` Johannes Weiner
@ 2026-04-01 14:47 ` Breno Leitao
  2026-04-01 15:01 ` Kiryl Shutsemau
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Breno Leitao @ 2026-04-01 14:47 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko
  Cc: linux-mm, linux-kernel, kas, shakeel.butt, usama.arif,
	kernel-team

On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
>   free_pcppages_bulk contention (KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      872 |      117 |
>   | Total wait   | 199.43ms | 80.76ms  |
>   | Max wait     |   4.19ms | 35.76ms  |
>   +--------------+----------+----------+
> 
> Results without KASAN:
> 
>   free_pcppages_bulk contention (no KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      240 |      133 |
>   | Total wait   |  34.01ms | 24.61ms  |
>   | Max wait     |   965us  |  1.35ms  |
>   +--------------+----------+----------+

Sorry, the Max wait time is inverted on both cases.

  free_pcppages_bulk contention (KASAN):
  +--------------+----------+----------+
  | Metric       | No fix   | With fix |
  +--------------+----------+----------+
  | Contentions  |      872 |      117 |
  | Total wait   | 199.43ms | 80.76ms  |
  | Max wait     |  35.76ms | 4.19ms   |
  +--------------+----------+----------+

Results without KASAN:

  free_pcppages_bulk contention (no KASAN):
  +--------------+----------+----------+
  | Metric       | No fix   | With fix |
  +--------------+----------+----------+
  | Contentions  |      240 |      133 |
  | Total wait   |  34.01ms | 24.61ms  |
  | Max wait     |   1.35ms |   965us  |
  +--------------+----------+----------+

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 14:39   ` Breno Leitao
@ 2026-04-01 14:57     ` Johannes Weiner
  0 siblings, 0 replies; 17+ messages in thread
From: Johannes Weiner @ 2026-04-01 14:57 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kas,
	shakeel.butt, usama.arif, kernel-team

On Wed, Apr 01, 2026 at 07:39:28AM -0700, Breno Leitao wrote:
> Hello Johannes,
> 
> On Wed, Apr 01, 2026 at 10:25:35AM -0400, Johannes Weiner wrote:
> > On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
> > > +static unsigned long vmstat_spread_delay(void)
> > > +{
> > > +	unsigned long interval = sysctl_stat_interval;
> > > +	unsigned int nr_cpus = num_online_cpus();
> > > +
> > > +	if (nr_cpus <= 1)
> > > +		return round_jiffies_relative(interval);
> > > +
> > > +	/*
> > > +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> > > +	 * use round_jiffies_relative() here -- it would snap every CPU
> > > +	 * back to the same second boundary, defeating the spread.
> > > +	 */
> > > +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> > 
> > smp_processor_id() <= nr_cpus, so
> > 
> > 	return interval + interval*cpu/nr_cpus
> > 
> > should be equivalent, no?
> 
> nr_cpus is the number of online CPUs, while smp_processor_id() is the
> CPU id.
> 
> If you offline a CPU, then smp_processor_id() might be bigger than
> num_online_cpus()
> 
> My goal was to linearly shift the timer and avoid creating gaps when
> removing certain CPUs.

Ah makes sense. Plus you'd spill into the next interval otherwise.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 13:57 [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Breno Leitao
  2026-04-01 14:25 ` Johannes Weiner
  2026-04-01 14:47 ` Breno Leitao
@ 2026-04-01 15:01 ` Kiryl Shutsemau
  2026-04-01 15:23 ` Usama Arif
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Kiryl Shutsemau @ 2026-04-01 15:01 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	shakeel.butt, usama.arif, kernel-team

On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary.  When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
> 
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.

Nice idea.

> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
> 
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.

Wow. That's huge improvement.

> 
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts.  Lock contention was measured with:
> 
>   perf lock contention -a -b -S free_pcppages_bulk
> 
> Results with KASAN enabled:
> 
>   free_pcppages_bulk contention (KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      872 |      117 |
>   | Total wait   | 199.43ms | 80.76ms  |
>   | Max wait     |   4.19ms | 35.76ms  |
>   +--------------+----------+----------+
> 
> Results without KASAN:
> 
>   free_pcppages_bulk contention (no KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      240 |      133 |
>   | Total wait   |  34.01ms | 24.61ms  |
>   | Max wait     |   965us  |  1.35ms  |
>   +--------------+----------+----------+
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Acked-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 13:57 [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Breno Leitao
                   ` (2 preceding siblings ...)
  2026-04-01 15:01 ` Kiryl Shutsemau
@ 2026-04-01 15:23 ` Usama Arif
  2026-04-01 15:43   ` Breno Leitao
  2026-04-01 17:46 ` Vlastimil Babka (SUSE)
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Usama Arif @ 2026-04-01 15:23 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Usama Arif, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kas,
	shakeel.butt, kernel-team

On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao <leitao@debian.org> wrote:

> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary.  When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
> 
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.
> 
> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
> 
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.
> 
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts.  Lock contention was measured with:
> 
>   perf lock contention -a -b -S free_pcppages_bulk
> 
> Results with KASAN enabled:
> 
>   free_pcppages_bulk contention (KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      872 |      117 |
>   | Total wait   | 199.43ms | 80.76ms  |
>   | Max wait     |   4.19ms | 35.76ms  |
>   +--------------+----------+----------+
> 
> Results without KASAN:
> 
>   free_pcppages_bulk contention (no KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      240 |      133 |
>   | Total wait   |  34.01ms | 24.61ms  |
>   | Max wait     |   965us  |  1.35ms  |
>   +--------------+----------+----------+
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  mm/vmstat.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2370c6fb1fcd..2e94bd765606 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
>  }
>  #endif /* CONFIG_PROC_FS */
>  
> +/*
> + * Return a per-cpu delay that spreads vmstat_update work across the stat
> + * interval.  Without this, round_jiffies_relative() aligns every CPU's
> + * timer to the same second boundary, causing a thundering-herd on
> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> + * decay_pcp_high() -> free_pcppages_bulk().
> + */
> +static unsigned long vmstat_spread_delay(void)
> +{
> +	unsigned long interval = sysctl_stat_interval;
> +	unsigned int nr_cpus = num_online_cpus();
> +
> +	if (nr_cpus <= 1)
> +		return round_jiffies_relative(interval);
> +
> +	/*
> +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> +	 * use round_jiffies_relative() here -- it would snap every CPU
> +	 * back to the same second boundary, defeating the spread.
> +	 */
> +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> +}
> +
>  static void vmstat_update(struct work_struct *w)
>  {
>  	if (refresh_cpu_vm_stats(true)) {
> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
>  		 */
>  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
>  				this_cpu_ptr(&vmstat_work),
> -				round_jiffies_relative(sysctl_stat_interval));
> +				vmstat_spread_delay());

This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?

vmstat_shepherd() still queues work with delay 0 on all CPUs that
need_update() in its for_each_online_cpu() loop:

      if (!delayed_work_pending(dw) && need_update(cpu))
          queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);

So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
simultaneously.

Under sustained memory pressure on a large system, I think the shepherd
fires every sysctl_stat_interval and could re-trigger the same lock
contention?
 
>  	}
>  }
>  
> 
> ---
> base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb
> change-id: 20260401-vmstat-048e0feaf344
> 
> Best regards,
> --  
> Breno Leitao <leitao@debian.org>
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 15:23 ` Usama Arif
@ 2026-04-01 15:43   ` Breno Leitao
  2026-04-01 15:50     ` Usama Arif
  0 siblings, 1 reply; 17+ messages in thread
From: Breno Leitao @ 2026-04-01 15:43 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kas,
	shakeel.butt, kernel-team

On Wed, Apr 01, 2026 at 08:23:40AM -0700, Usama Arif wrote:
> On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao <leitao@debian.org> wrote:
>
> > vmstat_update uses round_jiffies_relative() when re-queuing itself,
> > which aligns all CPUs' timers to the same second boundary.  When many
> > CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> > free_pcppages_bulk() simultaneously, serializing on zone->lock and
> > hitting contention.
> >
> > Introduce vmstat_spread_delay() which distributes each CPU's
> > vmstat_update evenly across the stat interval instead of aligning them.
> >
> > This does not increase the number of timer interrupts — each CPU still
> > fires once per interval. The timers are simply staggered rather than
> > aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> > wake idle CPUs regardless of scheduling; the spread only affects CPUs
> > that are already active
> >
> > `perf lock contention` shows 7.5x reduction in zone->lock contention
> > (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> > system under memory pressure.
> >
> > Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> > memory allocation bursts.  Lock contention was measured with:
> >
> >   perf lock contention -a -b -S free_pcppages_bulk
> >
> > Results with KASAN enabled:
> >
> >   free_pcppages_bulk contention (KASAN):
> >   +--------------+----------+----------+
> >   | Metric       | No fix   | With fix |
> >   +--------------+----------+----------+
> >   | Contentions  |      872 |      117 |
> >   | Total wait   | 199.43ms | 80.76ms  |
> >   | Max wait     |   4.19ms | 35.76ms  |
> >   +--------------+----------+----------+
> >
> > Results without KASAN:
> >
> >   free_pcppages_bulk contention (no KASAN):
> >   +--------------+----------+----------+
> >   | Metric       | No fix   | With fix |
> >   +--------------+----------+----------+
> >   | Contentions  |      240 |      133 |
> >   | Total wait   |  34.01ms | 24.61ms  |
> >   | Max wait     |   965us  |  1.35ms  |
> >   +--------------+----------+----------+
> >
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> >  mm/vmstat.c | 25 ++++++++++++++++++++++++-
> >  1 file changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 2370c6fb1fcd..2e94bd765606 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
> >  }
> >  #endif /* CONFIG_PROC_FS */
> >
> > +/*
> > + * Return a per-cpu delay that spreads vmstat_update work across the stat
> > + * interval.  Without this, round_jiffies_relative() aligns every CPU's
> > + * timer to the same second boundary, causing a thundering-herd on
> > + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> > + * decay_pcp_high() -> free_pcppages_bulk().
> > + */
> > +static unsigned long vmstat_spread_delay(void)
> > +{
> > +	unsigned long interval = sysctl_stat_interval;
> > +	unsigned int nr_cpus = num_online_cpus();
> > +
> > +	if (nr_cpus <= 1)
> > +		return round_jiffies_relative(interval);
> > +
> > +	/*
> > +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> > +	 * use round_jiffies_relative() here -- it would snap every CPU
> > +	 * back to the same second boundary, defeating the spread.
> > +	 */
> > +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> > +}
> > +
> >  static void vmstat_update(struct work_struct *w)
> >  {
> >  	if (refresh_cpu_vm_stats(true)) {
> > @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
> >  		 */
> >  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
> >  				this_cpu_ptr(&vmstat_work),
> > -				round_jiffies_relative(sysctl_stat_interval));
> > +				vmstat_spread_delay());
>
> This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?
>
> vmstat_shepherd() still queues work with delay 0 on all CPUs that
> need_update() in its for_each_online_cpu() loop:
>
>       if (!delayed_work_pending(dw) && need_update(cpu))
>           queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
>
> So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
> simultaneously.
>
> Under sustained memory pressure on a large system, I think the shepherd
> fires every sysctl_stat_interval and could re-trigger the same lock
> contention?

Good point - incorporating similar spreading logic in vmstat_shepherd()
would indeed address the simultaneous queueing issue you've described.

Should I include this in a v2 of this patch, or would you prefer it as
a separate follow-up patch?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 15:43   ` Breno Leitao
@ 2026-04-01 15:50     ` Usama Arif
  2026-04-01 15:52       ` Breno Leitao
  0 siblings, 1 reply; 17+ messages in thread
From: Usama Arif @ 2026-04-01 15:50 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kas,
	shakeel.butt, kernel-team



On 01/04/2026 18:43, Breno Leitao wrote:
> On Wed, Apr 01, 2026 at 08:23:40AM -0700, Usama Arif wrote:
>> On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao <leitao@debian.org> wrote:
>>
>>> vmstat_update uses round_jiffies_relative() when re-queuing itself,
>>> which aligns all CPUs' timers to the same second boundary.  When many
>>> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
>>> free_pcppages_bulk() simultaneously, serializing on zone->lock and
>>> hitting contention.
>>>
>>> Introduce vmstat_spread_delay() which distributes each CPU's
>>> vmstat_update evenly across the stat interval instead of aligning them.
>>>
>>> This does not increase the number of timer interrupts — each CPU still
>>> fires once per interval. The timers are simply staggered rather than
>>> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
>>> wake idle CPUs regardless of scheduling; the spread only affects CPUs
>>> that are already active
>>>
>>> `perf lock contention` shows 7.5x reduction in zone->lock contention
>>> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
>>> system under memory pressure.
>>>
>>> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
>>> memory allocation bursts.  Lock contention was measured with:
>>>
>>>   perf lock contention -a -b -S free_pcppages_bulk
>>>
>>> Results with KASAN enabled:
>>>
>>>   free_pcppages_bulk contention (KASAN):
>>>   +--------------+----------+----------+
>>>   | Metric       | No fix   | With fix |
>>>   +--------------+----------+----------+
>>>   | Contentions  |      872 |      117 |
>>>   | Total wait   | 199.43ms | 80.76ms  |
>>>   | Max wait     |   4.19ms | 35.76ms  |
>>>   +--------------+----------+----------+
>>>
>>> Results without KASAN:
>>>
>>>   free_pcppages_bulk contention (no KASAN):
>>>   +--------------+----------+----------+
>>>   | Metric       | No fix   | With fix |
>>>   +--------------+----------+----------+
>>>   | Contentions  |      240 |      133 |
>>>   | Total wait   |  34.01ms | 24.61ms  |
>>>   | Max wait     |   965us  |  1.35ms  |
>>>   +--------------+----------+----------+
>>>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>>  mm/vmstat.c | 25 ++++++++++++++++++++++++-
>>>  1 file changed, 24 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>> index 2370c6fb1fcd..2e94bd765606 100644
>>> --- a/mm/vmstat.c
>>> +++ b/mm/vmstat.c
>>> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
>>>  }
>>>  #endif /* CONFIG_PROC_FS */
>>>
>>> +/*
>>> + * Return a per-cpu delay that spreads vmstat_update work across the stat
>>> + * interval.  Without this, round_jiffies_relative() aligns every CPU's
>>> + * timer to the same second boundary, causing a thundering-herd on
>>> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
>>> + * decay_pcp_high() -> free_pcppages_bulk().
>>> + */
>>> +static unsigned long vmstat_spread_delay(void)
>>> +{
>>> +	unsigned long interval = sysctl_stat_interval;
>>> +	unsigned int nr_cpus = num_online_cpus();
>>> +
>>> +	if (nr_cpus <= 1)
>>> +		return round_jiffies_relative(interval);
>>> +
>>> +	/*
>>> +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
>>> +	 * use round_jiffies_relative() here -- it would snap every CPU
>>> +	 * back to the same second boundary, defeating the spread.
>>> +	 */
>>> +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
>>> +}
>>> +
>>>  static void vmstat_update(struct work_struct *w)
>>>  {
>>>  	if (refresh_cpu_vm_stats(true)) {
>>> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
>>>  		 */
>>>  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
>>>  				this_cpu_ptr(&vmstat_work),
>>> -				round_jiffies_relative(sysctl_stat_interval));
>>> +				vmstat_spread_delay());
>>
>> This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?
>>
>> vmstat_shepherd() still queues work with delay 0 on all CPUs that
>> need_update() in its for_each_online_cpu() loop:
>>
>>       if (!delayed_work_pending(dw) && need_update(cpu))
>>           queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
>>
>> So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
>> simultaneously.
>>
>> Under sustained memory pressure on a large system, I think the shepherd
>> fires every sysctl_stat_interval and could re-trigger the same lock
>> contention?
> 
> Good point - incorporating similar spreading logic in vmstat_shepherd()
> would indeed address the simultaneous queueing issue you've described.
> 
> Should I include this in a v2 of this patch, or would you prefer it as
> a separate follow-up patch?

I think it can be a separate follow-up patch, but no strong preference.
For this patch:

Acked-by: Usama Arif <usama.arif@linux.dev>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 15:50     ` Usama Arif
@ 2026-04-01 15:52       ` Breno Leitao
  0 siblings, 0 replies; 17+ messages in thread
From: Breno Leitao @ 2026-04-01 15:52 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kas,
	shakeel.butt, kernel-team

On Wed, Apr 01, 2026 at 04:50:03PM +0100, Usama Arif wrote:
> >>
> >> This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?
> >>
> >> vmstat_shepherd() still queues work with delay 0 on all CPUs that
> >> need_update() in its for_each_online_cpu() loop:
> >>
> >>       if (!delayed_work_pending(dw) && need_update(cpu))
> >>           queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> >>
> >> So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
> >> simultaneously.
> >>
> >> Under sustained memory pressure on a large system, I think the shepherd
> >> fires every sysctl_stat_interval and could re-trigger the same lock
> >> contention?
> > 
> > Good point - incorporating similar spreading logic in vmstat_shepherd()
> > would indeed address the simultaneous queueing issue you've described.
> > 
> > Should I include this in a v2 of this patch, or would you prefer it as
> > a separate follow-up patch?
> 
> I think it can be a separate follow-up patch, but no strong preference.

Thanks!

I will send a follow-up patch soon.
--breno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 13:57 [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Breno Leitao
                   ` (3 preceding siblings ...)
  2026-04-01 15:23 ` Usama Arif
@ 2026-04-01 17:46 ` Vlastimil Babka (SUSE)
  2026-04-02 12:40   ` Vlastimil Babka (SUSE)
  2026-04-02 12:43   ` Dmitry Ilvokhin
  2026-04-02  7:18 ` Michal Hocko
  2026-04-02 12:49 ` Matthew Wilcox
  6 siblings, 2 replies; 17+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-04-01 17:46 UTC (permalink / raw)
  To: Breno Leitao, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko
  Cc: linux-mm, linux-kernel, kas, shakeel.butt, usama.arif,
	kernel-team

On 4/1/26 15:57, Breno Leitao wrote:
> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary.  When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
> 
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.
> 
> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
> 
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.
> 
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts.  Lock contention was measured with:
> 
>   perf lock contention -a -b -S free_pcppages_bulk
> 
> Results with KASAN enabled:
> 
>   free_pcppages_bulk contention (KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      872 |      117 |
>   | Total wait   | 199.43ms | 80.76ms  |
>   | Max wait     |   4.19ms | 35.76ms  |
>   +--------------+----------+----------+
> 
> Results without KASAN:
> 
>   free_pcppages_bulk contention (no KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      240 |      133 |
>   | Total wait   |  34.01ms | 24.61ms  |
>   | Max wait     |   965us  |  1.35ms  |
>   +--------------+----------+----------+
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Cool!

I noticed __round_jiffies_relative() exists and the description looks like
it's meant for exactly this use case?

> ---
>  mm/vmstat.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2370c6fb1fcd..2e94bd765606 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
>  }
>  #endif /* CONFIG_PROC_FS */
>  
> +/*
> + * Return a per-cpu delay that spreads vmstat_update work across the stat
> + * interval.  Without this, round_jiffies_relative() aligns every CPU's
> + * timer to the same second boundary, causing a thundering-herd on
> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> + * decay_pcp_high() -> free_pcppages_bulk().
> + */
> +static unsigned long vmstat_spread_delay(void)
> +{
> +	unsigned long interval = sysctl_stat_interval;
> +	unsigned int nr_cpus = num_online_cpus();
> +
> +	if (nr_cpus <= 1)
> +		return round_jiffies_relative(interval);
> +
> +	/*
> +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> +	 * use round_jiffies_relative() here -- it would snap every CPU
> +	 * back to the same second boundary, defeating the spread.
> +	 */
> +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;

Hm doesn't this mean that lower id cpus will consistently fire in shorter
intervals and higher id in longer intervals? What we want is same interval
but differently offset, no?

> +}
> +
>  static void vmstat_update(struct work_struct *w)
>  {
>  	if (refresh_cpu_vm_stats(true)) {
> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
>  		 */
>  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
>  				this_cpu_ptr(&vmstat_work),
> -				round_jiffies_relative(sysctl_stat_interval));
> +				vmstat_spread_delay());
>  	}
>  }
>  
> 
> ---
> base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb
> change-id: 20260401-vmstat-048e0feaf344
> 
> Best regards,
> --  
> Breno Leitao <leitao@debian.org>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 13:57 [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Breno Leitao
                   ` (4 preceding siblings ...)
  2026-04-01 17:46 ` Vlastimil Babka (SUSE)
@ 2026-04-02  7:18 ` Michal Hocko
  2026-04-02 12:49 ` Matthew Wilcox
  6 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2026-04-02  7:18 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, linux-mm, linux-kernel, kas, shakeel.butt,
	usama.arif, kernel-team

On Wed 01-04-26 06:57:50, Breno Leitao wrote:
> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary.  When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
> 
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.
> 
> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
> 
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.
> 
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts.  Lock contention was measured with:
> 
>   perf lock contention -a -b -S free_pcppages_bulk
> 
> Results with KASAN enabled:
> 
>   free_pcppages_bulk contention (KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      872 |      117 |
>   | Total wait   | 199.43ms | 80.76ms  |
>   | Max wait     |   4.19ms | 35.76ms  |
>   +--------------+----------+----------+
> 
> Results without KASAN:
> 
>   free_pcppages_bulk contention (no KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      240 |      133 |
>   | Total wait   |  34.01ms | 24.61ms  |
>   | Max wait     |   965us  |  1.35ms  |
>   +--------------+----------+----------+
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Makes sense
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/vmstat.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2370c6fb1fcd..2e94bd765606 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
>  }
>  #endif /* CONFIG_PROC_FS */
>  
> +/*
> + * Return a per-cpu delay that spreads vmstat_update work across the stat
> + * interval.  Without this, round_jiffies_relative() aligns every CPU's
> + * timer to the same second boundary, causing a thundering-herd on
> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> + * decay_pcp_high() -> free_pcppages_bulk().
> + */
> +static unsigned long vmstat_spread_delay(void)
> +{
> +	unsigned long interval = sysctl_stat_interval;
> +	unsigned int nr_cpus = num_online_cpus();
> +
> +	if (nr_cpus <= 1)
> +		return round_jiffies_relative(interval);
> +
> +	/*
> +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> +	 * use round_jiffies_relative() here -- it would snap every CPU
> +	 * back to the same second boundary, defeating the spread.
> +	 */
> +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> +}
> +
>  static void vmstat_update(struct work_struct *w)
>  {
>  	if (refresh_cpu_vm_stats(true)) {
> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
>  		 */
>  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
>  				this_cpu_ptr(&vmstat_work),
> -				round_jiffies_relative(sysctl_stat_interval));
> +				vmstat_spread_delay());
>  	}
>  }
>  
> 
> ---
> base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb
> change-id: 20260401-vmstat-048e0feaf344
> 
> Best regards,
> --  
> Breno Leitao <leitao@debian.org>
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 17:46 ` Vlastimil Babka (SUSE)
@ 2026-04-02 12:40   ` Vlastimil Babka (SUSE)
  2026-04-02 13:33     ` Breno Leitao
  2026-04-02 12:43   ` Dmitry Ilvokhin
  1 sibling, 1 reply; 17+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-04-02 12:40 UTC (permalink / raw)
  To: Breno Leitao, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko
  Cc: linux-mm, linux-kernel, kas, shakeel.butt, usama.arif,
	kernel-team

On 4/1/26 7:46 PM, Vlastimil Babka (SUSE) wrote:
> On 4/1/26 15:57, Breno Leitao wrote:
>> vmstat_update uses round_jiffies_relative() when re-queuing itself,
>> which aligns all CPUs' timers to the same second boundary.  When many
>> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
>> free_pcppages_bulk() simultaneously, serializing on zone->lock and
>> hitting contention.
>>
>> Introduce vmstat_spread_delay() which distributes each CPU's
>> vmstat_update evenly across the stat interval instead of aligning them.
>>
>> This does not increase the number of timer interrupts — each CPU still
>> fires once per interval. The timers are simply staggered rather than
>> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
>> wake idle CPUs regardless of scheduling; the spread only affects CPUs
>> that are already active
>>
>> `perf lock contention` shows 7.5x reduction in zone->lock contention
>> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
>> system under memory pressure.
>>
>> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
>> memory allocation bursts.  Lock contention was measured with:
>>
>>   perf lock contention -a -b -S free_pcppages_bulk
>>
>> Results with KASAN enabled:
>>
>>   free_pcppages_bulk contention (KASAN):
>>   +--------------+----------+----------+
>>   | Metric       | No fix   | With fix |
>>   +--------------+----------+----------+
>>   | Contentions  |      872 |      117 |
>>   | Total wait   | 199.43ms | 80.76ms  |
>>   | Max wait     |   4.19ms | 35.76ms  |
>>   +--------------+----------+----------+
>>
>> Results without KASAN:
>>
>>   free_pcppages_bulk contention (no KASAN):
>>   +--------------+----------+----------+
>>   | Metric       | No fix   | With fix |
>>   +--------------+----------+----------+
>>   | Contentions  |      240 |      133 |
>>   | Total wait   |  34.01ms | 24.61ms  |
>>   | Max wait     |   965us  |  1.35ms  |
>>   +--------------+----------+----------+
>>
>> Signed-off-by: Breno Leitao <leitao@debian.org>
> 
> Cool!
> 
> I noticed __round_jiffies_relative() exists and the description looks like
> it's meant for exactly this use case?

On closer look, using round_jiffies_relative() as before your patch
means it's calling __round_jiffies_relative(j, raw_smp_processor_id())
so that's already doing this spread internally. You're also relying
smp_processor_id() so it's not about using a different cpu id.

But your patch has better results, why? I still think it's not doing
what it intends - I think it makes every cpu have different interval
length (up to twice the original length), not skew. Is it that, or that
the 3 jiffies skew per cpu used in round_jiffies_common() is
insufficient? Or it a bug in its skew implementation?

Ideally once that's clear, the findings could be used to improve
round_jiffies_common() and hopefully there's nothing here that's vmstat
specific.

Thanks,
Vlastimil

>> ---
>>  mm/vmstat.c | 25 ++++++++++++++++++++++++-
>>  1 file changed, 24 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 2370c6fb1fcd..2e94bd765606 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
>>  }
>>  #endif /* CONFIG_PROC_FS */
>>  
>> +/*
>> + * Return a per-cpu delay that spreads vmstat_update work across the stat
>> + * interval.  Without this, round_jiffies_relative() aligns every CPU's
>> + * timer to the same second boundary, causing a thundering-herd on
>> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
>> + * decay_pcp_high() -> free_pcppages_bulk().
>> + */
>> +static unsigned long vmstat_spread_delay(void)
>> +{
>> +	unsigned long interval = sysctl_stat_interval;
>> +	unsigned int nr_cpus = num_online_cpus();
>> +
>> +	if (nr_cpus <= 1)
>> +		return round_jiffies_relative(interval);
>> +
>> +	/*
>> +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
>> +	 * use round_jiffies_relative() here -- it would snap every CPU
>> +	 * back to the same second boundary, defeating the spread.
>> +	 */
>> +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> 
> Hm doesn't this mean that lower id cpus will consistently fire in shorter
> intervals and higher id in longer intervals? What we want is same interval
> but differently offset, no?
> 
>> +}
>> +
>>  static void vmstat_update(struct work_struct *w)
>>  {
>>  	if (refresh_cpu_vm_stats(true)) {
>> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
>>  		 */
>>  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
>>  				this_cpu_ptr(&vmstat_work),
>> -				round_jiffies_relative(sysctl_stat_interval));
>> +				vmstat_spread_delay());
>>  	}
>>  }
>>  
>>
>> ---
>> base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb
>> change-id: 20260401-vmstat-048e0feaf344
>>
>> Best regards,
>> --  
>> Breno Leitao <leitao@debian.org>
>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 17:46 ` Vlastimil Babka (SUSE)
  2026-04-02 12:40   ` Vlastimil Babka (SUSE)
@ 2026-04-02 12:43   ` Dmitry Ilvokhin
  1 sibling, 0 replies; 17+ messages in thread
From: Dmitry Ilvokhin @ 2026-04-02 12:43 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Breno Leitao, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, kas, shakeel.butt, usama.arif,
	kernel-team

On Wed, Apr 01, 2026 at 07:46:35PM +0200, Vlastimil Babka (SUSE) wrote:

[...]

> > +/*
> > + * Return a per-cpu delay that spreads vmstat_update work across the stat
> > + * interval.  Without this, round_jiffies_relative() aligns every CPU's
> > + * timer to the same second boundary, causing a thundering-herd on
> > + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> > + * decay_pcp_high() -> free_pcppages_bulk().
> > + */
> > +static unsigned long vmstat_spread_delay(void)
> > +{
> > +	unsigned long interval = sysctl_stat_interval;
> > +	unsigned int nr_cpus = num_online_cpus();
> > +
> > +	if (nr_cpus <= 1)
> > +		return round_jiffies_relative(interval);
> > +
> > +	/*
> > +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> > +	 * use round_jiffies_relative() here -- it would snap every CPU
> > +	 * back to the same second boundary, defeating the spread.
> > +	 */
> > +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> 
> Hm doesn't this mean that lower id cpus will consistently fire in shorter
> intervals and higher id in longer intervals? What we want is same interval
> but differently offset, no?

Yes, I think that's a valid concern, this effectively skews the
interval rather than just introducing a phase offset.

I initially thought this might explain the increase in max wait, but it
turns out the columns were just swapped.

Spreading the initial scheduling and then requeueing with a constant
interval sounds like a reasonable alternative, e.g. below.

From 56ed7e17b32f0a7ce433caed87650b0de8246c4e Mon Sep 17 00:00:00 2001
From: Dmitry Ilvokhin <d@ilvokhin.com>
Date: Thu, 2 Apr 2026 04:49:06 -0700
Subject: [PATCH] mm/vmstat: stagger per-cpu vmstat updates to avoid zone->lock
 contention

Fix by spreading the shepherd's initial wakeup across the stat plain
sysctl_stat_interval to preserve the stagger. Every CPU still fires once
per interval, same frequency, different phase.

Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
---
 mm/vmstat.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2370c6fb1fcd..aee99786718a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2042,7 +2042,7 @@ static void vmstat_update(struct work_struct *w)
 		 */
 		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
 				this_cpu_ptr(&vmstat_work),
-				round_jiffies_relative(sysctl_stat_interval));
+				sysctl_stat_interval);
 	}
 }
 
@@ -2140,7 +2140,8 @@ static void vmstat_shepherd(struct work_struct *w)
 				continue;
 
 			if (!delayed_work_pending(dw) && need_update(cpu))
-				queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
+				queue_delayed_work_on(cpu, mm_percpu_wq, dw,
+					(sysctl_stat_interval * cpu) / nr_cpu_ids);
 		}
 
 		cond_resched();
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-01 13:57 [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Breno Leitao
                   ` (5 preceding siblings ...)
  2026-04-02  7:18 ` Michal Hocko
@ 2026-04-02 12:49 ` Matthew Wilcox
  2026-04-02 13:26   ` Breno Leitao
  6 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2026-04-02 12:49 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kas,
	shakeel.butt, usama.arif, kernel-team

On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary.  When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
> 
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.

But, uh, round_jiffies_relative() is _supposed_ to do that!  Look at
both the documentation and implementation.  Why isn't it working?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-02 12:49 ` Matthew Wilcox
@ 2026-04-02 13:26   ` Breno Leitao
  0 siblings, 0 replies; 17+ messages in thread
From: Breno Leitao @ 2026-04-02 13:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kas,
	shakeel.butt, usama.arif, kernel-team

Hello Matthew,

On Thu, Apr 02, 2026 at 01:49:34PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 01, 2026 at 06:57:50AM -0700, Breno Leitao wrote:
> > vmstat_update uses round_jiffies_relative() when re-queuing itself,
> > which aligns all CPUs' timers to the same second boundary.  When many
> > CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> > free_pcppages_bulk() simultaneously, serializing on zone->lock and
> > hitting contention.
> > 
> > Introduce vmstat_spread_delay() which distributes each CPU's
> > vmstat_update evenly across the stat interval instead of aligning them.
> 
> But, uh, round_jiffies_relative() is _supposed_ to do that!  Look at
> both the documentation and implementation.  Why isn't it working?

The documentation is a bit trick on me,  sorry. This is what I read:

  "By rounding these timers to whole seconds, all such timers will fire
   at the same time, rather than at various times spread out. The goal
   of this is to have the CPU wake up less, which saves power."

This is the full documentation:

/**
 * round_jiffies_relative - function to round jiffies to a full second
 * @j: the time in (relative) jiffies that should be rounded
 *
 * round_jiffies_relative() rounds a time delta  in the future (in jiffies)
 * up or down to (approximately) full seconds. This is useful for timers
 * for which the exact time they fire does not matter too much, as long as
 * they fire approximately every X seconds.
 *
 * By rounding these timers to whole seconds, all such timers will fire
 * at the same time, rather than at various times spread out. The goal
 * of this is to have the CPU wake up less, which saves power.
 *
 * The return value is the rounded version of the @j parameter.
 */
unsigned long round_jiffies_relative(unsigned long j)
{
        return __round_jiffies_relative(j, raw_smp_processor_id());
}
EXPORT_SYMBOL_GPL(round_jiffies_relative);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
  2026-04-02 12:40   ` Vlastimil Babka (SUSE)
@ 2026-04-02 13:33     ` Breno Leitao
  0 siblings, 0 replies; 17+ messages in thread
From: Breno Leitao @ 2026-04-02 13:33 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, kas, shakeel.butt, usama.arif,
	kernel-team

> > 
> > Cool!
> > 
> > I noticed __round_jiffies_relative() exists and the description looks like
> > it's meant for exactly this use case?
> 
> On closer look, using round_jiffies_relative() as before your patch
> means it's calling __round_jiffies_relative(j, raw_smp_processor_id())
> so that's already doing this spread internally. You're also relying
> smp_processor_id() so it's not about using a different cpu id.
> 
> But your patch has better results, why? I still think it's not doing
> what it intends - I think it makes every cpu have different interval
> length (up to twice the original length), not skew. Is it that, or that
> the 3 jiffies skew per cpu used in round_jiffies_common() is
> insufficient? Or it a bug in its skew implementation?
> 
> Ideally once that's clear, the findings could be used to improve
> round_jiffies_common() and hopefully there's nothing here that's vmstat
> specific.

Excellent observation. I believe there are two key differences:

1) The interval duration now varies per CPU. Specifically, vmstat_update()
   is scheduled at sysctl_stat_interval*2 for the highest CPU with my
   proposed change, rather than a uniform sysctl_stat_interval across
   all CPUs. (as you raised in the first email)

2) round_jiffies_relative() applies a 3-jiffies shift per CPU, whereas
   vmstat_spread_delay distributes all CPUs across the full second
   interval. (My tests were on HZ=1000)

I'll investigate this further to provide more concrete data.

Thanks for the review,
--breno
   

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-04-02 13:33 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-01 13:57 [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Breno Leitao
2026-04-01 14:25 ` Johannes Weiner
2026-04-01 14:39   ` Breno Leitao
2026-04-01 14:57     ` Johannes Weiner
2026-04-01 14:47 ` Breno Leitao
2026-04-01 15:01 ` Kiryl Shutsemau
2026-04-01 15:23 ` Usama Arif
2026-04-01 15:43   ` Breno Leitao
2026-04-01 15:50     ` Usama Arif
2026-04-01 15:52       ` Breno Leitao
2026-04-01 17:46 ` Vlastimil Babka (SUSE)
2026-04-02 12:40   ` Vlastimil Babka (SUSE)
2026-04-02 13:33     ` Breno Leitao
2026-04-02 12:43   ` Dmitry Ilvokhin
2026-04-02  7:18 ` Michal Hocko
2026-04-02 12:49 ` Matthew Wilcox
2026-04-02 13:26   ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox