Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Breno Leitao <leitao@debian.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, kas@kernel.org,
	shakeel.butt@linux.dev, usama.arif@linux.dev,
	kernel-team@meta.com
Subject: Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Date: Wed, 8 Apr 2026 12:13:04 +0200	[thread overview]
Message-ID: <a9129d39-f9dc-4f09-b951-203c8e28b600@kernel.org> (raw)
In-Reply-To: <adUhtrhU7c1TJQww@gmail.com>

On 4/7/26 17:39, Breno Leitao wrote:
> On Thu, Apr 02, 2026 at 06:33:17AM -0700, Breno Leitao wrote:
>> > >
>> > > Cool!
>> > >
>> > > I noticed __round_jiffies_relative() exists and the description looks like
>> > > it's meant for exactly this use case?
>> >
>> > On closer look, using round_jiffies_relative() as before your patch
>> > means it's calling __round_jiffies_relative(j, raw_smp_processor_id())
>> > so that's already doing this spread internally. You're also relying
>> > smp_processor_id() so it's not about using a different cpu id.
>> >
>> > But your patch has better results, why? I still think it's not doing
>> > what it intends - I think it makes every cpu have different interval
>> > length (up to twice the original length), not skew. Is it that, or that
>> > the 3 jiffies skew per cpu used in round_jiffies_common() is
>> > insufficient? Or it a bug in its skew implementation?
>> >
>> > Ideally once that's clear, the findings could be used to improve
>> > round_jiffies_common() and hopefully there's nothing here that's vmstat
>> > specific.
>>
>> Excellent observation. I believe there are two key differences:
>>
>> 1) The interval duration now varies per CPU. Specifically, vmstat_update()
>>    is scheduled at sysctl_stat_interval*2 for the highest CPU with my
>>    proposed change, rather than a uniform sysctl_stat_interval across
>>    all CPUs. (as you raised in the first email)
>>
>> 2) round_jiffies_relative() applies a 3-jiffies shift per CPU, whereas
>>    vmstat_spread_delay distributes all CPUs across the full second
>>    interval. (My tests were on HZ=1000)
>>
>> I'll investigate this further to provide more concrete data.
> 
> After further investigation, I can confirm that both factors mentioned
> above contribute to the performance improvement.
> 
> However, we certainly don't want scenario (1) where the delay varies per
> CPU, resulting in the last CPU having vmstat_update() scheduled every
> 2 seconds instead of 1 second.

Indeed.

> I've implemented a patch following Dmitry's suggestion, and the
> performance gains are measurable.
> 
> Here's my testing methodology:
> 
> 1) Use ftrace to measure the execution time of refresh_cpu_vm_stats()
>    * Applied a custom instrumentation patch [1]
> 
> 2) Execute stress-ng:
>    * stress-ng --vm 72 --vm-bytes 11256M --vm-method all --timeout 60s ; cat /sys/kernel/debug/tracing/trace
> 
> 3) Parse the output using a Python script [2]
> 
> While the results are not as dramatic as initially reported (since
> approach (1) was good but incorrect), the improvement is still
> substantial:
> 
> 
>   ┌─────────┬────────────┬────────────┬───────┐
>   │ Metric  │ upstream*  │    fix**   │ Delta │
>   ├─────────┼────────────┼────────────┼───────┤
>   │ samples │ 36,981     │ 37,267     │ ~same │
>   ├─────────┼────────────┼────────────┼───────┤
>   │ avg     │ 31,511 ns  │ 21,337 ns  │ -32%  │
>   ├─────────┼────────────┼────────────┼───────┤
>   │ p50     │ 2,644 ns   │ 2,925 ns   │ ~same │
>   ├─────────┼────────────┼────────────┼───────┤
>   │ p99     │ 382,083 ns │ 304,357 ns │ -20%  │
>   ├─────────┼────────────┼────────────┼───────┤
>   │ max     │ 72.6 ms    │ 16.0 ms    │ -78%  │
>   └─────────┴────────────┴────────────┴───────┘

So you have 72 cpus, the vmstat interval is 1s, and what's the CONFIG_HZ?
If it's 1000, it means 13 jiffies per cpu. Would changing the
round_jiffies_common() implementation to add 13 jiffies per cpu instead of 3
have the same effect?

> 
> * Upstream is based on linux-next commit f3e6330d7fe42  ("Add linux-next specific files for 20260407")
> ** "fix" contains the patch below:
> 
> Link: https://github.com/leitao/linux/commit/ac200164df1bda45ee8504cc3db5bff5b696245e [1]
> Link: https://github.com/leitao/linux/commit/baa2ea6ea4c4c2b1df689de6db0a2a6f119e51be [2]
> 
> 
> commit 41b7aaa1a51f07fc1f0db0614d140fbca78463d3
> Author: Breno Leitao <leitao@debian.org>
> Date:   Tue Apr 7 07:56:35 2026 -0700
> 
>     mm/vmstat: spread per-cpu vmstat work to reduce zone->lock contention
> 
>     vmstat_shepherd() queues all per-cpu vmstat_update work with zero delay,
>     and vmstat_update() re-queues itself with round_jiffies_relative(), which
>     clusters timers near the same second boundary due to the small per-CPU
>     spread in round_jiffies_common(). On many-CPU systems this causes
>     thundering-herd contention on zone->lock when multiple CPUs
>     simultaneously call refresh_cpu_vm_stats() -> decay_pcp_high() ->
>     free_pcppages_bulk().
> 
>     Introduce vmstat_spread_delay() to assign each CPU a unique offset
>     distributed evenly across sysctl_stat_interval. The shepherd uses this
>     when initially queuing per-cpu work, and vmstat_update re-queues with a
>     plain sysctl_stat_interval to preserve the spread (round_jiffies_relative
>     would snap CPUs back to the same boundary).
> 
>     Signed-off-by: Breno Leitao <leitao@debian.org>

I think this approach could have the following problems:

- the initially spread delays can drift over time, there's nothing keeping
them in sync
- not using round_jiffies_relative() means firing at other times than other
timers that are using the rounding, so this could be working against the
power savings effects of rounding
- it's a vmstat-specific workaround for some yet unclear underlying
suboptimality that's likely not vmstat specific

> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 3704f6ca7a268..8d93eee3b1f75 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2040,6 +2040,22 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
>  }
>  #endif /* CONFIG_PROC_FS */
> 
> +/*
> + * Return a per-cpu initial delay that spreads vmstat_update work evenly
> + * across the stat interval, so that CPUs do not all fire at the same
> + * second boundary.
> + */
> +static unsigned long vmstat_spread_delay(int cpu)
> +{
> +	unsigned long interval = sysctl_stat_interval;
> +	unsigned int nr_cpus = num_online_cpus();
> +
> +	if (nr_cpus <= 1)
> +		return 0;
> +
> +	return (interval * (cpu % nr_cpus)) / nr_cpus;
> +}
> +
>  static void vmstat_update(struct work_struct *w)
>  {
>  	if (refresh_cpu_vm_stats(true)) {
> @@ -2047,10 +2063,13 @@ static void vmstat_update(struct work_struct *w)
>  		 * Counters were updated so we expect more updates
>  		 * to occur in the future. Keep on running the
>  		 * update worker thread.
> +		 * Avoid round_jiffies_relative() here -- it would snap
> +		 * every CPU back to the same second boundary, undoing
> +		 * the initial spread from vmstat_shepherd.
>  		 */
>  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
>  				this_cpu_ptr(&vmstat_work),
> -				round_jiffies_relative(sysctl_stat_interval));
> +				sysctl_stat_interval);
>  	}
>  }
> 
> @@ -2148,7 +2167,8 @@ static void vmstat_shepherd(struct work_struct *w)
>  				continue;
> 
>  			if (!delayed_work_pending(dw) && need_update(cpu))
> -				queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> +				queue_delayed_work_on(cpu, mm_percpu_wq, dw,
> +						      vmstat_spread_delay(cpu));
>  		}
> 
>  		cond_resched();

next prev parent reply	other threads:[~2026-04-08 10:13 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-01 13:57 [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Breno Leitao
2026-04-01 14:25 ` Johannes Weiner
2026-04-01 14:39   ` Breno Leitao
2026-04-01 14:57     ` Johannes Weiner
2026-04-01 14:47 ` Breno Leitao
2026-04-01 15:01 ` Kiryl Shutsemau
2026-04-01 15:23 ` Usama Arif
2026-04-01 15:43   ` Breno Leitao
2026-04-01 15:50     ` Usama Arif
2026-04-01 15:52       ` Breno Leitao
2026-04-01 17:46 ` Vlastimil Babka (SUSE)
2026-04-02 12:40   ` Vlastimil Babka (SUSE)
2026-04-02 13:33     ` Breno Leitao
2026-04-07 15:39       ` Breno Leitao
2026-04-08 10:13         ` Vlastimil Babka (SUSE) [this message]
2026-04-08 15:13           ` Breno Leitao
2026-04-08 17:00             ` Breno Leitao
2026-04-02 12:43   ` Dmitry Ilvokhin
2026-04-02  7:18 ` Michal Hocko
2026-04-02 12:49 ` Matthew Wilcox
2026-04-02 13:26   ` Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a9129d39-f9dc-4f09-b951-203c8e28b600@kernel.org \
    --to=vbabka@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=kas@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=leitao@debian.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=usama.arif@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox