From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Breno Leitao <leitao@debian.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org, kas@kernel.org,
shakeel.butt@linux.dev, usama.arif@linux.dev,
kernel-team@meta.com
Subject: Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Date: Wed, 8 Apr 2026 12:13:04 +0200 [thread overview]
Message-ID: <a9129d39-f9dc-4f09-b951-203c8e28b600@kernel.org> (raw)
In-Reply-To: <adUhtrhU7c1TJQww@gmail.com>
On 4/7/26 17:39, Breno Leitao wrote:
> On Thu, Apr 02, 2026 at 06:33:17AM -0700, Breno Leitao wrote:
>> > >
>> > > Cool!
>> > >
>> > > I noticed __round_jiffies_relative() exists and the description looks like
>> > > it's meant for exactly this use case?
>> >
>> > On closer look, using round_jiffies_relative() as before your patch
>> > means it's calling __round_jiffies_relative(j, raw_smp_processor_id())
>> > so that's already doing this spread internally. You're also relying
>> > smp_processor_id() so it's not about using a different cpu id.
>> >
>> > But your patch has better results, why? I still think it's not doing
>> > what it intends - I think it makes every cpu have different interval
>> > length (up to twice the original length), not skew. Is it that, or that
>> > the 3 jiffies skew per cpu used in round_jiffies_common() is
>> > insufficient? Or it a bug in its skew implementation?
>> >
>> > Ideally once that's clear, the findings could be used to improve
>> > round_jiffies_common() and hopefully there's nothing here that's vmstat
>> > specific.
>>
>> Excellent observation. I believe there are two key differences:
>>
>> 1) The interval duration now varies per CPU. Specifically, vmstat_update()
>> is scheduled at sysctl_stat_interval*2 for the highest CPU with my
>> proposed change, rather than a uniform sysctl_stat_interval across
>> all CPUs. (as you raised in the first email)
>>
>> 2) round_jiffies_relative() applies a 3-jiffies shift per CPU, whereas
>> vmstat_spread_delay distributes all CPUs across the full second
>> interval. (My tests were on HZ=1000)
>>
>> I'll investigate this further to provide more concrete data.
>
> After further investigation, I can confirm that both factors mentioned
> above contribute to the performance improvement.
>
> However, we certainly don't want scenario (1) where the delay varies per
> CPU, resulting in the last CPU having vmstat_update() scheduled every
> 2 seconds instead of 1 second.
Indeed.
> I've implemented a patch following Dmitry's suggestion, and the
> performance gains are measurable.
>
> Here's my testing methodology:
>
> 1) Use ftrace to measure the execution time of refresh_cpu_vm_stats()
> * Applied a custom instrumentation patch [1]
>
> 2) Execute stress-ng:
> * stress-ng --vm 72 --vm-bytes 11256M --vm-method all --timeout 60s ; cat /sys/kernel/debug/tracing/trace
>
> 3) Parse the output using a Python script [2]
>
> While the results are not as dramatic as initially reported (since
> approach (1) was good but incorrect), the improvement is still
> substantial:
>
>
> ┌─────────┬────────────┬────────────┬───────┐
> │ Metric │ upstream* │ fix** │ Delta │
> ├─────────┼────────────┼────────────┼───────┤
> │ samples │ 36,981 │ 37,267 │ ~same │
> ├─────────┼────────────┼────────────┼───────┤
> │ avg │ 31,511 ns │ 21,337 ns │ -32% │
> ├─────────┼────────────┼────────────┼───────┤
> │ p50 │ 2,644 ns │ 2,925 ns │ ~same │
> ├─────────┼────────────┼────────────┼───────┤
> │ p99 │ 382,083 ns │ 304,357 ns │ -20% │
> ├─────────┼────────────┼────────────┼───────┤
> │ max │ 72.6 ms │ 16.0 ms │ -78% │
> └─────────┴────────────┴────────────┴───────┘
So you have 72 cpus, the vmstat interval is 1s, and what's the CONFIG_HZ?
If it's 1000, it means 13 jiffies per cpu. Would changing the
round_jiffies_common() implementation to add 13 jiffies per cpu instead of 3
have the same effect?
>
> * Upstream is based on linux-next commit f3e6330d7fe42 ("Add linux-next specific files for 20260407")
> ** "fix" contains the patch below:
>
> Link: https://github.com/leitao/linux/commit/ac200164df1bda45ee8504cc3db5bff5b696245e [1]
> Link: https://github.com/leitao/linux/commit/baa2ea6ea4c4c2b1df689de6db0a2a6f119e51be [2]
>
>
> commit 41b7aaa1a51f07fc1f0db0614d140fbca78463d3
> Author: Breno Leitao <leitao@debian.org>
> Date: Tue Apr 7 07:56:35 2026 -0700
>
> mm/vmstat: spread per-cpu vmstat work to reduce zone->lock contention
>
> vmstat_shepherd() queues all per-cpu vmstat_update work with zero delay,
> and vmstat_update() re-queues itself with round_jiffies_relative(), which
> clusters timers near the same second boundary due to the small per-CPU
> spread in round_jiffies_common(). On many-CPU systems this causes
> thundering-herd contention on zone->lock when multiple CPUs
> simultaneously call refresh_cpu_vm_stats() -> decay_pcp_high() ->
> free_pcppages_bulk().
>
> Introduce vmstat_spread_delay() to assign each CPU a unique offset
> distributed evenly across sysctl_stat_interval. The shepherd uses this
> when initially queuing per-cpu work, and vmstat_update re-queues with a
> plain sysctl_stat_interval to preserve the spread (round_jiffies_relative
> would snap CPUs back to the same boundary).
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
I think this approach could have the following problems:
- the initially spread delays can drift over time, there's nothing keeping
them in sync
- not using round_jiffies_relative() means firing at other times than other
timers that are using the rounding, so this could be working against the
power savings effects of rounding
- it's a vmstat-specific workaround for some yet unclear underlying
suboptimality that's likely not vmstat specific
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 3704f6ca7a268..8d93eee3b1f75 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2040,6 +2040,22 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
> }
> #endif /* CONFIG_PROC_FS */
>
> +/*
> + * Return a per-cpu initial delay that spreads vmstat_update work evenly
> + * across the stat interval, so that CPUs do not all fire at the same
> + * second boundary.
> + */
> +static unsigned long vmstat_spread_delay(int cpu)
> +{
> + unsigned long interval = sysctl_stat_interval;
> + unsigned int nr_cpus = num_online_cpus();
> +
> + if (nr_cpus <= 1)
> + return 0;
> +
> + return (interval * (cpu % nr_cpus)) / nr_cpus;
> +}
> +
> static void vmstat_update(struct work_struct *w)
> {
> if (refresh_cpu_vm_stats(true)) {
> @@ -2047,10 +2063,13 @@ static void vmstat_update(struct work_struct *w)
> * Counters were updated so we expect more updates
> * to occur in the future. Keep on running the
> * update worker thread.
> + * Avoid round_jiffies_relative() here -- it would snap
> + * every CPU back to the same second boundary, undoing
> + * the initial spread from vmstat_shepherd.
> */
> queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
> this_cpu_ptr(&vmstat_work),
> - round_jiffies_relative(sysctl_stat_interval));
> + sysctl_stat_interval);
> }
> }
>
> @@ -2148,7 +2167,8 @@ static void vmstat_shepherd(struct work_struct *w)
> continue;
>
> if (!delayed_work_pending(dw) && need_update(cpu))
> - queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> + queue_delayed_work_on(cpu, mm_percpu_wq, dw,
> + vmstat_spread_delay(cpu));
> }
>
> cond_resched();
next prev parent reply other threads:[~2026-04-08 10:13 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-01 13:57 [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Breno Leitao
2026-04-01 14:25 ` Johannes Weiner
2026-04-01 14:39 ` Breno Leitao
2026-04-01 14:57 ` Johannes Weiner
2026-04-01 14:47 ` Breno Leitao
2026-04-01 15:01 ` Kiryl Shutsemau
2026-04-01 15:23 ` Usama Arif
2026-04-01 15:43 ` Breno Leitao
2026-04-01 15:50 ` Usama Arif
2026-04-01 15:52 ` Breno Leitao
2026-04-01 17:46 ` Vlastimil Babka (SUSE)
2026-04-02 12:40 ` Vlastimil Babka (SUSE)
2026-04-02 13:33 ` Breno Leitao
2026-04-07 15:39 ` Breno Leitao
2026-04-08 10:13 ` Vlastimil Babka (SUSE) [this message]
2026-04-08 15:13 ` Breno Leitao
2026-04-08 17:00 ` Breno Leitao
2026-04-09 9:36 ` Vlastimil Babka (SUSE)
2026-04-09 12:27 ` Dmitry Ilvokhin
2026-04-09 9:17 ` Vlastimil Babka (SUSE)
2026-04-02 12:43 ` Dmitry Ilvokhin
2026-04-02 7:18 ` Michal Hocko
2026-04-02 12:49 ` Matthew Wilcox
2026-04-02 13:26 ` Breno Leitao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a9129d39-f9dc-4f09-b951-203c8e28b600@kernel.org \
--to=vbabka@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=david@kernel.org \
--cc=kas@kernel.org \
--cc=kernel-team@meta.com \
--cc=leitao@debian.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.