public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	 "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	 Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
	kas@kernel.org, shakeel.butt@linux.dev, usama.arif@linux.dev,
	 kernel-team@meta.com
Subject: Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Date: Tue, 7 Apr 2026 08:39:00 -0700	[thread overview]
Message-ID: <adUhtrhU7c1TJQww@gmail.com> (raw)
In-Reply-To: <ac5urCFeEB9oyUiD@gmail.com>

On Thu, Apr 02, 2026 at 06:33:17AM -0700, Breno Leitao wrote:
> > >
> > > Cool!
> > >
> > > I noticed __round_jiffies_relative() exists and the description looks like
> > > it's meant for exactly this use case?
> >
> > On closer look, using round_jiffies_relative() as before your patch
> > means it's calling __round_jiffies_relative(j, raw_smp_processor_id())
> > so that's already doing this spread internally. You're also relying
> > smp_processor_id() so it's not about using a different cpu id.
> >
> > But your patch has better results, why? I still think it's not doing
> > what it intends - I think it makes every cpu have different interval
> > length (up to twice the original length), not skew. Is it that, or that
> > the 3 jiffies skew per cpu used in round_jiffies_common() is
> > insufficient? Or it a bug in its skew implementation?
> >
> > Ideally once that's clear, the findings could be used to improve
> > round_jiffies_common() and hopefully there's nothing here that's vmstat
> > specific.
>
> Excellent observation. I believe there are two key differences:
>
> 1) The interval duration now varies per CPU. Specifically, vmstat_update()
>    is scheduled at sysctl_stat_interval*2 for the highest CPU with my
>    proposed change, rather than a uniform sysctl_stat_interval across
>    all CPUs. (as you raised in the first email)
>
> 2) round_jiffies_relative() applies a 3-jiffies shift per CPU, whereas
>    vmstat_spread_delay distributes all CPUs across the full second
>    interval. (My tests were on HZ=1000)
>
> I'll investigate this further to provide more concrete data.

After further investigation, I can confirm that both factors mentioned
above contribute to the performance improvement.

However, we certainly don't want scenario (1) where the delay varies per
CPU, resulting in the last CPU having vmstat_update() scheduled every
2 seconds instead of 1 second.

I've implemented a patch following Dmitry's suggestion, and the
performance gains are measurable.

Here's my testing methodology:

1) Use ftrace to measure the execution time of refresh_cpu_vm_stats()
   * Applied a custom instrumentation patch [1]

2) Execute stress-ng:
   * stress-ng --vm 72 --vm-bytes 11256M --vm-method all --timeout 60s ; cat /sys/kernel/debug/tracing/trace

3) Parse the output using a Python script [2]

While the results are not as dramatic as initially reported (since
approach (1) was good but incorrect), the improvement is still
substantial:


  ┌─────────┬────────────┬────────────┬───────┐
  │ Metric  │ upstream*  │    fix**   │ Delta │
  ├─────────┼────────────┼────────────┼───────┤
  │ samples │ 36,981     │ 37,267     │ ~same │
  ├─────────┼────────────┼────────────┼───────┤
  │ avg     │ 31,511 ns  │ 21,337 ns  │ -32%  │
  ├─────────┼────────────┼────────────┼───────┤
  │ p50     │ 2,644 ns   │ 2,925 ns   │ ~same │
  ├─────────┼────────────┼────────────┼───────┤
  │ p99     │ 382,083 ns │ 304,357 ns │ -20%  │
  ├─────────┼────────────┼────────────┼───────┤
  │ max     │ 72.6 ms    │ 16.0 ms    │ -78%  │
  └─────────┴────────────┴────────────┴───────┘


* Upstream is based on linux-next commit f3e6330d7fe42  ("Add linux-next specific files for 20260407")
** "fix" contains the patch below:

Link: https://github.com/leitao/linux/commit/ac200164df1bda45ee8504cc3db5bff5b696245e [1]
Link: https://github.com/leitao/linux/commit/baa2ea6ea4c4c2b1df689de6db0a2a6f119e51be [2]


commit 41b7aaa1a51f07fc1f0db0614d140fbca78463d3
Author: Breno Leitao <leitao@debian.org>
Date:   Tue Apr 7 07:56:35 2026 -0700

    mm/vmstat: spread per-cpu vmstat work to reduce zone->lock contention

    vmstat_shepherd() queues all per-cpu vmstat_update work with zero delay,
    and vmstat_update() re-queues itself with round_jiffies_relative(), which
    clusters timers near the same second boundary due to the small per-CPU
    spread in round_jiffies_common(). On many-CPU systems this causes
    thundering-herd contention on zone->lock when multiple CPUs
    simultaneously call refresh_cpu_vm_stats() -> decay_pcp_high() ->
    free_pcppages_bulk().

    Introduce vmstat_spread_delay() to assign each CPU a unique offset
    distributed evenly across sysctl_stat_interval. The shepherd uses this
    when initially queuing per-cpu work, and vmstat_update re-queues with a
    plain sysctl_stat_interval to preserve the spread (round_jiffies_relative
    would snap CPUs back to the same boundary).

    Signed-off-by: Breno Leitao <leitao@debian.org>

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3704f6ca7a268..8d93eee3b1f75 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2040,6 +2040,22 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
 }
 #endif /* CONFIG_PROC_FS */

+/*
+ * Return a per-cpu initial delay that spreads vmstat_update work evenly
+ * across the stat interval, so that CPUs do not all fire at the same
+ * second boundary.
+ */
+static unsigned long vmstat_spread_delay(int cpu)
+{
+	unsigned long interval = sysctl_stat_interval;
+	unsigned int nr_cpus = num_online_cpus();
+
+	if (nr_cpus <= 1)
+		return 0;
+
+	return (interval * (cpu % nr_cpus)) / nr_cpus;
+}
+
 static void vmstat_update(struct work_struct *w)
 {
 	if (refresh_cpu_vm_stats(true)) {
@@ -2047,10 +2063,13 @@ static void vmstat_update(struct work_struct *w)
 		 * Counters were updated so we expect more updates
 		 * to occur in the future. Keep on running the
 		 * update worker thread.
+		 * Avoid round_jiffies_relative() here -- it would snap
+		 * every CPU back to the same second boundary, undoing
+		 * the initial spread from vmstat_shepherd.
 		 */
 		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
 				this_cpu_ptr(&vmstat_work),
-				round_jiffies_relative(sysctl_stat_interval));
+				sysctl_stat_interval);
 	}
 }

@@ -2148,7 +2167,8 @@ static void vmstat_shepherd(struct work_struct *w)
 				continue;

 			if (!delayed_work_pending(dw) && need_update(cpu))
-				queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
+				queue_delayed_work_on(cpu, mm_percpu_wq, dw,
+						      vmstat_spread_delay(cpu));
 		}

 		cond_resched();


  reply	other threads:[~2026-04-07 15:39 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-01 13:57 [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Breno Leitao
2026-04-01 14:25 ` Johannes Weiner
2026-04-01 14:39   ` Breno Leitao
2026-04-01 14:57     ` Johannes Weiner
2026-04-01 14:47 ` Breno Leitao
2026-04-01 15:01 ` Kiryl Shutsemau
2026-04-01 15:23 ` Usama Arif
2026-04-01 15:43   ` Breno Leitao
2026-04-01 15:50     ` Usama Arif
2026-04-01 15:52       ` Breno Leitao
2026-04-01 17:46 ` Vlastimil Babka (SUSE)
2026-04-02 12:40   ` Vlastimil Babka (SUSE)
2026-04-02 13:33     ` Breno Leitao
2026-04-07 15:39       ` Breno Leitao [this message]
2026-04-02 12:43   ` Dmitry Ilvokhin
2026-04-02  7:18 ` Michal Hocko
2026-04-02 12:49 ` Matthew Wilcox
2026-04-02 13:26   ` Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=adUhtrhU7c1TJQww@gmail.com \
    --to=leitao@debian.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=kas@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox