public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Usama Arif <usama.arif@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	 "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	 Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	 Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, kas@kernel.org,
	 shakeel.butt@linux.dev, kernel-team@meta.com
Subject: Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Date: Wed, 1 Apr 2026 08:43:52 -0700	[thread overview]
Message-ID: <ac08dcv31J5_lAHs@gmail.com> (raw)
In-Reply-To: <20260401152343.3294686-1-usama.arif@linux.dev>

On Wed, Apr 01, 2026 at 08:23:40AM -0700, Usama Arif wrote:
> On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao <leitao@debian.org> wrote:
>
> > vmstat_update uses round_jiffies_relative() when re-queuing itself,
> > which aligns all CPUs' timers to the same second boundary.  When many
> > CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> > free_pcppages_bulk() simultaneously, serializing on zone->lock and
> > hitting contention.
> >
> > Introduce vmstat_spread_delay() which distributes each CPU's
> > vmstat_update evenly across the stat interval instead of aligning them.
> >
> > This does not increase the number of timer interrupts — each CPU still
> > fires once per interval. The timers are simply staggered rather than
> > aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> > wake idle CPUs regardless of scheduling; the spread only affects CPUs
> > that are already active
> >
> > `perf lock contention` shows 7.5x reduction in zone->lock contention
> > (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> > system under memory pressure.
> >
> > Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> > memory allocation bursts.  Lock contention was measured with:
> >
> >   perf lock contention -a -b -S free_pcppages_bulk
> >
> > Results with KASAN enabled:
> >
> >   free_pcppages_bulk contention (KASAN):
> >   +--------------+----------+----------+
> >   | Metric       | No fix   | With fix |
> >   +--------------+----------+----------+
> >   | Contentions  |      872 |      117 |
> >   | Total wait   | 199.43ms | 80.76ms  |
> >   | Max wait     |   4.19ms | 35.76ms  |
> >   +--------------+----------+----------+
> >
> > Results without KASAN:
> >
> >   free_pcppages_bulk contention (no KASAN):
> >   +--------------+----------+----------+
> >   | Metric       | No fix   | With fix |
> >   +--------------+----------+----------+
> >   | Contentions  |      240 |      133 |
> >   | Total wait   |  34.01ms | 24.61ms  |
> >   | Max wait     |   965us  |  1.35ms  |
> >   +--------------+----------+----------+
> >
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> >  mm/vmstat.c | 25 ++++++++++++++++++++++++-
> >  1 file changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 2370c6fb1fcd..2e94bd765606 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
> >  }
> >  #endif /* CONFIG_PROC_FS */
> >
> > +/*
> > + * Return a per-cpu delay that spreads vmstat_update work across the stat
> > + * interval.  Without this, round_jiffies_relative() aligns every CPU's
> > + * timer to the same second boundary, causing a thundering-herd on
> > + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> > + * decay_pcp_high() -> free_pcppages_bulk().
> > + */
> > +static unsigned long vmstat_spread_delay(void)
> > +{
> > +	unsigned long interval = sysctl_stat_interval;
> > +	unsigned int nr_cpus = num_online_cpus();
> > +
> > +	if (nr_cpus <= 1)
> > +		return round_jiffies_relative(interval);
> > +
> > +	/*
> > +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> > +	 * use round_jiffies_relative() here -- it would snap every CPU
> > +	 * back to the same second boundary, defeating the spread.
> > +	 */
> > +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> > +}
> > +
> >  static void vmstat_update(struct work_struct *w)
> >  {
> >  	if (refresh_cpu_vm_stats(true)) {
> > @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
> >  		 */
> >  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
> >  				this_cpu_ptr(&vmstat_work),
> > -				round_jiffies_relative(sysctl_stat_interval));
> > +				vmstat_spread_delay());
>
> This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?
>
> vmstat_shepherd() still queues work with delay 0 on all CPUs that
> need_update() in its for_each_online_cpu() loop:
>
>       if (!delayed_work_pending(dw) && need_update(cpu))
>           queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
>
> So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
> simultaneously.
>
> Under sustained memory pressure on a large system, I think the shepherd
> fires every sysctl_stat_interval and could re-trigger the same lock
> contention?

Good point - incorporating similar spreading logic in vmstat_shepherd()
would indeed address the simultaneous queueing issue you've described.

Should I include this in a v2 of this patch, or would you prefer it as
a separate follow-up patch?

  reply	other threads:[~2026-04-01 15:44 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-01 13:57 [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Breno Leitao
2026-04-01 14:25 ` Johannes Weiner
2026-04-01 14:39   ` Breno Leitao
2026-04-01 14:57     ` Johannes Weiner
2026-04-01 14:47 ` Breno Leitao
2026-04-01 15:01 ` Kiryl Shutsemau
2026-04-01 15:23 ` Usama Arif
2026-04-01 15:43   ` Breno Leitao [this message]
2026-04-01 15:50     ` Usama Arif
2026-04-01 15:52       ` Breno Leitao
2026-04-01 17:46 ` Vlastimil Babka (SUSE)
2026-04-02 12:40   ` Vlastimil Babka (SUSE)
2026-04-02 13:33     ` Breno Leitao
2026-04-02 12:43   ` Dmitry Ilvokhin
2026-04-02  7:18 ` Michal Hocko
2026-04-02 12:49 ` Matthew Wilcox
2026-04-02 13:26   ` Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ac08dcv31J5_lAHs@gmail.com \
    --to=leitao@debian.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=kas@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox