From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail.ilvokhin.com (mail.ilvokhin.com [178.62.254.231])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id F07D13BAD81
	for <linux-kernel@vger.kernel.org>; Thu,  2 Apr 2026 12:43:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=178.62.254.231
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1775133822; cv=none; b=s3L3+eSOrncPRaHP3QbaSkPiLELIl8i4tX+9nYBP8+SA0h7X4IvrcLWfDi2bwN28PCewjTS0vkfllRo90SGSR0qD2RHLTlpMyJffei7yC3LGi0UMeCFvMBIkpP9dPikyEjMF6XBCKRFntQd8ezftitYVmMonh3/oSzH99H3p8vM=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1775133822; c=relaxed/simple;
	bh=614siX2ZJhuPiriLJla3Tuw+IjvEGffp8jzcNwe0dwg=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=JqrQs6P9hJfSp6lWCl1pQTXmXyOlOuiI1Edd139ri+N+AtOV2gkrQTnjkPYF1vjI+x8s3DXrA4XiIaUz+RS/EixQy0Rr+K+X1THD9paDLYkdEygb9xEN7VCVoJycrH1Net8r9ceg3Qv/dZH5bH/Lizfy7JhyfAGO5JLZCj3HqGM=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=ilvokhin.com; spf=pass smtp.mailfrom=ilvokhin.com; dkim=pass (1024-bit key) header.d=ilvokhin.com header.i=@ilvokhin.com header.b=2lK1MPv1; arc=none smtp.client-ip=178.62.254.231
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=ilvokhin.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=ilvokhin.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=ilvokhin.com header.i=@ilvokhin.com header.b="2lK1MPv1"
Received: from shell.ilvokhin.com (shell.ilvokhin.com [138.68.190.75])
	(Authenticated sender: d@ilvokhin.com)
	by mail.ilvokhin.com (Postfix) with ESMTPSA id 5BC9EBE3F2;
	Thu, 02 Apr 2026 12:43:31 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ilvokhin.com;
	s=mail; t=1775133811;
	bh=QyXgvoWPVr5+FW7E5J5E0+RpleuW2xwe4E/ruDFgOGg=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To;
	b=2lK1MPv1HnzwFABlcrlQ26ilwOgV0q6zP568K7bPTbrApdBc0hddFIRdwBXN3iyzy
	 EqhsvFqjsJ159oZ7EFvNOqcZPKvReIoEFfdNCGyYPFWc834/PX9rcxi3VyFkZ5Kz7A
	 pLV76/nN3U3wkRzvZHcr8aAzXpKtJkFHiRP/Baoo=
Date: Thu, 2 Apr 2026 12:43:27 +0000
From: Dmitry Ilvokhin <d@ilvokhin.com>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
Cc: Breno Leitao <leitao@debian.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, kas@kernel.org,
	shakeel.butt@linux.dev, usama.arif@linux.dev, kernel-team@meta.com
Subject: Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat
 interval
Message-ID: <ac5kb4ZHWAouBFEK@shell.ilvokhin.com>
References: <20260401-vmstat-v1-1-b68ce4a35055@debian.org>
 <fa089716-1bed-478b-96e3-a2ef5465b52f@kernel.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <fa089716-1bed-478b-96e3-a2ef5465b52f@kernel.org>

On Wed, Apr 01, 2026 at 07:46:35PM +0200, Vlastimil Babka (SUSE) wrote:

[...]

> > +/*
> > + * Return a per-cpu delay that spreads vmstat_update work across the stat
> > + * interval.  Without this, round_jiffies_relative() aligns every CPU's
> > + * timer to the same second boundary, causing a thundering-herd on
> > + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> > + * decay_pcp_high() -> free_pcppages_bulk().
> > + */
> > +static unsigned long vmstat_spread_delay(void)
> > +{
> > +	unsigned long interval = sysctl_stat_interval;
> > +	unsigned int nr_cpus = num_online_cpus();
> > +
> > +	if (nr_cpus <= 1)
> > +		return round_jiffies_relative(interval);
> > +
> > +	/*
> > +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> > +	 * use round_jiffies_relative() here -- it would snap every CPU
> > +	 * back to the same second boundary, defeating the spread.
> > +	 */
> > +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> 
> Hm doesn't this mean that lower id cpus will consistently fire in shorter
> intervals and higher id in longer intervals? What we want is same interval
> but differently offset, no?

Yes, I think that's a valid concern, this effectively skews the
interval rather than just introducing a phase offset.

I initially thought this might explain the increase in max wait, but it
turns out the columns were just swapped.

Spreading the initial scheduling and then requeueing with a constant
interval sounds like a reasonable alternative, e.g. below.

>From 56ed7e17b32f0a7ce433caed87650b0de8246c4e Mon Sep 17 00:00:00 2001
From: Dmitry Ilvokhin <d@ilvokhin.com>
Date: Thu, 2 Apr 2026 04:49:06 -0700
Subject: [PATCH] mm/vmstat: stagger per-cpu vmstat updates to avoid zone->lock
 contention

Fix by spreading the shepherd's initial wakeup across the stat plain
sysctl_stat_interval to preserve the stagger. Every CPU still fires once
per interval, same frequency, different phase.

Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
---
 mm/vmstat.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2370c6fb1fcd..aee99786718a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2042,7 +2042,7 @@ static void vmstat_update(struct work_struct *w)
 		 */
 		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
 				this_cpu_ptr(&vmstat_work),
-				round_jiffies_relative(sysctl_stat_interval));
+				sysctl_stat_interval);
 	}
 }
 
@@ -2140,7 +2140,8 @@ static void vmstat_shepherd(struct work_struct *w)
 				continue;
 
 			if (!delayed_work_pending(dw) && need_update(cpu))
-				queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
+				queue_delayed_work_on(cpu, mm_percpu_wq, dw,
+					(sysctl_stat_interval * cpu) / nr_cpu_ids);
 		}
 
 		cond_resched();
-- 
2.52.0