From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D457040DFCA
	for <linux-kernel@vger.kernel.org>; Wed,  1 Apr 2026 15:24:04 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.171
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1775057046; cv=none; b=ZdsELxM5W28fVV342xGAwsua7nZ0F3D3L2D5Ffg28x9IJINUIn+85xb0Susb+ViJmXvr8zCA30VXDJMENZdEqnstwLjy/QFAFzrF1YJjNmSaJeMzg0vjoe5344fNe/z/fa+++6XlfCgWtNtv9M53Dsp2mx+HsBxL+JY+UEfrYUc=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1775057046; c=relaxed/simple;
	bh=JcbY3Hnxun0BoJ/RrN3SuYMTnt8L4tjkF0NFwT1cORw=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type; b=gcJa2oA6ThQJ4OZJ1RxAKTJVm6Q1BsFp+s0Nivt2AMrXBTFl9KlwQJvfXLnfPs4vd2wb5KMiMxAsI2+Lp6HCgl7kU+NyyoYkN5Y9wRlGTYIDDIe9+CuJ0ybOBOrrSLeiPhrXorSJqjQM/O/vVFUPzC+6dCnRbtN7WWPrhIfy5/c=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=PKEmrArh; arc=none smtp.client-ip=91.218.175.171
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="PKEmrArh"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1775057042;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=7bgKpzn8XxryGiVAnWC5zFvW5iObgW7oiW/vycCAsgs=;
	b=PKEmrArhr4S5xmd8rRqYloL1+Gt1Ok+SavD1QVhWSjom+r1bFhwk5YVaSqJ2OKwKJ2nT1X
	vC4I+46Ulmu69eV6yS6Z1Q3rWZC3oK1ErmO8ThVkAXLFbYQiS4wwX74dRLHo/84LYGxdQY
	uThg9XrqZVMC2zSzmzFIZOI51P7zJaw=
From: Usama Arif <usama.arif@linux.dev>
To: Breno Leitao <leitao@debian.org>
Cc: Usama Arif <usama.arif@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	kas@kernel.org,
	shakeel.butt@linux.dev,
	kernel-team@meta.com
Subject: Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval
Date: Wed,  1 Apr 2026 08:23:40 -0700
Message-ID: <20260401152343.3294686-1-usama.arif@linux.dev>
In-Reply-To: <20260401-vmstat-v1-1-b68ce4a35055@debian.org>
References: 
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Migadu-Flow: FLOW_OUT

On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao <leitao@debian.org> wrote:

> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary.  When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
> 
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.
> 
> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
> 
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.
> 
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts.  Lock contention was measured with:
> 
>   perf lock contention -a -b -S free_pcppages_bulk
> 
> Results with KASAN enabled:
> 
>   free_pcppages_bulk contention (KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      872 |      117 |
>   | Total wait   | 199.43ms | 80.76ms  |
>   | Max wait     |   4.19ms | 35.76ms  |
>   +--------------+----------+----------+
> 
> Results without KASAN:
> 
>   free_pcppages_bulk contention (no KASAN):
>   +--------------+----------+----------+
>   | Metric       | No fix   | With fix |
>   +--------------+----------+----------+
>   | Contentions  |      240 |      133 |
>   | Total wait   |  34.01ms | 24.61ms  |
>   | Max wait     |   965us  |  1.35ms  |
>   +--------------+----------+----------+
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  mm/vmstat.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2370c6fb1fcd..2e94bd765606 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
>  }
>  #endif /* CONFIG_PROC_FS */
>  
> +/*
> + * Return a per-cpu delay that spreads vmstat_update work across the stat
> + * interval.  Without this, round_jiffies_relative() aligns every CPU's
> + * timer to the same second boundary, causing a thundering-herd on
> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> + * decay_pcp_high() -> free_pcppages_bulk().
> + */
> +static unsigned long vmstat_spread_delay(void)
> +{
> +	unsigned long interval = sysctl_stat_interval;
> +	unsigned int nr_cpus = num_online_cpus();
> +
> +	if (nr_cpus <= 1)
> +		return round_jiffies_relative(interval);
> +
> +	/*
> +	 * Spread per-cpu vmstat work evenly across the interval.  Don't
> +	 * use round_jiffies_relative() here -- it would snap every CPU
> +	 * back to the same second boundary, defeating the spread.
> +	 */
> +	return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> +}
> +
>  static void vmstat_update(struct work_struct *w)
>  {
>  	if (refresh_cpu_vm_stats(true)) {
> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
>  		 */
>  		queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
>  				this_cpu_ptr(&vmstat_work),
> -				round_jiffies_relative(sysctl_stat_interval));
> +				vmstat_spread_delay());

This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?

vmstat_shepherd() still queues work with delay 0 on all CPUs that
need_update() in its for_each_online_cpu() loop:

      if (!delayed_work_pending(dw) && need_update(cpu))
          queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);

So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
simultaneously.

Under sustained memory pressure on a large system, I think the shepherd
fires every sysctl_stat_interval and could re-trigger the same lock
contention?
 
>  	}
>  }
>  
> 
> ---
> base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb
> change-id: 20260401-vmstat-048e0feaf344
> 
> Best regards,
> --  
> Breno Leitao <leitao@debian.org>
> 
>