From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-173.mta0.migadu.com (out-173.mta0.migadu.com [91.218.175.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 128672D2495 for ; Wed, 1 Apr 2026 15:50:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.173 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775058615; cv=none; b=OFN2AFMzWxYBBf0X+DEP8LEX6QTdvTlByarHLChIM94PnvNWPXYD0N11ALV5L9rKwoInH2GoE8buf3SLhSwZQTgCgODtF0e3/VrplZBYTJx+IA3dUQ/eFwhkVaDjdv9EAAj60be6pl5ZQCrNgiyYc9dyferOUW0aoQy8BBXsMhE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775058615; c=relaxed/simple; bh=NNSyyKdfnvLDCBMRzY8WyleyS+Dhz2nHWdpOdaEXGbo=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=UkvDQ7O4R+VIwn0QB1eAZj76po7+PuJlfyV6ygDrZLlRo4KuPox7tQxAcdW+B2mJzBfFDjxUTMlxMqpZhdPt1pCOzREQ4D+2gG7hxhW8G+Wm8JmBvoUJ5lKrifi1Jr8my1bNeQ0JzVzPRUklz7kIiCovLmNeGRVt8/pCd2LZ0mM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=luC/chPf; arc=none smtp.client-ip=91.218.175.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="luC/chPf" Message-ID: <8ede5e20-9309-4d1a-8f12-13603fd92014@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1775058610; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4tIZAcYqdA6GrIHmU/xfIR+EG2WmYL1cpIUsbse0D6I=; b=luC/chPf0F7PM3uPuRnFdGRXycwcFQ0t5/WHBlGLpgP7l+GiJNIYa7ZN95YouGXaRXbf8s bmgU5hT8XUJx0YzvZlfsRwyC6iWTvOYFCErdCBztxLLrVPoxoz7bDh3Uei5JNM31g6NgDk Yyw5VGDNSqlLJBxBGb91v3ffO1sNlpE= Date: Wed, 1 Apr 2026 16:50:03 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Content-Language: en-GB To: Breno Leitao Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kas@kernel.org, shakeel.butt@linux.dev, kernel-team@meta.com References: <20260401-vmstat-v1-1-b68ce4a35055@debian.org> <20260401152343.3294686-1-usama.arif@linux.dev> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 01/04/2026 18:43, Breno Leitao wrote: > On Wed, Apr 01, 2026 at 08:23:40AM -0700, Usama Arif wrote: >> On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao wrote: >> >>> vmstat_update uses round_jiffies_relative() when re-queuing itself, >>> which aligns all CPUs' timers to the same second boundary. When many >>> CPUs have pending PCP pages to drain, they all call decay_pcp_high() -> >>> free_pcppages_bulk() simultaneously, serializing on zone->lock and >>> hitting contention. >>> >>> Introduce vmstat_spread_delay() which distributes each CPU's >>> vmstat_update evenly across the stat interval instead of aligning them. >>> >>> This does not increase the number of timer interrupts — each CPU still >>> fires once per interval. The timers are simply staggered rather than >>> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not >>> wake idle CPUs regardless of scheduling; the spread only affects CPUs >>> that are already active >>> >>> `perf lock contention` shows 7.5x reduction in zone->lock contention >>> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64 >>> system under memory pressure. >>> >>> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate >>> memory allocation bursts. Lock contention was measured with: >>> >>> perf lock contention -a -b -S free_pcppages_bulk >>> >>> Results with KASAN enabled: >>> >>> free_pcppages_bulk contention (KASAN): >>> +--------------+----------+----------+ >>> | Metric | No fix | With fix | >>> +--------------+----------+----------+ >>> | Contentions | 872 | 117 | >>> | Total wait | 199.43ms | 80.76ms | >>> | Max wait | 4.19ms | 35.76ms | >>> +--------------+----------+----------+ >>> >>> Results without KASAN: >>> >>> free_pcppages_bulk contention (no KASAN): >>> +--------------+----------+----------+ >>> | Metric | No fix | With fix | >>> +--------------+----------+----------+ >>> | Contentions | 240 | 133 | >>> | Total wait | 34.01ms | 24.61ms | >>> | Max wait | 965us | 1.35ms | >>> +--------------+----------+----------+ >>> >>> Signed-off-by: Breno Leitao >>> --- >>> mm/vmstat.c | 25 ++++++++++++++++++++++++- >>> 1 file changed, 24 insertions(+), 1 deletion(-) >>> >>> diff --git a/mm/vmstat.c b/mm/vmstat.c >>> index 2370c6fb1fcd..2e94bd765606 100644 >>> --- a/mm/vmstat.c >>> +++ b/mm/vmstat.c >>> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write, >>> } >>> #endif /* CONFIG_PROC_FS */ >>> >>> +/* >>> + * Return a per-cpu delay that spreads vmstat_update work across the stat >>> + * interval. Without this, round_jiffies_relative() aligns every CPU's >>> + * timer to the same second boundary, causing a thundering-herd on >>> + * zone->lock when multiple CPUs drain PCP pages simultaneously via >>> + * decay_pcp_high() -> free_pcppages_bulk(). >>> + */ >>> +static unsigned long vmstat_spread_delay(void) >>> +{ >>> + unsigned long interval = sysctl_stat_interval; >>> + unsigned int nr_cpus = num_online_cpus(); >>> + >>> + if (nr_cpus <= 1) >>> + return round_jiffies_relative(interval); >>> + >>> + /* >>> + * Spread per-cpu vmstat work evenly across the interval. Don't >>> + * use round_jiffies_relative() here -- it would snap every CPU >>> + * back to the same second boundary, defeating the spread. >>> + */ >>> + return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus; >>> +} >>> + >>> static void vmstat_update(struct work_struct *w) >>> { >>> if (refresh_cpu_vm_stats(true)) { >>> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w) >>> */ >>> queue_delayed_work_on(smp_processor_id(), mm_percpu_wq, >>> this_cpu_ptr(&vmstat_work), >>> - round_jiffies_relative(sysctl_stat_interval)); >>> + vmstat_spread_delay()); >> >> This is awesome! Maybe this needs to be done to vmstat_shepherd() as well? >> >> vmstat_shepherd() still queues work with delay 0 on all CPUs that >> need_update() in its for_each_online_cpu() loop: >> >> if (!delayed_work_pending(dw) && need_update(cpu)) >> queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); >> >> So when the shepherd fires, it kicks all dormant CPUs' vmstat workers >> simultaneously. >> >> Under sustained memory pressure on a large system, I think the shepherd >> fires every sysctl_stat_interval and could re-trigger the same lock >> contention? > > Good point - incorporating similar spreading logic in vmstat_shepherd() > would indeed address the simultaneous queueing issue you've described. > > Should I include this in a v2 of this patch, or would you prefer it as > a separate follow-up patch? I think it can be a separate follow-up patch, but no strong preference. For this patch: Acked-by: Usama Arif