From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D457040DFCA for ; Wed, 1 Apr 2026 15:24:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.171 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775057046; cv=none; b=ZdsELxM5W28fVV342xGAwsua7nZ0F3D3L2D5Ffg28x9IJINUIn+85xb0Susb+ViJmXvr8zCA30VXDJMENZdEqnstwLjy/QFAFzrF1YJjNmSaJeMzg0vjoe5344fNe/z/fa+++6XlfCgWtNtv9M53Dsp2mx+HsBxL+JY+UEfrYUc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775057046; c=relaxed/simple; bh=JcbY3Hnxun0BoJ/RrN3SuYMTnt8L4tjkF0NFwT1cORw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=gcJa2oA6ThQJ4OZJ1RxAKTJVm6Q1BsFp+s0Nivt2AMrXBTFl9KlwQJvfXLnfPs4vd2wb5KMiMxAsI2+Lp6HCgl7kU+NyyoYkN5Y9wRlGTYIDDIe9+CuJ0ybOBOrrSLeiPhrXorSJqjQM/O/vVFUPzC+6dCnRbtN7WWPrhIfy5/c= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=PKEmrArh; arc=none smtp.client-ip=91.218.175.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="PKEmrArh" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1775057042; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7bgKpzn8XxryGiVAnWC5zFvW5iObgW7oiW/vycCAsgs=; b=PKEmrArhr4S5xmd8rRqYloL1+Gt1Ok+SavD1QVhWSjom+r1bFhwk5YVaSqJ2OKwKJ2nT1X vC4I+46Ulmu69eV6yS6Z1Q3rWZC3oK1ErmO8ThVkAXLFbYQiS4wwX74dRLHo/84LYGxdQY uThg9XrqZVMC2zSzmzFIZOI51P7zJaw= From: Usama Arif To: Breno Leitao Cc: Usama Arif , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kas@kernel.org, shakeel.butt@linux.dev, kernel-team@meta.com Subject: Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval Date: Wed, 1 Apr 2026 08:23:40 -0700 Message-ID: <20260401152343.3294686-1-usama.arif@linux.dev> In-Reply-To: <20260401-vmstat-v1-1-b68ce4a35055@debian.org> References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao wrote: > vmstat_update uses round_jiffies_relative() when re-queuing itself, > which aligns all CPUs' timers to the same second boundary. When many > CPUs have pending PCP pages to drain, they all call decay_pcp_high() -> > free_pcppages_bulk() simultaneously, serializing on zone->lock and > hitting contention. > > Introduce vmstat_spread_delay() which distributes each CPU's > vmstat_update evenly across the stat interval instead of aligning them. > > This does not increase the number of timer interrupts — each CPU still > fires once per interval. The timers are simply staggered rather than > aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not > wake idle CPUs regardless of scheduling; the spread only affects CPUs > that are already active > > `perf lock contention` shows 7.5x reduction in zone->lock contention > (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64 > system under memory pressure. > > Tested on a 72-CPU aarch64 system using stress-ng --vm to generate > memory allocation bursts. Lock contention was measured with: > > perf lock contention -a -b -S free_pcppages_bulk > > Results with KASAN enabled: > > free_pcppages_bulk contention (KASAN): > +--------------+----------+----------+ > | Metric | No fix | With fix | > +--------------+----------+----------+ > | Contentions | 872 | 117 | > | Total wait | 199.43ms | 80.76ms | > | Max wait | 4.19ms | 35.76ms | > +--------------+----------+----------+ > > Results without KASAN: > > free_pcppages_bulk contention (no KASAN): > +--------------+----------+----------+ > | Metric | No fix | With fix | > +--------------+----------+----------+ > | Contentions | 240 | 133 | > | Total wait | 34.01ms | 24.61ms | > | Max wait | 965us | 1.35ms | > +--------------+----------+----------+ > > Signed-off-by: Breno Leitao > --- > mm/vmstat.c | 25 ++++++++++++++++++++++++- > 1 file changed, 24 insertions(+), 1 deletion(-) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 2370c6fb1fcd..2e94bd765606 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write, > } > #endif /* CONFIG_PROC_FS */ > > +/* > + * Return a per-cpu delay that spreads vmstat_update work across the stat > + * interval. Without this, round_jiffies_relative() aligns every CPU's > + * timer to the same second boundary, causing a thundering-herd on > + * zone->lock when multiple CPUs drain PCP pages simultaneously via > + * decay_pcp_high() -> free_pcppages_bulk(). > + */ > +static unsigned long vmstat_spread_delay(void) > +{ > + unsigned long interval = sysctl_stat_interval; > + unsigned int nr_cpus = num_online_cpus(); > + > + if (nr_cpus <= 1) > + return round_jiffies_relative(interval); > + > + /* > + * Spread per-cpu vmstat work evenly across the interval. Don't > + * use round_jiffies_relative() here -- it would snap every CPU > + * back to the same second boundary, defeating the spread. > + */ > + return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus; > +} > + > static void vmstat_update(struct work_struct *w) > { > if (refresh_cpu_vm_stats(true)) { > @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w) > */ > queue_delayed_work_on(smp_processor_id(), mm_percpu_wq, > this_cpu_ptr(&vmstat_work), > - round_jiffies_relative(sysctl_stat_interval)); > + vmstat_spread_delay()); This is awesome! Maybe this needs to be done to vmstat_shepherd() as well? vmstat_shepherd() still queues work with delay 0 on all CPUs that need_update() in its for_each_online_cpu() loop: if (!delayed_work_pending(dw) && need_update(cpu)) queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); So when the shepherd fires, it kicks all dormant CPUs' vmstat workers simultaneously. Under sustained memory pressure on a large system, I think the shepherd fires every sysctl_stat_interval and could re-trigger the same lock contention? > } > } > > > --- > base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb > change-id: 20260401-vmstat-048e0feaf344 > > Best regards, > -- > Breno Leitao > >