From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from stravinsky.debian.org (stravinsky.debian.org [82.195.75.108]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D0DF938836A for ; Thu, 9 Apr 2026 12:27:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=82.195.75.108 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775737622; cv=none; b=bAYjCYbWlOMWFvsEG5SfynFU9RVWtTZICSMfhpCjj4BAi0EPepE4Pj2IZYBRN54+BmRfGkKRruxZYLQSve3/31URXHXmaAoU/m4Q9mvFOCz9NXDvWocDqzKze5zxlUzrRpOnOC4sQSxDYbJJygQZl1oFpfZKn6OYY3Zfm6UW1Xc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775737622; c=relaxed/simple; bh=2K8+Ryh5CYTlYSK3JxJau1TxsZI23fwwxvROlCFTSPo=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:Cc; b=R2M6zMemKhow3iPnKrolWiM+9ywH/D8lRtbQGvdx15iXGrwRbzyZLLZ5aEmTJ+XabQ3yw/+/5hhVcXuXBS8MGGiD2dKJ7OyYjuho7iaivO9F2twinoojF2Po7KfKBMU58/al91q4q7y70byAGH99Vwfpo4chpVMMakZSVPDUzQ0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=debian.org; spf=none smtp.mailfrom=debian.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b=gw45fOEi; arc=none smtp.client-ip=82.195.75.108 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=debian.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=debian.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b="gw45fOEi" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debian.org; s=smtpauto.stravinsky; h=X-Debian-User:Cc:To:Message-Id: Content-Transfer-Encoding:Content-Type:MIME-Version:Subject:Date:From: Reply-To:Content-ID:Content-Description:In-Reply-To:References; bh=RyTOqeITLagvuFs2mDcK3Qx9vkcML7y9dkzPswSeJak=; b=gw45fOEi9RoUFaHcgE2WvhcvHB HYkT+dB/DCp7Qa7Frp9nrbAQkr8qUMcOnlvY9mmzeaCgOkaXS1iSROZvb/M8vUFpa0s21kYKpa8V2 SxP2Kr0qKirBAiB7k/NgJaFMN0yfuVLu7GdYF+8Hg031gmA/QPWBK6XKpD5EWjyK87yn0Qac9OJow azLbx+cTWaoUgLAPqS4sKeySDnfRNU30gr19T4Teny9ZpjpKD7UYZCA18JuExjTovGxf6Hoj8PhN3 QpUwGl727/2sSU+Db4JApvT7yuhTMuAmLPatWnCYemBB+eqj9t0DDkCfbbdzlPOXFg3E0BIpzTOxh bHoOog4A==; Received: from authenticated user by stravinsky.debian.org with esmtpsa (TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim 4.96) (envelope-from ) id 1wAoT6-009CJU-2N; Thu, 09 Apr 2026 12:26:46 +0000 From: Breno Leitao Date: Thu, 09 Apr 2026 05:26:36 -0700 Subject: [PATCH v2] mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Message-Id: <20260409-vmstat-v2-1-e9d9a6db08ad@debian.org> X-B4-Tracking: v=1; b=H4sIAPua12kC/2WNwQrDIBAFf0X2HIsmGoKn/kfJQZM12UKTolZag v9eTI89Ppg3c0DEQBjBsAMCZoq0b2BY2zCYVrstyGkGw6AVbS+UkDw/YrKJCzWg8Gh9pxQ0DJ4 BPb1P0W387fhyd5xSfVdipZj28DlLWVbuT5oll9z1w4TKdlpofZ3Rkd0ue1hgLKV8AYG1scKvA AAA X-Change-ID: 20260401-vmstat-048e0feaf344 To: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Christoph Lameter Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, kas@kernel.org, shakeel.butt@linux.dev, usama.arif@linux.dev, kernel-team@meta.com, Breno Leitao X-Mailer: b4 0.16-dev-453a6 X-Developer-Signature: v=1; a=openpgp-sha256; l=3434; i=leitao@debian.org; h=from:subject:message-id; bh=2K8+Ryh5CYTlYSK3JxJau1TxsZI23fwwxvROlCFTSPo=; b=owEBbQKS/ZANAwAIATWjk5/8eHdtAcsmYgBp15r/OU19OSkGmX/Y6UIrGWXMbpDn+lJdP+3nb nkBS69CiNiJAjMEAAEIAB0WIQSshTmm6PRnAspKQ5s1o5Of/Hh3bQUCadea/wAKCRA1o5Of/Hh3 bd1CD/4r4aWtdHN/BsI4SB8AOdNGCLlWMhfj1wQMj/1GeqZUVnfDV8Q7ZFVFkyJKYPUfBqAfyJv LxPTcawrJBfho8DV/okYRvHsxBwGxyThO9YHfzGmi6qhl9ro0F65zLiJ9JSiZzWFXLCCO19gpxa dL6syUVI7PX3mpYQe84guBRwemprOpXLQ4tdca5FsLKcjzdPpPTQ3Ng4JqXPrv0pRaSZDHlWw8t qBrd5a+R6jYUzL/tDVW6tdAn17qfCKItSabcOoZmJ00CpQgI7tDxD4MQOAZgxgdxExfU1W66vNP lovS1EN9HATrtgwxG97GvkxrQ3g2Pyn/LAKQ4H46oTs2M7u1k4sxjAAQiPUfLo/E/Rw0tknIZax ovk2CL5QGeHASbPMCeZ+gb41q/3LLkfQZd4Ti98kJ6iTPsGBH5a7UbEFVsPg5DCWz3dxLiysNWH kzn8biDclRoMYpAr6YZ58xZyaw2Ix1MZOJ09XFc9Pbfo/imjEuY8cTOQXZ02aTQhGRz7a1/ZUJf cMUxEytTdbr1jP40faWy6magzF3biJ/yjWWa6FfO0zGD8tVaQx/xpM9eyYrxExafYo5GQZMGVA5 rfTohDyYBxplki6o0+czIObGVTwhX8Zs3qaMGZjnhTKxxFhfoPvMXUXn4lqCGBni3iZuebZNETY srwSrfm3l3ET+vg== X-Developer-Key: i=leitao@debian.org; a=openpgp; fpr=AC8539A6E8F46702CA4A439B35A3939FFC78776D X-Debian-User: leitao vmstat_shepherd uses delayed_work_pending() to check whether vmstat_update is already scheduled for a given CPU before queuing it. However, delayed_work_pending() only tests WORK_STRUCT_PENDING_BIT, which is cleared the moment a worker thread picks up the work to execute it. This means that while vmstat_update is actively running on a CPU, delayed_work_pending() returns false. If need_update() also returns true at that point (per-cpu counters not yet zeroed mid-flush), the shepherd queues a second invocation with delay=0, causing vmstat_update to run again immediately after finishing. On a 72-CPU system this race is readily observable: before the fix, many CPUs show invocation gaps well below 500 jiffies (the minimum round_jiffies_relative() can produce), with the most extreme cases reaching 0 jiffies—vmstat_update called twice within the same jiffy. Fix this by replacing delayed_work_pending() with work_busy(), which returns non-zero for both WORK_BUSY_PENDING (timer armed or work queued) and WORK_BUSY_RUNNING (work currently executing). The shepherd now correctly skips a CPU in all busy states. After the fix, all sub-jiffy and most sub-100-jiffie gaps disappear. The remaining early invocations have gaps in the 700–999 jiffie range, attributable to round_jiffies_relative() aligning to a nearer jiffie-second boundary rather than to this race. Each spurious vmstat_update invocation has a measurable side effect: refresh_cpu_vm_stats() calls decay_pcp_high() for every zone, which drains idle per-CPU pages back to the buddy allocator via free_pcppages_bulk(), taking the zone spinlock each time. Eliminating the double-scheduling therefore reduces zone lock contention directly. On a 72-CPU stress-ng workload measured with perf lock contention: free_pcppages_bulk contention count: ~55% reduction free_pcppages_bulk total wait time: ~57% reduction free_pcppages_bulk max wait time: ~47% reduction Note: work_busy() is inherently racy—between the check and the subsequent queue_delayed_work_on() call, vmstat_update can finish execution, leaving the work neither pending nor running. In that narrow window the shepherd can still queue a second invocation. After the fix, this residual race is rare and produces only occasional small gaps, a significant improvement over the systematic double-scheduling seen with delayed_work_pending(). Fixes: 7b8da4c7f07774 ("vmstat: get rid of the ugly cpu_stat_off variable") Signed-off-by: Breno Leitao Reviewed-by: Vlastimil Babka (SUSE) --- Changes in v2: - Instead of changing the timings, do not double-schedule. - Link to v1: https://patch.msgid.link/20260401-vmstat-v1-1-b68ce4a35055@debian.org --- mm/vmstat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmstat.c b/mm/vmstat.c index 2370c6fb1fcd6..cc5fdc0d0f298 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -2139,7 +2139,7 @@ static void vmstat_shepherd(struct work_struct *w) if (cpu_is_isolated(cpu)) continue; - if (!delayed_work_pending(dw) && need_update(cpu)) + if (!work_busy(&dw->work) && need_update(cpu)) queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); } --- base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb change-id: 20260401-vmstat-048e0feaf344 Best regards, -- Breno Leitao