From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtpout.efficios.com (smtpout.efficios.com [158.69.130.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B85682E7BDC; Mon, 12 Jan 2026 20:01:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=158.69.130.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768248078; cv=none; b=JB5RCSpIMx4mvfV47QwvKNdIn6gwlFbuKyoFWWnIVdD/twnq+HfkHbKzDBMP3ltYjkBMtQBWHBky/LmP+Ve9atKuaEv8CpV+DKxeR8bhinoVNEOeHTAiAwZwieauZ/etDHiAofoRChvaRy7355DrgHWIFlONs1X8tfQR6o4ws5I= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768248078; c=relaxed/simple; bh=zp4HPfI5ae0TEAVq32kDpjODcaCC3v1NG5I//9ABBOI=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=i1MrFTv9W6hkjAWwAjH052r4YmXMuUMMoi0N81u4D93R/8+tP2RbPtNW0A60hnZxG5oD2D/3Y0erSyn40uPeSruUqZOlNWf+qhvN9KLfmLCjcowCFMIU5mpc93YPmBgsJo5eAe6IfYYD4tKgwAOyBgz3WeohIC+Vad9oBHuIZ5w= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com; spf=pass smtp.mailfrom=efficios.com; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b=BijV+8lS; arc=none smtp.client-ip=158.69.130.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=efficios.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b="BijV+8lS" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=smtpout1; t=1768248070; bh=BxBSRmGxgfXg/cqYvYDq8dvF1zaNPsFOZi5Vw1+H8Lw=; h=From:To:Cc:Subject:Date:From; b=BijV+8lSFalUiReQfK65dQWqBj1T3ltynENnDHrcIvy5TapIKD7QVnDcnF/Z6Fq6I QUeBEmLRTeSIMgsU1J/qk1GDp1r/AUyazsIm7oKS1pndyTm/wERjJWpKYZQD65tzB6 dN2q1W0X8luat7dQESVxy6dwM8tsR2v07CSLQ3FsR4T33t/C1o4wpb5IwbiCPMOdl5 7fjyfyzPzMBq2wfyVNFmI523yByzO0EKyguxc75eXo4AUWiutscco9OHpAqUtUBApC JobNzP2CEv54z3yRrCbIVsHeE0gTToqCtAMl1Qd3XmZ8/dUTDwT7kIySbqKz8/AhsI J7JlKMCyP0zOg== Received: from thinkos.internal.efficios.com (mtl.efficios.com [216.120.195.104]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4dqjtZ0tMPzkqd; Mon, 12 Jan 2026 15:01:10 -0500 (EST) From: Mathieu Desnoyers To: Andrew Morton Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , "Paul E. McKenney" , Steven Rostedt , Masami Hiramatsu , Dennis Zhou , Tejun Heo , Christoph Lameter , Martin Liu , David Rientjes , christian.koenig@amd.com, Shakeel Butt , SeongJae Park , Michal Hocko , Johannes Weiner , Sweet Tea Dorminy , Lorenzo Stoakes , "Liam R . Howlett" , Mike Rapoport , Suren Baghdasaryan , Vlastimil Babka , Christian Brauner , Wei Yang , David Hildenbrand , Miaohe Lin , Al Viro , linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, Yu Zhao , Roman Gushchin , Mateusz Guzik , Matthew Wilcox , Baolin Wang , Aboorva Devarajan Subject: [PATCH v14 0/3] mm: Fix OOM killer inaccuracy on large many-core systems Date: Mon, 12 Jan 2026 15:00:53 -0500 Message-Id: <20260112200056.1250404-1-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.5 Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Introduce hierarchical per-cpu counters and use them for RSS tracking to fix the per-mm RSS tracking which has become too inaccurate for OOM killer purposes on large many-core systems. The following rss tracking issues were noted by Sweet Tea Dorminy [1], which lead to picking wrong tasks as OOM kill target: Recently, several internal services had an RSS usage regression as part of a kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to read RSS statistics in a backup watchdog process to monitor and decide if they'd overrun their memory budget. Now, however, a representative service with five threads, expected to use about a hundred MB of memory, on a 250-cpu machine had memory usage tens of megabytes different from the expected amount -- this constituted a significant percentage of inaccuracy, causing the watchdog to act. This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") [1]. Previously, the memory error was bounded by 64*nr_threads pages, a very livable megabyte. Now, however, as a result of scheduler decisions moving the threads around the CPUs, the memory error could be as large as a gigabyte. This is a really tremendous inaccuracy for any few-threaded program on a large machine and impedes monitoring significantly. These stat counters are also used to make OOM killing decisions, so this additional inaccuracy could make a big difference in OOM situations -- either resulting in the wrong process being killed, or in less memory being returned from an OOM-kill than expected. The approach proposed here is to replace this by the hierarchical per-cpu counters, which bounds the inaccuracy based on the system topology with O(N*logN). Notable changes for v14: - Change check_mm print format from %d to %ld (was folded into the wrong patch). I've done moderate testing of this series on a 256-core VM with 128GB RAM. Figuring out whether this indeed helps solve issues with real-life workloads will require broader feedback from the community. This series is based on v6.19-rc4, on top of the following two preparation series: https://lore.kernel.org/linux-mm/20251224173358.647691-1-mathieu.desnoyers@efficios.com/T/#t https://lore.kernel.org/linux-mm/20251224173810.648699-1-mathieu.desnoyers@efficios.com/T/#t Andrew, this series replaces v13, for testing in mm-new. Thanks! Mathieu Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1] To: Andrew Morton Cc: "Paul E. McKenney" Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Dennis Zhou Cc: Tejun Heo Cc: Christoph Lameter Cc: Martin Liu Cc: David Rientjes Cc: christian.koenig@amd.com Cc: Shakeel Butt Cc: SeongJae Park Cc: Michal Hocko Cc: Johannes Weiner Cc: Sweet Tea Dorminy Cc: Lorenzo Stoakes Cc: "Liam R . Howlett" Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Christian Brauner Cc: Wei Yang Cc: David Hildenbrand Cc: Miaohe Lin Cc: Al Viro Cc: linux-mm@kvack.org Cc: linux-trace-kernel@vger.kernel.org Cc: Yu Zhao Cc: Roman Gushchin Cc: Mateusz Guzik Cc: Matthew Wilcox Cc: Baolin Wang Cc: Aboorva Devarajan Mathieu Desnoyers (3): lib: Introduce hierarchical per-cpu counters mm: Fix OOM killer inaccuracy on large many-core systems mm: Implement precise OOM killer task selection fs/proc/base.c | 2 +- include/linux/mm.h | 49 +- include/linux/mm_types.h | 54 ++- include/linux/oom.h | 11 +- include/linux/percpu_counter_tree.h | 344 ++++++++++++++ include/trace/events/kmem.h | 2 +- init/main.c | 2 + kernel/fork.c | 22 +- lib/Makefile | 1 + lib/percpu_counter_tree.c | 702 ++++++++++++++++++++++++++++ mm/oom_kill.c | 84 +++- 11 files changed, 1223 insertions(+), 50 deletions(-) create mode 100644 include/linux/percpu_counter_tree.h create mode 100644 lib/percpu_counter_tree.c -- 2.39.5