Re: [PATCH v16 2/3] mm: Improve RSS counter approximation accuracy for proc interfaces

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Michal Hocko <mhocko@suse.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux.com>,
	Martin Liu <liumartin@google.com>,
	David Rientjes <rientjes@google.com>,
	christian.koenig@amd.com, Shakeel Butt <shakeel.butt@linux.dev>,
	SeongJae Park <sj@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Sweet Tea Dorminy <sweettea-kernel@dorminy.me>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <liam.howlett@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Christian Brauner <brauner@kernel.org>,
	Wei Yang <richard.weiyang@gmail.com>,
	David Hildenbrand <david@redhat.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	Yu Zhao <yuzhao@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Mateusz Guzik <mjguzik@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Aboorva Devarajan <aboorvad@linux.ibm.com>
Subject: Re: [PATCH v16 2/3] mm: Improve RSS counter approximation accuracy for proc interfaces
Date: Wed, 14 Jan 2026 17:48:00 +0100	[thread overview]
Message-ID: <aWfIwKzzIihhByJ9@tiehlicka> (raw)
In-Reply-To: <20260114145915.49926-3-mathieu.desnoyers@efficios.com>

On Wed 14-01-26 09:59:14, Mathieu Desnoyers wrote:
> Use hierarchical per-cpu counters for RSS tracking to improve the
> accuracy of per-mm RSS sum approximation on large many-core systems [1].
> This improves the accuracy of the RSS values returned by proc
> interfaces.
> 
> This is also a preparation step to introduce a 2-pass OOM killer task
> selection which leverages the approximation and accuracy ranges to
> quickly eliminate tasks which are outside of the range of the current
> selection, and thus reduce the latency introduced by execution of the
> OOM killer.
> 
> Here is a (possibly incomplete) list of the prior approaches that were
> used or proposed, along with their downside:
> 
> 1) Per-thread rss tracking: large error on many-thread processes.
> 
> 2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
>    increased system time in make test workloads [1]. Moreover, the
>    inaccuracy increases with O(n^2) with the number of CPUs.
> 
> 3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
>    error is high with systems that have lots of NUMA nodes (32 times
>    the number of NUMA nodes).
> 
> 4) Use a percise per-cpu counter sum for each counter value query:
>    Requires iteration on each possible CPUs for each sum, which
>    adds overhead (and thus increases OOM killer latency) on large
>    many-core systems running many processes.
> 
> The approach proposed here is to replace the per-cpu counters by the
> hierarchical per-cpu counters, which bounds the inaccuracy based on the
> system topology with O(N*logN).
> 
> * Testing results:
> 
> Test hardware: 2 sockets AMD EPYC 9654 96-Core Processor (384 logical CPUs total)
> 
> Methodology:
> 
> Comparing the current upstream implementation with the hierarchical
> counters is done by keeping both implementations wired up in parallel,
> and running a single-process, single-threaded program which hops
> randomly across CPUs in the system, calling mmap(2) and munmap(2) on
> random CPUs, keeping track of an array of allocated mappings, randomly
> choosing entries to either map or unmap.
> 
> get_mm_counter() is instrumented to compare the upstream counter
> approximation to the precise value, and print the delta when going over
> a given threshold. The delta of the hierarchical counter approximation
> to the precise value is also printed for comparison.
> 
> After a few minutes running this test, the upstream implementation
> counter approximation reaches a 1GB delta from the
> precise value, compared to 80MB delta with the hierarchical counter.
> The hierarchical counter provides a guaranteed maximum approximation
> inaccuracy of 192MB on that hardware topology.
> 
> * Fast path implementation comparison
> 
> The new inline percpu_counter_tree_add() uses a this_cpu_add_return()
> for the fast path (under a certain allocation size threshold).  Above
> that, it calls a slow path which "trickles up" the carry to upper level
> counters with atomic_add_return.
> 
> In comparison, the upstream counters implementation calls
> percpu_counter_add_batch which uses this_cpu_try_cmpxchg() on the fast
> path, and does a raw_spin_lock_irqsave above a certain threshold.
> 
> The hierarchical implementation is therefore expected to have less
> contention on mid-sized allocations than the upstream counters because
> the atomic counters tracking those bits are only shared across nearby
> CPUs. In comparison, the upstream counters immediately use a global
> spinlock when reaching the threshold.
> 
> * Benchmarks
> 
> Using will-it-scale page_fault1 benchmarks to compare the upstream
> counters to the hierarchical counters. This is done with hyperthreading
> disabled. The speedup is within the standard deviation of the upstream
> runs, so the overhead is not significant.
> 
>                                           upstream   hierarchical    speedup
> page_fault1_processes -s 100 -t 1           614783         615558      +0.1%
> page_fault1_threads -s 100 -t 1             612788         612447      -0.1%
> page_fault1_processes -s 100 -t 96        37994977       37932035      -0.2%
> page_fault1_threads -s 100 -t 96           2484130        2504860      +0.8%
> page_fault1_processes -s 100 -t 192       71262917       71118830      -0.2%
> page_fault1_threads -s 100 -t 192          2446437        2469296      +0.1%
> 
> This change depends on the following patch:
> "mm: Fix OOM killer inaccuracy on large many-core systems" [2]

As mentioned in the previous patch, it would be great to explicitly
mention what is the memory price for the new tracking data structure.

Other than that this seems like a generally useful improvement for
larger systems and it is my understanding that it doesn't add almost any
overhead on small end systems, correct?

-- 
Michal Hocko
SUSE Labs

next prev parent reply	other threads:[~2026-01-14 16:48 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-14 14:59 [PATCH v16 0/3] Improve proc RSS accuracy and OOM killer latency Mathieu Desnoyers
2026-01-14 14:59 ` [PATCH v16 1/3] lib: Introduce hierarchical per-cpu counters Mathieu Desnoyers
2026-01-14 16:41   ` Michal Hocko
2026-01-14 19:19     ` Mathieu Desnoyers
2026-01-16 15:51       ` Michal Hocko
2026-01-26 16:34         ` Mathieu Desnoyers
2026-01-14 14:59 ` [PATCH v16 2/3] mm: Improve RSS counter approximation accuracy for proc interfaces Mathieu Desnoyers
2026-01-14 16:48   ` Michal Hocko [this message]
2026-01-14 19:21     ` Mathieu Desnoyers
2026-01-14 14:59 ` [PATCH v16 3/3] mm: Reduce latency of OOM killer task selection with 2-pass algorithm Mathieu Desnoyers
2026-01-14 17:06   ` Michal Hocko
2026-01-14 19:36     ` Mathieu Desnoyers
2026-01-16 15:55       ` Michal Hocko
2026-01-26 16:39         ` Mathieu Desnoyers
2026-01-26 17:47           ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aWfIwKzzIihhByJ9@tiehlicka \
    --to=mhocko@suse.com \
    --cc=aboorvad@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brauner@kernel.org \
    --cc=christian.koenig@amd.com \
    --cc=cl@linux.com \
    --cc=david@redhat.com \
    --cc=dennis@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=liam.howlett@oracle.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=liumartin@google.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=mjguzik@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=richard.weiyang@gmail.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=sweettea-kernel@dorminy.me \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.