All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux.com>,
	Martin Liu <liumartin@google.com>,
	David Rientjes <rientjes@google.com>,
	christian.koenig@amd.com, Shakeel Butt <shakeel.butt@linux.dev>,
	SeongJae Park <sj@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Sweet Tea Dorminy <sweettea-kernel@dorminy.me>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <liam.howlett@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Christian Brauner <brauner@kernel.org>,
	Wei Yang <richard.weiyang@gmail.com>,
	David Hildenbrand <david@redhat.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
	Yu Zhao <yuzhao@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Mateusz Guzik <mjguzik@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Aboorva Devarajan <aboorvad@linux.ibm.com>
Subject: Re: [PATCH v13 2/3] mm: Fix OOM killer inaccuracy on large many-core systems
Date: Tue, 13 Jan 2026 10:24:40 +0100	[thread overview]
Message-ID: <aWYPWNIv4lR2FpUZ@tiehlicka> (raw)
In-Reply-To: <b779c646-64c7-49f8-8847-8819227e3f1f@efficios.com>

On Mon 12-01-26 19:47:54, Mathieu Desnoyers wrote:
> On 2026-01-12 14:48, Michal Hocko wrote:
> > On Mon 12-01-26 14:37:49, Mathieu Desnoyers wrote:
> > > On 2026-01-12 03:42, Michal Hocko wrote:
> > > > Hi,
> > > > sorry to jump in this late but the timing of previous versions didn't
> > > > really work well for me.
> > > > 
> > > > On Sun 11-01-26 14:49:57, Mathieu Desnoyers wrote:
> > > > [...]
> > > > > Here is a (possibly incomplete) list of the prior approaches that were
> > > > > used or proposed, along with their downside:
> > > > > 
> > > > > 1) Per-thread rss tracking: large error on many-thread processes.
> > > > > 
> > > > > 2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
> > > > >      increased system time in make test workloads [1]. Moreover, the
> > > > >      inaccuracy increases with O(n^2) with the number of CPUs.
> > > > > 
> > > > > 3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
> > > > >      error is high with systems that have lots of NUMA nodes (32 times
> > > > >      the number of NUMA nodes).
> > > > > 
> > > > > The approach proposed here is to replace this by the hierarchical
> > > > > per-cpu counters, which bounds the inaccuracy based on the system
> > > > > topology with O(N*logN).
> > > > 
> > > > The concept of hierarchical pcp counter is interesting and I am
> > > > definitely not opposed if there are more users that would benefit.
> > > > 
> > > >   From the OOM POV, IIUC the primary problem is that get_mm_counter
> > > > (percpu_counter_read_positive) is too imprecise on systems when the task
> > > > is moving around a large number of cpus. In the list of alternative
> > > > solutions I do not see percpu_counter_sum_positive to be mentioned.
> > > > oom_badness() is a really slow path and taking the slow path to
> > > > calculate a much more precise value seems acceptable. Have you
> > > > considered that option?
> > > I must admit I assumed that since there was already a mechanism in place
> > > to ensure it's not necessary to sum per-cpu counters when the oom killer
> > > is trying to select tasks, it must be because this
> > > 
> > >    O(nr_possible_cpus * nr_processes)
> > > 
> > > operation must be too slow for the oom killer requirements.
> > > 
> > > AFAIU, the oom killer is executed when the memory allocator fails to
> > > allocate memory, which can be within code paths which need to progress
> > > eventually. So even though it's a slow path compared to the allocator
> > > fast path, there must be at least _some_ expectations about it
> > > completing within a decent amount of time. What would that ballpark be ?
> > 
> > I do not think we have ever promissed more than the oom killer will try
> > to unlock the system blocked on memory shortage.
> > 
> > > To give an order of magnitude, I've tried modifying the upstream
> > > oom killer to use percpu_counter_sum_positive and compared it to
> > > the hierarchical approach:
> > > 
> > > AMD EPYC 9654 96-Core (2 sockets)
> > > Within a KVM, configured with 256 logical cpus.
> > > 
> > >                     nr_processes=40    nr_processes=10000
> > > Counter sum:            0.4 ms             81.0 ms
> > > HPCC with 2-pass:       0.3 ms              9.3 ms
> > 
> > These are peanuts for the global oom situations. We have had situations
> > when soft lockup detector triggered because of the process tree
> > traversal so adding 100ms is not really critical.
> > 
> > > So as we scale up the number of processes on large SMP systems,
> > > the latency caused by the oom killer task selection greatly
> > > increases with the counter sums compared with the hierarchical
> > > approach.
> > 
> > Yes, I am not really questioning the hierarchical approach will perform
> > much better but I am thinking of a good enough solution and calculating
> > the number might be just that stop gap solution (that would be also
> > suitable for stable tree backports). I am not ruling out improving on
> > top of that by a more clever solution like your hierarchical counters
> > approach. Especially if there are more benefits from that elsewhere.
> > 
> 
> Would you be OK with introducing changes in the following order ?
> 
> 1) Fix the OOM killer inaccuracy by using counter sum (iteration on all
>    cpu counters) in task selection. This may slow down the oom killer,
>    but would at least fix its current inaccuracy issues. This could be
>    backported to stable kernels.
> 
> 2) Introduce the hierarchical percpu counters on top, as a oom killer
>    task selection performance optimization (reduce latency of oom kill).
> 
> This way, (2) becomes purely a performance optimization, so it's easy
> to bissect and revert if it causes issues.

Yes, this makes more sense.

> I agree that bringing a fix along with a performance optimization within
> a single commit makes it hard to backport to stable, and tricky to
> revert if it causes problems.
> 
> As for finding other users of the hpcc, I have ideas, but not so much
> time available to try them out, as I'm pretty much doing this in my
> spare time.

I do understand this constrain and motivation to have OOM situation
addressed with a priority. I am pretty sure that if you see issues in
OOM path then other consumers of get_mm_counter would be affected as
well. Namely /proc/<pid>/stat. There might be others but I can imagine
that some of them are more performance than precision sensitive.
All that being said it seems that we need slow-and-precise and
fast-approximate interfaces to have incremental path for other users as
well. Looking at patch 1 it seems there are interfaces available for
that. I think it would be great to call those out explicitly in the
highlevel doc to give some guidance what to use when with what kind of
expectations.

Thanks!
-- 
Michal Hocko
SUSE Labs


  reply	other threads:[~2026-01-13  9:24 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-11 19:49 [PATCH v13 0/3] mm: Fix OOM killer inaccuracy on large many-core systems Mathieu Desnoyers
2026-01-11 19:49 ` [PATCH v13 1/3] lib: Introduce hierarchical per-cpu counters Mathieu Desnoyers
2026-01-11 19:49 ` [PATCH v13 2/3] mm: Fix OOM killer inaccuracy on large many-core systems Mathieu Desnoyers
2026-01-12  8:42   ` Michal Hocko
2026-01-12 19:37     ` Mathieu Desnoyers
2026-01-12 19:48       ` Michal Hocko
2026-01-13  0:47         ` Mathieu Desnoyers
2026-01-13  9:24           ` Michal Hocko [this message]
2026-01-13 13:51             ` Mathieu Desnoyers
2026-01-13 14:11               ` Michal Hocko
2026-01-12 17:29   ` Shakeel Butt
2026-01-12 18:46     ` Mathieu Desnoyers
2026-01-11 19:49 ` [PATCH v13 3/3] mm: Implement precise OOM killer task selection Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aWYPWNIv4lR2FpUZ@tiehlicka \
    --to=mhocko@suse.com \
    --cc=aboorvad@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brauner@kernel.org \
    --cc=christian.koenig@amd.com \
    --cc=cl@linux.com \
    --cc=david@redhat.com \
    --cc=dennis@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=liam.howlett@oracle.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=liumartin@google.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=mjguzik@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=richard.weiyang@gmail.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=sweettea-kernel@dorminy.me \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.