From: Roman Gushchin <roman.gushchin@linux.dev>
To: Matt Bobrowski <mattbobrowski@google.com>
Cc: linux-kernel@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
Alexei Starovoitov <ast@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Shakeel Butt <shakeel.butt@linux.dev>,
Suren Baghdasaryan <surenb@google.com>,
David Rientjes <rientjes@google.com>,
Josh Don <joshdon@google.com>,
Chuyi Zhou <zhouchuyi@bytedance.com>,
cgroups@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org
Subject: Re: [PATCH rfc 00/12] mm: BPF OOM
Date: Mon, 28 Apr 2025 17:24:21 +0000 [thread overview]
Message-ID: <aA-5xX10nXE2C2Dn@google.com> (raw)
In-Reply-To: <aA9bu7UJOCTQGk6L@google.com>
On Mon, Apr 28, 2025 at 10:43:07AM +0000, Matt Bobrowski wrote:
> On Mon, Apr 28, 2025 at 03:36:05AM +0000, Roman Gushchin wrote:
> > This patchset adds an ability to customize the out of memory
> > handling using bpf.
> >
> > It focuses on two parts:
> > 1) OOM handling policy,
> > 2) PSI-based OOM invocation.
> >
> > The idea to use bpf for customizing the OOM handling is not new, but
> > unlike the previous proposal [1], which augmented the existing task
> > ranking-based policy, this one tries to be as generic as possible and
> > leverage the full power of the modern bpf.
> >
> > It provides a generic hook which is called before the existing OOM
> > killer code and allows implementing any policy, e.g. picking a victim
> > task or memory cgroup or potentially even releasing memory in other
> > ways, e.g. deleting tmpfs files (the last one might require some
> > additional but relatively simple changes).
> >
> > The past attempt to implement memory-cgroup aware policy [2] showed
> > that there are multiple opinions on what the best policy is. As it's
> > highly workload-dependent and specific to a concrete way of organizing
> > workloads, the structure of the cgroup tree etc, a customizable
> > bpf-based implementation is preferable over a in-kernel implementation
> > with a dozen on sysctls.
> >
> > The second part is related to the fundamental question on when to
> > declare the OOM event. It's a trade-off between the risk of
> > unnecessary OOM kills and associated work losses and the risk of
> > infinite trashing and effective soft lockups. In the last few years
> > several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> > systemd-OOMd [4]). The common idea was to use userspace daemons to
> > implement custom OOM logic as well as rely on PSI monitoring to avoid
> > stalls. In this scenario the userspace daemon was supposed to handle
> > the majority of OOMs, while the in-kernel OOM killer worked as the
> > last resort measure to guarantee that the system would never deadlock
> > on the memory. But this approach creates additional infrastructure
> > churn: userspace OOM daemon is a separate entity which needs to be
> > deployed, updated, monitored. A completely different pipeline needs to
> > be built to monitor both types of OOM events and collect associated
> > logs. A userspace daemon is more restricted in terms on what data is
> > available to it. Implementing a daemon which can work reliably under a
> > heavy memory pressure in the system is also tricky.
> >
> > [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> > [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> > [3]: https://github.com/facebookincubator/oomd
> > [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> >
> > ----
> >
> > This is an RFC version, which is not intended to be merged in the current form.
> > Open questions/TODOs:
> > 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
> > It has to be able to return a value, to be sleepable (to use cgroup iterators)
> > and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
> > Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
> > arguments as trusted"), which is not safe. One option is to fake acquire/release
> > semantics for the oom_control pointer. Other option is to introduce a completely
> > new attachment or program type, similar to lsm hooks.
>
> Thinking out loud now, but rather than introducing and having a single
> BPF-specific function/interface, and BPF program for that matter,
> which can effectively be used to short-circuit steps from within
> out_of_memory(), why not introduce a
> tcp_congestion_ops/sched_ext_ops-like interface which essentially
> provides a multifaceted interface for controlling OOM killing
> (->select_bad_process, ->oom_kill_process, etc), optionally also from
> the context of a BPF program (BPF_PROG_TYPE_STRUCT_OPS)?
It's certainly an option and I thought about it. I don't think we need a bunch
of hooks though. This patchset adds 2 and they belong to completely different
subsystems (mm and sched/psi), so Idk how well they can be gathered
into a single struct ops. But maybe it's fine.
The only potentially new hook I can envision now is one to customize
the oom reporting.
Thanks for the suggestion!
next prev parent reply other threads:[~2025-04-28 17:24 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-28 3:36 [PATCH rfc 00/12] mm: BPF OOM Roman Gushchin
2025-04-28 3:36 ` [PATCH rfc 01/12] mm: introduce a bpf hook for OOM handling Roman Gushchin
2025-04-28 3:36 ` [PATCH rfc 02/12] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
2025-04-28 3:36 ` [PATCH rfc 03/12] bpf: treat fmodret tracing program's arguments as trusted Roman Gushchin
2025-04-28 3:36 ` [PATCH rfc 04/12] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
2025-04-28 3:36 ` [PATCH rfc 05/12] mm: introduce bpf kfuncs to deal with memcg pointers Roman Gushchin
2025-04-28 3:36 ` [PATCH rfc 06/12] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc Roman Gushchin
2025-04-28 3:36 ` [PATCH rfc 07/12] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
2025-04-28 3:36 ` [PATCH rfc 08/12] bpf: selftests: bpf OOM handler test Roman Gushchin
2025-04-28 3:36 ` [PATCH rfc 09/12] sched: psi: bpf hook to handle psi events Roman Gushchin
2025-04-28 6:11 ` kernel test robot
2025-04-30 0:28 ` Suren Baghdasaryan
2025-04-30 0:58 ` Roman Gushchin
2025-04-28 3:36 ` [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc Roman Gushchin
2025-04-29 11:46 ` Michal Hocko
2025-04-29 21:31 ` Roman Gushchin
2025-04-30 7:27 ` Michal Hocko
2025-04-30 14:53 ` Roman Gushchin
2025-05-05 8:08 ` Michal Hocko
2025-04-28 3:36 ` [PATCH rfc 11/12] bpf: selftests: introduce open_cgroup_file() helper Roman Gushchin
2025-04-28 3:36 ` [PATCH rfc 12/12] bpf: selftests: psi handler test Roman Gushchin
2025-04-28 10:43 ` [PATCH rfc 00/12] mm: BPF OOM Matt Bobrowski
2025-04-28 17:24 ` Roman Gushchin [this message]
2025-04-29 1:56 ` Kumar Kartikeya Dwivedi
2025-04-29 15:42 ` Roman Gushchin
2025-05-02 17:26 ` Song Liu
2025-04-29 11:42 ` Michal Hocko
2025-04-29 14:44 ` Roman Gushchin
2025-04-29 21:56 ` Suren Baghdasaryan
2025-04-29 22:17 ` Roman Gushchin
2025-04-29 23:01 ` Suren Baghdasaryan
2025-04-29 22:44 ` Suren Baghdasaryan
2025-04-29 23:01 ` Roman Gushchin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aA-5xX10nXE2C2Dn@google.com \
--to=roman.gushchin@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=joshdon@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mattbobrowski@google.com \
--cc=mhocko@kernel.org \
--cc=rientjes@google.com \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=zhouchuyi@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.