From: Michal Hocko <mhocko@suse.com>
To: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Shakeel Butt <shakeel.butt@linux.dev>,
Johannes Weiner <hannes@cmpxchg.org>,
Andrii Nakryiko <andrii@kernel.org>,
JP Kobryn <inwardvessel@gmail.com>,
linux-mm@kvack.org, cgroups@vger.kernel.org, bpf@vger.kernel.org,
Martin KaFai Lau <martin.lau@kernel.org>,
Song Liu <song@kernel.org>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>,
Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
Date: Mon, 3 Nov 2025 20:00:09 +0100 [thread overview]
Message-ID: <aQj7uRjz668NNrm_@tiehlicka> (raw)
In-Reply-To: <875xbsglra.fsf@linux.dev>
On Sun 02-11-25 13:36:25, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
>
> > On Mon 27-10-25 16:17:09, Roman Gushchin wrote:
> >> Introduce a bpf struct ops for implementing custom OOM handling
> >> policies.
> >>
> >> It's possible to load one bpf_oom_ops for the system and one
> >> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
> >> cgroup tree is traversed from the OOM'ing memcg up to the root and
> >> corresponding BPF OOM handlers are executed until some memory is
> >> freed. If no memory is freed, the kernel OOM killer is invoked.
> >
> > Do you have any usecase in mind where parent memcg oom handler decides
> > to not kill or cannot kill anything and hand over upwards in the
> > hierarchy?
>
> I believe that in most cases bpf handlers will handle ooms themselves,
> but because strictly speaking I don't have control over what bpf
> programs do or do not, the kernel should provide the fallback mechanism.
> This is a common practice with bpf, e.g. sched_ext falls back to
> CFS/EEVDF in case something is wrong.
We do have fallback mechanism - the kernel oom handling. For that we do
not need to pass to parent handler. Please not that I am not opposing
this but I would like to understand thinking behind and hopefully start
with a simpler model and then extend it later than go with a more
complex one initially and then corner ourselves with weird side
effects.
> Specifically to OOM case, I believe someone might want to use bpf
> programs just for monitoring/collecting some information, without
> trying to actually free some memory.
>
> >> The struct ops provides the bpf_handle_out_of_memory() callback,
> >> which expected to return 1 if it was able to free some memory and 0
> >> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
> >> field of the oom_control structure, which is expected to be set by
> >> kfuncs suitable for releasing memory. If both are set, OOM is
> >> considered handled, otherwise the next OOM handler in the chain
> >> (e.g. BPF OOM attached to the parent cgroup or the in-kernel OOM
> >> killer) is executed.
> >
> > Could you explain why do we need both? Why is not bpf_memory_freed
> > return value sufficient?
>
> Strictly speaking, bpf_memory_freed should be enough, but because
> bpf programs have to return an int and there is no additional cost
> to add this option (pass to next or in-kernel oom handler), I thought
> it's not a bad idea. If you feel strongly otherwise, I can ignore
> the return value on rely on bpf_memory_freed only.
No, I do not feel strongly one way or the other but I would like to
understand thinking behind that. My slight preference would be to have a
single return status that clearly describe the intention. If you want to
have more flexible chaining semantic then an enum { IGNORED, HANDLED,
PASS_TO_PARENT, ...} would be both more flexible, extensible and easier
to understand.
> >> The bpf_handle_out_of_memory() callback program is sleepable to enable
> >> using iterators, e.g. cgroup iterators. The callback receives struct
> >> oom_control as an argument, so it can determine the scope of the OOM
> >> event: if this is a memcg-wide or system-wide OOM.
> >
> > This could be tricky because it might introduce a subtle and hard to
> > debug lock dependency chain. lock(a); allocation() -> oom -> lock(a).
> > Sleepable locks should be only allowed in trylock mode.
>
> Agree, but it's achieved by controlling the context where oom can be
> declared (e.g. in bpf_psi case it's done from a work context).
but out_of_memory is any sleepable context. So this is a real problem.
> >> The callback is executed just before the kernel victim task selection
> >> algorithm, so all heuristics and sysctls like panic on oom,
> >> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
> >> are respected.
> >
> > I guess you meant to say and sysctl_panic_on_oom.
>
> Yep, fixed.
> >
> >> BPF OOM struct ops provides the handle_cgroup_offline() callback
> >> which is good for releasing struct ops if the corresponding cgroup
> >> is gone.
> >
> > What kind of synchronization is expected between handle_cgroup_offline
> > and bpf_handle_out_of_memory?
>
> You mean from a user's perspective?
I mean from bpf handler writer POV
> E.g. can these two callbacks run in
> parallel? Currently yes, but it's a good question, I haven't thought
> about it, maybe it's better to synchronize them.
> Internally both rely on srcu to pin bpf_oom_ops in memory.
This should be really documented.
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2025-11-03 19:00 UTC|newest]
Thread overview: 86+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 01/23] bpf: move bpf_struct_ops_link into bpf.h Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups Roman Gushchin
2025-10-27 23:48 ` bot+bpf-ci
2025-10-28 15:57 ` Roman Gushchin
2025-10-29 18:01 ` Song Liu
2025-10-29 20:26 ` Roman Gushchin
2025-10-30 17:22 ` Roman Gushchin
2025-10-30 18:03 ` Song Liu
2025-10-30 18:19 ` Amery Hung
2025-10-30 19:06 ` Roman Gushchin
2025-10-30 21:34 ` Song Liu
2025-10-30 22:42 ` Martin KaFai Lau
2025-10-30 23:14 ` Roman Gushchin
2025-10-31 0:05 ` Song Liu
2025-10-30 22:19 ` bpf_st_ops and cgroups. Was: " Alexei Starovoitov
2025-10-30 23:24 ` Roman Gushchin
2025-10-31 3:03 ` Yafang Shao
2025-10-31 6:14 ` Song Liu
2025-10-31 11:35 ` Yafang Shao
2025-10-31 17:37 ` Alexei Starovoitov
2025-10-29 18:14 ` Tejun Heo
2025-10-29 20:25 ` Roman Gushchin
2025-10-29 20:36 ` Tejun Heo
2025-10-29 21:18 ` Song Liu
2025-10-29 21:27 ` Tejun Heo
2025-10-29 21:37 ` Song Liu
2025-10-29 21:45 ` Tejun Heo
2025-10-30 4:32 ` Song Liu
2025-10-30 16:13 ` Tejun Heo
2025-10-30 17:56 ` Song Liu
2025-10-29 21:53 ` Roman Gushchin
2025-10-29 22:43 ` Alexei Starovoitov
2025-10-29 22:53 ` Tejun Heo
2025-10-29 23:53 ` Alexei Starovoitov
2025-10-30 0:03 ` Tejun Heo
2025-10-30 0:16 ` Alexei Starovoitov
2025-10-30 6:33 ` Yafang Shao
2025-10-29 21:04 ` Song Liu
2025-10-30 0:43 ` Martin KaFai Lau
2025-10-27 23:17 ` [PATCH v2 03/23] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 04/23] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Roman Gushchin
2025-10-31 8:32 ` Michal Hocko
2025-10-27 23:17 ` [PATCH v2 05/23] mm: declare memcg_page_state_output() in memcontrol.h Roman Gushchin
2025-10-31 8:34 ` Michal Hocko
2025-10-27 23:17 ` [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling Roman Gushchin
2025-10-27 23:57 ` bot+bpf-ci
2025-10-28 17:45 ` Alexei Starovoitov
2025-10-28 18:42 ` Roman Gushchin
2025-10-28 22:07 ` Alexei Starovoitov
2025-10-28 22:56 ` Roman Gushchin
2025-10-28 21:33 ` Song Liu
2025-10-28 23:24 ` Roman Gushchin
2025-10-30 0:20 ` Martin KaFai Lau
2025-10-30 5:57 ` Yafang Shao
2025-10-30 14:26 ` Roman Gushchin
2025-10-31 9:02 ` Michal Hocko
2025-11-02 21:36 ` Roman Gushchin
2025-11-03 19:00 ` Michal Hocko [this message]
2025-11-04 1:45 ` Roman Gushchin
2025-11-04 8:18 ` Michal Hocko
2025-11-04 18:14 ` Roman Gushchin
2025-11-04 19:22 ` Michal Hocko
2026-01-12 14:54 ` Matt Bobrowski
2026-01-12 17:20 ` Roman Gushchin
2026-01-12 18:20 ` Matt Bobrowski
2025-10-27 23:17 ` [PATCH v2 07/23] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
2025-10-31 9:05 ` Michal Hocko
2025-11-02 21:09 ` Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers Roman Gushchin
2025-10-27 23:48 ` bot+bpf-ci
2025-10-28 16:10 ` Roman Gushchin
2025-10-28 17:12 ` Alexei Starovoitov
2025-10-28 18:03 ` Chris Mason
2025-10-28 18:32 ` Roman Gushchin
2025-10-28 17:42 ` Tejun Heo
2025-10-28 18:12 ` Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 09/23] mm: introduce bpf_get_root_mem_cgroup() BPF kfunc Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 10/23] mm: introduce BPF kfuncs to access memcg statistics and events Roman Gushchin
2025-10-27 23:48 ` bot+bpf-ci
2025-10-28 16:16 ` Roman Gushchin
2025-10-31 9:08 ` Michal Hocko
2025-10-31 9:31 ` [PATCH v2 00/23] mm: BPF OOM Michal Hocko
2025-10-31 16:48 ` Lance Yang
2025-11-02 20:53 ` Roman Gushchin
2025-11-03 18:18 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aQj7uRjz668NNrm_@tiehlicka \
--to=mhocko@suse.com \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=inwardvessel@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=martin.lau@kernel.org \
--cc=memxor@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=song@kernel.org \
--cc=surenb@google.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.