From: Roman Gushchin <roman.gushchin@linux.dev>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org, bpf@vger.kernel.org,
Suren Baghdasaryan <surenb@google.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@suse.com>,
David Rientjes <rientjes@google.com>,
Matt Bobrowski <mattbobrowski@google.com>,
Song Liu <song@kernel.org>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>,
Alexei Starovoitov <ast@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1 00/14] mm: BPF OOM
Date: Wed, 20 Aug 2025 17:01:00 -0700 [thread overview]
Message-ID: <87ldndmtkz.fsf@linux.dev> (raw)
In-Reply-To: <h2bmsuk7iq7i6hphp7vbaxndawwgjz42mhfntlcc2yt4u6but6@7xlre5c56xlq> (Shakeel Butt's message of "Wed, 20 Aug 2025 14:06:03 -0700")
Shakeel Butt <shakeel.butt@linux.dev> writes:
> On Mon, Aug 18, 2025 at 10:01:22AM -0700, Roman Gushchin wrote:
>> This patchset adds an ability to customize the out of memory
>> handling using bpf.
>>
>> It focuses on two parts:
>> 1) OOM handling policy,
>> 2) PSI-based OOM invocation.
>>
>> The idea to use bpf for customizing the OOM handling is not new, but
>> unlike the previous proposal [1], which augmented the existing task
>> ranking policy, this one tries to be as generic as possible and
>> leverage the full power of the modern bpf.
>>
>> It provides a generic interface which is called before the existing OOM
>> killer code and allows implementing any policy, e.g. picking a victim
>> task or memory cgroup or potentially even releasing memory in other
>> ways, e.g. deleting tmpfs files (the last one might require some
>> additional but relatively simple changes).
>
> The releasing memory part is really interesting and useful. I can see
> much more reliable and targetted oom reaping with this approach.
>
>>
>> The past attempt to implement memory-cgroup aware policy [2] showed
>> that there are multiple opinions on what the best policy is. As it's
>> highly workload-dependent and specific to a concrete way of organizing
>> workloads, the structure of the cgroup tree etc,
>
> and user space policies like Google has very clear priorities among
> concurrently running workloads while many other users do not.
>
>> a customizable
>> bpf-based implementation is preferable over a in-kernel implementation
>> with a dozen on sysctls.
>
> +1
>
>>
>> The second part is related to the fundamental question on when to
>> declare the OOM event. It's a trade-off between the risk of
>> unnecessary OOM kills and associated work losses and the risk of
>> infinite trashing and effective soft lockups. In the last few years
>> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
>> systemd-OOMd [4]
>
> and Android's LMKD (https://source.android.com/docs/core/perf/lmkd) uses
> PSI too.
>
>> ). The common idea was to use userspace daemons to
>> implement custom OOM logic as well as rely on PSI monitoring to avoid
>> stalls. In this scenario the userspace daemon was supposed to handle
>> the majority of OOMs, while the in-kernel OOM killer worked as the
>> last resort measure to guarantee that the system would never deadlock
>> on the memory. But this approach creates additional infrastructure
>> churn: userspace OOM daemon is a separate entity which needs to be
>> deployed, updated, monitored. A completely different pipeline needs to
>> be built to monitor both types of OOM events and collect associated
>> logs. A userspace daemon is more restricted in terms on what data is
>> available to it. Implementing a daemon which can work reliably under a
>> heavy memory pressure in the system is also tricky.
>
> Thanks for raising this and it is really challenging on very aggressive
> overcommitted system. The userspace oom-killer needs cpu (or scheduling)
> and memory guarantees as it needs to run and collect stats to decide who
> to kill. Even with that, it can still get stuck in some global kernel
> locks (I remember at Google I have seen their userspace oom-killer which
> was a thread in borglet stuck on cgroup mutex or kernfs lock or
> something). Anyways I see a lot of potential of this BPF based
> oom-killer.
>
> Orthogonally I am wondering if we can enable actions other than killing.
> For example some workloads might prefer to get frozen or migrated away
> instead of being killed.
Absolutely, PSI events handling in the kernel (via BPF) opens a broad
range of possibilities. e.g. we can tune cgroup knobs, freeze/unfreeze
tasks, remove tmpfs files, promote/demote memory to other tiers, etc.
I was also thinking about tuning the readahead based on the memory
pressure.
Thanks!
prev parent reply other threads:[~2025-08-21 0:01 UTC|newest]
Thread overview: 83+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling Roman Gushchin
2025-08-19 4:09 ` Suren Baghdasaryan
2025-08-19 20:06 ` Roman Gushchin
2025-08-20 19:34 ` Suren Baghdasaryan
2025-08-20 19:52 ` Roman Gushchin
2025-08-20 20:01 ` Suren Baghdasaryan
2025-08-26 16:23 ` Amery Hung
2025-08-20 11:28 ` Kumar Kartikeya Dwivedi
2025-08-21 0:24 ` Roman Gushchin
2025-08-21 0:36 ` Kumar Kartikeya Dwivedi
2025-08-21 2:22 ` Roman Gushchin
2025-08-21 15:54 ` Suren Baghdasaryan
2025-08-22 19:27 ` Martin KaFai Lau
2025-08-25 17:00 ` Roman Gushchin
2025-08-26 18:01 ` Martin KaFai Lau
2025-08-26 19:52 ` Alexei Starovoitov
2025-08-27 18:28 ` Roman Gushchin
2025-09-02 17:31 ` Roman Gushchin
2025-09-02 22:30 ` Martin KaFai Lau
2025-09-02 23:36 ` Roman Gushchin
2025-10-04 2:00 ` Roman Gushchin
2025-10-06 23:21 ` Andrii Nakryiko
2025-10-06 23:52 ` Roman Gushchin
2025-10-06 23:57 ` Andrii Nakryiko
2025-10-07 0:41 ` Roman Gushchin
2025-10-08 1:07 ` Song Liu
2025-10-08 2:15 ` Roman Gushchin
2025-10-08 7:03 ` Song Liu
2025-10-08 17:02 ` Roman Gushchin
2025-10-07 2:25 ` Martin KaFai Lau
2025-09-03 0:29 ` Tejun Heo
2025-09-03 23:30 ` Roman Gushchin
2025-09-04 6:39 ` Tejun Heo
2025-09-04 14:32 ` Roman Gushchin
2025-09-04 16:26 ` Alexei Starovoitov
2025-09-04 16:58 ` Tejun Heo
2025-08-26 16:56 ` Amery Hung
2025-08-18 17:01 ` [PATCH v1 02/14] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
2025-08-20 9:17 ` Kumar Kartikeya Dwivedi
2025-08-20 22:32 ` Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 03/14] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 04/14] mm: introduce bpf kfuncs to deal with memcg pointers Roman Gushchin
2025-08-20 9:21 ` Kumar Kartikeya Dwivedi
2025-08-20 22:43 ` Roman Gushchin
2025-08-20 23:33 ` Kumar Kartikeya Dwivedi
2025-08-18 17:01 ` [PATCH v1 05/14] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc Roman Gushchin
2025-08-20 9:25 ` Kumar Kartikeya Dwivedi
2025-08-20 22:45 ` Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 06/14] mm: introduce bpf_out_of_memory() " Roman Gushchin
2025-08-19 4:09 ` Suren Baghdasaryan
2025-08-19 20:16 ` Roman Gushchin
2025-08-20 9:34 ` Kumar Kartikeya Dwivedi
2025-08-20 22:59 ` Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 07/14] mm: allow specifying custom oom constraint for bpf triggers Roman Gushchin
2025-10-02 16:37 ` ChaosEsque Team
2025-08-18 17:01 ` [PATCH v1 08/14] mm: introduce bpf_task_is_oom_victim() kfunc Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 09/14] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 10/14] bpf: selftests: bpf OOM handler test Roman Gushchin
2025-08-20 9:33 ` Kumar Kartikeya Dwivedi
2025-08-20 22:49 ` Roman Gushchin
2025-08-20 20:23 ` Andrii Nakryiko
2025-08-21 0:10 ` Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 11/14] sched: psi: refactor psi_trigger_create() Roman Gushchin
2025-08-19 4:09 ` Suren Baghdasaryan
2025-08-19 20:28 ` Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 12/14] sched: psi: implement psi trigger handling using bpf Roman Gushchin
2025-08-19 4:11 ` Suren Baghdasaryan
2025-08-19 22:31 ` Roman Gushchin
2025-08-19 23:31 ` Roman Gushchin
2025-08-20 23:56 ` Suren Baghdasaryan
2025-08-26 17:03 ` Amery Hung
2025-08-18 17:01 ` [PATCH v1 13/14] sched: psi: implement bpf_psi_create_trigger() kfunc Roman Gushchin
2025-08-20 20:30 ` Andrii Nakryiko
2025-08-21 0:36 ` Roman Gushchin
2025-08-22 19:13 ` Andrii Nakryiko
2025-08-22 19:57 ` Martin KaFai Lau
2025-08-25 16:56 ` Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 14/14] bpf: selftests: psi struct ops test Roman Gushchin
2025-08-19 4:08 ` [PATCH v1 00/14] mm: BPF OOM Suren Baghdasaryan
2025-08-19 19:52 ` Roman Gushchin
2025-08-20 21:06 ` Shakeel Butt
2025-08-21 0:01 ` Roman Gushchin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ldndmtkz.fsf@linux.dev \
--to=roman.gushchin@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mattbobrowski@google.com \
--cc=memxor@gmail.com \
--cc=mhocko@suse.com \
--cc=rientjes@google.com \
--cc=shakeel.butt@linux.dev \
--cc=song@kernel.org \
--cc=surenb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.