From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-179.mta1.migadu.com (out-179.mta1.migadu.com [95.215.58.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7B7AE1C8630 for ; Tue, 29 Apr 2025 22:17:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.179 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745965081; cv=none; b=QRcnbvcr2APC7dXzV8sZIoV3igkteNQrYeDxSQe22192BwwUkAdAkCZBX0cksKp8yWlVYUyY6K55jOwajlujwsHEm7ADp42U6s+XbFOd3/R/ZSxVmBjU3BgT3nibJt4OQ48rSzzbKp2fbgtGdWOhKKbH626QnquAR7l15tIcvCw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745965081; c=relaxed/simple; bh=wDpIvk141ZV01NhrIY+JkCqDNYoHd9B0SSs3egxRMkc=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=TMQA+vJ0l/Q6n1h9EAbq4RUdlIR+djwuYcHXvWRGq2oZg2h+EYmNh/u2/54lpRcwuUqxpzvZ4VocrZAyXJAUAAob2j0X1PB2j6+y3Z1o9kD6oHBgD85AJfFUNYc11LTdSNBVtguPwMwfOSsH+Y3N8sj6ztgxXvwxr4iNS2rVvsk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=G6qOR9Nm; arc=none smtp.client-ip=95.215.58.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="G6qOR9Nm" Date: Tue, 29 Apr 2025 22:17:41 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745965076; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VQJrRrVaFnBo4XocryJt5anmF0O/r9hN8T59g9L9+bs=; b=G6qOR9Nmd2hZRkjRJSwbebk2ojGNBZjKTHw9Cbq30iktCdJKq2cJddDCl+7skE3ZxiG5Ol YkVX0BUIC+d/kVdjEu6Xx0fbr0tLyechP3c0pYVUSxH1kX1oi4tPMOWBz+N/Zt0i7VM5gM p6eYvL0BqmG+5/n7pnQRRr/WLiA56vU= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Roman Gushchin To: Suren Baghdasaryan Cc: Michal Hocko , linux-kernel@vger.kernel.org, Andrew Morton , Alexei Starovoitov , Johannes Weiner , Shakeel Butt , David Rientjes , Josh Don , Chuyi Zhou , cgroups@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org Subject: Re: [PATCH rfc 00/12] mm: BPF OOM Message-ID: References: <20250428033617.3797686-1-roman.gushchin@linux.dev> <87selrrpqz.fsf@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Migadu-Flow: FLOW_OUT On Tue, Apr 29, 2025 at 02:56:31PM -0700, Suren Baghdasaryan wrote: > On Tue, Apr 29, 2025 at 7:45 AM Roman Gushchin wrote: > > > > Michal Hocko writes: > > > > > On Mon 28-04-25 03:36:05, Roman Gushchin wrote: > > >> This patchset adds an ability to customize the out of memory > > >> handling using bpf. > > >> > > >> It focuses on two parts: > > >> 1) OOM handling policy, > > >> 2) PSI-based OOM invocation. > > >> > > >> The idea to use bpf for customizing the OOM handling is not new, but > > >> unlike the previous proposal [1], which augmented the existing task > > >> ranking-based policy, this one tries to be as generic as possible and > > >> leverage the full power of the modern bpf. > > >> > > >> It provides a generic hook which is called before the existing OOM > > >> killer code and allows implementing any policy, e.g. picking a victim > > >> task or memory cgroup or potentially even releasing memory in other > > >> ways, e.g. deleting tmpfs files (the last one might require some > > >> additional but relatively simple changes). > > > > > > Makes sense to me. I still have a slight concern though. We have 3 > > > different oom handlers smashed into a single one with special casing > > > involved. This is manageable (although not great) for the in kernel > > > code but I am wondering whether we should do better for BPF based OOM > > > implementations. Would it make sense to have different callbacks for > > > cpuset, memcg and global oom killer handlers? > > > > Yes, it's certainly possible. If we go struct_ops path, we can even > > have both the common hook which handles all types of OOM's and separate > > hooks for each type. The user then can choose what's more convenient. > > Good point. > > > > > > > > I can see you have already added some helper functions to deal with > > > memcgs but I do not see anything to iterate processes or find a process to > > > kill etc. Is that functionality generally available (sorry I am not > > > really familiar with BPF all that much so please bear with me)? > > > > Yes, task iterator is available since v6.7: > > https://docs.ebpf.io/linux/kfuncs/bpf_iter_task_new/ > > > > > > > > I like the way how you naturalely hooked into existing OOM primitives > > > like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are > > > you waiting for a first user that needs to implement oom victim > > > synchronization or do you plan to integrate that into tasks iterators? > > > > It can be implemented in bpf directly, but I agree that it probably > > deserves at least an example in the test or a separate in-kernel helper. > > In-kernel helper is probably a better idea. > > > > > I am mostly asking because it is exactly these kind of details that > > > make the current in kernel oom handler quite complex and it would be > > > great if custom ones do not have to reproduce that complexity and only > > > focus on the high level policy. > > > > Totally agree. > > > > > > > >> The second part is related to the fundamental question on when to > > >> declare the OOM event. It's a trade-off between the risk of > > >> unnecessary OOM kills and associated work losses and the risk of > > >> infinite trashing and effective soft lockups. In the last few years > > >> several PSI-based userspace solutions were developed (e.g. OOMd [3] or > > >> systemd-OOMd [4]). The common idea was to use userspace daemons to > > >> implement custom OOM logic as well as rely on PSI monitoring to avoid > > >> stalls. > > > > > > This makes sense to me as well. I have to admit I am not fully familiar > > > with PSI integration into sched code but from what I can see the > > > evaluation is done on regular bases from the worker context kicked off > > > from the scheduler code. There shouldn't be any locking constrains which > > > is good. Is there any risk if the oom handler took too long though? > > > > It's a good question. In theory yes, it can affect the timing of other > > PSI events. An option here is to move it into a separate work, however > > I'm not sure if it worth the added complexity. I actually tried this > > approach in an earlier version of this patchset, but the problem was > > that the code for scheduling this work should be dynamically turned > > on/off when a bpf program is attached/detached, otherwise it's an > > obvious cpu overhead. > > It's doable, but Idk if it's justified. > > > > > > > > Also an important question. I can see selftests which are using the > > > infrastructure. But have you tried to implement a real OOM handler with > > > this proposed infrastructure? > > > > Not yet. Given the size and complexity of the infrastructure of my > > current employer, it's not a short process. But we're working on it. > > Hi Roman, > This might end up being very useful for Android. Since we have a > shared current employer, we might be able to provide an earlier test > environment for this concept on Android and speed up development of a > real OOM handler. I'll be following the development of this patchset > and will see if we can come up with an early prototype for testing. Hi Suren, Sounds great, thank you!