Re: OOM kill of privileged processes when exhausting a single NUMA node

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Michal Hocko <mhocko@suse.com>
To: Pedro Falcato <pfalcato@suse.de>
Cc: Felix Abecassis <fabecassis@nvidia.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Zi Yan <ziy@nvidia.com>, John Hubbard <jhubbard@nvidia.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>
Subject: Re: OOM kill of privileged processes when exhausting a single NUMA node
Date: Fri, 27 Jun 2025 10:17:47 +0200	[thread overview]
Message-ID: <aF5Tq19v0aspvqox@tiehlicka> (raw)
In-Reply-To: <btenbxjm4eheurdl2oexxy4h5diphmfy5cugiscfv6nljqhfki@2xxhbwtfslsj>

On Fri 27-06-25 00:21:57, Pedro Falcato wrote:
> On Thu, Jun 26, 2025 at 10:27:36PM +0000, Felix Abecassis wrote:
> > Hello linux-mm team,
> > 
> > I have found an interesting behavior in the Linux kernel: an unprivileged user
> > with access to user namespaces can cause privileged processes to be killed due
> > to an OOM situation on a single NUMA node, even if the system has plenty of
> > memory available on other NUMA nodes.
> > 
> > This might lead to a local denial of service in some situations, so please
> > review and let me know if the current behavior is expected.
> > 
> > The steps are simple:
> > 1. Use a Linux system with multiple NUMA nodes
> > 2. Enable unprivileged user namespaces (often distro dependent)
> > 3. As an unprivileged user, create a user namespace + mount namespace
> >    and mount a tmpfs bound to NUMA node 1
> > 4. Attempt to fill the tmpfs with more data than it can possibly store
> > 5. The OOM killer will kill a significant amount of system daemons
> >    (UID 0).

This is really something that our OOM handling is not able to deal with
because we cannot simply remove persistent (even if boot time scoped)
data. Even if we managed to kill a task that has consumed an excessive
amount of tmpfs data then the data will be left with the current
implementation. Changing the behavior would require defining disposable
tmpfs mounts and make any userspace aware of the fact. Otherwise we are
causing active data corruption bugs.

> I somewhat agree that this is somewhat unintended tmpfs behavior, but you can
> (probably) pull this off in other ways:

Well, it is a filesystem and as such we do not allow data corruptions.
The same way we do not simply allow removing data on ENOSPC. This
filesystem just happens to be backed by memory rather than a real
storage.

> - use set_mempolicy()/mbind to bind to a NUMA node and use a big mmap() mapping
> - just use a lot of memory
> 
> and it's not limited to NUMA either.

Right there are ways to deplete memory and therefore it is generally
recommended to contain untrusted users by memory cgroups and make sure
the untrusted user cannot consume any specific resource. NUMA topology
makes that more complicated because that adds to the resource constrains
as pointed out in the below example (hard limit harder than a single
numa node while tmpfs is configured to consume the full Numa node).

My experience with unprivileged user namespaces is limited but I would
say that you need some policy built on top if you want to allow
arbitrary tmpfs mounts.
-- 
Michal Hocko
SUSE Labs

     prev parent reply	other threads:[~2025-06-27  8:17 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-26 22:27 OOM kill of privileged processes when exhausting a single NUMA node Felix Abecassis
2025-06-26 23:21 ` Pedro Falcato
2025-06-26 23:27   ` Zi Yan
2025-06-27  3:15   ` Felix Abecassis
2025-06-27  8:17   ` Michal Hocko [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aF5Tq19v0aspvqox@tiehlicka \
    --to=mhocko@suse.com \
    --cc=fabecassis@nvidia.com \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=pfalcato@suse.de \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).