All of lore.kernel.org
 help / color / mirror / Atom feed
From: Al Viro <viro@zeniv.linux.org.uk>
To: Kiryl Shutsemau <kas@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Hugh Dickins <hughd@google.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: Orphan filesystems after mount namespace destruction and tmpfs "leak"
Date: Mon, 2 Feb 2026 18:43:56 +0000	[thread overview]
Message-ID: <20260202184356.GD3183987@ZenIV> (raw)
In-Reply-To: <aYDjHJstnz2V-ZZg@thinkstation>

On Mon, Feb 02, 2026 at 05:50:30PM +0000, Kiryl Shutsemau wrote:

> In the Meta fleet, we saw a problem where destroying a container didn't
> lead to freeing the shmem memory attributed to a tmpfs mounted inside
> that container. It triggered an OOM when a new container attempted to
> start.
> 
> Investigation has shown that this happened because a process outside of
> the container kept a file from the tmpfs mapped. The mapped file is
> small (4k), but it holds all the contents of the tmpfs (~47GiB) from
> being freed.
> 
> When a tmpfs filesystem is mounted inside a mount namespace (e.g., a
> container), and a process outside that namespace holds an open file
> descriptor to a file on that tmpfs, the tmpfs superblock remains in
> kernel memory indefinitely after:
> 
> 1. All processes inside the mount namespace have exited.
> 2. The mount namespace has been destroyed.
> 3. The tmpfs is no longer visible in any mount namespace.

Yes?  That's precisely what should happen as long as something's opened
on a filesystem.

> The superblock persists with mnt_ns = NULL in its mount structures,
> keeping all tmpfs contents pinned in memory until the external file
> descriptor is closed.

Yes.

> The problem is not specific to tmpfs, but for filesystems with backing
> storage, the memory impact is not as severe since the page cache is
> reclaimable.
> 
> The obvious solution to the problem is "Don't do that": the file should
> be unmapped/closed upon container destruction.

Or remove the junk there from time to time, if you don't want it to stay
until the filesystem shutdown...

> But I wonder if the kernel can/should do better here? Currently, this
> scenario is hard to diagnose. It looks like a leak of shmem pages.
> 
> Also, I wonder if the current behavior can lead to data loss on a
> filesystem with backing storage:
>  - The mount namespace where my USB stick was mounted is gone.
>  - The USB stick is no longer mounted anywhere.
>  - I can pull the USB stick out.
>  - Oops, someone was writing there: corruption/data loss.
> 
> I am not sure what a possible solution would be here. I can only think
> of blocking exit(2) for the last process in the namespace until all
> filesystems are cleanly unmounted, but that is not very informative
> either.

That's insane - if nothing else, the process that holds the sucker
opened may very well be waiting for the one you've blocked.

You are getting exactly what you asked for - same as you would on
lazy umount, for that matter.

Filesystem may be active without being attached to any namespace;
it's an intentional behaviour.  What's more, it _is_ visible to
ustat(2), as well as lsof(1) and similar userland tools in case
of opened file keeping it busy.

  reply	other threads:[~2026-02-02 18:42 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-02 17:50 Orphan filesystems after mount namespace destruction and tmpfs "leak" Kiryl Shutsemau
2026-02-02 18:43 ` Al Viro [this message]
2026-02-02 19:43   ` Kiryl Shutsemau
2026-02-02 20:03 ` Askar Safin
2026-02-03 14:58 ` Christian Brauner
2026-02-04 17:04   ` Theodore Tso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260202184356.GD3183987@ZenIV \
    --to=viro@zeniv.linux.org.uk \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brauner@kernel.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=kas@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.