Re: [PATCH 0/2] mount: add OPEN_TREE

public inbox for initramfs@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
       [not found] <20251229-work-empty-namespace-v1-0-bfb24c7b061f@kernel.org>
@ 2026-01-19 17:11 ` Askar Safin
  2026-01-19 19:05   ` Andy Lutomirski
  0 siblings, 1 reply; 9+ messages in thread
From: Askar Safin @ 2026-01-19 17:11 UTC (permalink / raw)
  To: brauner
  Cc: amir73il, cyphar, jack, jlayton, josef, linux-fsdevel, viro,
	Lennart Poettering, David Howells, Zhang Yunkai, cgel.zte,
	Menglong Dong, linux-kernel, initramfs, containers, linux-api,
	news, lwn, Jonathan Corbet, Rob Landley, emily, Christoph Hellwig

Christian Brauner <brauner@kernel.org>:
> Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> returning a file descriptor referring to that mount tree
> OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> to a new mount namespace. In that new mount namespace the copied mount
> tree has been mounted on top of a copy of the real rootfs.

I want to point at security benefits of this.

[[ TL;DR: [1] and [2] are very big changes to how mount namespaces work.
I like them, and I think they should get wider exposure. ]]

If this patchset ([1]) and [2] both land (they are both in "next" now and
likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on
command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will
usually contain exactly 2 mounts: nullfs and whatever was passed to
open_tree(OPEN_TREE_NAMESPACE).

This means that even if attacker somehow is able to unmount its root and
get access to underlying mounts, then the only underlying thing they will
get is nullfs.

Also this means that other mounts are not only hidden in new namespace, they
are fully absent. This prevents attacks discussed here: [3], [4].

Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs"
is passed), there is no anymore hidden writable mount shared by all containers,
potentially available to attackers. This is concern raised in [5]:

> You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
> actually _be_ a filesystem. Even with your "fix", containers could communicate
> with each _other_ through it if it becomes accessible. If a container can get
> access to an empty initramfs and write into it, it can ask/answer the question
> "Are there any other containers on this machine running stux24" and then coordinate.

Note: as well as I understand all actual security bugs are already fixed in kernel,
runc and similar tools. But still [1] and [2] reduce chances of similar bugs
in the future, and this is very good thing.

Also: [1] and [2] are pretty big changes to how mount namespaces work, so
I added more people and lists to CC.

This mail is answer to [1].

[1] https://lore.kernel.org/all/20251229-work-empty-namespace-v1-0-bfb24c7b061f@kernel.org/
[2] https://lore.kernel.org/all/20260112-work-immutable-rootfs-v2-0-88dd1c34a204@kernel.org/

[3] https://lore.kernel.org/all/rxh6knvencwjajhgvdgzmrkwmyxwotu3itqyreun3h2pmaujhr@snhuqoq44kkf/
[4] https://github.com/opencontainers/runc/pull/1962
[5] https://lore.kernel.org/all/cec90924-e7ec-377c-fb02-e0f25ab9db73@landley.net/

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-19 17:11 ` [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Askar Safin
@ 2026-01-19 19:05   ` Andy Lutomirski
  2026-01-19 22:21     ` Jeff Layton
  0 siblings, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2026-01-19 19:05 UTC (permalink / raw)
  To: Askar Safin
  Cc: brauner, amir73il, cyphar, jack, jlayton, josef, linux-fsdevel,
	viro, Lennart Poettering, David Howells, Zhang Yunkai, cgel.zte,
	Menglong Dong, linux-kernel, initramfs, containers, linux-api,
	news, lwn, Jonathan Corbet, Rob Landley, emily, Christoph Hellwig

On Mon, Jan 19, 2026 at 10:56 AM Askar Safin <safinaskar@gmail.com> wrote:
>
> Christian Brauner <brauner@kernel.org>:
> > Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> > OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> > returning a file descriptor referring to that mount tree
> > OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> > to a new mount namespace. In that new mount namespace the copied mount
> > tree has been mounted on top of a copy of the real rootfs.
>
> I want to point at security benefits of this.
>
> [[ TL;DR: [1] and [2] are very big changes to how mount namespaces work.
> I like them, and I think they should get wider exposure. ]]
>
> If this patchset ([1]) and [2] both land (they are both in "next" now and
> likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on
> command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will
> usually contain exactly 2 mounts: nullfs and whatever was passed to
> open_tree(OPEN_TREE_NAMESPACE).
>
> This means that even if attacker somehow is able to unmount its root and
> get access to underlying mounts, then the only underlying thing they will
> get is nullfs.
>
> Also this means that other mounts are not only hidden in new namespace, they
> are fully absent. This prevents attacks discussed here: [3], [4].
>
> Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs"
> is passed), there is no anymore hidden writable mount shared by all containers,
> potentially available to attackers. This is concern raised in [5]:
>
> > You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
> > actually _be_ a filesystem. Even with your "fix", containers could communicate
> > with each _other_ through it if it becomes accessible. If a container can get
> > access to an empty initramfs and write into it, it can ask/answer the question
> > "Are there any other containers on this machine running stux24" and then coordinate.

I think this new OPEN_TREE_NAMESPACE is nifty, but I don't think the
path that gives it sensible behavior should be conditional like this.
Either make it *always* mount on top of nullfs (regardless of boot
options) or find some way to have it actually be the root.  I assume
the latter is challenging for some reason.

--Andy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-19 19:05   ` Andy Lutomirski
@ 2026-01-19 22:21     ` Jeff Layton
  2026-01-21 10:20       ` Christian Brauner
                         ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Jeff Layton @ 2026-01-19 22:21 UTC (permalink / raw)
  To: Andy Lutomirski, Askar Safin
  Cc: brauner, amir73il, cyphar, jack, josef, linux-fsdevel, viro,
	Lennart Poettering, David Howells, Zhang Yunkai, cgel.zte,
	Menglong Dong, linux-kernel, initramfs, containers, linux-api,
	news, lwn, Jonathan Corbet, Rob Landley, emily, Christoph Hellwig

On Mon, 2026-01-19 at 11:05 -0800, Andy Lutomirski wrote:
> On Mon, Jan 19, 2026 at 10:56 AM Askar Safin <safinaskar@gmail.com> wrote:
> > 
> > Christian Brauner <brauner@kernel.org>:
> > > Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> > > OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> > > returning a file descriptor referring to that mount tree
> > > OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> > > to a new mount namespace. In that new mount namespace the copied mount
> > > tree has been mounted on top of a copy of the real rootfs.
> > 
> > I want to point at security benefits of this.
> > 
> > [[ TL;DR: [1] and [2] are very big changes to how mount namespaces work.
> > I like them, and I think they should get wider exposure. ]]
> > 
> > If this patchset ([1]) and [2] both land (they are both in "next" now and
> > likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on
> > command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will
> > usually contain exactly 2 mounts: nullfs and whatever was passed to
> > open_tree(OPEN_TREE_NAMESPACE).
> > 
> > This means that even if attacker somehow is able to unmount its root and
> > get access to underlying mounts, then the only underlying thing they will
> > get is nullfs.
> > 
> > Also this means that other mounts are not only hidden in new namespace, they
> > are fully absent. This prevents attacks discussed here: [3], [4].
> > 
> > Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs"
> > is passed), there is no anymore hidden writable mount shared by all containers,
> > potentially available to attackers. This is concern raised in [5]:
> > 
> > > You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
> > > actually _be_ a filesystem. Even with your "fix", containers could communicate
> > > with each _other_ through it if it becomes accessible. If a container can get
> > > access to an empty initramfs and write into it, it can ask/answer the question
> > > "Are there any other containers on this machine running stux24" and then coordinate.
> 
> I think this new OPEN_TREE_NAMESPACE is nifty, but I don't think the
> path that gives it sensible behavior should be conditional like this.
> Either make it *always* mount on top of nullfs (regardless of boot
> options) or find some way to have it actually be the root.  I assume
> the latter is challenging for some reason.
> 

I think that's the plan. I suggested the same to Christian last week,
and he was amenable to removing the option and just always doing a
nullfs_rootfs mount.

We think that older runtimes should still "just work" with this scheme.
Out of an abundance of caution, we _might_ want a command-line option
to make it go back to old way, in case we find some userland stuff that
doesn't like this for some reason, but hopefully we won't even need
that.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-19 22:21     ` Jeff Layton
@ 2026-01-21 10:20       ` Christian Brauner
  2026-01-21 18:00       ` Andy Lutomirski
  2026-01-21 19:56       ` Rob Landley
  2 siblings, 0 replies; 9+ messages in thread
From: Christian Brauner @ 2026-01-21 10:20 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Andy Lutomirski, Askar Safin, amir73il, cyphar, jack, josef,
	linux-fsdevel, viro, Lennart Poettering, David Howells,
	Zhang Yunkai, cgel.zte, Menglong Dong, linux-kernel, initramfs,
	containers, linux-api, news, lwn, Jonathan Corbet, Rob Landley,
	emily, Christoph Hellwig

On Mon, Jan 19, 2026 at 05:21:30PM -0500, Jeff Layton wrote:
> On Mon, 2026-01-19 at 11:05 -0800, Andy Lutomirski wrote:
> > On Mon, Jan 19, 2026 at 10:56 AM Askar Safin <safinaskar@gmail.com> wrote:
> > > 
> > > Christian Brauner <brauner@kernel.org>:
> > > > Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> > > > OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> > > > returning a file descriptor referring to that mount tree
> > > > OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> > > > to a new mount namespace. In that new mount namespace the copied mount
> > > > tree has been mounted on top of a copy of the real rootfs.
> > > 
> > > I want to point at security benefits of this.
> > > 
> > > [[ TL;DR: [1] and [2] are very big changes to how mount namespaces work.
> > > I like them, and I think they should get wider exposure. ]]
> > > 
> > > If this patchset ([1]) and [2] both land (they are both in "next" now and
> > > likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on
> > > command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will
> > > usually contain exactly 2 mounts: nullfs and whatever was passed to
> > > open_tree(OPEN_TREE_NAMESPACE).
> > > 
> > > This means that even if attacker somehow is able to unmount its root and
> > > get access to underlying mounts, then the only underlying thing they will
> > > get is nullfs.
> > > 
> > > Also this means that other mounts are not only hidden in new namespace, they
> > > are fully absent. This prevents attacks discussed here: [3], [4].
> > > 
> > > Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs"
> > > is passed), there is no anymore hidden writable mount shared by all containers,
> > > potentially available to attackers. This is concern raised in [5]:
> > > 
> > > > You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
> > > > actually _be_ a filesystem. Even with your "fix", containers could communicate
> > > > with each _other_ through it if it becomes accessible. If a container can get
> > > > access to an empty initramfs and write into it, it can ask/answer the question
> > > > "Are there any other containers on this machine running stux24" and then coordinate.
> > 
> > I think this new OPEN_TREE_NAMESPACE is nifty, but I don't think the
> > path that gives it sensible behavior should be conditional like this.
> > Either make it *always* mount on top of nullfs (regardless of boot
> > options) or find some way to have it actually be the root.  I assume
> > the latter is challenging for some reason.
> > 
> 
> I think that's the plan. I suggested the same to Christian last week,
> and he was amenable to removing the option and just always doing a
> nullfs_rootfs mount.

Whether or not the underlying mount is nullfs or not is irrelevant. If
it's not nullfs but a regular tmpfs it works just as well. If it has any
locked overmounts the new rootfs will become locked as well similarly if
it'll be owned by a new userns.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-19 22:21     ` Jeff Layton
  2026-01-21 10:20       ` Christian Brauner
@ 2026-01-21 18:00       ` Andy Lutomirski
  2026-01-23 10:23         ` Christian Brauner
  2026-01-21 19:56       ` Rob Landley
  2 siblings, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2026-01-21 18:00 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Askar Safin, brauner, amir73il, cyphar, jack, josef,
	linux-fsdevel, viro, Lennart Poettering, David Howells,
	Yunkai Zhang, cgel.zte, Menglong Dong, linux-kernel, initramfs,
	containers, linux-api, news, lwn, Jonathan Corbet, Rob Landley,
	emily, Christoph Hellwig

> On Jan 19, 2026, at 2:21 PM, Jeff Layton <jlayton@kernel.org> wrote:
>
> On Mon, 2026-01-19 at 11:05 -0800, Andy Lutomirski wrote:
>>> On Mon, Jan 19, 2026 at 10:56 AM Askar Safin <safinaskar@gmail.com> wrote:
>>>
>>> Christian Brauner <brauner@kernel.org>:
>>>> Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
>>>> OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
>>>> returning a file descriptor referring to that mount tree
>>>> OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
>>>> to a new mount namespace. In that new mount namespace the copied mount
>>>> tree has been mounted on top of a copy of the real rootfs.
>>>
>>> I want to point at security benefits of this.
>>>
>>> [[ TL;DR: [1] and [2] are very big changes to how mount namespaces work.
>>> I like them, and I think they should get wider exposure. ]]
>>>
>>> If this patchset ([1]) and [2] both land (they are both in "next" now and
>>> likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on
>>> command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will
>>> usually contain exactly 2 mounts: nullfs and whatever was passed to
>>> open_tree(OPEN_TREE_NAMESPACE).
>>>
>>> This means that even if attacker somehow is able to unmount its root and
>>> get access to underlying mounts, then the only underlying thing they will
>>> get is nullfs.
>>>
>>> Also this means that other mounts are not only hidden in new namespace, they
>>> are fully absent. This prevents attacks discussed here: [3], [4].
>>>
>>> Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs"
>>> is passed), there is no anymore hidden writable mount shared by all containers,
>>> potentially available to attackers. This is concern raised in [5]:
>>>
>>>> You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
>>>> actually _be_ a filesystem. Even with your "fix", containers could communicate
>>>> with each _other_ through it if it becomes accessible. If a container can get
>>>> access to an empty initramfs and write into it, it can ask/answer the question
>>>> "Are there any other containers on this machine running stux24" and then coordinate.
>>
>> I think this new OPEN_TREE_NAMESPACE is nifty, but I don't think the
>> path that gives it sensible behavior should be conditional like this.
>> Either make it *always* mount on top of nullfs (regardless of boot
>> options) or find some way to have it actually be the root.  I assume
>> the latter is challenging for some reason.
>>
>
> I think that's the plan. I suggested the same to Christian last week,
> and he was amenable to removing the option and just always doing a
> nullfs_rootfs mount.
>
> We think that older runtimes should still "just work" with this scheme.
> Out of an abundance of caution, we _might_ want a command-line option
> to make it go back to old way, in case we find some userland stuff that
> doesn't like this for some reason, but hopefully we won't even need
> that.

What I mean is: even if for some reason the kernel is running in a
mode where the *initial* rootfs is a real fs, I think it would be nice
for OPEN_TREE_NAMESPACE to use nullfs.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-21 18:00       ` Andy Lutomirski
@ 2026-01-23 10:23         ` Christian Brauner
  2026-01-24 10:13           ` Askar Safin
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2026-01-23 10:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jeff Layton, Askar Safin, amir73il, cyphar, jack, josef,
	linux-fsdevel, viro, Lennart Poettering, David Howells,
	Yunkai Zhang, cgel.zte, Menglong Dong, linux-kernel, initramfs,
	containers, linux-api, news, lwn, Jonathan Corbet, Rob Landley,
	emily, Christoph Hellwig

On Wed, Jan 21, 2026 at 10:00:19AM -0800, Andy Lutomirski wrote:
> > On Jan 19, 2026, at 2:21 PM, Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Mon, 2026-01-19 at 11:05 -0800, Andy Lutomirski wrote:
> >>> On Mon, Jan 19, 2026 at 10:56 AM Askar Safin <safinaskar@gmail.com> wrote:
> >>>
> >>> Christian Brauner <brauner@kernel.org>:
> >>>> Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> >>>> OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> >>>> returning a file descriptor referring to that mount tree
> >>>> OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> >>>> to a new mount namespace. In that new mount namespace the copied mount
> >>>> tree has been mounted on top of a copy of the real rootfs.
> >>>
> >>> I want to point at security benefits of this.
> >>>
> >>> [[ TL;DR: [1] and [2] are very big changes to how mount namespaces work.
> >>> I like them, and I think they should get wider exposure. ]]
> >>>
> >>> If this patchset ([1]) and [2] both land (they are both in "next" now and
> >>> likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on
> >>> command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will
> >>> usually contain exactly 2 mounts: nullfs and whatever was passed to
> >>> open_tree(OPEN_TREE_NAMESPACE).
> >>>
> >>> This means that even if attacker somehow is able to unmount its root and
> >>> get access to underlying mounts, then the only underlying thing they will
> >>> get is nullfs.
> >>>
> >>> Also this means that other mounts are not only hidden in new namespace, they
> >>> are fully absent. This prevents attacks discussed here: [3], [4].
> >>>
> >>> Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs"
> >>> is passed), there is no anymore hidden writable mount shared by all containers,
> >>> potentially available to attackers. This is concern raised in [5]:
> >>>
> >>>> You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
> >>>> actually _be_ a filesystem. Even with your "fix", containers could communicate
> >>>> with each _other_ through it if it becomes accessible. If a container can get
> >>>> access to an empty initramfs and write into it, it can ask/answer the question
> >>>> "Are there any other containers on this machine running stux24" and then coordinate.
> >>
> >> I think this new OPEN_TREE_NAMESPACE is nifty, but I don't think the
> >> path that gives it sensible behavior should be conditional like this.
> >> Either make it *always* mount on top of nullfs (regardless of boot
> >> options) or find some way to have it actually be the root.  I assume
> >> the latter is challenging for some reason.
> >>
> >
> > I think that's the plan. I suggested the same to Christian last week,
> > and he was amenable to removing the option and just always doing a
> > nullfs_rootfs mount.
> >
> > We think that older runtimes should still "just work" with this scheme.
> > Out of an abundance of caution, we _might_ want a command-line option
> > to make it go back to old way, in case we find some userland stuff that
> > doesn't like this for some reason, but hopefully we won't even need
> > that.
> 
> What I mean is: even if for some reason the kernel is running in a
> mode where the *initial* rootfs is a real fs, I think it would be nice
> for OPEN_TREE_NAMESPACE to use nullfs.

The current patchset makes nullfs unconditional. As each mount
namespaces creates a new copy of the namespace root of the namespace it
was created from all mount namespace have nullfs as namespace root.
So every OPEN_TREE_NAMESPACE/FSMOUNT_NAMESPACE will be mounted on top of
nullfs as we always take the namespace root. If we have to make nullfs
conditional then yes, we could still do that - althoug it would be ugly
in various ways.

I would love to keep nullfs unconditional because it means I can wipe a
whole class of MNT_LOCKED nonsense from the face of the earth
afterwards.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-23 10:23         ` Christian Brauner
@ 2026-01-24 10:13           ` Askar Safin
  0 siblings, 0 replies; 9+ messages in thread
From: Askar Safin @ 2026-01-24 10:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Jeff Layton, amir73il, cyphar, jack, josef,
	linux-fsdevel, viro, Lennart Poettering, David Howells,
	Yunkai Zhang, cgel.zte, Menglong Dong, linux-kernel, initramfs,
	containers, linux-api, news, lwn, Jonathan Corbet, Rob Landley,
	Christoph Hellwig

On Fri, Jan 23, 2026 at 1:23 PM Christian Brauner <brauner@kernel.org> wrote:
> The current patchset makes nullfs unconditional. As each mount

Oops, I missed that "fs: use nullfs unconditionally as the real
rootfs" is present in vfs.all.

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-19 22:21     ` Jeff Layton
  2026-01-21 10:20       ` Christian Brauner
  2026-01-21 18:00       ` Andy Lutomirski
@ 2026-01-21 19:56       ` Rob Landley
  2026-02-19 23:42         ` Askar Safin
  2 siblings, 1 reply; 9+ messages in thread
From: Rob Landley @ 2026-01-21 19:56 UTC (permalink / raw)
  To: Jeff Layton, Andy Lutomirski, Askar Safin
  Cc: amir73il, cyphar, jack, josef, linux-fsdevel, viro,
	Lennart Poettering, David Howells, Zhang Yunkai, cgel.zte,
	Menglong Dong, linux-kernel, initramfs, containers, linux-api,
	news, lwn, Jonathan Corbet, emily, Christoph Hellwig

>>>> You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
>>>> actually _be_ a filesystem. Even with your "fix", containers could communicate
>>>> with each _other_ through it if it becomes accessible. If a container can get
>>>> access to an empty initramfs and write into it, it can ask/answer the question
>>>> "Are there any other containers on this machine running stux24" and then coordinate.

Or you could just make the ROOT= codepath remount the empty initramfs -o 
ro like some switch_root implementations do. If the PID 1 you launch 
isn't in initramfs, don't leave initramfs writeable. That seems unlikely 
to break userspace.

(Having permissions to remount initramfs but _not_ having already 
"cracked root" seems... a bit funky? You have waaaaay more faith in 
security modules than I do...)

>> I think this new OPEN_TREE_NAMESPACE is nifty, but I don't think the
>> path that gives it sensible behavior should be conditional like this.
>> Either make it *always* mount on top of nullfs (regardless of boot
>> options) or find some way to have it actually be the root.  I assume
>> the latter is challenging for some reason.
> 
> I think that's the plan. I suggested the same to Christian last week,
> and he was amenable to removing the option and just always doing a
> nullfs_rootfs mount.

Since 2013, initramfs might be ramfs or tmpfs depending on 
circumstances. Adding a third option for it be nullfs when there's no 
cpio.gz extracted into it seems reasonable. (You can always mount a 
tmpfs _over_ it if you need that later, it's writeable so a PID 1 
launched in it has workspace.)

That said, if you are changing the semantics, right now we switch_root 
from initramfs instead of pivot_root because initramfs couldn't be 
unmounted. With this change would pivot_root become the mechanism for 
initramfs too? (If the cpio.gz recipient wasn't actually rootfs but was 
an overmount the way ROOT= does it.)

Aside: it would be nice if inaccessible mount points could automatically 
be garbage collected. There's already some "lazy umount" plumbing that 
does that when explicitly requested to, but last I checked there were 
cases that didn't get caught. It's been a while though, might already 
have been fixed. Presumably initramfs would always get pinned because 
it's PID 0's / reference...

Also, could you guys make CONFIG_DEVTMPFS_MOUNT work with initramfs? 
I've posted patches for that on and off since 2017, most recent one's 
probably 
https://landley.net/bin/mkroot/0.8.13/linux-patches/0003-Wire-up-CONFIG_DEVTMPFS_MOUNT-to-initramfs.patch 
(tested on a 6.17 kernel).

> We think that older runtimes should still "just work" with this scheme.
> Out of an abundance of caution, we _might_ want a command-line option
> to make it go back to old way, in case we find some userland stuff that
> doesn't like this for some reason, but hopefully we won't even need
> that.

I assume it will break stuff, but I also assume the systems it breaks 
will never upgrade to a 7.x kernel because the kernel itself would 
consume all available memory before launching PID 1.

Rob

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
  2026-01-21 19:56       ` Rob Landley
@ 2026-02-19 23:42         ` Askar Safin
  0 siblings, 0 replies; 9+ messages in thread
From: Askar Safin @ 2026-02-19 23:42 UTC (permalink / raw)
  To: rob; +Cc: containers, initramfs, linux-api, linux-fsdevel, linux-kernel

Rob Landley <rob@landley.net>:
> Also, could you guys make CONFIG_DEVTMPFS_MOUNT work with initramfs?

I did something similar:
https://lore.kernel.org/initramfs/20260219210312.3468980-1-safinaskar@gmail.com/T/#u

Does this solve your problem?

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-02-19 23:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251229-work-empty-namespace-v1-0-bfb24c7b061f@kernel.org>
2026-01-19 17:11 ` [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE Askar Safin
2026-01-19 19:05   ` Andy Lutomirski
2026-01-19 22:21     ` Jeff Layton
2026-01-21 10:20       ` Christian Brauner
2026-01-21 18:00       ` Andy Lutomirski
2026-01-23 10:23         ` Christian Brauner
2026-01-24 10:13           ` Askar Safin
2026-01-21 19:56       ` Rob Landley
2026-02-19 23:42         ` Askar Safin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox