[RFC] Null Namespaces

Linux userland API discussions
 help / color / mirror / Atom feed

* [RFC] Null Namespaces
@ 2026-06-24 22:51 John Ericson
  2026-06-24 23:06 ` Andy Lutomirski
  2026-06-24 23:12 ` Al Viro
  0 siblings, 2 replies; 16+ messages in thread
From: John Ericson @ 2026-06-24 22:51 UTC (permalink / raw)
  To: Li Chen, Cong Wang, Christian Brauner, linux-arch
  Cc: linux-kernel, linux-fsdevel, linux-api, Arnd Bergmann,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Jan Kara, Jonathan Corbet,
	Shuah Khan, Alexander Viro, Kees Cook, Sergei Zimmerman,
	Farid Zakaria

Hello, I am hoping to discuss an idea I've had for a while, that I am
calling "null namespaces" that has become more relevant with some recent
other discussions. First I'll discuss null namespaces in general terms,
and then I'll link those recent discussions and relate null namespaces
to them.

### Null namespaces

The essence of null namespaces is trying to give processes as little
ambient authority as possible, so they are lighter weight and allowed to
do even less than fully unshared processes today.

Namespaces as they exist today are frequently described as an isolation
mechanism, but I think this is the conflation of two different things.
*Removing* a new process from its parent's namespaces unquestionably is
increasing isolation --- no disagreement there. But putting the process
in new namespaces is something else; I would call it supporting
"delusions of grandeur" of that process. For example, namespaces allow a
process to do mounts, have `CAP_SYS_ADMIN`, create network interfaces,
look up other processes by PID, etc.

Conceptually, to remove a process from one ambient authority scope (the
very name "namespaces" indicates they are about ambient authority)
should not require putting it in some ambient authority scope. Just
because, for example, the process cannot see one mount tree, doesn't
mean it needs to see another.

Here's what I am thinking would happen concretely:

First, the simpler cases:

#### Null mount namespace

- requires:

  - null root file system: absolute paths don't work.

  - null current working directory: relative paths with traditional,
    non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.

- All operations relating to the "ambient" mount tree don't work.

- `*at` operations with a file descriptor do work.

- The new fd-based mount APIs with detached mounts do work, modulo
  the calling process having enough permissions (as usual).

#### Null network namespace

- No network interfaces

- No abstract Unix sockets

#### Null IPC namespace

- cannot create or look up either type of message queue

#### Null UTS namespace

- no hostname or domainname: `uname`, `gethostname`/`sethostname`, and the
  related `/proc/sys/kernel` sysctls all fail.

#### Null user namespace

- Process has no user or group ids

- All uid/gid-based authorization lookups return "denied"

- -1 / "nobody" IDs for operations we don't want to fail (like `fstat`)
  can be used.

Note how in each of these, the notion of there "existing" a "single"
null namespace or not is degenerate --- every process with a null
namespace field is as isolated from one another (in terms of the axis
that namespace regulates) as they are from processes that are in other
namespaces. It is truly a minimal permission level, and (as we shall
see) cheap too, because it is just a null pointer in `task_struct`.

Then for the nested ones --- PID and cgroup --- we cannot have quite a
null namespace in the same sense, because it is an important property
that these namespaces are hierarchical up to the root namespaces.
Instead of having a disjoint null namespace, we need a null namespace
with a parent.

#### Null PID namespace

- cannot look up other processes by PID

- current process ID lookup fails

- current process's parent process ID lookup fails

- current process still assigned IDs in parent PID namespaces, per usual

#### Null cgroup namespace

- Process still can have resources restricted according to parent cgroup

- Process unaware of cgroup hierarchy though --- blind to who/how it is
  constrained

In these cases, we cannot just implement with a null pointer, because we
still need a valid parent namespace. However, we shouldn't need any info
*but* the parent namespace. A pair of a pointer and a bool indicating
null namespace with parent namespace or actual namespace membership,
with some sort of helper to get the parent namespace in either case
(since the actual namespace has its parent), should implement this.

Finally there is the time namespace. Conceptually a null time namespace
is simple enough --- you cannot look up the time! --- but the
implementation is a bit more complex to get right because of the vDSO
for certain timing operations.

### General Motivation

Why am I so interested in this stuff?

Firstly it is because I have always been interested in a more strictly
object-capability-based userland, and projects like
Capsicum/CloudABI/WASI. I think going all in on file descriptors is
generally the direction that Linux has been going in, and it creates a
genuinely better programming model than the traditional Unix one with
all its ambient authority, and the TOCTOU and other issues that attend
it.

Today's container idioms and the "delusions of grandeur" that namespaces
provide are great for retrofitting existing software to run in a more
isolated environment. But I don't want that to be the ceiling of our
ambitions. Especially in this age of LLM refactoring, it is very easy to
get both new and existing software to abide by the more limited set of
allowed operations that null-namespace processes allow. And the
modifications that that entails (more `openat`, more socket activation,
etc.) make that software (in my view) simply *better* --- I would want
it to work that way with or without these constraints forcing the issue.

Secondly, and more concretely/imminently as a Nix developer, I am very
interested in the performance and overhead of process isolation. It is
very much my ambition to move Nix into the Bazel/Buck space of ever more
numerous and fine-grained atomic build steps (i.e. small compilation
units, not "packages"), but to do this *without* sacrificing Nix's
strong sandboxing guarantees that make our build plans so self-contained
and thus the ease of onboarding new Nix users.

I think this "null namespace" sandboxing will likely be simpler and more
performant than creating and destroying a bunch of regular namespaces
for each compilation unit. And while it will no doubt take some compiler
/ other tool patching to fix up any assumptions that get in the way of
running processes with so few permissions, I am happy to take a stab at
that too. Nix is, after all, for "tool-assisted yak shaves" as one put
it --- patching GCC / Clang / whatever and then rebuilding the world is
something we are quite good at.

Lastly, I'll add that the traditional way people have thought about
things like Capsicum/CloudABI is custom personalities/seccomp rules, but
IMO trying to tackle the massive UAPI surface area so shallowly is ugly
and unmaintainable. Nulling out namespace fields in `task_struct`,
conversely, attacks the problem at its core, much more elegantly, and
makes it easy to handle both current *and future* syscalls in a
minimally invasive and maintainable manner.

### Null namespaces and process spawning

Why bring this up now?

Recently [1], Li Chen took a stab at the venerable old goal of making a
better process spawning UAPI than fork/clone + exec. I am quite excited
to see this happen, as it generally dovetails very nicely with the
object capability goals I have above. (E.g. making it performant and
idiomatic to opt-in, rather than opt-out of sharing file descriptors
with a child process is very good for a world where all
resource/privilege sharing is done with file descriptors.)

One problem with clone that didn't yet come up is that its defaults are
not good from a security perspective: sharing by default, and unsharing
as the opt in means that one must remember and take active measures to
ensure that child processes get *less* privileges. This is very bad ---
secure practices mean that the "lazy programmer" and the "smallest
program" must always err on the side of giving the child process *less*
privileges. This is the only way economics and the "principle of least
privilege" will work together, rather than against each other (and
economics is quite likely to win when they are working against each
other).

The reason that clone *doesn't* work that way is, of course,
performance: it would be wasteful to unshare and create new namespaces
when they are just going to be thrown away because the user wants to
share after all.

Null namespaces I think elegantly work around this performance/security
trade-off, while also avoiding the need for gazillion-parameter syscalls
like clone. This is because, as the most secure option, and a cheap
option, they are the rightful default for a new process creation API.

1. When an "embryonic" (under construction, not yet ready to be
   scheduled) task is first created, it should have all null namespaces.

2. Separate syscalls (`io_uring` exists for batching, we don't need to
   reinvent an ad-hoc batch solution) can exist for setting the
   namespaces on the process, where either "sharing" (use parent process
   namespace) or "unsharing" (use fresh namespace, usually derived from
   the parent process namespace but perhaps derived from a different
   one) are choices that can be opted into instead of the null namespace
   default.

3. After all state is initialized (arguments, environment variables,
   file descriptors, namespaces, etc.), the process can be "birthed",
   and submitted as ready to be scheduled.

This design is very natural to me, but its full naturality is *only*
available with the null namespace option. Otherwise we are stuck in a
place of no good defaults, and the "builder pattern" seems more awkward.

Also in [2], I bring up a design for unix sockets without the file
system or the "abstract" socket namespace, and how I want to avoid both
in order to firmly rule out TOCTOU and other ambient authority issues. I
think those arguments stand on their own, but the possibility of a null
network namespace sharpens the issue: it forces the `O_PATH` FD stuff I
discuss to be the only viable option.

### Implementation

I've "LLM'd" out some draft patches [3] for this. I'm not submitting
them because I still need to review and test them, and I don't want
(currently, pre those steps) low-quality slop to tarnish this proposal.
What this initial exploration did, however, confirm for me is that these
changes should be quite lightweight to implement. (Also, what I propose
is slightly different from my implementation draft in a few cases where
I think the design I proposed here is better than my draft
implementation.)

If the discussion here starts moving towards consensus, I'll clean up
and rework those patches along the lines of the consensus. Ideally I
would submit them one at a time, I figure, since the implementations for
different namespaces are necessarily changes to different subsystems.

Cheers!

John

[1]: https://lore.kernel.org/all/20260528095235.2491226-1-me@linux.beauty/

[2]: https://lore.kernel.org/all/455281ec-3ee1-4f27-989b-c239f0690d8b@app.fastmail.com/

[3]: https://github.com/Ericson2314/linux/commits/null-namespace

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-24 22:51 [RFC] Null Namespaces John Ericson
@ 2026-06-24 23:06 ` Andy Lutomirski
  2026-06-24 23:20   ` Andy Lutomirski
  2026-06-24 23:12 ` Al Viro
  1 sibling, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2026-06-24 23:06 UTC (permalink / raw)
  To: John Ericson
  Cc: Li Chen, Cong Wang, Christian Brauner, linux-arch, linux-kernel,
	linux-fsdevel, linux-api, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan,
	Alexander Viro, Kees Cook, Sergei Zimmerman, Farid Zakaria

On Wed, Jun 24, 2026 at 3:52 PM John Ericson <mail@johnericson.me> wrote:
>
> Hello, I am hoping to discuss an idea I've had for a while, that I am
> calling "null namespaces" that has become more relevant with some recent
> other discussions. First I'll discuss null namespaces in general terms,
> and then I'll link those recent discussions and relate null namespaces
> to them.
>
> ### Null namespaces
>
> The essence of null namespaces is trying to give processes as little
> ambient authority as possible, so they are lighter weight and allowed to
> do even less than fully unshared processes today.
>
> Namespaces as they exist today are frequently described as an isolation
> mechanism, but I think this is the conflation of two different things.
> *Removing* a new process from its parent's namespaces unquestionably is
> increasing isolation --- no disagreement there. But putting the process
> in new namespaces is something else; I would call it supporting
> "delusions of grandeur" of that process. For example, namespaces allow a
> process to do mounts, have `CAP_SYS_ADMIN`, create network interfaces,
> look up other processes by PID, etc.
>
> Conceptually, to remove a process from one ambient authority scope (the
> very name "namespaces" indicates they are about ambient authority)
> should not require putting it in some ambient authority scope. Just
> because, for example, the process cannot see one mount tree, doesn't
> mean it needs to see another.

I think I like this, but some comments:

>
> Here's what I am thinking would happen concretely:
>
> First, the simpler cases:
>
> #### Null mount namespace
>
> - requires:
>
>   - null root file system: absolute paths don't work.
>
>   - null current working directory: relative paths with traditional,
>     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.

It's perfectly valid to cd to a directory that does not belong to
one's namespace.  We have fchdir.  What's wrong with letting it
continue working?

Regardless of that, the current directory either needs to be a
directory or to be nothing at all, and if we support the latter, we
need to figure out what /proc will show.

> #### Null user namespace

A user namespace is kind of about how *non-current* uids and gids work
for the process and how it perceives its own uid and gid and not so
much about what uid and gid it has when accessing outside resources.
So...

>
> - Process has no user or group ids

What does that mean?  What does ps show?



Maybe the way to go is to implement the ones that have clearer
semantics and to defer the others.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-24 23:06 ` Andy Lutomirski
@ 2026-06-24 23:20   ` Andy Lutomirski
  2026-06-24 23:53     ` John Ericson
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2026-06-24 23:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: John Ericson, Li Chen, Cong Wang, Christian Brauner, linux-arch,
	linux-kernel, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan,
	Alexander Viro, Kees Cook, Sergei Zimmerman, Farid Zakaria

On Wed, Jun 24, 2026 at 4:06 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Jun 24, 2026 at 3:52 PM John Ericson <mail@johnericson.me> wrote:

> >   - null current working directory: relative paths with traditional,
> >     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
>
> It's perfectly valid to cd to a directory that does not belong to
> one's namespace.  We have fchdir.  What's wrong with letting it
> continue working?
>
> Regardless of that, the current directory either needs to be a
> directory or to be nothing at all, and if we support the latter, we
> need to figure out what /proc will show.

Thinking about this more: I think that handling CWD might actually be
a prerequisite for the series and has little to do with namespaces.
Maybe try adding, as a standalone feature, the ability to have a null
CWD.  Define semantics and see what the implementation looks like.

Then, if you add null namespaces, you could optionally make
transitioning to a null namespace set a null CWD.  Or those features
could be orthogonal.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-24 23:20   ` Andy Lutomirski
@ 2026-06-24 23:53     ` John Ericson
  2026-06-25  1:10       ` Al Viro
  0 siblings, 1 reply; 16+ messages in thread
From: John Ericson @ 2026-06-24 23:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Li Chen, Cong Wang, Christian Brauner, linux-arch, LKML,
	linux-fsdevel, linux-api, Arnd Bergmann, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jan Kara, Jonathan Corbet, Shuah Khan, Al Viro, Kees Cook,
	Sergei Zimmerman, Farid Zakaria

On Wed, Jun 24, 2026, at 7:20 PM, Andy Lutomirski wrote:
> I think I like this, but some comments:

Thanks, that's really nice to hear!

While arguably this is just the culmination of a direction Linux has
been going in for a while, it could also be seen as a very "out there"
idea. That at least one person likes the rough sound of things makes me
feel a lot better!

> On Wed, Jun 24, 2026 at 4:06 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Wed, Jun 24, 2026 at 3:52 PM John Ericson <mail@johnericson.me> wrote:
>
> > >   - null current working directory: relative paths with traditional,
> > >     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
> >
> > It's perfectly valid to cd to a directory that does not belong to
> > one's namespace.  We have fchdir.  What's wrong with letting it
> > continue working?
> >
> > Regardless of that, the current directory either needs to be a
> > directory or to be nothing at all, and if we support the latter, we
> > need to figure out what /proc will show.
>
> Thinking about this more: I think that handling CWD might actually be
> a prerequisite for the series and has little to do with namespaces.
> Maybe try adding, as a standalone feature, the ability to have a null
> CWD.  Define semantics and see what the implementation looks like.
>
> Then, if you add null namespaces, you could optionally make
> transitioning to a null namespace set a null CWD.  Or those features
> could be orthogonal.

Hehe, I had the same thought after working on the filesystem patches,
along with the analogous thought for the root filesystem. It had been so
long since I had done a `chroot` without also doing a mount namespace
`unshare` --- despite the former being much older --- that I had
forgotten this separation of concerns.

My apologies for forgetting to include this insight in the original
email.

> Maybe the way to go is to implement the ones that have clearer
> semantics and to defer the others.

I would much prefer this, actually.

I wanted to discuss a bit about each type of namespace to indicate that
this is a concept I think works across the board --- it wouldn't be such
a good solution for the process spawning API if it was only applicable
to some but not all namespace types. But the truth is that I have
thought about the FS cases the most, as I think you have picked up on.

If there is interest in landing

  1. null CWD
  2. null root fs
  3. null mount namespace

in isolation, and then returning to the other namespaces to iron out
their details, that would be fantastic. It would be much nicer for me to
get some momentum that way, without having to design everything all at
once first before getting to implement anything.

John

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-24 23:53     ` John Ericson
@ 2026-06-25  1:10       ` Al Viro
  2026-06-25  3:41         ` John Ericson
  0 siblings, 1 reply; 16+ messages in thread
From: Al Viro @ 2026-06-25  1:10 UTC (permalink / raw)
  To: John Ericson
  Cc: Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria

On Wed, Jun 24, 2026 at 07:53:53PM -0400, John Ericson wrote:
> I wanted to discuss a bit about each type of namespace to indicate that
> this is a concept I think works across the board --- it wouldn't be such
> a good solution for the process spawning API if it was only applicable
> to some but not all namespace types. But the truth is that I have
> thought about the FS cases the most, as I think you have picked up on.
> 
> If there is interest in landing
> 
>   1. null CWD
>   2. null root fs
>   3. null mount namespace
> 
> in isolation, and then returning to the other namespaces to iron out
> their details, that would be fantastic. It would be much nicer for me to
> get some momentum that way, without having to design everything all at
> once first before getting to implement anything.

Please, start with explaining what, in your opinion, a mount namespace _is_,
and where does "mount X is attached at path P relative to mount Y" belong.

What's the fundamental difference between CWD and any open descriptor for
a directory?  Why does it make sense to ban the former, but allow the
equivalents done via the latter?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-25  1:10       ` Al Viro
@ 2026-06-25  3:41         ` John Ericson
  2026-06-25 15:51           ` Andy Lutomirski
  2026-06-26  0:15           ` Al Viro
  0 siblings, 2 replies; 16+ messages in thread
From: John Ericson @ 2026-06-25  3:41 UTC (permalink / raw)
  To: Al Viro
  Cc: Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria

Ah, I started replying to your first email, but this is better, this
gets to the heart of the matter. Please don't mind me responding to your
two questions in reverse.

On Wed, Jun 24, 2026, at 9:10 PM, Al Viro wrote:
> What's the fundamental difference between CWD and any open descriptor
> for a directory?  Why does it make sense to ban the former, but allow
> the equivalents done via the latter?

Yes! These two notions are very close --- but that's the *problem*, not
a reason to not care about the existence of the CWD and root FS. I want
to get rid of CWD in my processes not because it is fundamentally
different (it isn't), but because it is superfluous.

If one is capability-minded like me, it's a bad mistake that we ever had
this "working directory" notion to begin with, and yet another example
of the folks at Bell Labs sticking something in the kernel that was
really only needed by the shell, and that could have just been done in
userland.

The current working directory, roughly, is *just* some global state
holding a directory file descriptor. But I don't want that global state.
If I am writing my userland program (that is not a shell), I would not
create the global variable. I do not appreciate the fact that the kernel
foists that state upon me whether I like it or not.

Now obviously we cannot have a giant breaking change removing the notion
of a current working directory altogether. But we can allow individual
processes which don't want it to opt out, and that is what nulling out
these fields (and updating the path resolution code to cope with that)
allows.

There is no loss of expressive power doing this, because one can (and
should!) just use the `*at` and file descriptors. But there is, however,
the imposition of discipline. The programmer (or coding agent) is
encouraged to do everything with file descriptors rather than path
concatenations etc., because they need to use `*at` anyways, and then
voilà, without browbeating anyone in security seminars or code review, a
bunch of TOCTOU issues disappear simply because doing the right thing is
now the path of least resistance.

> Please, start with explaining what, in your opinion, a mount namespace
> _is_, and where does "mount X is attached at path P relative to mount
> Y" belong.

Let's take a pathological example:

- Process A has `/foo` bind-mounted at `/bar/foo`

- Process B has `/bar` without that bind mount, and `/foo` mounted at
  `/baz/foo`, as is possible because it is in a different mount
  namespace.

If A opens `/bar/foo`, and sends it over (via socket) to B, and then B
does `openat(recv_fd, "..")`, B will get `/bar`, not `/baz`. This is
because `..` is resolved according to the mount referenced in the open
file. (This is, by the way, very good! Directory file descriptors would
be perilous to use if this were not the case!)

The moral of the story is that "mount X is attached at path P relative
to mount Y" is information accessed in the mounts themselves (maybe via
their containing mount namespace, per the `mnt_ns` field, or maybe not,
I am not sure, but it is immaterial). In contrast, the mount namespace
of the *opening* task (`current->nsproxy->mnt_ns`, and current is B)
doesn't matter at all for this purpose.

I am not on a crusade against `struct mnt_namespace` in general; I am
just trying to null out `(struct nsproxy)::mnt_ns` in particular. (This
is just as I am not on a crusade against `struct path`, just `root` and
`pwd` of `struct fs_struct`.)

These days, `current->nsproxy->mnt_ns` is, to me, first and foremost,
there for the legacy mount API. Again, just like our CWD example above,
this is mostly just global state.

The new mount API drastically [^1] reduces the need for it, since it
allows referring to mounts explicitly via file descriptors. That's OK!
The argument is the same as the above --- I am *not* trying to limit
what can be done if one has all the right files open with the right
perms. I am just trying to limit what works out of the box --- to reduce
the default set of privileges, *especially* where the resources involved
are implicit and/or stateful.

[^1]: It doesn't *quite* eliminate the need for `nsproxy->mnt_ns`
    entirely, since (as I understand it, from reading the `move_mount`
    man page) it is still used for some authorization checks, since
    `O_PATH` file descriptors do not grant privileges other than mere
    discoverability. But that's a problem that could be solved later
    with an `O_MOUNT` option analogous to `O_RDONLY` or `O_WRONLY`. In
    the meantime, I am perfectly happy if my processes with null mount
    namespaces get `move_mount` permission errors.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-25  3:41         ` John Ericson
@ 2026-06-25 15:51           ` Andy Lutomirski
  2026-06-25 18:21             ` John Ericson
  2026-06-26  0:15           ` Al Viro
  1 sibling, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2026-06-25 15:51 UTC (permalink / raw)
  To: John Ericson
  Cc: Al Viro, Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria

On Wed, Jun 24, 2026 at 8:41 PM John Ericson <mail@johnericson.me> wrote:
>
> Ah, I started replying to your first email, but this is better, this
> gets to the heart of the matter. Please don't mind me responding to your
> two questions in reverse.
>
> On Wed, Jun 24, 2026, at 9:10 PM, Al Viro wrote:
> > What's the fundamental difference between CWD and any open descriptor
> > for a directory?  Why does it make sense to ban the former, but allow
> > the equivalents done via the latter?
>
> Yes! These two notions are very close --- but that's the *problem*, not
> a reason to not care about the existence of the CWD and root FS. I want
> to get rid of CWD in my processes not because it is fundamentally
> different (it isn't), but because it is superfluous.
>
> If one is capability-minded like me, it's a bad mistake that we ever had
> this "working directory" notion to begin with, and yet another example
> of the folks at Bell Labs sticking something in the kernel that was
> really only needed by the shell, and that could have just been done in
> userland.
>
> The current working directory, roughly, is *just* some global state
> holding a directory file descriptor. But I don't want that global state.
> If I am writing my userland program (that is not a shell), I would not
> create the global variable. I do not appreciate the fact that the kernel
> foists that state upon me whether I like it or not.
>
> Now obviously we cannot have a giant breaking change removing the notion
> of a current working directory altogether. But we can allow individual
> processes which don't want it to opt out, and that is what nulling out
> these fields (and updating the path resolution code to cope with that)
> allows.
>
> There is no loss of expressive power doing this, because one can (and
> should!) just use the `*at` and file descriptors. But there is, however,
> the imposition of discipline. The programmer (or coding agent) is
> encouraged to do everything with file descriptors rather than path
> concatenations etc., because they need to use `*at` anyways, and then
> voilà, without browbeating anyone in security seminars or code review, a
> bunch of TOCTOU issues disappear simply because doing the right thing is
> now the path of least resistance.
>
> > Please, start with explaining what, in your opinion, a mount namespace
> > _is_, and where does "mount X is attached at path P relative to mount
> > Y" belong.
>
> Let's take a pathological example:
>
> - Process A has `/foo` bind-mounted at `/bar/foo`
>
> - Process B has `/bar` without that bind mount, and `/foo` mounted at
>   `/baz/foo`, as is possible because it is in a different mount
>   namespace.
>
> If A opens `/bar/foo`, and sends it over (via socket) to B, and then B
> does `openat(recv_fd, "..")`, B will get `/bar`, not `/baz`. This is
> because `..` is resolved according to the mount referenced in the open
> file. (This is, by the way, very good! Directory file descriptors would
> be perilous to use if this were not the case!)
>
> The moral of the story is that "mount X is attached at path P relative
> to mount Y" is information accessed in the mounts themselves (maybe via
> their containing mount namespace, per the `mnt_ns` field, or maybe not,
> I am not sure, but it is immaterial). In contrast, the mount namespace
> of the *opening* task (`current->nsproxy->mnt_ns`, and current is B)
> doesn't matter at all for this purpose.

It's sort of a combination -- read the data structures :)  Other than
the propagation part, they're really not that bad.

In any event, I think this discussion is sort of immaterial to the
proposed API change.  No one is about to remove the concept of a mount
namespace.  But maybe it makes sense to have a way to have a task that
doesn't actually belong to a mount namespace.  A mount namespace is
certainly going to exist.

There will definitely be subtleties.  For example, what happens if a
task with "no mount namespace" tries to do OPEN_TREE_CLONE?  In some
logical sense it ought to work but it ought to be impossible to
actually mount the resulting tree anywhere, but this risks running
afoul of all kinds of checks.  Maybe you get a whole new mount
namespace (that does not become your current mnt_ns) if you
OPEN_TREE_CLONE?

This stuff is complex and it probably makes more sense to keep changes simple.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-25 15:51           ` Andy Lutomirski
@ 2026-06-25 18:21             ` John Ericson
  0 siblings, 0 replies; 16+ messages in thread
From: John Ericson @ 2026-06-25 18:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Li Chen, Cong Wang, Christian Brauner, linux-arch, LKML,
	linux-fsdevel, linux-api, Arnd Bergmann, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria

On Thu, Jun 25, 2026, at 11:51 AM, Andy Lutomirski wrote:
> On Wed, Jun 24, 2026 at 8:41 PM John Ericson <mail@johnericson.me> wrote:
>
> It's sort of a combination -- read the data structures :)  Other than
> the propagation part, they're really not that bad.

Are you saying path resolution *does* depend on the mount namespace that
the task belongs to? I certainly hope not! I did look over the data
structures along with my patches and I didn't see an example of this ---
just path resolution depending on the CWD and root directories (as one
would expect it to).

> In any event, I think this discussion is sort of immaterial to the
> proposed API change.  No one is about to remove the concept of a mount
> namespace.  But maybe it makes sense to have a way to have a task that
> doesn't actually belong to a mount namespace.  A mount namespace is
> certainly going to exist.

I am not sure if that is addressed more to Al or me? I certainly do
agree with all that, in any case. Mount namespaces are absolutely here
to stay, and I'm just trying to make a process that does not belong to
one; that's exactly correct. Sorry if my motivation by way of historical
analysis veered off topic.

> There will definitely be subtleties.  For example, what happens if a
> task with "no mount namespace" tries to do OPEN_TREE_CLONE?  In some
> logical sense it ought to work but it ought to be impossible to
> actually mount the resulting tree anywhere, but this risks running
> afoul of all kinds of checks.  Maybe you get a whole new mount
> namespace (that does not become your current mnt_ns) if you
> OPEN_TREE_CLONE?
>
> This stuff is complex and it probably makes more sense to keep changes simple.

Yes it is subtle; I definitely don't claim to fully understand the
permission model with mount namespace modifications yet, for one. Should
we switch gears to just discussing the null CWD and root directories,
then, and return to mount namespaces later?

I have started to rework my patch series accordingly, so I have a new
draft first patch for just that, before changing anything else. I could
(after some testing) submit that next; it's pretty small.

John

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-25  3:41         ` John Ericson
  2026-06-25 15:51           ` Andy Lutomirski
@ 2026-06-26  0:15           ` Al Viro
  2026-06-26 16:26             ` John Ericson
  1 sibling, 1 reply; 16+ messages in thread
From: Al Viro @ 2026-06-26  0:15 UTC (permalink / raw)
  To: John Ericson
  Cc: Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria

On Wed, Jun 24, 2026 at 11:41:07PM -0400, John Ericson wrote:

> The current working directory, roughly, is *just* some global state
> holding a directory file descriptor.

So's the descriptor table; what's the difference?

> But I don't want that global state.

Don't use it, then... out of curiosity, does that extend to stdout et.al.?

> If I am writing my userland program (that is not a shell), I would not
> create the global variable. I do not appreciate the fact that the kernel
> foists that state upon me whether I like it or not.

<wry> Kernel will have to live without your appreciation, I suppose. </wry>

> Now obviously we cannot have a giant breaking change removing the notion
> of a current working directory altogether. But we can allow individual
> processes which don't want it to opt out, and that is what nulling out
> these fields (and updating the path resolution code to cope with that)
> allows.
> 
> There is no loss of expressive power doing this, because one can (and
> should!) just use the `*at` and file descriptors. But there is, however,
> the imposition of discipline.

So supply a library of your own and try to convince people to use it
instead of libc.  You'll have to anyway, seeing that a large and
hard-to-predict part of libc will be non-functional.  Which syscalls
are used by your library is entirely up to you.

Would that kind of thing added kernel-side assist the development of such
library?  Maybe, but I wouldn't bet too much on that - if you start from
scratch, you can trivially verify that you don't even attempt given
set of syscalls and if you use libc as a starting point, you get to
debug all the failure exits you've added...

> The programmer (or coding agent) is
> encouraged to do everything with file descriptors rather than path
> concatenations etc., because they need to use `*at` anyways, and then
> voilà, without browbeating anyone in security seminars or code review, a
> bunch of TOCTOU issues disappear simply because doing the right thing is
> now the path of least resistance.

I'm sorry, but the path of least resistance is picking a snippet from google
that will implement open(), etc., on top of your setup and using it.
_Especially_ if coding agents are going to be involved, precisely because
they'll do a convincing simulation of human duhveloper's behaviour, i.e.
"cut'n'paste it from the net".

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-26  0:15           ` Al Viro
@ 2026-06-26 16:26             ` John Ericson
  0 siblings, 0 replies; 16+ messages in thread
From: John Ericson @ 2026-06-26 16:26 UTC (permalink / raw)
  To: Al Viro
  Cc: Andy Lutomirski, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria

On Thu, Jun 25, 2026, at 8:15 PM, Al Viro wrote:
> On Wed, Jun 24, 2026 at 11:41:07PM -0400, John Ericson wrote:
>
> > But I don't want that global state.
>
> Don't use it, then... out of curiosity, does that extend to stdout et.al.?

Good question; it turns out I like the standard streams much better!

First of all, the standard streams are just an idiom --- there is
nothing actually special about file descriptors 0, 1, and 2. That's a
clean design --- the kernel doesn't need to know about userspace idioms.

Second of all, if you don't want any of those, you can just close 'em!
You can't do that with the cwd, however. It's stuck open.

Ideally `*at` would have been with us from the beginning, and, say, file
descriptor 3 would have been the "current working directory" merely by
convention.

> Would that kind of thing added kernel-side assist the development of such
> library?  Maybe, but I wouldn't bet too much on that - if you start from
> scratch, you can trivially verify that you don't even attempt given
> set of syscalls and if you use libc as a starting point, you get to
> debug all the failure exits you've added...

First of all, I am trying to change what processes are allowed to do,
and this includes programs I did not write. A libc-based solution is the
program cooperating with its own sandboxing; this is not a solution for
running arbitrary programs which may not be trusted in a restricted
manner.

Second of all, this would be very laborious in practice, because we're
talking not about what syscalls the program uses, but about what data is
passed in those syscalls. Any program that consumes arbitrary user input
(like shell utilities) might receive an absolute or relative path, and
so it would have to manually check for that, lest the user input "trick"
the program into using the root dir and cwd it is trying to ignore.

Making a tiny few edits in the kernel path resolution logic to allow for
these null fields is much more practical than defending a much broader
perimeter in userspace.

> > The programmer (or coding agent) is
> > encouraged to do everything with file descriptors rather than path
> > concatenations etc., because they need to use `*at` anyways, and then
> > voilà, without browbeating anyone in security seminars or code review, a
> > bunch of TOCTOU issues disappear simply because doing the right thing is
> > now the path of least resistance.
>
> I'm sorry, but the path of least resistance is picking a snippet from google
> that will implement open(), etc., on top of your setup and using it.
> _Especially_ if coding agents are going to be involved, precisely because
> they'll do a convincing simulation of human duhveloper's behaviour, i.e.
> "cut'n'paste it from the net".

We agree! But this is precisely why it is important to make these things
fail. Mindless Stack Overflow cut'n'pasters (human or agent) still run
their program to make sure it works. Making the thing you don't want
them to do *actually fail* creates sufficiently strong and incremental
feedback that they will end up doing the right thing.

> > The current working directory, roughly, is *just* some global state
> > holding a directory file descriptor.
>
> So's the descriptor table; what's the difference?

Now that I've responded to everything else, I can answer this in
summary:

- File descriptors can be closed; cwd and root cannot be.

- File descriptors need to be explicitly used in syscalls. The cwd and
  root are implicitly used (in too many different syscalls to make
  syscall-level auditing practical) based on the sort of path string
  argument to the syscall, without the program's explicit consent.

John

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-24 22:51 [RFC] Null Namespaces John Ericson
  2026-06-24 23:06 ` Andy Lutomirski
@ 2026-06-24 23:12 ` Al Viro
  2026-06-25 21:00   ` H. Peter Anvin
  1 sibling, 1 reply; 16+ messages in thread
From: Al Viro @ 2026-06-24 23:12 UTC (permalink / raw)
  To: John Ericson
  Cc: Li Chen, Cong Wang, Christian Brauner, linux-arch, linux-kernel,
	linux-fsdevel, linux-api, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria

On Wed, Jun 24, 2026 at 06:51:47PM -0400, John Ericson wrote:

> #### Null mount namespace
> 
> - requires:
> 
>   - null root file system: absolute paths don't work.
> 
>   - null current working directory: relative paths with traditional,
>     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
> 
> - All operations relating to the "ambient" mount tree don't work.
> 
> - `*at` operations with a file descriptor do work.

Huh?  The last bit looks contradicts the previous one - if you have
an opened directory in a mount from some namespace, those `*at` operations
with that descriptor *will* be seeing the mount tree of that namespace,
whatever the hell is "ambient" supposed to mean.  Either that, or you
will be exposing whatever's overmounted in that mount, which is a huge
can of worms.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-24 23:12 ` Al Viro
@ 2026-06-25 21:00   ` H. Peter Anvin
  2026-06-25 21:50     ` John Ericson
  0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2026-06-25 21:00 UTC (permalink / raw)
  To: Al Viro, John Ericson
  Cc: Li Chen, Cong Wang, Christian Brauner, linux-arch, linux-kernel,
	linux-fsdevel, linux-api, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria

On 2026-06-24 16:12, Al Viro wrote:
> On Wed, Jun 24, 2026 at 06:51:47PM -0400, John Ericson wrote:
> 
>> #### Null mount namespace
>>
>> - requires:
>>
>>   - null root file system: absolute paths don't work.
>>
>>   - null current working directory: relative paths with traditional,
>>     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
>>
>> - All operations relating to the "ambient" mount tree don't work.
>>
>> - `*at` operations with a file descriptor do work.
> 
> Huh?  The last bit looks contradicts the previous one - if you have
> an opened directory in a mount from some namespace, those `*at` operations
> with that descriptor *will* be seeing the mount tree of that namespace,
> whatever the hell is "ambient" supposed to mean.  Either that, or you
> will be exposing whatever's overmounted in that mount, which is a huge
> can of worms.

It seems to me that this is really no different *in practice* to having an
empty mount namespace, no? You might still be able to stat("/") and get a
d--------- result, but how does that actually affect anything?

The big thing with a lot of this is that introducing a null case can really
complicate things all over the place, and since this is very likely to be only
a niche use case, it kind of screams to me like it has the potential to become
an attack surface like any other rarely used code in the kernel...

	-hpa


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-25 21:00   ` H. Peter Anvin
@ 2026-06-25 21:50     ` John Ericson
  2026-06-25 23:09       ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: John Ericson @ 2026-06-25 21:50 UTC (permalink / raw)
  To: H. Peter Anvin, Al Viro
  Cc: Li Chen, Cong Wang, Christian Brauner, linux-arch, LKML,
	linux-fsdevel, linux-api, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria

On Thu, Jun 25, 2026, at 5:00 PM, H. Peter Anvin wrote:
> On 2026-06-24 16:12, Al Viro wrote:
> > On Wed, Jun 24, 2026 at 06:51:47PM -0400, John Ericson wrote:
> >
> >> #### Null mount namespace
> >>
> >> - requires:
> >>
> >>   - null root file system: absolute paths don't work.
> >>
> >>   - null current working directory: relative paths with traditional,
> >>     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
> >>
> >> - All operations relating to the "ambient" mount tree don't work.
> >>
> >> - `*at` operations with a file descriptor do work.
> >
> > Huh?  The last bit looks contradicts the previous one - if you have
> > an opened directory in a mount from some namespace, those `*at` operations
> > with that descriptor *will* be seeing the mount tree of that namespace,
> > whatever the hell is "ambient" supposed to mean.  Either that, or you
> > will be exposing whatever's overmounted in that mount, which is a huge
> > can of worms.
>
> It seems to me that this is really no different *in practice* to having an
> empty mount namespace, no? You might still be able to stat("/") and get a
> d--------- result, but how does that actually affect anything?

The argument against just having an empty, immutable root directory and
calling it a day is the tie-in with a new process-spawning API discussed
near the bottom of my original email. I want to have nice secure
defaults, rather than forcing the programmer to remember to unshare, but
I also don't want to degrade performance by speculatively creating new
empty mount namespaces that might just be thrown away. Null fields alone
get us both --- security and good performance.

> The big thing with a lot of this is that introducing a null case can really
> complicate things all over the place, and since this is very likely to be only
> a niche use case, it kind of screams to me like it has the potential to become
> an attack surface like any other rarely used code in the kernel...

I understand and am sympathetic to this line of reasoning, but I think
it is important to look at the patch in question (which I suppose I
should soon submit) to weigh the competing concerns.

The kernel rightfully has consolidated path resolution in a few key
places as much as possible -- the internal `struct path` does not suffer
from these issues. I barely modify those places to support null root and
CWD, and because of that consolidation, we shouldn't expect new places
to crop up in the future. (Duplicative path resolution logic is a bad
idea whether or not we have a nascent, little-used NULL-cwd/root code
path.) Therefore, I think existing code review, even among people
totally ignorant of this feature, will protect us --- the vast majority
of code will just be working with `struct path`, and be totally
unaffected by this change.

Moreover, every new feature starts rarely used. This is to me a
judicious anti-feature (removing state, making more things fail) that
should be quite intuitive to those developing for Linux, given the
prominence of things like WASI, and I will do what I can in the Nix
ecosystem to try to get it widely used in short order. Just guessing
from its design, this ought to be something other ecosystems, like
Android, are also interested in.

John

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-25 21:50     ` John Ericson
@ 2026-06-25 23:09       ` Andy Lutomirski
  2026-06-26  8:27         ` David Laight
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2026-06-25 23:09 UTC (permalink / raw)
  To: John Ericson
  Cc: H. Peter Anvin, Al Viro, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria

On Thu, Jun 25, 2026 at 2:53 PM John Ericson <mail@johnericson.me> wrote:
>
> On Thu, Jun 25, 2026, at 5:00 PM, H. Peter Anvin wrote:
> > On 2026-06-24 16:12, Al Viro wrote:
> > > On Wed, Jun 24, 2026 at 06:51:47PM -0400, John Ericson wrote:
> > >
> > >> #### Null mount namespace
> > >>
> > >> - requires:
> > >>
> > >>   - null root file system: absolute paths don't work.
> > >>
> > >>   - null current working directory: relative paths with traditional,
> > >>     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
> > >>
> > >> - All operations relating to the "ambient" mount tree don't work.
> > >>
> > >> - `*at` operations with a file descriptor do work.
> > >
> > > Huh?  The last bit looks contradicts the previous one - if you have
> > > an opened directory in a mount from some namespace, those `*at` operations
> > > with that descriptor *will* be seeing the mount tree of that namespace,
> > > whatever the hell is "ambient" supposed to mean.  Either that, or you
> > > will be exposing whatever's overmounted in that mount, which is a huge
> > > can of worms.
> >
> > It seems to me that this is really no different *in practice* to having an
> > empty mount namespace, no? You might still be able to stat("/") and get a
> > d--------- result, but how does that actually affect anything?
>
> The argument against just having an empty, immutable root directory and
> calling it a day is the tie-in with a new process-spawning API discussed
> near the bottom of my original email. I want to have nice secure
> defaults, rather than forcing the programmer to remember to unshare, but
> I also don't want to degrade performance by speculatively creating new
> empty mount namespaces that might just be thrown away. Null fields alone
> get us both --- security and good performance.

This seems like a false dichotomy.  There's such thing as a singleton.

In fact, we have this spiffy nullfs_fs_get_tree.  It seems relatively
straightforward to have an API to get an fd to the singleton nullfs,
and the default for a newly spawned process could even be to have cwd
pointing at nullfs.

root is still harder, because of the shadowing issue.  I think I
proposed, ages ago, relaxing the chroot rules so that, at least under
certain circumstances (e.g. the task is not already chrooted) an
unprivileged task could chroot.  chrooting to nullfs seems like a
somewhat useful operation.

I can imagine more complex schemes to allow even a chrooted process to
safely start acting as though their root is nullfs, but that would be
potentially fairly nasty.  *Maybe* everything would work if there was
a root-for-dotdot and a separate root-for-absolute-paths, and
nameidata->root could point to the former, but I'm certainly not
willing to say that I think this would work with any confidence at
all.

--Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-25 23:09       ` Andy Lutomirski
@ 2026-06-26  8:27         ` David Laight
  2026-06-26 17:23           ` John Ericson
  0 siblings, 1 reply; 16+ messages in thread
From: David Laight @ 2026-06-26  8:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: John Ericson, H. Peter Anvin, Al Viro, Li Chen, Cong Wang,
	Christian Brauner, linux-arch, LKML, linux-fsdevel, linux-api,
	Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria

On Thu, 25 Jun 2026 16:09:58 -0700
Andy Lutomirski <luto@kernel.org> wrote:

> On Thu, Jun 25, 2026 at 2:53 PM John Ericson <mail@johnericson.me> wrote:
> >
> > On Thu, Jun 25, 2026, at 5:00 PM, H. Peter Anvin wrote:  
> > > On 2026-06-24 16:12, Al Viro wrote:  
> > > > On Wed, Jun 24, 2026 at 06:51:47PM -0400, John Ericson wrote:
> > > >  
> > > >> #### Null mount namespace
> > > >>
> > > >> - requires:
> > > >>
> > > >>   - null root file system: absolute paths don't work.
> > > >>
> > > >>   - null current working directory: relative paths with traditional,
> > > >>     non-`*at` system calls (and `*at` ones using `AT_FDCWD`) don't work.
> > > >>
> > > >> - All operations relating to the "ambient" mount tree don't work.
> > > >>
> > > >> - `*at` operations with a file descriptor do work.  
> > > >
> > > > Huh?  The last bit looks contradicts the previous one - if you have
> > > > an opened directory in a mount from some namespace, those `*at` operations
> > > > with that descriptor *will* be seeing the mount tree of that namespace,
> > > > whatever the hell is "ambient" supposed to mean.  Either that, or you
> > > > will be exposing whatever's overmounted in that mount, which is a huge
> > > > can of worms.  
> > >
> > > It seems to me that this is really no different *in practice* to having an
> > > empty mount namespace, no? You might still be able to stat("/") and get a
> > > d--------- result, but how does that actually affect anything?  
> >
> > The argument against just having an empty, immutable root directory and
> > calling it a day is the tie-in with a new process-spawning API discussed
> > near the bottom of my original email. I want to have nice secure
> > defaults, rather than forcing the programmer to remember to unshare, but
> > I also don't want to degrade performance by speculatively creating new
> > empty mount namespaces that might just be thrown away. Null fields alone
> > get us both --- security and good performance.  
> 
> This seems like a false dichotomy.  There's such thing as a singleton.
> 
> In fact, we have this spiffy nullfs_fs_get_tree.  It seems relatively
> straightforward to have an API to get an fd to the singleton nullfs,
> and the default for a newly spawned process could even be to have cwd
> pointing at nullfs.
> 
> root is still harder, because of the shadowing issue.  I think I
> proposed, ages ago, relaxing the chroot rules so that, at least under
> certain circumstances (e.g. the task is not already chrooted) an
> unprivileged task could chroot.  chrooting to nullfs seems like a
> somewhat useful operation.
> 
> I can imagine more complex schemes to allow even a chrooted process to
> safely start acting as though their root is nullfs, but that would be
> potentially fairly nasty.  *Maybe* everything would work if there was
> a root-for-dotdot and a separate root-for-absolute-paths, and
> nameidata->root could point to the former, but I'm certainly not
> willing to say that I think this would work with any confidence at
> all.

You'd also need to sort out the 'pwd' mess.
The kernel inode always has its real parent, inside a chroot the scan stops
when the inode is the same as that of the base of the chroot.
But faf about with namespaces (IIRC I was doing an unshare to get out of
a network namespace) and that comparison can fail (if the chroot base isn't
a mount point) - so "../.." can go all the way back to the real root rather
than stopping at the base of the chroot (as you would expect).

	David


> 
> --Andy
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] Null Namespaces
  2026-06-26  8:27         ` David Laight
@ 2026-06-26 17:23           ` John Ericson
  0 siblings, 0 replies; 16+ messages in thread
From: John Ericson @ 2026-06-26 17:23 UTC (permalink / raw)
  To: David Laight, Andy Lutomirski
  Cc: H. Peter Anvin, Al Viro, Li Chen, Cong Wang, Christian Brauner,
	linux-arch, LKML, linux-fsdevel, linux-api, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Jan Kara, Jonathan Corbet, Shuah Khan, Kees Cook,
	Sergei Zimmerman, Farid Zakaria

I am replying to both Andy and David in a single email --- hope that is
not confusing.

On Thu, Jun 25, 2026, at 7:09 PM, Andy Lutomirski wrote:
> On Thu, Jun 25, 2026 at 2:53 PM John Ericson <mail@johnericson.me> wrote:
> >
> > The argument against just having an empty, immutable root directory and
> > calling it a day is the tie-in with a new process-spawning API discussed
> > near the bottom of my original email. I want to have nice secure
> > defaults, rather than forcing the programmer to remember to unshare, but
> > I also don't want to degrade performance by speculatively creating new
> > empty mount namespaces that might just be thrown away. Null fields alone
> > get us both --- security and good performance.
>
> This seems like a false dichotomy.  There's such thing as a singleton.
>
> In fact, we have this spiffy nullfs_fs_get_tree.  It seems relatively
> straightforward to have an API to get an fd to the singleton nullfs,
> and the default for a newly spawned process could even be to have cwd
> pointing at nullfs.

Ah! This is the first I am learning about the new nullfs. OK yes I agree
this gives us both properties, since it is truly immutably empty.

I still have a slight preference for something that also makes
statting/opening/etc. of `/` itself fail, but this is otherwise good ---
there's no denying it.

> root is still harder, because of the shadowing issue.  I think I
> proposed, ages ago, relaxing the chroot rules so that, at least under
> certain circumstances (e.g. the task is not already chrooted) an
> unprivileged task could chroot.  chrooting to nullfs seems like a
> somewhat useful operation.
>
> I can imagine more complex schemes to allow even a chrooted process to
> safely start acting as though their root is nullfs, but that would be
> potentially fairly nasty.  *Maybe* everything would work if there was
> a root-for-dotdot and a separate root-for-absolute-paths, and
> nameidata->root could point to the former, but I'm certainly not
> willing to say that I think this would work with any confidence at
> all.

I really like these ideas!

- Splitting the two uses of root sounds great. Even more generally (at
  least as a thought experiment, I don't like the O(n) performance), one
  can imagine a set of paths one must not `cd ..` past. Conceptually, I
  feel optimistic that inserting another boundary path into the set on
  every `chroot` makes it safe.

- In the original "real root", the "root for .." field could be null,
  since no `..` check is actually needed. Then, if we only want to have
  a single "root for .." (to avoid the O(n)), only the initial
  assignment of it from null to non-null would be unprivileged --- this
  would implement your "task is not already chrooted" idea. Subsequent
  assignment would still be privileged since we are replacing, not
  extending our "set". (The nullable single path means we have 0 or 1
  paths in our set.)

----

On Fri, Jun 26, 2026, at 4:27 AM, David Laight wrote:
>
> You'd also need to sort out the 'pwd' mess.
> The kernel inode always has its real parent, inside a chroot the scan stops
> when the inode is the same as that of the base of the chroot.
> But faf about with namespaces (IIRC I was doing an unshare to get out of
> a network namespace) and that comparison can fail (if the chroot base isn't
> a mount point) - so "../.." can go all the way back to the real root rather
> than stopping at the base of the chroot (as you would expect).
>
> David

I did get the impression that the `..` check is...rather fragile. I am
also thinking that a global setting like `openat2`'s `RESOLVE_BENEATH`
to make `..` never work would be useful; then all manner of chrooting is
trivially safe, because you cannot go up regardless!

----

Given the state of the discussion, I'll go submit my null cwd and root
patch momentarily. The nullfs alternative is quite compelling; to the
extent that I do prefer making the root operations fail as I said above,
I think my best shot is demonstrating that this patch is so small and
lightweight that this slight benefit is paid for by the simplicity of
the implementation.

John

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-06-26 17:30 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 22:51 [RFC] Null Namespaces John Ericson
2026-06-24 23:06 ` Andy Lutomirski
2026-06-24 23:20   ` Andy Lutomirski
2026-06-24 23:53     ` John Ericson
2026-06-25  1:10       ` Al Viro
2026-06-25  3:41         ` John Ericson
2026-06-25 15:51           ` Andy Lutomirski
2026-06-25 18:21             ` John Ericson
2026-06-26  0:15           ` Al Viro
2026-06-26 16:26             ` John Ericson
2026-06-24 23:12 ` Al Viro
2026-06-25 21:00   ` H. Peter Anvin
2026-06-25 21:50     ` John Ericson
2026-06-25 23:09       ` Andy Lutomirski
2026-06-26  8:27         ` David Laight
2026-06-26 17:23           ` John Ericson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox