[RFC] pivot_root(2) races

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] pivot_root(2) races
@ 2026-02-09  0:34 Al Viro
  2026-02-09  5:49 ` Linus Torvalds
  2026-02-12 13:23 ` Askar Safin
  0 siblings, 2 replies; 42+ messages in thread
From: Al Viro @ 2026-02-09  0:34 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Christian Brauner, Jan Kara, H. Peter Anvin,
	Werner Almesberger

	pivot_root(2) semantics is unique in one respect: this is the
only syscall that changes current directory and root of other processes.

	All other syscalls affect that only for the caller and the
threads that happen to share its fs_struct (i.e. if CLONE_FS had been
used all the way back to common ancestor).

	pivot_root(), OTOH, goes over all threads in the system and every
thread that had the same root as the caller gets switched to new root;
ditto for current directories.

	AFAICS, the original rationale had been about the kernel threads
that would otherwise keep the old root busy.  These days that could've
been dealt with much easier, but that behaviour is cast in stone -
it's been 25 years since Werner had brought that thing into the tree
(2.3.41-pre4), and the time for objections is long gone.

	Unfortunately, the way it's been done (all the way since the
original posting) is racy.  If pivot_root() is called while another
thread is in the middle of fork(), it will not see the fs_struct of
the child to be.  The race window is from the call of copy_fs() to the
point where copy_process() finally decides that everything is ready and
grabs tasklist_lock.  There's enough blocking allocations in that range
to make it an issue even on UP, without any kind of preemption, etc. -
that race had been there since the very beginning.

	If pivot_root() comes before that window, child will get the root
and current directory already switched; if it comes after the window, the
child will have its root and current directory switched by pivot_root()
itself.  If it's in the window, though, the child ends up born chrooted
into wherever the original root has been moved to by pivot_root().

	Similar races exist for other syscalls that create new fs_struct
instances (unshare(2) and setns(2)), with the same underlying mechanism -
embryonic fs_struct instances are missed by the function that does that
switchover (chroot_fs_refs()).	Making those instances visible to it is
possible; I've tried to play with that idea, but that leads to really
disgusting code and adds a cost to each fork(), all for the sake of a
rarely used syscall.  IMO it's a non-starter.

	If it was just the fork() alone, we could deal with that simply
by delaying the copying of ->fs->{root,pwd} until copy_process() has
grabbed tasklist_lock.  Unfortunately, the things are more complicated.

	First of all, there's CLONE_NEWNS to deal with.  It does flip
the root and current directory of the caller from locations in original
namespace to the corresponding locations in the copy.  These days it's
done by passing child's fs_struct all the way to copy_mnt_ns(), where we
use that fs_struct to pick the original locations from *and* to store
the new locations into.  That, of course, opens the same pivot_root()
race; child's fs_struct is invisible to it, and if it has happened
between copy_fs() and copy_mnt_ns() we'll get the new namespace with
mount tree reflecting the changes from pivot_root() and child chrooted
into the subtree where the original root had been moved to.  pivot_root()
done to the old namespace after that point is not an issue (we get the
same result as if clone(CLONE_NEWNS) had won the race) and pivot_root()
to the new namespace is not possible until the child becomes visible
to chroot_fs_refs().

	Note that fs_struct passed to copy_mnt_ns() serves two purposes -
we need the original locations to calculate the new ones and we need
some way to report those locations to the rest of the system.  The former
role should be served by current->fs; for the latter I would prefer to
give it a pointer to pair of struct path, so that setting the child's
fs_struct would be done by the caller.  We could keep using child's
fs_struct for that (it will always be an embryonic instance - CLONE_NEWNS
is mutually exclusive with CLONE_FS), but that makes for considerably
messier cleanup logics in copy_process().

	Doing that closes the race for all clone(2) variants - we add
a two-element array of struct path in copy_process(), initialize it with
all NULLs and pass it to copy_namespaces() instead of child's ->fs.
Then, once we have grabbed the tasklist_lock, we do the following:
	struct fs_struct *fs = current->fs;
	struct fs_struct *new_fs = p->fs;
	read_seqlock_excl(&fs->seq);
	if (fs != new_fs) {
		new_fs->root = likely(!path[0].mnt) ? fs->root : path[0];
		new_fs->pwd = likely(!path[1].mnt) ? fs->root : path[1];
		path_get(&fs->root);
		path_get(&fs->pwd);
	} else {
		fs->users++;
	}
	read_sequnlock_excl(&fs->seq);
for switchover.  copy_fs() would allocate the new fs_struct (in !CLONE_FS case,
that is), but do no copying of pwd/root into it.  Failure exits would not use
exit_fs() - that has a side benefit of using exit_fs() only for current; new
fs_struct would be freed directly and references in path[] would be dropped
by path_put(), success of failure.

	unshare(2) also has a similar race, with similar solution:
* copy pwd/root from original to replacement fs_struct in the same scope
where we change current->fs
* pass a pair of struct path to unshare_nsproxy_namespaces() instead of
giving it fs_struct
* use exact same logics as in copy_process() for filling ->root and ->pwd
of new fs_struct (that gives a side benefit in unshare_nsproxy_namespaces() -
instead of the
kernel/nsproxy.c:226:   *new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
kernel/nsproxy.c:227:                                    new_fs ? new_fs : current->fs);
we simply pass struct path array as-is)

	Other callers of create_new_namespaces() can simply pass NULL instead of
bothering with struct path pairs - they never get CLONE_NEWNS in flags, so...

	At that point we are left with two callers of copy_fs_struct().  One
is unshare_fs_struct(), with only one user (knfsd thread setup).  It does have
the same race with pivot_root(), not that it had been likely, but that race
is trivially closed in the same way as with fork() et.al. - just delay copying
until we are about to switch current->fs.

	The remaining caller is something entirely different - it's prepare_nsset().
There the embryonic fs_struct is not going end up as any task's ->fs; it's used
only to get the root of namespace we are joining - mntns_install() sets it, and
later it goes into current->fs->{root,pwd} once we are sure that joining other
namespaces won't fail (if it's only CLONE_NEWNS, nothing gets allocated and the
damn thing goes straight into current->fs->{root,pwd}).

	It's also racy; unlike copy_mnt_ns(), there's no exclusion whatsoever
wrt pivot_root(2) (cloning namespace obviously needs it to stabilize the mount
tree it's copying).  Even the case of pure CLONE_NEWNS is racy - it switches
fs->{root,pwd} to whatever overmounts the namespace root and each of those is
done under fs->seq, but that's two scopes and neither covers finding that overmount,
so chroot_fs_refs() might come and mess the things up.  If it's more than just
CLONE_NEWNS and we end up with the damn thing stored in the embryonic fs_struct,
the race window includes joining the other namespaces as well.

	Another problem with "CLONE_NEWS + something else" case is that we
end up screwing the logics in mntns_install() that tried to reject the case
of shared current->fs; embryo isn't shared, so the test in there passes just
fine and it's not repeated in the caller, so it's possible to have one of
the fs_struct-sharing threads join a namespace, switching the root and current
directory for the rest without having their namespace switched.

	TBH, looking at that one I'd say that we move finding the namespace
root to separate helper and have _that_ switch root and current directory of
the caller, with commit_nsset() using it.  The interesting part is whether
we want to deal with the possibility of errors at that point...  It *can*,
in theory, happen, but only if the namespace root is overmounted by a mount
trap and stepping onto it ends up failing.  That's an insane setup, of course,
but...

	Comments?

PS: Werner and hpa Cc'd as the folks involved in introducing pivot_root(2) in
the first place.  See https://lkml.org/lkml/2000/1/25/111; thanks to Sasha Levin
for finding that thread, BTW - lore doesn't seem to have it and google not just
fails to find it, their "AI" had fed me an impressive gaslighting session,
complete with inventing inexistent l-k postings from me, claiming that syscall
in question had been introduced by one Al Viro ;-/  Telling it that no such
postings exist on any of the suggested sites got the expected waffling, request
for URL of specific posting has ended up with "https://lore.kernel.org", once
we got through the difference between the site name and clickable link...
Pity whoever tries to use that shite for "research".

For the record - pivot_root(2) is from Werner Almesberger, with suggestions from
Linus and Peter; I hadn't been involved at all.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-09  0:34 [RFC] pivot_root(2) races Al Viro
@ 2026-02-09  5:49 ` Linus Torvalds
  2026-02-09  5:53   ` H. Peter Anvin
  2026-02-09  6:34   ` Al Viro
  2026-02-12 13:23 ` Askar Safin
  1 sibling, 2 replies; 42+ messages in thread
From: Linus Torvalds @ 2026-02-09  5:49 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, H. Peter Anvin,
	Werner Almesberger

On Sun, 8 Feb 2026 at 16:32, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         AFAICS, the original rationale had been about the kernel threads
> that would otherwise keep the old root busy.

I don't think it was even about just kernel threads, it was about the
fact that pivot_root was done early, but after other user space things
could have been started.

Of course, now it's used much more widely than the original "handle
initial root switching in user space"

>         Unfortunately, the way it's been done (all the way since the
> original posting) is racy.  If pivot_root() is called while another
> thread is in the middle of fork(), it will not see the fs_struct of
> the child to be.

I think that what is much more serious than races is the *non*racy behavior.

Maybe I'm missing something, but it looks like anybody can just switch
things around for _other_ namespaces if they have CAP_SYS_ADMIN in
_their_ namespace. It's just using may_mount()", which i sabout the
permission to modify the locall namespace.

I probably am missing something, and just took a very quick look, and
am missing some check for "only change processes we have permission to
change".

         Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-09  5:49 ` Linus Torvalds
@ 2026-02-09  5:53   ` H. Peter Anvin
  2026-02-09  6:34   ` Al Viro
  1 sibling, 0 replies; 42+ messages in thread
From: H. Peter Anvin @ 2026-02-09  5:53 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Werner Almesberger

On February 8, 2026 9:49:40 PM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Sun, 8 Feb 2026 at 16:32, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>>         AFAICS, the original rationale had been about the kernel threads
>> that would otherwise keep the old root busy.
>
>I don't think it was even about just kernel threads, it was about the
>fact that pivot_root was done early, but after other user space things
>could have been started.
>
>Of course, now it's used much more widely than the original "handle
>initial root switching in user space"
>
>>         Unfortunately, the way it's been done (all the way since the
>> original posting) is racy.  If pivot_root() is called while another
>> thread is in the middle of fork(), it will not see the fs_struct of
>> the child to be.
>
>I think that what is much more serious than races is the *non*racy behavior.
>
>Maybe I'm missing something, but it looks like anybody can just switch
>things around for _other_ namespaces if they have CAP_SYS_ADMIN in
>_their_ namespace. It's just using may_mount()", which i sabout the
>permission to modify the locall namespace.
>
>I probably am missing something, and just took a very quick look, and
>am missing some check for "only change processes we have permission to
>change".
>
>         Linus

Kernel threads were absolutely the motivation early on. pivot_root() itself was expected to be run by the ramdisk pre-init which would then chroot() itself.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-09  5:49 ` Linus Torvalds
  2026-02-09  5:53   ` H. Peter Anvin
@ 2026-02-09  6:34   ` Al Viro
  2026-02-09  6:44     ` Linus Torvalds
  1 sibling, 1 reply; 42+ messages in thread
From: Al Viro @ 2026-02-09  6:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, H. Peter Anvin,
	Werner Almesberger

On Sun, Feb 08, 2026 at 09:49:40PM -0800, Linus Torvalds wrote:
> On Sun, 8 Feb 2026 at 16:32, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         AFAICS, the original rationale had been about the kernel threads
> > that would otherwise keep the old root busy.
> 
> I don't think it was even about just kernel threads, it was about the
> fact that pivot_root was done early, but after other user space things
> could have been started.
> 
> Of course, now it's used much more widely than the original "handle
> initial root switching in user space"
> 
> >         Unfortunately, the way it's been done (all the way since the
> > original posting) is racy.  If pivot_root() is called while another
> > thread is in the middle of fork(), it will not see the fs_struct of
> > the child to be.
> 
> I think that what is much more serious than races is the *non*racy behavior.
> 
> Maybe I'm missing something, but it looks like anybody can just switch
> things around for _other_ namespaces if they have CAP_SYS_ADMIN in
> _their_ namespace. It's just using may_mount()", which i sabout the
> permission to modify the locall namespace.
> 
> I probably am missing something, and just took a very quick look, and
> am missing some check for "only change processes we have permission to
> change".

Not really - look at those check_mnt() in pivot_root(2).
static inline int check_mnt(const struct mount *mnt)
{
        return mnt->mnt_ns == current->nsproxy->mnt_ns;
}

SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
                const char __user *, put_old)
{
	...
        if (!check_mnt(root_mnt) || !check_mnt(new_mnt))
		return -EINVAL;

IOW, try to do that to another namespace and you'll get -EINVAL,
no matter what permissions you might have in your namespace
(or globally, for that matter).

may_mount() check is "if you don't have CAP_SYS_ADMIN in your namespace,
you are not getting anywhere at all"; check_mount() ones - "... and
it would better be your namespace you are about to change"

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-09  6:34   ` Al Viro
@ 2026-02-09  6:44     ` Linus Torvalds
  2026-02-09 11:53       ` Christian Brauner
                         ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Linus Torvalds @ 2026-02-09  6:44 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, H. Peter Anvin,
	Werner Almesberger

On Sun, 8 Feb 2026 at 22:32, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Not really - look at those check_mnt() in pivot_root(2).
> static inline int check_mnt(const struct mount *mnt)
> {
>         return mnt->mnt_ns == current->nsproxy->mnt_ns;
> }
>
> SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
>                 const char __user *, put_old)
> {
>         ...
>         if (!check_mnt(root_mnt) || !check_mnt(new_mnt))
>                 return -EINVAL;
>
> IOW, try to do that to another namespace and you'll get -EINVAL,
> no matter what permissions you might have in your namespace
> (or globally, for that matter).

It's more that you can affect *processes* in another namespace if I
read things right. Not other processes' namespaces, but basically
processes that you have no business trying to change...

Yes, both the old and new root need to be in your own namespace, but
imagine that you are a process in some random container, and let's say
that root (the *real* root in the init namespace) is looking at your
container state.

IOW, imagine that I'm system root, and I've naively done a "cd
/proc/<pid>/cwd" to look at the state of some sucker, and now...

Am I mis-reading things entirely, or can a random process in that
container (that has mount permissions in that thing) basically do
pivot_root(), and in the process change the CWD of that root process
that just happens to be looking at that container state?

I'm just naively looking at that for_each_process_thread() loop that does that

                hits += replace_path(&fs->pwd, old_root, new_root);

but the keyword here is "naively".

Is there some other check that I'm missing?

             Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-09  6:44     ` Linus Torvalds
@ 2026-02-09 11:53       ` Christian Brauner
  2026-02-12 17:17       ` Askar Safin
  2026-02-12 19:22       ` Al Viro
  2 siblings, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2026-02-09 11:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, linux-fsdevel, Christian Brauner, Jan Kara,
	H. Peter Anvin, Werner Almesberger

On Sun, Feb 08, 2026 at 10:44:31PM -0800, Linus Torvalds wrote:
> On Sun, 8 Feb 2026 at 22:32, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Not really - look at those check_mnt() in pivot_root(2).
> > static inline int check_mnt(const struct mount *mnt)
> > {
> >         return mnt->mnt_ns == current->nsproxy->mnt_ns;
> > }
> >
> > SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
> >                 const char __user *, put_old)
> > {
> >         ...
> >         if (!check_mnt(root_mnt) || !check_mnt(new_mnt))
> >                 return -EINVAL;
> >
> > IOW, try to do that to another namespace and you'll get -EINVAL,
> > no matter what permissions you might have in your namespace
> > (or globally, for that matter).
> 
> It's more that you can affect *processes* in another namespace if I
> read things right. Not other processes' namespaces, but basically
> processes that you have no business trying to change...
> 
> Yes, both the old and new root need to be in your own namespace, but
> imagine that you are a process in some random container, and let's say
> that root (the *real* root in the init namespace) is looking at your
> container state.
> 
> IOW, imagine that I'm system root, and I've naively done a "cd
> /proc/<pid>/cwd" to look at the state of some sucker, and now...
> 
> Am I mis-reading things entirely, or can a random process in that
> container (that has mount permissions in that thing) basically do
> pivot_root(), and in the process change the CWD of that root process
> that just happens to be looking at that container state?
> 
> I'm just naively looking at that for_each_process_thread() loop that does that
> 
>                 hits += replace_path(&fs->pwd, old_root, new_root);
> 
> but the keyword here is "naively".
> 
> Is there some other check that I'm missing?

Funny, I was looking at this just a very short while ago and have
written a lengthy mail about this to Jeff here on-list. And yes, I think
your reading is right. Let me add the braindump I did for Jeff (very
container centric):

  The main problem with pivot_root() is not just that it moves the
  old rootfs to any other location on the new rootfs it also takes the
  tasklist read lock and walks all processes on the system trying to find
  any process that uses the old rootfs as its fs root or its pwd and then
  rechroots and repwds all of these processes into the new rootfs.

  But for 90% of the use-cases (containers) that is not needed. When the
  container's mount namespace and rootfs are setup the task creating that
  container is the only task that is using the old rootfs and that task
  could very well just rechroot itself after it unmounted the old rootfs.

  So in essence pivot_root() holds tasklist lock and walks all tasks on
  the systems for no reason. If the user has a beefy and busy machine with
  lots of processes coming and going each pivot_root() penalizes the whole
  system.

I have a patchset sitting in my tree for the 7.1 cycle that I want to
merge that is a better alternative to pivot_root() in most cases -
definitely the container cases.

It just enables MOVE_MOUNT_BENEATH which I added a few years ago to work
with the rootfs. This means one can stuff a new rootfs under the current
rootfs, unmount the old rootfs and chroot and chdir into the new rootfs
and then be done. This works for all container use-cases where you don't
need to care about any global kernel threads picking up the updated
root. You don't need chroot_fs_refs() for that at all.

With MOVE_MOUNT_BENEATH being able to mount beneath the rootfs you get
the following benefits:

* completely avoids the tasklist locking and rechroot/repwd logic

* the pivot_root(".", ".") trick where the old rootfs is mounted on top
  of the new rootfs is for free

* MOVE_MOUNT_BENEATH works with detached mounts which means one can
  assemble a (container) rootfs and then swap the old rootfs with the
  new rootfs

Fwiw, there's also work I sent for this cycle that allows to completely
sidestep pivot_root() already for containers. I'm pretty sure that
sooner or later with that work unshare(CLONE_NEWNS) will be gone from
most container times completely because what we did scales a lot better.

I think even for the case where init pivot's root from the initramfs
the pivot_root() system call isn't really needed anymore because iirc
CLONE_FS guarantees that any change to its root and pwd is visible in
kernel threads. So as long as init isn't stupid and does
unshare(CLONE_FS) and watches what other tasks are created just doing
MOVE_MOUNT_BENEATH might be enough. But idk. I don't think it matters
calling pivot_root() during boot so for that case I don't care. But for
containers I for sure care as the current situation is just a
performance issue for no good reason.

Fwiw, for the 7.1 cycle I also have a patchset for
unshare(UNSHARE_EMPTY_MNTNS) and also for clone3() which creates a
completely empty mount namespaces with a catatonic/immutable mount as
its root getting rid of a bunch of the issues mentioned earlier as well.
It's simply not needed for most modern container setups to copy the
whole mount namespace. It's just wasteful as the rootfs setup can now
happen way before the container is ever created via the new mount api.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-09  0:34 [RFC] pivot_root(2) races Al Viro
  2026-02-09  5:49 ` Linus Torvalds
@ 2026-02-12 13:23 ` Askar Safin
  2026-02-12 19:25   ` Al Viro
  1 sibling, 1 reply; 42+ messages in thread
From: Askar Safin @ 2026-02-12 13:23 UTC (permalink / raw)
  To: viro; +Cc: christian, hpa, jack, linux-fsdevel, torvalds, werner

Al Viro <viro@zeniv.linux.org.uk>:
> The interesting part is whether
> we want to deal with the possibility of errors at that point...

You mean that finding topmost overmount of root of namespace may fail?

Current code in mntns_install indeed can fail, at least theoretically:

https://elixir.bootlin.com/linux/v6.19-rc5/source/fs/namespace.c#L6243

But I argue that this line is unnessesary and should be simply replaced
with call to topmost_overmount (which, as well as I understand, cannot
fail).

As an additional benefit, this will allow us to remove LOOKUP_DOWN
(this line is the only user of it).

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-09  6:44     ` Linus Torvalds
  2026-02-09 11:53       ` Christian Brauner
@ 2026-02-12 17:17       ` Askar Safin
  2026-02-12 19:11         ` Linus Torvalds
  2026-02-12 19:22       ` Al Viro
  2 siblings, 1 reply; 42+ messages in thread
From: Askar Safin @ 2026-02-12 17:17 UTC (permalink / raw)
  To: torvalds; +Cc: christian, hpa, jack, linux-fsdevel, viro, werner, Aleksa Sarai

Linus Torvalds <torvalds@linux-foundation.org>:
> IOW, imagine that I'm system root, and I've naively done a "cd
> /proc/<pid>/cwd" to look at the state of some sucker, and now...
> 
> Am I mis-reading things entirely, or can a random process in that
> container (that has mount permissions in that thing) basically do
> pivot_root(), and in the process change the CWD of that root process
> that just happens to be looking at that container state?

Yes, exactly. I just tested it. In the end of this letter you will find
the code.

I tested on 6.12.48, but I'm nearly sure this applies to later versions, too.

In my opinion this is a bug. We should make pivot_root change cwd and root
for processes in the same mount and user namespace only, not all processes
on the system. (And possibly also require "can ptrace" etc.)

This bug is yet another way for a container to mess with container runtime
(so I CC Sarai).

Here is how my code works. I assume we start as UID 0 (in any user namespace).
Then I create child and in that child I change UID to 1, then in child I do
unshare (CLONE_NEWUSER | CLONE_NEWNS).

Then in parent (which still runs as root) I do cd to /proc/$CHILD/cwd.

Then in child I do pivot_root.

And then parent sees that its cwd changed.

The most important part is here:

    ASSERT (access ("tmp/pivot_root", F_OK) == 0);
    ASSERT (sleep (2) == 0); // Wait for "pivot_root" in child
    ASSERT (access ("tmp/pivot_root", F_OK) != 0);

We wait for pivot_root to happen, and then (using "access") we see that
our cwd in fact changed.

The program should print "OK" and nothing else. If it printed "OK", this
will mean that we indeed can mess with outer processes.

So we see that pivot_root can change cwd/root across users, user namespaces
and mount namespaces.

-- 
Askar Safin

// Run this program as root (uid 0) in any user namespace
// CLONE_NEWUSER should be enabled on the system (it is disabled in some distros)

// You have two options to run this program. First as real root:
// $ sudo ./prog
// Second: as root inside a user namespace:
// $ unshare --map-users auto --map-groups auto -r ./prog
// (make sure you have "newuidmap" installed)

// Written by me without LLMs...
// Public domain

// The program should print "OK" and nothing else. This will mean that processes
// inside container indeed can change cwd of outer processes using pivot_root

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>

#define ASSERT(cond)       do { if (!(cond)) { fprintf (stderr, "%d: %s: failed\n", __LINE__, #cond); exit (EXIT_FAILURE); } } while (0)
#define ASSERT_ERRNO(cond) do { if (!(cond)) { fprintf (stderr, "%d: %s: %m\n",     __LINE__, #cond); exit (EXIT_FAILURE); } } while (0)

int
main (void)
{
    ASSERT_ERRNO (chdir ("/") == 0);

    // From previous runs
    if (rmdir ("/tmp/pivot_root/new-root") == -1 && errno != ENOENT)
        {
            fprintf (stderr, "Cannot remove tmp dirs from previous run");
            exit (EXIT_FAILURE);
        }
    if (rmdir ("/tmp/pivot_root") == -1 && errno != ENOENT)
        {
            fprintf (stderr, "Cannot remove tmp dirs from previous run");
            exit (EXIT_FAILURE);
        }

    ASSERT (getuid () == 0);
    ASSERT (getgid () == 0);

    pid_t child = fork ();

    ASSERT_ERRNO (child != -1);

    if (child == 0)
        {
            ASSERT_ERRNO (setgid (1) == 0);
            ASSERT_ERRNO (setuid (1) == 0);

            ASSERT_ERRNO (mkdir ("/tmp/pivot_root", 0700) == 0);

            ASSERT_ERRNO (unshare (CLONE_NEWUSER | CLONE_NEWNS) == 0);

            ASSERT (sleep (2) == 0); // Wait for parent to setup uid_map, etc

            ASSERT_ERRNO (mkdir ("/tmp/pivot_root/new-root", 0700) == 0);
            ASSERT_ERRNO (mount ("tmpfs", "/tmp/pivot_root/new-root", "tmpfs", 0, NULL) == 0);
            ASSERT_ERRNO (mkdir ("/tmp/pivot_root/new-root/put-old", 0700) == 0);
            ASSERT_ERRNO (syscall (SYS_pivot_root, "/tmp/pivot_root/new-root", "/tmp/pivot_root/new-root/put-old") == 0);
            exit (0);
        }

    ASSERT (sleep (1) == 0); // Wait for child to do unshare

    // See "man 7 user_namespaces" about these /proc/self/uid_map, etc
    {
        char ss[1000];
        ASSERT (snprintf (ss, sizeof ss, "/proc/%lld/uid_map", (long long)child) > 0);
        int fd = open (ss, O_WRONLY);
        ASSERT_ERRNO (fd != -1);
        char s[] = "0 1 1";
        ASSERT (write (fd, s, strlen (s)) == (ssize_t)strlen (s));
        ASSERT_ERRNO (close (fd) == 0);
    }
    {
        char ss[1000];
        ASSERT (snprintf (ss, sizeof ss, "/proc/%lld/setgroups", (long long)child) > 0);
        int fd = open (ss, O_WRONLY);
        ASSERT_ERRNO (fd != -1);
        ASSERT (write (fd, "deny", strlen ("deny")) == (ssize_t)strlen ("deny"));
        ASSERT_ERRNO (close (fd) == 0);
    }
    {
        char ss[1000];
        ASSERT (snprintf (ss, sizeof ss, "/proc/%lld/gid_map", (long long)child) > 0);
        int fd = open (ss, O_WRONLY);
        ASSERT_ERRNO (fd != -1);
        char s[] = "0 1 1";
        ASSERT (write (fd, s, strlen (s)) == (ssize_t)strlen (s));
        ASSERT_ERRNO (close (fd) == 0);
    }
    {
        char s[1000];
        ASSERT (snprintf (s, sizeof s, "/proc/%lld/cwd", (long long)child) > 0);
        ASSERT_ERRNO (chdir (s) == 0);
    }

    // These 3 lines are the most important part
    ASSERT (access ("tmp/pivot_root", F_OK) == 0);
    ASSERT (sleep (2) == 0); // Wait for "pivot_root" in child
    ASSERT (access ("tmp/pivot_root", F_OK) != 0);

    {
        int status;
        ASSERT_ERRNO (waitpid (child, &status, 0) != -1);
        ASSERT (WIFEXITED (status));
        ASSERT (WEXITSTATUS (status) == 0);
    }

    printf ("OK\n");

    return 0;
}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-12 17:17       ` Askar Safin
@ 2026-02-12 19:11         ` Linus Torvalds
  2026-02-12 19:31           ` H. Peter Anvin
  2026-02-13 13:46           ` Aleksa Sarai
  0 siblings, 2 replies; 42+ messages in thread
From: Linus Torvalds @ 2026-02-12 19:11 UTC (permalink / raw)
  To: Askar Safin
  Cc: christian, hpa, jack, linux-fsdevel, viro, werner, Aleksa Sarai

On Thu, 12 Feb 2026 at 09:17, Askar Safin <safinaskar@gmail.com> wrote:
>
> In my opinion this is a bug. We should make pivot_root change cwd and root
> for processes in the same mount and user namespace only, not all processes
> on the system. (And possibly also require "can ptrace" etc.)

Yeah, I think adding a few more tests to that

                fs = p->fs;
                if (fs) {

check in chroot_fs_refs() is called for.

Maybe just make it a helper function that returns 'struct fs_struct'
if replacing things is appropriate.  But yes, I think "can ptrace" is
the thing to check.

Of course, somebody who actually sets up containers and knows how
those things use pivot_root() today should check the rules.

               Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-09  6:44     ` Linus Torvalds
  2026-02-09 11:53       ` Christian Brauner
  2026-02-12 17:17       ` Askar Safin
@ 2026-02-12 19:22       ` Al Viro
  2026-02-13 17:34         ` Askar Safin
  2 siblings, 1 reply; 42+ messages in thread
From: Al Viro @ 2026-02-12 19:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, H. Peter Anvin,
	Werner Almesberger

On Sun, Feb 08, 2026 at 10:44:31PM -0800, Linus Torvalds wrote:

> 
> Am I mis-reading things entirely, or can a random process in that
> container (that has mount permissions in that thing) basically do
> pivot_root(), and in the process change the CWD of that root process
> that just happens to be looking at that container state?

They can.  But then they can do other fun things to the environment
there, so naive root process walking in might be in for really
unpleasant things.  Creating use of mount --move, for example.  Or
umount -l, or...

We could restrict the set of those who could be flipped, but I doubt
that "could ptrace" is workable - that would exclude all kernel threads,
and that could easily break existing setups in hard-to-recover ways.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-12 13:23 ` Askar Safin
@ 2026-02-12 19:25   ` Al Viro
  0 siblings, 0 replies; 42+ messages in thread
From: Al Viro @ 2026-02-12 19:25 UTC (permalink / raw)
  To: Askar Safin; +Cc: christian, hpa, jack, linux-fsdevel, torvalds, werner

On Thu, Feb 12, 2026 at 04:23:45PM +0300, Askar Safin wrote:
> Al Viro <viro@zeniv.linux.org.uk>:
> > The interesting part is whether
> > we want to deal with the possibility of errors at that point...
> 
> You mean that finding topmost overmount of root of namespace may fail?
> 
> Current code in mntns_install indeed can fail, at least theoretically:
> 
> https://elixir.bootlin.com/linux/v6.19-rc5/source/fs/namespace.c#L6243
> 
> But I argue that this line is unnessesary and should be simply replaced
> with call to topmost_overmount (which, as well as I understand, cannot
> fail).

The pathological case to consider would be something like e.g. NFS root with
referral right at the place we are mounting...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-12 19:11         ` Linus Torvalds
@ 2026-02-12 19:31           ` H. Peter Anvin
  2026-02-13  9:51             ` Aleksa Sarai
  2026-02-13 17:47             ` Askar Safin
  2026-02-13 13:46           ` Aleksa Sarai
  1 sibling, 2 replies; 42+ messages in thread
From: H. Peter Anvin @ 2026-02-12 19:31 UTC (permalink / raw)
  To: Linus Torvalds, Askar Safin
  Cc: christian, jack, linux-fsdevel, viro, werner, Aleksa Sarai

On February 12, 2026 11:11:55 AM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Thu, 12 Feb 2026 at 09:17, Askar Safin <safinaskar@gmail.com> wrote:
>>
>> In my opinion this is a bug. We should make pivot_root change cwd and root
>> for processes in the same mount and user namespace only, not all processes
>> on the system. (And possibly also require "can ptrace" etc.)
>
>Yeah, I think adding a few more tests to that
>
>                fs = p->fs;
>                if (fs) {
>
>check in chroot_fs_refs() is called for.
>
>Maybe just make it a helper function that returns 'struct fs_struct'
>if replacing things is appropriate.  But yes, I think "can ptrace" is
>the thing to check.
>
>Of course, somebody who actually sets up containers and knows how
>those things use pivot_root() today should check the rules.
>
>               Linus

It would be interesting to see how much would break if pivot_root() was restricted (with kernel threads parked in nullfs safely out of the way.)

I have gotten a feeling that pivot_root() is used today mostly due to convenience rather than need.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-12 19:31           ` H. Peter Anvin
@ 2026-02-13  9:51             ` Aleksa Sarai
  2026-02-13 17:47             ` Askar Safin
  1 sibling, 0 replies; 42+ messages in thread
From: Aleksa Sarai @ 2026-02-13  9:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Askar Safin, christian, jack, linux-fsdevel, viro,
	werner

[-- Attachment #1: Type: text/plain, Size: 1792 bytes --]

On 2026-02-12, H. Peter Anvin <hpa@zytor.com> wrote:
> On February 12, 2026 11:11:55 AM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >On Thu, 12 Feb 2026 at 09:17, Askar Safin <safinaskar@gmail.com> wrote:
> >>
> >> In my opinion this is a bug. We should make pivot_root change cwd and root
> >> for processes in the same mount and user namespace only, not all processes
> >> on the system. (And possibly also require "can ptrace" etc.)
> >
> >Yeah, I think adding a few more tests to that
> >
> >                fs = p->fs;
> >                if (fs) {
> >
> >check in chroot_fs_refs() is called for.
> >
> >Maybe just make it a helper function that returns 'struct fs_struct'
> >if replacing things is appropriate.  But yes, I think "can ptrace" is
> >the thing to check.
> >
> >Of course, somebody who actually sets up containers and knows how
> >those things use pivot_root() today should check the rules.
> >
> >               Linus
> 
> It would be interesting to see how much would break if pivot_root()
> was restricted (with kernel threads parked in nullfs safely out of the
> way.)
> 
> I have gotten a feeling that pivot_root() is used today mostly due to
> convenience rather than need.

Speaking as probably the heaviest user of pivot_root(2) on modern
systems (container runtimes), we still need to use it today to configure
mount namespaces for containers in a sane way.

But (as Christian mentioned elsewhere), the new OPEN_TREE_NAMESPACE
stuff opens the door to eliminating this requirement for container
runtimes by sidestepping the whole "clone a mntns and its mounts that
you need to blow away later" dance that we currently do with
unshare(CLONE_NEWNS) + pivot_root().

-- 
Aleksa Sarai
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-12 19:11         ` Linus Torvalds
  2026-02-12 19:31           ` H. Peter Anvin
@ 2026-02-13 13:46           ` Aleksa Sarai
  2026-02-13 15:03             ` H. Peter Anvin
  1 sibling, 1 reply; 42+ messages in thread
From: Aleksa Sarai @ 2026-02-13 13:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Askar Safin, christian, hpa, jack, linux-fsdevel, viro, werner

[-- Attachment #1: Type: text/plain, Size: 1564 bytes --]

On 2026-02-12, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Thu, 12 Feb 2026 at 09:17, Askar Safin <safinaskar@gmail.com> wrote:
> >
> > In my opinion this is a bug. We should make pivot_root change cwd and root
> > for processes in the same mount and user namespace only, not all processes
> > on the system. (And possibly also require "can ptrace" etc.)
> 
> Yeah, I think adding a few more tests to that
> 
>                 fs = p->fs;
>                 if (fs) {
> 
> check in chroot_fs_refs() is called for.
> 
> Maybe just make it a helper function that returns 'struct fs_struct'
> if replacing things is appropriate.  But yes, I think "can ptrace" is
> the thing to check.
> 
> Of course, somebody who actually sets up containers and knows how
> those things use pivot_root() today should check the rules.

For containers, we don't actually care about chroot_fs_refs() for other
processes because we are the only process in the mount namespace when
setting it up.

In fact, I'd honestly prefer if chroot_fs_refs() would only apply to (at
most) the processes in the same mount namespace -- the fact this can
leak to outside of the container is an anti-feature from my perspective.
(But the new OPEN_TREE_NAMESPACE stuff means we might be able to avoid
pivot_root(2) entirely soon. Happy days!)

I think the init(rd) people will care more -- my impression this was
only really needed because of the initrd switch (to change the root of
kthreads to the new root)?

-- 
Aleksa Sarai
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 13:46           ` Aleksa Sarai
@ 2026-02-13 15:03             ` H. Peter Anvin
  2026-02-13 17:47               ` Linus Torvalds
  0 siblings, 1 reply; 42+ messages in thread
From: H. Peter Anvin @ 2026-02-13 15:03 UTC (permalink / raw)
  To: Aleksa Sarai, Linus Torvalds
  Cc: Askar Safin, christian, jack, linux-fsdevel, viro, werner

On February 13, 2026 5:46:34 AM PST, Aleksa Sarai <cyphar@cyphar.com> wrote:
>On 2026-02-12, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> On Thu, 12 Feb 2026 at 09:17, Askar Safin <safinaskar@gmail.com> wrote:
>> >
>> > In my opinion this is a bug. We should make pivot_root change cwd and root
>> > for processes in the same mount and user namespace only, not all processes
>> > on the system. (And possibly also require "can ptrace" etc.)
>> 
>> Yeah, I think adding a few more tests to that
>> 
>>                 fs = p->fs;
>>                 if (fs) {
>> 
>> check in chroot_fs_refs() is called for.
>> 
>> Maybe just make it a helper function that returns 'struct fs_struct'
>> if replacing things is appropriate.  But yes, I think "can ptrace" is
>> the thing to check.
>> 
>> Of course, somebody who actually sets up containers and knows how
>> those things use pivot_root() today should check the rules.
>
>For containers, we don't actually care about chroot_fs_refs() for other
>processes because we are the only process in the mount namespace when
>setting it up.
>
>In fact, I'd honestly prefer if chroot_fs_refs() would only apply to (at
>most) the processes in the same mount namespace -- the fact this can
>leak to outside of the container is an anti-feature from my perspective.
>(But the new OPEN_TREE_NAMESPACE stuff means we might be able to avoid
>pivot_root(2) entirely soon. Happy days!)
>
>I think the init(rd) people will care more -- my impression this was
>only really needed because of the initrd switch (to change the root of
>kthreads to the new root)?
>

That was the original motivation, yes. The real question is if they are anytime out there abusing it in other ways...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-12 19:22       ` Al Viro
@ 2026-02-13 17:34         ` Askar Safin
  2026-02-13 22:28           ` Al Viro
  0 siblings, 1 reply; 42+ messages in thread
From: Askar Safin @ 2026-02-13 17:34 UTC (permalink / raw)
  To: viro; +Cc: christian, hpa, jack, linux-fsdevel, torvalds, werner,
	Aleksa Sarai

Al Viro <viro@zeniv.linux.org.uk>:
> We could restrict the set of those who could be flipped, but I doubt
> that "could ptrace" is workable - that would exclude all kernel threads,
> and that could easily break existing setups in hard-to-recover ways.

No. Kernel threads share cwd and root with init, because we create PID 1
with CLONE_FS:

https://elixir.bootlin.com/linux/v6.19-rc5/source/init/main.c#L722

Brauner already said this earlier in this thread:
> I think even for the case where init pivot's root from the initramfs
> the pivot_root() system call isn't really needed anymore because iirc
> CLONE_FS guarantees that any change to its root and pwd is visible in
> kernel threads

Currently classic initrd is not used, and nullfs is not yet mainstream,
so all distros switch to new root during initialization of system using
"mount --move". And this works perfectly thanks to mentioned CLONE_FS.

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 15:03             ` H. Peter Anvin
@ 2026-02-13 17:47               ` Linus Torvalds
  2026-02-13 18:27                 ` Askar Safin
  2026-02-13 20:08                 ` H. Peter Anvin
  0 siblings, 2 replies; 42+ messages in thread
From: Linus Torvalds @ 2026-02-13 17:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Aleksa Sarai, Askar Safin, christian, jack, linux-fsdevel, viro,
	werner

[-- Attachment #1: Type: text/plain, Size: 650 bytes --]

On Fri, 13 Feb 2026 at 07:04, H. Peter Anvin <hpa@zytor.com> wrote:
>
> On February 13, 2026 5:46:34 AM PST, Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> >I think the init(rd) people will care more -- my impression this was
> >only really needed because of the initrd switch (to change the root of
> >kthreads to the new root)?
>
> That was the original motivation, yes. The real question is if they are anytime out there abusing it in other ways...

I guess we could just try it.

How does something like this feel to people?

I also changed the name from 'chroot' to 'pivot'. Comments?

Entirely untested in every way possible.

            Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 2668 bytes --]

 fs/fs_struct.c | 53 +++++++++++++++++++++++++++++++++--------------------
 fs/internal.h  |  2 +-
 fs/namespace.c |  2 +-
 3 files changed, 35 insertions(+), 22 deletions(-)

diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 394875d06fd6..76a46fd78602 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -52,30 +52,43 @@ static inline int replace_path(struct path *p, const struct path *old, const str
 	return 1;
 }
 
-void chroot_fs_refs(const struct path *old_root, const struct path *new_root)
+static int pivot_process_refs(struct task_struct * p, const struct path *old_root, const struct path *new_root)
 {
-	struct task_struct *g, *p;
+	int count = 0;
 	struct fs_struct *fs;
+
+	task_lock(p);
+	fs = p->fs;
+	if (fs) {
+		int hits = 0;
+		write_seqlock(&fs->seq);
+		hits += replace_path(&fs->root, old_root, new_root);
+		hits += replace_path(&fs->pwd, old_root, new_root);
+		while (hits--) {
+			count++;
+			path_get(new_root);
+		}
+		write_sequnlock(&fs->seq);
+	}
+	task_unlock(p);
+	return count;
+}
+
+void pivot_fs_refs(const struct path *old_root, const struct path *new_root)
+{
 	int count = 0;
 
-	read_lock(&tasklist_lock);
-	for_each_process_thread(g, p) {
-		task_lock(p);
-		fs = p->fs;
-		if (fs) {
-			int hits = 0;
-			write_seqlock(&fs->seq);
-			hits += replace_path(&fs->root, old_root, new_root);
-			hits += replace_path(&fs->pwd, old_root, new_root);
-			while (hits--) {
-				count++;
-				path_get(new_root);
-			}
-			write_sequnlock(&fs->seq);
-		}
-		task_unlock(p);
-	}
-	read_unlock(&tasklist_lock);
+	/* Special case: 'init' changes everything */
+	if (current ==  &init_task) {
+		struct task_struct *g, *p;
+
+		read_lock(&tasklist_lock);
+		for_each_process_thread(g, p)
+			count += pivot_process_refs(p, old_root, new_root);
+		read_unlock(&tasklist_lock);
+	} else
+		count = pivot_process_refs(current, old_root, new_root);
+
 	while (count--)
 		path_put(old_root);
 }
diff --git a/fs/internal.h b/fs/internal.h
index cbc384a1aa09..2615401e8c66 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -99,7 +99,7 @@ int show_path(struct seq_file *m, struct dentry *root);
 /*
  * fs_struct.c
  */
-extern void chroot_fs_refs(const struct path *, const struct path *);
+extern void pivot_fs_refs(const struct path *, const struct path *);
 
 /*
  * file_table.c
diff --git a/fs/namespace.c b/fs/namespace.c
index a67cbe42746d..b01749683439 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4681,7 +4681,7 @@ int path_pivot_root(struct path *new, struct path *old)
 	unlock_mount_hash();
 	mnt_notify_add(root_mnt);
 	mnt_notify_add(new_mnt);
-	chroot_fs_refs(&root, new);
+	pivot_fs_refs(&root, new);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-12 19:31           ` H. Peter Anvin
  2026-02-13  9:51             ` Aleksa Sarai
@ 2026-02-13 17:47             ` Askar Safin
  2026-02-13 20:27               ` H. Peter Anvin
  1 sibling, 1 reply; 42+ messages in thread
From: Askar Safin @ 2026-02-13 17:47 UTC (permalink / raw)
  To: hpa
  Cc: christian, cyphar, jack, linux-fsdevel, safinaskar, torvalds,
	viro, werner

"H. Peter Anvin" <hpa@zytor.com>:
> It would be interesting to see how much would break if pivot_root() was restricted (with kernel threads parked in nullfs safely out of the way.)

As well as I understand, kernel threads need to follow real root directory,
because they sometimes load firmware from /lib/firmware and call
user mode helpers, such as modprobe.

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 17:47               ` Linus Torvalds
@ 2026-02-13 18:27                 ` Askar Safin
  2026-02-13 18:39                   ` Linus Torvalds
  2026-02-13 18:42                   ` H. Peter Anvin
  2026-02-13 20:08                 ` H. Peter Anvin
  1 sibling, 2 replies; 42+ messages in thread
From: Askar Safin @ 2026-02-13 18:27 UTC (permalink / raw)
  To: torvalds
  Cc: christian, cyphar, hpa, jack, linux-fsdevel, safinaskar, viro,
	werner

Linus Torvalds <torvalds@linux-foundation.org>:
> +	/* Special case: 'init' changes everything */
> +	if (current ==  &init_task) {

pivot_root is not used by real inits now.

pivot_root was actively used by inits in classic initrd epoch, but
initrd is not used anymore.

pivot_root cannot be used by inits to switch from initramfs (in 6.19) because one
cannot unmount or move root of namespace (so everybody moves new root on
top of old using "mount --move").

Very recently (in 7.0) pivot_root became viable to switch from initramfs, because
of Brauner's nullfs patches. But distros didn't start to use this yet.

So, inits now don't use pivot_root.

So there is no need to special case it.

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 18:27                 ` Askar Safin
@ 2026-02-13 18:39                   ` Linus Torvalds
  2026-02-13 20:00                     ` H. Peter Anvin
  2026-02-14 12:15                     ` Christian Brauner
  2026-02-13 18:42                   ` H. Peter Anvin
  1 sibling, 2 replies; 42+ messages in thread
From: Linus Torvalds @ 2026-02-13 18:39 UTC (permalink / raw)
  To: Askar Safin; +Cc: christian, cyphar, hpa, jack, linux-fsdevel, viro, werner

On Fri, 13 Feb 2026 at 10:27, Askar Safin <safinaskar@gmail.com> wrote:
>
> pivot_root was actively used by inits in classic initrd epoch, but
> initrd is not used anymore.

Well, debian code search does find it being used in systemd, although
I didn't then chase down how it is used.

Of course, Dbian code search also pointed out to me that we have a
"pivot_root" thing in util-linux, so apparently some older things used
an external program and that "&init_task" check wouldn't work anyway.

             Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 18:27                 ` Askar Safin
  2026-02-13 18:39                   ` Linus Torvalds
@ 2026-02-13 18:42                   ` H. Peter Anvin
  1 sibling, 0 replies; 42+ messages in thread
From: H. Peter Anvin @ 2026-02-13 18:42 UTC (permalink / raw)
  To: Askar Safin, torvalds
  Cc: christian, cyphar, jack, linux-fsdevel, safinaskar, viro, werner

On February 13, 2026 10:27:32 AM PST, Askar Safin <safinaskar@gmail.com> wrote:
>Linus Torvalds <torvalds@linux-foundation.org>:
>> +	/* Special case: 'init' changes everything */
>> +	if (current ==  &init_task) {
>
>pivot_root is not used by real inits now.
>
>pivot_root was actively used by inits in classic initrd epoch, but
>initrd is not used anymore.
>
>pivot_root cannot be used by inits to switch from initramfs (in 6.19) because one
>cannot unmount or move root of namespace (so everybody moves new root on
>top of old using "mount --move").
>
>Very recently (in 7.0) pivot_root became viable to switch from initramfs, because
>of Brauner's nullfs patches. But distros didn't start to use this yet.
>
>So, inits now don't use pivot_root.
>
>So there is no need to special case it.
>

The question isn't if most use it, but what's fraction, if any, is legacy.

Not sure how to find out, other than maybe introduce a pivot_root2() or equivalent library call if it is now possible to do without the system call and deprecating the old one, alternatively making the legacy behavior a configuration opt-in, possibly a "once only".

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 18:39                   ` Linus Torvalds
@ 2026-02-13 20:00                     ` H. Peter Anvin
  2026-02-14 15:31                       ` Askar Safin
  2026-02-14 12:15                     ` Christian Brauner
  1 sibling, 1 reply; 42+ messages in thread
From: H. Peter Anvin @ 2026-02-13 20:00 UTC (permalink / raw)
  To: Linus Torvalds, Askar Safin
  Cc: christian, cyphar, jack, linux-fsdevel, viro, werner

On February 13, 2026 10:39:58 AM PST, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>On Fri, 13 Feb 2026 at 10:27, Askar Safin <safinaskar@gmail.com> wrote:
>>
>> pivot_root was actively used by inits in classic initrd epoch, but
>> initrd is not used anymore.
>
>Well, debian code search does find it being used in systemd, although
>I didn't then chase down how it is used.
>
>Of course, Dbian code search also pointed out to me that we have a
>"pivot_root" thing in util-linux, so apparently some older things used
>an external program and that "&init_task" check wouldn't work anyway.
>
>             Linus

Yeah, no doubt. The real question is how much it actually matters as long as
there aren't kernel threads in the way.

I think I wrote a pivot_root(8) for klibc too (* which I now note has a harder
to use equivalent in the girl of nolibc in the kennel tree ;) *)

Either way, the documented way to use pivot_root(8) is not to rely on it to
change the actual process root at all [the same caveats are supposed to apply
to pivot_root(2), but was not written down in that man page:

>        Note that, depending on the implementation of pivot_root, root and
>        current working directory of the caller may or may not change. The
>        following is a sequence for invoking pivot_root that works in either
>        case, assuming that pivot_root and chroot are in the current PATH:
> 
>            cd new_root
>            pivot_root . put_old
>            exec chroot . command
> 
>        Note that chroot must be available under the old root and under the
>        new root, because pivot_root may or may not have implicitly changed
>        the root directory of the shell.
> 
>        Note that exec chroot changes the running executable, which is
>        necessary if the old root directory should be unmounted afterwards.
>        Also note that standard input, output, and error may still point to a
>        device on the old root file system, keeping it busy. They can easily
>        be changed when invoking chroot (see below; note the absence of
>        leading slashes to make it work whether pivot_root has changed the
>        shell’s root or not).


This was the originally intended operation, and the "chrooting all processes"
(for some definition of "all") was specifically to deal with the problem of
kernel threads holding the old root busy.

The description for how to deal with an NFS mount makes it even more clear
that the chrooting of other processes is NOT an architectural promise:

>            ifconfig lo 127.0.0.1 up   # for portmap
>            # configure Ethernet or such
>            portmap   # for lockd (implicitly started by mount)
>            mount -o ro 10.0.0.1:/my_root /mnt
>            killall portmap   # portmap keeps old root busy
>            cd /mnt
>            pivot_root . old_root
>            exec chroot . sh -c 'umount /old_root; exec /sbin/init' \
>              <dev/console >dev/console 2>&1

So if any real-life user of pivot_root(2|8) actually followed these
guidelines, things should Just Work[TM].

That, of course, is a big open question.

	-hpa


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 17:47               ` Linus Torvalds
  2026-02-13 18:27                 ` Askar Safin
@ 2026-02-13 20:08                 ` H. Peter Anvin
  1 sibling, 0 replies; 42+ messages in thread
From: H. Peter Anvin @ 2026-02-13 20:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Aleksa Sarai, Askar Safin, christian, jack, linux-fsdevel, viro,
	werner

On 2026-02-13 09:47, Linus Torvalds wrote:
> On Fri, 13 Feb 2026 at 07:04, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> On February 13, 2026 5:46:34 AM PST, Aleksa Sarai <cyphar@cyphar.com> wrote:
>>>
>>> I think the init(rd) people will care more -- my impression this was
>>> only really needed because of the initrd switch (to change the root of
>>> kthreads to the new root)?
>>
>> That was the original motivation, yes. The real question is if they are anytime out there abusing it in other ways...
> 
> I guess we could just try it.
> 
> How does something like this feel to people?
> 
> I also changed the name from 'chroot' to 'pivot'. Comments?
> 
> Entirely untested in every way possible.
> 

Before doing anything else I would like to see if we can simply get rid of
that behavior completely, regardless of which process is running it. As I
posted in another email, it was always recommended to not rely on this
behavior since the beginning -- whether or not everyone listened is a whole
other ball of wax.

So to be correct: it was Werner who wrote pivot_root(8) for klibc as well,
this is literally the entire file:

> /* Change the root file system */
> 
> /* Written 2000 by Werner Almesberger */
> 
> #include <stdio.h>
> #include <sys/mount.h>
> 
> int main(int argc, const char **argv)
> {
>         if (argc != 3) {
>                 fprintf(stderr, "Usage: %s new_root put_old\n", argv[0]);
>                 return 1;
>         }
>         if (pivot_root(argv[1], argv[2]) < 0) {
>                 perror("pivot_root");
>                 return 1;
>         }
>         return 0;
> }

kinit from klibc never used pivot_root(), as it was specifically intended to
exist in the world of initramfs in rootfs and replace any further in-kernel
mounting code.

It was definitely a mistake to make initramfs == rootfs. rootfs should have
been nullfs, which is being fixed now, finally.

	-hpa

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 17:47             ` Askar Safin
@ 2026-02-13 20:27               ` H. Peter Anvin
  2026-02-13 20:35                 ` H. Peter Anvin
  2026-02-13 22:25                 ` Al Viro
  0 siblings, 2 replies; 42+ messages in thread
From: H. Peter Anvin @ 2026-02-13 20:27 UTC (permalink / raw)
  To: Askar Safin
  Cc: christian, cyphar, jack, linux-fsdevel, torvalds, viro, werner

On 2026-02-13 09:47, Askar Safin wrote:
> "H. Peter Anvin" <hpa@zytor.com>:
>> It would be interesting to see how much would break if pivot_root() was restricted (with kernel threads parked in nullfs safely out of the way.)
> 
> As well as I understand, kernel threads need to follow real root directory,
> because they sometimes load firmware from /lib/firmware and call
> user mode helpers, such as modprobe.
> 

If they are parked in nullfs, which is always overmounted by the global root,
that should Just Work[TM]. Path resolution based on that directory should
follow the mount point unless I am mistaken (which is possible, the Linux vfs
has changed a lot since the last time I did a deep dive.)

	-hpa


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 20:27               ` H. Peter Anvin
@ 2026-02-13 20:35                 ` H. Peter Anvin
  2026-02-13 22:25                 ` Al Viro
  1 sibling, 0 replies; 42+ messages in thread
From: H. Peter Anvin @ 2026-02-13 20:35 UTC (permalink / raw)
  To: Askar Safin
  Cc: christian, cyphar, jack, linux-fsdevel, torvalds, viro, werner

On 2026-02-13 12:27, H. Peter Anvin wrote:
> On 2026-02-13 09:47, Askar Safin wrote:
>> "H. Peter Anvin" <hpa@zytor.com>:
>>> It would be interesting to see how much would break if pivot_root() was restricted (with kernel threads parked in nullfs safely out of the way.)
>>
>> As well as I understand, kernel threads need to follow real root directory,
>> because they sometimes load firmware from /lib/firmware and call
>> user mode helpers, such as modprobe.
>>
> 
> If they are parked in nullfs, which is always overmounted by the global root,
> that should Just Work[TM]. Path resolution based on that directory should
> follow the mount point unless I am mistaken (which is possible, the Linux vfs
> has changed a lot since the last time I did a deep dive.)
> 

If that doesn't work, then it can be dealt with by resolving the pathname in
the namespace of the init process *at the time it needs to resolve the path*,
as opposed to having to cache a pointer to the root.

The init process is inherently special anyway, and it isn't like the
additional overhead will be significant for these kinds of heavyweight events.

	-hpa


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 20:27               ` H. Peter Anvin
  2026-02-13 20:35                 ` H. Peter Anvin
@ 2026-02-13 22:25                 ` Al Viro
  2026-02-13 23:00                   ` H. Peter Anvin
  1 sibling, 1 reply; 42+ messages in thread
From: Al Viro @ 2026-02-13 22:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Askar Safin, christian, cyphar, jack, linux-fsdevel, torvalds,
	werner

On Fri, Feb 13, 2026 at 12:27:46PM -0800, H. Peter Anvin wrote:
> On 2026-02-13 09:47, Askar Safin wrote:
> > "H. Peter Anvin" <hpa@zytor.com>:
> >> It would be interesting to see how much would break if pivot_root() was restricted (with kernel threads parked in nullfs safely out of the way.)
> > 
> > As well as I understand, kernel threads need to follow real root directory,
> > because they sometimes load firmware from /lib/firmware and call
> > user mode helpers, such as modprobe.
> > 
> 
> If they are parked in nullfs, which is always overmounted by the global root,
> that should Just Work[TM]. Path resolution based on that directory should
> follow the mount point unless I am mistaken (which is possible, the Linux vfs
> has changed a lot since the last time I did a deep dive.)

You are, and it had always been that way.  We do *not* follow mounts at
the starting point.  /../lib would work, /lib won't.  I'd love to deal with
that wart, but that would break early boot on unknown number of boxen and
breakage that early is really unpleasant to debug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 17:34         ` Askar Safin
@ 2026-02-13 22:28           ` Al Viro
  2026-02-14 16:16             ` Askar Safin
  0 siblings, 1 reply; 42+ messages in thread
From: Al Viro @ 2026-02-13 22:28 UTC (permalink / raw)
  To: Askar Safin
  Cc: christian, hpa, jack, linux-fsdevel, torvalds, werner,
	Aleksa Sarai

On Fri, Feb 13, 2026 at 08:34:27PM +0300, Askar Safin wrote:
> Al Viro <viro@zeniv.linux.org.uk>:
> > We could restrict the set of those who could be flipped, but I doubt
> > that "could ptrace" is workable - that would exclude all kernel threads,
> > and that could easily break existing setups in hard-to-recover ways.
> 
> No. Kernel threads share cwd and root with init, because we create PID 1
> with CLONE_FS:
> 
> https://elixir.bootlin.com/linux/v6.19-rc5/source/init/main.c#L722

knfsd does not.  There's an explicit "give me a separate fs_struct" since
it want ->umask to be independent.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 22:25                 ` Al Viro
@ 2026-02-13 23:00                   ` H. Peter Anvin
  2026-02-13 23:41                     ` Al Viro
                                       ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: H. Peter Anvin @ 2026-02-13 23:00 UTC (permalink / raw)
  To: Al Viro
  Cc: Askar Safin, christian, cyphar, jack, linux-fsdevel, torvalds,
	werner

On February 13, 2026 2:25:21 PM PST, Al Viro <viro@zeniv.linux.org.uk> wrote:
>On Fri, Feb 13, 2026 at 12:27:46PM -0800, H. Peter Anvin wrote:
>> On 2026-02-13 09:47, Askar Safin wrote:
>> > "H. Peter Anvin" <hpa@zytor.com>:
>> >> It would be interesting to see how much would break if pivot_root() was restricted (with kernel threads parked in nullfs safely out of the way.)
>> > 
>> > As well as I understand, kernel threads need to follow real root directory,
>> > because they sometimes load firmware from /lib/firmware and call
>> > user mode helpers, such as modprobe.
>> > 
>> 
>> If they are parked in nullfs, which is always overmounted by the global root,
>> that should Just Work[TM]. Path resolution based on that directory should
>> follow the mount point unless I am mistaken (which is possible, the Linux vfs
>> has changed a lot since the last time I did a deep dive.)
>
>You are, and it had always been that way.  We do *not* follow mounts at
>the starting point.  /../lib would work, /lib won't.  I'd love to deal with
>that wart, but that would break early boot on unknown number of boxen and
>breakage that early is really unpleasant to debug.

Well, it ought to be easy to make the kernel implicitly prefix /../ for kernel-upcall pathnames, which is more or less the same concept as, but should be a lot simpler than, looking up the init process root.

So if needing to do a side trip via /.. is the worst thing then I have a hard time seeing a problem.

Incidentally, how is . treated?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 23:41                     ` Al Viro
@ 2026-02-13 23:40                       ` H. Peter Anvin
  0 siblings, 0 replies; 42+ messages in thread
From: H. Peter Anvin @ 2026-02-13 23:40 UTC (permalink / raw)
  To: Al Viro
  Cc: Askar Safin, christian, cyphar, jack, linux-fsdevel, torvalds,
	werner

On February 13, 2026 3:41:47 PM PST, Al Viro <viro@zeniv.linux.org.uk> wrote:
>On Fri, Feb 13, 2026 at 03:00:49PM -0800, H. Peter Anvin wrote:
>
>> Incidentally, how is . treated?
>
>Explicit no-op at any point of pathname.  ".." is
>	while dentry == root(mount) && (mount,dentry) != nd->root
>		(mount,dentry) = under(mount, dentry)
>	if (mount,dentry) != nd->root
>		dentry = parent(dentry)
>	step_into(dentry)
>and crossing into overmounts (if any) happens in step_into().
>
>See handle_dots() and its callers...

Ok, I was just curious.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 23:00                   ` H. Peter Anvin
@ 2026-02-13 23:41                     ` Al Viro
  2026-02-13 23:40                       ` H. Peter Anvin
  2026-02-14 12:42                     ` Christian Brauner
  2026-02-14 16:20                     ` Askar Safin
  2 siblings, 1 reply; 42+ messages in thread
From: Al Viro @ 2026-02-13 23:41 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Askar Safin, christian, cyphar, jack, linux-fsdevel, torvalds,
	werner

On Fri, Feb 13, 2026 at 03:00:49PM -0800, H. Peter Anvin wrote:

> Incidentally, how is . treated?

Explicit no-op at any point of pathname.  ".." is
	while dentry == root(mount) && (mount,dentry) != nd->root
		(mount,dentry) = under(mount, dentry)
	if (mount,dentry) != nd->root
		dentry = parent(dentry)
	step_into(dentry)
and crossing into overmounts (if any) happens in step_into().

See handle_dots() and its callers...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 18:39                   ` Linus Torvalds
  2026-02-13 20:00                     ` H. Peter Anvin
@ 2026-02-14 12:15                     ` Christian Brauner
  2026-02-14 16:18                       ` Linus Torvalds
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Brauner @ 2026-02-14 12:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Askar Safin, christian, cyphar, hpa, jack, linux-fsdevel, viro,
	werner

On Fri, Feb 13, 2026 at 10:39:58AM -0800, Linus Torvalds wrote:
> On Fri, 13 Feb 2026 at 10:27, Askar Safin <safinaskar@gmail.com> wrote:
> >
> > pivot_root was actively used by inits in classic initrd epoch, but
> > initrd is not used anymore.
> 
> Well, debian code search does find it being used in systemd, although
> I didn't then chase down how it is used.

I've explained that in my other mail. It's used to setup containers just
like any other container runtime does. Everytime you setup a container
pivot_root() is used. systemd also uses it to try and figure out whether
pivot_root() works in the initramfs and falls back to move_mount()
otherwise.

But my point has been: we don't need it anymore. As soon as we have
MOVE_MOUNT_BENEATH working with the rootfs we can just treat
pivot_root() as deprecated and point out that this should be used.

systemd/container just mounts the new rootfs beneath the old rootfs.
systemd/container fchdir()s into the new rootfs, unmounts the old one
then chroots into the new one and done. That also works in the initramfs
as CLONE_FS guarantees that every kernel thread will see the updated
rootfs. No pivot_root() needed.

> Of course, Dbian code search also pointed out to me that we have a
> "pivot_root" thing in util-linux, so apparently some older things used
> an external program and that "&init_task" check wouldn't work anyway.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 23:00                   ` H. Peter Anvin
  2026-02-13 23:41                     ` Al Viro
@ 2026-02-14 12:42                     ` Christian Brauner
  2026-02-15  0:48                       ` H. Peter Anvin
  2026-02-14 16:20                     ` Askar Safin
  2 siblings, 1 reply; 42+ messages in thread
From: Christian Brauner @ 2026-02-14 12:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Al Viro, Askar Safin, christian, cyphar, jack, linux-fsdevel,
	torvalds, werner

On Fri, Feb 13, 2026 at 03:00:49PM -0800, H. Peter Anvin wrote:
> On February 13, 2026 2:25:21 PM PST, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >On Fri, Feb 13, 2026 at 12:27:46PM -0800, H. Peter Anvin wrote:
> >> On 2026-02-13 09:47, Askar Safin wrote:
> >> > "H. Peter Anvin" <hpa@zytor.com>:
> >> >> It would be interesting to see how much would break if pivot_root() was restricted (with kernel threads parked in nullfs safely out of the way.)
> >> > 
> >> > As well as I understand, kernel threads need to follow real root directory,
> >> > because they sometimes load firmware from /lib/firmware and call
> >> > user mode helpers, such as modprobe.
> >> > 
> >> 
> >> If they are parked in nullfs, which is always overmounted by the global root,
> >> that should Just Work[TM]. Path resolution based on that directory should
> >> follow the mount point unless I am mistaken (which is possible, the Linux vfs
> >> has changed a lot since the last time I did a deep dive.)
> >
> >You are, and it had always been that way.  We do *not* follow mounts at
> >the starting point.  /../lib would work, /lib won't.  I'd love to deal with
> >that wart, but that would break early boot on unknown number of boxen and
> >breakage that early is really unpleasant to debug.
> 
> Well, it ought to be easy to make the kernel implicitly prefix /../ for kernel-upcall pathnames, which is more or less the same concept as, but should be a lot simpler than, looking up the init process root.

I don't think parking kernel threads unconditionally in nullfs is going
to work. This will not just break firmware loading it will also break
coredump handling and a bunch of other stuff that relies on root based
lookup.

I think introducing all this new machinery just to improve
pivot_root()'s broken semantics is pointless. Let's just let it die. We
have all the tools to avoid it ready. OPEN_TREE_NAMESPACE for containers
so pivot_root() isn't needed at all anymore for that case and
MOVE_MOUNT_BENEATH for the rootfs for v7.1 and then even if someone
wanted to replace the rootfs that whole chroot_fs_refs() dance is not
needed at all anymore.

The only reason to do it would be to make sure that no one accidently
pins the old rootfs anymore but that's not a strong argument anyway:

- If done during boot it's pointless because most of the times there's
  exactly one process running and CLONE_FS will guaratee that kernel
  threads pick up the rootfs change as well.

- If done during container setup it's especially useless because again
  only the process setting up the container will be around.

- It doesn't at all deal with file descriptors that pin the old rootfs
  which is the much more likely case.

If anyone actually does pivot_root() on a live system in the initial
user namespace with a full userspace running work without introducing
all kinds of breakage they should probably reexamine some design
decisions.

I don't think we need to fix it I think we need to make it unused and I
think that's possible as I tried to argue.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 20:00                     ` H. Peter Anvin
@ 2026-02-14 15:31                       ` Askar Safin
  2026-02-15  0:52                         ` H. Peter Anvin
  0 siblings, 1 reply; 42+ messages in thread
From: Askar Safin @ 2026-02-14 15:31 UTC (permalink / raw)
  To: hpa; +Cc: christian, cyphar, jack, linux-fsdevel, torvalds, viro, werner

"H. Peter Anvin" <hpa@zytor.com>:
> Either way, the documented way to use pivot_root(8) is not to rely on it to
> change the actual process root at all [the same caveats are supposed to apply
> to pivot_root(2), but was not written down in that man page:

Unfortunately, pivot_root(2) (as opposed to pivot_root(8)) contains
effectively contrary thing:

> For many years, this manual page carried the following text:
>
>     pivot_root() may or may not change the current root and the current
>     working directory of any processes or threads which use the old root
>     directory. The caller of pivot_root() must ensure that processes with
>     root or current working directory at the old root operate correctly
>     in either case. An easy way to ensure this is to change their root
>     and current working directory to new_root before invoking pivot_root().
>
> This text, written before the system call implementation was even
> finalized in the kernel, was probably intended to warn users at that time
> that the implementation might change before final release. However, the
> behavior stated in DESCRIPTION has remained consistent since this system
> call was first implemented and will not change now.

( https://manpages.debian.org/unstable/manpages-dev/pivot_root.2.en.html ).

So effectively this pivot_root(2) note cancels pivot_root(8) note.

Note: I link here to manpages.debian.org as opposed to man7.org, because
manpages.debian.org usually contains newer versions of mans.

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 22:28           ` Al Viro
@ 2026-02-14 16:16             ` Askar Safin
  0 siblings, 0 replies; 42+ messages in thread
From: Askar Safin @ 2026-02-14 16:16 UTC (permalink / raw)
  To: viro; +Cc: christian, cyphar, hpa, jack, linux-fsdevel, torvalds, werner

Al Viro <viro@zeniv.linux.org.uk>:
> knfsd does not.  There's an explicit "give me a separate fs_struct" since
> it want ->umask to be independent.

As well as I understand, knfsd is NFS server.
NFS server starts when the system is already up and running (as opposed to
NFS client), and all early rootfs transitions already done.

So this should not matter.

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-14 12:15                     ` Christian Brauner
@ 2026-02-14 16:18                       ` Linus Torvalds
  2026-02-14 17:40                         ` Al Viro
  0 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2026-02-14 16:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Askar Safin, christian, cyphar, hpa, jack, linux-fsdevel, viro,
	werner

On Sat, 14 Feb 2026 at 04:15, Christian Brauner <brauner@kernel.org> wrote:
>
> But my point has been: we don't need it anymore.

I don't think that makes much of a difference. We'd still need to have
pivot_root() around for the legacy case, and I do think we want to
make sure it can't be used as an attack vector (perhaps not directly,
but by possibly confusing other people).

Not that you should use containers as security boundaries anyway, but
I do think the current behavior needs to be limited. Because it's
dangerous.

Maybe just limiting it by namespace is ok.

Because even if the "white hat" users stop using pivot_root, we'd keep
it around for legacy - and we want to limit the damage.

            Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-13 23:00                   ` H. Peter Anvin
  2026-02-13 23:41                     ` Al Viro
  2026-02-14 12:42                     ` Christian Brauner
@ 2026-02-14 16:20                     ` Askar Safin
  2026-02-15  0:49                       ` H. Peter Anvin
  2 siblings, 1 reply; 42+ messages in thread
From: Askar Safin @ 2026-02-14 16:20 UTC (permalink / raw)
  To: hpa; +Cc: christian, cyphar, jack, linux-fsdevel, torvalds, viro, werner

"H. Peter Anvin" <hpa@zytor.com>:
> Well, it ought to be easy to make the kernel implicitly prefix /../ for kernel-upcall pathnames

If some kernel thread executes user mode helper /../sbin/modprobe , then
the thread will execute right binary, but still with wrong root (so modprobe
program itself will likely misbehave).

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-14 16:18                       ` Linus Torvalds
@ 2026-02-14 17:40                         ` Al Viro
  2026-02-17  8:35                           ` Christian Brauner
  0 siblings, 1 reply; 42+ messages in thread
From: Al Viro @ 2026-02-14 17:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Askar Safin, christian, cyphar, hpa, jack,
	linux-fsdevel, werner

On Sat, Feb 14, 2026 at 08:18:55AM -0800, Linus Torvalds wrote:
> On Sat, 14 Feb 2026 at 04:15, Christian Brauner <brauner@kernel.org> wrote:
> >
> > But my point has been: we don't need it anymore.
> 
> I don't think that makes much of a difference. We'd still need to have
> pivot_root() around for the legacy case, and I do think we want to
> make sure it can't be used as an attack vector (perhaps not directly,
> but by possibly confusing other people).
> 
> Not that you should use containers as security boundaries anyway, but
> I do think the current behavior needs to be limited. Because it's
> dangerous.
> 
> Maybe just limiting it by namespace is ok.
> 
> Because even if the "white hat" users stop using pivot_root, we'd keep
> it around for legacy - and we want to limit the damage.

Indeed.  Let's backtrack a bit:
1) pivot_root() screwing around with root/pwd is unfortunate, but it's not
going away and neither is pivot_root() itself - not for several years,
at the very least.
2) having the set of affected threads limited shouldn't be a problem, as
long as we don't get too enthisiastic about that (relying upon the
assumptions about kernel threads getting picked along with init is
a bad idea, IMO).
3) walking into a container and expecting that mount tree won't get fucked
under you is a Bloody Bad Idea(tm) even without pivot_root() in the mix.
Sure, having pwd changed under you is still nice to avoid, but that's far
from the only thing to be cautious about.  That doesn't affect the
desirability of (2).
4) as long as we change pwd and root of _any_ threads that are not aware of
pivot_root() being done, we ought to have that change atomic wrt fork();
if child gets spawned by any such thread, the resulting state should not
depend upon the timinig of fork() vs. pivot_root().  The same goes for
unshare(2) (and while we are at it, in absense of pivot_root(2) and other
mount-related activity a caller of unshare(2) that gets ENOMEM can reasonably
expect their mount tree to remain unchanged).

Does anybody have objections to the above?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-14 12:42                     ` Christian Brauner
@ 2026-02-15  0:48                       ` H. Peter Anvin
  2026-02-17  8:37                         ` Christian Brauner
  0 siblings, 1 reply; 42+ messages in thread
From: H. Peter Anvin @ 2026-02-15  0:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Askar Safin, christian, cyphar, jack, linux-fsdevel,
	torvalds, werner

On February 14, 2026 4:42:32 AM PST, Christian Brauner <brauner@kernel.org> wrote:
>On Fri, Feb 13, 2026 at 03:00:49PM -0800, H. Peter Anvin wrote:
>> On February 13, 2026 2:25:21 PM PST, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> >On Fri, Feb 13, 2026 at 12:27:46PM -0800, H. Peter Anvin wrote:
>> >> On 2026-02-13 09:47, Askar Safin wrote:
>> >> > "H. Peter Anvin" <hpa@zytor.com>:
>> >> >> It would be interesting to see how much would break if pivot_root() was restricted (with kernel threads parked in nullfs safely out of the way.)
>> >> > 
>> >> > As well as I understand, kernel threads need to follow real root directory,
>> >> > because they sometimes load firmware from /lib/firmware and call
>> >> > user mode helpers, such as modprobe.
>> >> > 
>> >> 
>> >> If they are parked in nullfs, which is always overmounted by the global root,
>> >> that should Just Work[TM]. Path resolution based on that directory should
>> >> follow the mount point unless I am mistaken (which is possible, the Linux vfs
>> >> has changed a lot since the last time I did a deep dive.)
>> >
>> >You are, and it had always been that way.  We do *not* follow mounts at
>> >the starting point.  /../lib would work, /lib won't.  I'd love to deal with
>> >that wart, but that would break early boot on unknown number of boxen and
>> >breakage that early is really unpleasant to debug.
>> 
>> Well, it ought to be easy to make the kernel implicitly prefix /../ for kernel-upcall pathnames, which is more or less the same concept as, but should be a lot simpler than, looking up the init process root.
>
>I don't think parking kernel threads unconditionally in nullfs is going
>to work. This will not just break firmware loading it will also break
>coredump handling and a bunch of other stuff that relies on root based
>lookup.
>
>I think introducing all this new machinery just to improve
>pivot_root()'s broken semantics is pointless. Let's just let it die. We
>have all the tools to avoid it ready. OPEN_TREE_NAMESPACE for containers
>so pivot_root() isn't needed at all anymore for that case and
>MOVE_MOUNT_BENEATH for the rootfs for v7.1 and then even if someone
>wanted to replace the rootfs that whole chroot_fs_refs() dance is not
>needed at all anymore.
>
>The only reason to do it would be to make sure that no one accidently
>pins the old rootfs anymore but that's not a strong argument anyway:
>
>- If done during boot it's pointless because most of the times there's
>  exactly one process running and CLONE_FS will guaratee that kernel
>  threads pick up the rootfs change as well.
>
>- If done during container setup it's especially useless because again
>  only the process setting up the container will be around.
>
>- It doesn't at all deal with file descriptors that pin the old rootfs
>  which is the much more likely case.
>
>If anyone actually does pivot_root() on a live system in the initial
>user namespace with a full userspace running work without introducing
>all kinds of breakage they should probably reexamine some design
>decisions.
>
>I don't think we need to fix it I think we need to make it unused and I
>think that's possible as I tried to argue.

You missed the bit that the kernel tasks would use /.. to get to the "real" root.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-14 16:20                     ` Askar Safin
@ 2026-02-15  0:49                       ` H. Peter Anvin
  0 siblings, 0 replies; 42+ messages in thread
From: H. Peter Anvin @ 2026-02-15  0:49 UTC (permalink / raw)
  To: Askar Safin
  Cc: christian, cyphar, jack, linux-fsdevel, torvalds, viro, werner

On February 14, 2026 8:20:39 AM PST, Askar Safin <safinaskar@gmail.com> wrote:
>"H. Peter Anvin" <hpa@zytor.com>:
>> Well, it ought to be easy to make the kernel implicitly prefix /../ for kernel-upcall pathnames
>
>If some kernel thread executes user mode helper /../sbin/modprobe , then
>the thread will execute right binary, but still with wrong root (so modprobe
>program itself will likely misbehave).
>

You're right. What is really needed is chroot("/..") before the exec. 

I was thinking of file accesses like firmware.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-14 15:31                       ` Askar Safin
@ 2026-02-15  0:52                         ` H. Peter Anvin
  0 siblings, 0 replies; 42+ messages in thread
From: H. Peter Anvin @ 2026-02-15  0:52 UTC (permalink / raw)
  To: Askar Safin
  Cc: christian, cyphar, jack, linux-fsdevel, torvalds, viro, werner

On February 14, 2026 7:31:43 AM PST, Askar Safin <safinaskar@gmail.com> wrote:
>"H. Peter Anvin" <hpa@zytor.com>:
>> Either way, the documented way to use pivot_root(8) is not to rely on it to
>> change the actual process root at all [the same caveats are supposed to apply
>> to pivot_root(2), but was not written down in that man page:
>
>Unfortunately, pivot_root(2) (as opposed to pivot_root(8)) contains
>effectively contrary thing:
>
>> For many years, this manual page carried the following text:
>>
>>     pivot_root() may or may not change the current root and the current
>>     working directory of any processes or threads which use the old root
>>     directory. The caller of pivot_root() must ensure that processes with
>>     root or current working directory at the old root operate correctly
>>     in either case. An easy way to ensure this is to change their root
>>     and current working directory to new_root before invoking pivot_root().
>>
>> This text, written before the system call implementation was even
>> finalized in the kernel, was probably intended to warn users at that time
>> that the implementation might change before final release. However, the
>> behavior stated in DESCRIPTION has remained consistent since this system
>> call was first implemented and will not change now.
>
>( https://manpages.debian.org/unstable/manpages-dev/pivot_root.2.en.html ).
>
>So effectively this pivot_root(2) note cancels pivot_root(8) note.
>
>Note: I link here to manpages.debian.org as opposed to man7.org, because
>manpages.debian.org usually contains newer versions of mans.
>

Sigh.

The difference between "descriptive" documentation and "prescriptive" documentation — what constraints actually apply to the interface contract.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-14 17:40                         ` Al Viro
@ 2026-02-17  8:35                           ` Christian Brauner
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2026-02-17  8:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Askar Safin, christian, cyphar, hpa, jack,
	linux-fsdevel, werner

On Sat, Feb 14, 2026 at 05:40:30PM +0000, Al Viro wrote:
> On Sat, Feb 14, 2026 at 08:18:55AM -0800, Linus Torvalds wrote:
> > On Sat, 14 Feb 2026 at 04:15, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > But my point has been: we don't need it anymore.
> > 
> > I don't think that makes much of a difference. We'd still need to have
> > pivot_root() around for the legacy case, and I do think we want to
> > make sure it can't be used as an attack vector (perhaps not directly,
> > but by possibly confusing other people).
> > 
> > Not that you should use containers as security boundaries anyway, but
> > I do think the current behavior needs to be limited. Because it's
> > dangerous.
> > 
> > Maybe just limiting it by namespace is ok.
> > 
> > Because even if the "white hat" users stop using pivot_root, we'd keep
> > it around for legacy - and we want to limit the damage.
> 
> Indeed.  Let's backtrack a bit:
> 1) pivot_root() screwing around with root/pwd is unfortunate, but it's not
> going away and neither is pivot_root() itself - not for several years,
> at the very least.
> 2) having the set of affected threads limited shouldn't be a problem, as
> long as we don't get too enthisiastic about that (relying upon the
> assumptions about kernel threads getting picked along with init is
> a bad idea, IMO).
> 3) walking into a container and expecting that mount tree won't get fucked
> under you is a Bloody Bad Idea(tm) even without pivot_root() in the mix.
> Sure, having pwd changed under you is still nice to avoid, but that's far
> from the only thing to be cautious about.  That doesn't affect the
> desirability of (2).

It's also not how people do it anyway.

> 4) as long as we change pwd and root of _any_ threads that are not aware of
> pivot_root() being done, we ought to have that change atomic wrt fork();
> if child gets spawned by any such thread, the resulting state should not
> depend upon the timinig of fork() vs. pivot_root().  The same goes for
> unshare(2) (and while we are at it, in absense of pivot_root(2) and other
> mount-related activity a caller of unshare(2) that gets ENOMEM can reasonably
> expect their mount tree to remain unchanged).
> 
> Does anybody have objections to the above?

Sounds good to me.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] pivot_root(2) races
  2026-02-15  0:48                       ` H. Peter Anvin
@ 2026-02-17  8:37                         ` Christian Brauner
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2026-02-17  8:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Al Viro, Askar Safin, christian, cyphar, jack, linux-fsdevel,
	torvalds, werner

On Sat, Feb 14, 2026 at 04:48:17PM -0800, H. Peter Anvin wrote:
> On February 14, 2026 4:42:32 AM PST, Christian Brauner <brauner@kernel.org> wrote:
> >On Fri, Feb 13, 2026 at 03:00:49PM -0800, H. Peter Anvin wrote:
> >> On February 13, 2026 2:25:21 PM PST, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >> >On Fri, Feb 13, 2026 at 12:27:46PM -0800, H. Peter Anvin wrote:
> >> >> On 2026-02-13 09:47, Askar Safin wrote:
> >> >> > "H. Peter Anvin" <hpa@zytor.com>:
> >> >> >> It would be interesting to see how much would break if pivot_root() was restricted (with kernel threads parked in nullfs safely out of the way.)
> >> >> > 
> >> >> > As well as I understand, kernel threads need to follow real root directory,
> >> >> > because they sometimes load firmware from /lib/firmware and call
> >> >> > user mode helpers, such as modprobe.
> >> >> > 
> >> >> 
> >> >> If they are parked in nullfs, which is always overmounted by the global root,
> >> >> that should Just Work[TM]. Path resolution based on that directory should
> >> >> follow the mount point unless I am mistaken (which is possible, the Linux vfs
> >> >> has changed a lot since the last time I did a deep dive.)
> >> >
> >> >You are, and it had always been that way.  We do *not* follow mounts at
> >> >the starting point.  /../lib would work, /lib won't.  I'd love to deal with
> >> >that wart, but that would break early boot on unknown number of boxen and
> >> >breakage that early is really unpleasant to debug.
> >> 
> >> Well, it ought to be easy to make the kernel implicitly prefix /../ for kernel-upcall pathnames, which is more or less the same concept as, but should be a lot simpler than, looking up the init process root.
> >
> >I don't think parking kernel threads unconditionally in nullfs is going
> >to work. This will not just break firmware loading it will also break
> >coredump handling and a bunch of other stuff that relies on root based
> >lookup.
> >
> >I think introducing all this new machinery just to improve
> >pivot_root()'s broken semantics is pointless. Let's just let it die. We
> >have all the tools to avoid it ready. OPEN_TREE_NAMESPACE for containers
> >so pivot_root() isn't needed at all anymore for that case and
> >MOVE_MOUNT_BENEATH for the rootfs for v7.1 and then even if someone
> >wanted to replace the rootfs that whole chroot_fs_refs() dance is not
> >needed at all anymore.
> >
> >The only reason to do it would be to make sure that no one accidently
> >pins the old rootfs anymore but that's not a strong argument anyway:
> >
> >- If done during boot it's pointless because most of the times there's
> >  exactly one process running and CLONE_FS will guaratee that kernel
> >  threads pick up the rootfs change as well.
> >
> >- If done during container setup it's especially useless because again
> >  only the process setting up the container will be around.
> >
> >- It doesn't at all deal with file descriptors that pin the old rootfs
> >  which is the much more likely case.
> >
> >If anyone actually does pivot_root() on a live system in the initial
> >user namespace with a full userspace running work without introducing
> >all kinds of breakage they should probably reexamine some design
> >decisions.
> >
> >I don't think we need to fix it I think we need to make it unused and I
> >think that's possible as I tried to argue.
> 
> You missed the bit that the kernel tasks would use /.. to get to the "real" root.

My point was rather just that I don't know if we want to have yet
another set of helpers just for this. Sure, we can do it but is
this something that's really urgent right now. I'm not so sure.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2026-02-17  8:37 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-09  0:34 [RFC] pivot_root(2) races Al Viro
2026-02-09  5:49 ` Linus Torvalds
2026-02-09  5:53   ` H. Peter Anvin
2026-02-09  6:34   ` Al Viro
2026-02-09  6:44     ` Linus Torvalds
2026-02-09 11:53       ` Christian Brauner
2026-02-12 17:17       ` Askar Safin
2026-02-12 19:11         ` Linus Torvalds
2026-02-12 19:31           ` H. Peter Anvin
2026-02-13  9:51             ` Aleksa Sarai
2026-02-13 17:47             ` Askar Safin
2026-02-13 20:27               ` H. Peter Anvin
2026-02-13 20:35                 ` H. Peter Anvin
2026-02-13 22:25                 ` Al Viro
2026-02-13 23:00                   ` H. Peter Anvin
2026-02-13 23:41                     ` Al Viro
2026-02-13 23:40                       ` H. Peter Anvin
2026-02-14 12:42                     ` Christian Brauner
2026-02-15  0:48                       ` H. Peter Anvin
2026-02-17  8:37                         ` Christian Brauner
2026-02-14 16:20                     ` Askar Safin
2026-02-15  0:49                       ` H. Peter Anvin
2026-02-13 13:46           ` Aleksa Sarai
2026-02-13 15:03             ` H. Peter Anvin
2026-02-13 17:47               ` Linus Torvalds
2026-02-13 18:27                 ` Askar Safin
2026-02-13 18:39                   ` Linus Torvalds
2026-02-13 20:00                     ` H. Peter Anvin
2026-02-14 15:31                       ` Askar Safin
2026-02-15  0:52                         ` H. Peter Anvin
2026-02-14 12:15                     ` Christian Brauner
2026-02-14 16:18                       ` Linus Torvalds
2026-02-14 17:40                         ` Al Viro
2026-02-17  8:35                           ` Christian Brauner
2026-02-13 18:42                   ` H. Peter Anvin
2026-02-13 20:08                 ` H. Peter Anvin
2026-02-12 19:22       ` Al Viro
2026-02-13 17:34         ` Askar Safin
2026-02-13 22:28           ` Al Viro
2026-02-14 16:16             ` Askar Safin
2026-02-12 13:23 ` Askar Safin
2026-02-12 19:25   ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox