Possible bug with open between unshare(CLONE

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* Possible bug with open between unshare(CLONE_NEWNS) calls
@ 2025-01-15 18:56 Boris Burkov
  2025-01-16  4:14 ` Al Viro
  2025-01-16 10:46 ` Christian Brauner
  0 siblings, 2 replies; 9+ messages in thread
From: Boris Burkov @ 2025-01-15 18:56 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: daan.j.demeyer

Hello,

If we run the following C code:

unshare(CLONE_NEWNS);
int fd = open("/dev/loop0", O_RDONLY)
unshare(CLONE_NEWNS);

Then after the second unshare, the mount hierarchy created by the first
unshare is fully dereferenced and gets torn down, leaving the file
pointed to by fd with a broken dentry.

Specifically, subsequent calls to d_path on its path resolve to
"/loop0". I was able to confirm this with drgn, and it has caused an
unexpected failure in mkosi/systemd-repart attempting to mount a btrfs
filesystem through such an fd, since btrfs uses d_path to resolve the
source device file path fully.

I confirmed that this is definitely due to the first unshare mount
namespace going away by:
1. printks/bpftrace the copy_root path in the kernel
2. rewriting my test program to fork after the first unshare to keep
that namespace referenced. In this case, the fd is not broken after the
second unshare.

My question is:
Is this expected behavior with respect to mount reference counts and
namespace teardown?

If I mount a filesystem and have a running program with an open file
descriptor in that filesystem, I would expect unmounting that filesystem
to fail with EBUSY, so it stands to reason that the automatic unmount
that happens from tearing down the mount namespace of the first unshare
should respect similar semantics and either return EBUSY or at least
have the lazy umount behavior and not wreck the still referenced mount
objects.

If this behavior seems like a bug to people better versed in the
expected behavior of namespaces, I would be happy to work on a fix.

Thanks,
Boris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug with open between unshare(CLONE_NEWNS) calls
  2025-01-15 18:56 Possible bug with open between unshare(CLONE_NEWNS) calls Boris Burkov
@ 2025-01-16  4:14 ` Al Viro
  2025-01-16  4:52   ` Boris Burkov
  2025-01-16 10:46 ` Christian Brauner
  1 sibling, 1 reply; 9+ messages in thread
From: Al Viro @ 2025-01-16  4:14 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-fsdevel, daan.j.demeyer

On Wed, Jan 15, 2025 at 10:56:08AM -0800, Boris Burkov wrote:
> Hello,
> 
> If we run the following C code:
> 
> unshare(CLONE_NEWNS);
> int fd = open("/dev/loop0", O_RDONLY)
> unshare(CLONE_NEWNS);
> 
> Then after the second unshare, the mount hierarchy created by the first
> unshare is fully dereferenced and gets torn down, leaving the file
> pointed to by fd with a broken dentry.

No, it does not.  dentry is just fine and so's mount - it is not
attached to anything, but it's alive and well.

> Specifically, subsequent calls to d_path on its path resolve to
> "/loop0".

> My question is:
> Is this expected behavior with respect to mount reference counts and
> namespace teardown?

Yes.  Again, mount is still alive; it is detached, but that's it.

> If I mount a filesystem and have a running program with an open file
> descriptor in that filesystem, I would expect unmounting that filesystem
> to fail with EBUSY, so it stands to reason that the automatic unmount
> that happens from tearing down the mount namespace of the first unshare
> should respect similar semantics and either return EBUSY or at least
> have the lazy umount behavior and not wreck the still referenced mount
> objects.

Lazy umount is precisely what is happening.  Referenced mount object is
there as long as it is referenced.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug with open between unshare(CLONE_NEWNS) calls
  2025-01-16  4:14 ` Al Viro
@ 2025-01-16  4:52   ` Boris Burkov
  2025-01-16  5:12     ` Al Viro
  0 siblings, 1 reply; 9+ messages in thread
From: Boris Burkov @ 2025-01-16  4:52 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, daan.j.demeyer

On Thu, Jan 16, 2025 at 04:14:59AM +0000, Al Viro wrote:
> On Wed, Jan 15, 2025 at 10:56:08AM -0800, Boris Burkov wrote:
> > Hello,
> > 
> > If we run the following C code:
> > 
> > unshare(CLONE_NEWNS);
> > int fd = open("/dev/loop0", O_RDONLY)
> > unshare(CLONE_NEWNS);
> > 
> > Then after the second unshare, the mount hierarchy created by the first
> > unshare is fully dereferenced and gets torn down, leaving the file
> > pointed to by fd with a broken dentry.
> 
> No, it does not.  dentry is just fine and so's mount - it is not
> attached to anything, but it's alive and well.
> 
> > Specifically, subsequent calls to d_path on its path resolve to
> > "/loop0".
> 
> > My question is:
> > Is this expected behavior with respect to mount reference counts and
> > namespace teardown?
> 
> Yes.  Again, mount is still alive; it is detached, but that's it.
> 
> > If I mount a filesystem and have a running program with an open file
> > descriptor in that filesystem, I would expect unmounting that filesystem
> > to fail with EBUSY, so it stands to reason that the automatic unmount
> > that happens from tearing down the mount namespace of the first unshare
> > should respect similar semantics and either return EBUSY or at least
> > have the lazy umount behavior and not wreck the still referenced mount
> > objects.
> 
> Lazy umount is precisely what is happening.  Referenced mount object is
> there as long as it is referenced.

Thank you for your reply and explanations.

So in your opinion, what is the bug here?

btrfs started using d_path and checking that the device source file was
in /dev, to avoid putting nonsense like /proc/self/fd/3 into the mount
table, where it makes userspace fall over. 
(https://bugzilla.suse.com/show_bug.cgi?id=1230641)

I'd be loathe to call the userspace program hitting the
'unshare; open; unshare' sequence buggy, as we don't fail any of the
syscalls in a particularly sensible way. And if you use unshare -m, you
now have to vet the program you call doesn't use unshare itself?

You've taught me that d_path is working as intended in the face of the
namespace lifetime, so we can't rely on it to produce the "real"
(original?) path, in general.

So, to me, that leaves the bug as
"btrfs shouldn't assume/validate that device files will be in /dev."

We can do the d_path resolution thing anyway to cover the common case,
in the bugzilla, but shouldn't fail on something like /loop0 when that
is what we get out of d_path?

Boris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug with open between unshare(CLONE_NEWNS) calls
  2025-01-16  4:52   ` Boris Burkov
@ 2025-01-16  5:12     ` Al Viro
  0 siblings, 0 replies; 9+ messages in thread
From: Al Viro @ 2025-01-16  5:12 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-fsdevel, daan.j.demeyer

On Wed, Jan 15, 2025 at 08:52:41PM -0800, Boris Burkov wrote:
> So in your opinion, what is the bug here?
> 
> btrfs started using d_path and checking that the device source file was
> in /dev, to avoid putting nonsense like /proc/self/fd/3 into the mount
> table, where it makes userspace fall over. 
>
> (https://bugzilla.suse.com/show_bug.cgi?id=1230641)
> 
> I'd be loathe to call the userspace program hitting the
> 'unshare; open; unshare' sequence buggy, as we don't fail any of the
> syscalls in a particularly sensible way. And if you use unshare -m, you
> now have to vet the program you call doesn't use unshare itself?
> 
> You've taught me that d_path is working as intended in the face of the
> namespace lifetime, so we can't rely on it to produce the "real"
> (original?) path, in general.
> 
> So, to me, that leaves the bug as
> "btrfs shouldn't assume/validate that device files will be in /dev."
> 
> We can do the d_path resolution thing anyway to cover the common case,
> in the bugzilla, but shouldn't fail on something like /loop0 when that
> is what we get out of d_path?

You are asking for a pathname associated with an open file on a mount
that is not within your namespace.  "The path from the root of whatever
namespace it's in starts with /dev" is an odd predicate to check.

Note that the same namespace may have a very different meaning in your
namespace, so...  I'd say that predicate is very likely not doing what
your userland expects anyway.

What is that code trying to do?  is_good_path() looks very... misguided.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug with open between unshare(CLONE_NEWNS) calls
  2025-01-15 18:56 Possible bug with open between unshare(CLONE_NEWNS) calls Boris Burkov
  2025-01-16  4:14 ` Al Viro
@ 2025-01-16 10:46 ` Christian Brauner
  2025-01-16 21:09   ` Qu Wenruo
  1 sibling, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2025-01-16 10:46 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-fsdevel, daan.j.demeyer

On Wed, Jan 15, 2025 at 10:56:08AM -0800, Boris Burkov wrote:
> Hello,
> 
> If we run the following C code:
> 
> unshare(CLONE_NEWNS);
> int fd = open("/dev/loop0", O_RDONLY)
> unshare(CLONE_NEWNS);
> 
> Then after the second unshare, the mount hierarchy created by the first
> unshare is fully dereferenced and gets torn down, leaving the file
> pointed to by fd with a broken dentry.
> 
> Specifically, subsequent calls to d_path on its path resolve to
> "/loop0". I was able to confirm this with drgn, and it has caused an
> unexpected failure in mkosi/systemd-repart attempting to mount a btrfs
> filesystem through such an fd, since btrfs uses d_path to resolve the
> source device file path fully.
> 
> I confirmed that this is definitely due to the first unshare mount
> namespace going away by:
> 1. printks/bpftrace the copy_root path in the kernel
> 2. rewriting my test program to fork after the first unshare to keep
> that namespace referenced. In this case, the fd is not broken after the
> second unshare.
> 
> 
> My question is:
> Is this expected behavior with respect to mount reference counts and
> namespace teardown?
> 
> If I mount a filesystem and have a running program with an open file
> descriptor in that filesystem, I would expect unmounting that filesystem
> to fail with EBUSY, so it stands to reason that the automatic unmount
> that happens from tearing down the mount namespace of the first unshare
> should respect similar semantics and either return EBUSY or at least
> have the lazy umount behavior and not wreck the still referenced mount
> objects.
> 
> If this behavior seems like a bug to people better versed in the
> expected behavior of namespaces, I would be happy to work on a fix.

It's expected as Al already said. And is_good_dev_path()
looks pretty hacky...

Wouldn't something like:

bool is_devtmpfs(const struct super_block *sb)
{
        return sb->s_type == &dev_fs_type;
}

and then:

        ret = kern_path(dev_path, 0, &path);
        if (ret)
                goto out;

	if (is_devtmpfs(path->mnt->mnt_sb))
		// something something

be enough? Or do you specifically need to care where devtmpfs is
mounted? The current check means that anything that mounts devtmpfs
somewhere other than /dev would fail that check.

Of course, any standard Linux distribution will mount devtmpfs at /dev
so it probably won't matter in practice. And contains may make /dev a
tmpfs mount and bind-mount device nodes in from the host's devtmpfs so
that would work too with this check.

In other words, I don't get why the /dev prefix check gets you anything?
If you just verify that the device node is located on devtmpfs you
should be good.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug with open between unshare(CLONE_NEWNS) calls
  2025-01-16 10:46 ` Christian Brauner
@ 2025-01-16 21:09   ` Qu Wenruo
  2025-01-16 21:29     ` Al Viro
  2025-01-20 15:37     ` Christian Brauner
  0 siblings, 2 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-01-16 21:09 UTC (permalink / raw)
  To: Christian Brauner, Boris Burkov; +Cc: linux-fsdevel, daan.j.demeyer



在 2025/1/16 21:16, Christian Brauner 写道:
> On Wed, Jan 15, 2025 at 10:56:08AM -0800, Boris Burkov wrote:
>> Hello,
>>
>> If we run the following C code:
>>
>> unshare(CLONE_NEWNS);
>> int fd = open("/dev/loop0", O_RDONLY)
>> unshare(CLONE_NEWNS);
>>
>> Then after the second unshare, the mount hierarchy created by the first
>> unshare is fully dereferenced and gets torn down, leaving the file
>> pointed to by fd with a broken dentry.
>>
>> Specifically, subsequent calls to d_path on its path resolve to
>> "/loop0". I was able to confirm this with drgn, and it has caused an
>> unexpected failure in mkosi/systemd-repart attempting to mount a btrfs
>> filesystem through such an fd, since btrfs uses d_path to resolve the
>> source device file path fully.
>>
>> I confirmed that this is definitely due to the first unshare mount
>> namespace going away by:
>> 1. printks/bpftrace the copy_root path in the kernel
>> 2. rewriting my test program to fork after the first unshare to keep
>> that namespace referenced. In this case, the fd is not broken after the
>> second unshare.
>>
>>
>> My question is:
>> Is this expected behavior with respect to mount reference counts and
>> namespace teardown?
>>
>> If I mount a filesystem and have a running program with an open file
>> descriptor in that filesystem, I would expect unmounting that filesystem
>> to fail with EBUSY, so it stands to reason that the automatic unmount
>> that happens from tearing down the mount namespace of the first unshare
>> should respect similar semantics and either return EBUSY or at least
>> have the lazy umount behavior and not wreck the still referenced mount
>> objects.
>>
>> If this behavior seems like a bug to people better versed in the
>> expected behavior of namespaces, I would be happy to work on a fix.
>
> It's expected as Al already said. And is_good_dev_path()
> looks pretty hacky...
>
> Wouldn't something like:
>
> bool is_devtmpfs(const struct super_block *sb)
> {
>          return sb->s_type == &dev_fs_type;
> }
>
> and then:
>
>          ret = kern_path(dev_path, 0, &path);
>          if (ret)
>                  goto out;
>
> 	if (is_devtmpfs(path->mnt->mnt_sb))
> 		// something something
>
> be enough? Or do you specifically need to care where devtmpfs is
> mounted? The current check means that anything that mounts devtmpfs
> somewhere other than /dev would fail that check.

That above checks looks good.

>
> Of course, any standard Linux distribution will mount devtmpfs at /dev
> so it probably won't matter in practice. And contains may make /dev a
> tmpfs mount and bind-mount device nodes in from the host's devtmpfs so
> that would work too with this check.
>
> In other words, I don't get why the /dev prefix check gets you anything?
> If you just verify that the device node is located on devtmpfs you
> should be good.

The original problem is that we can get very weird device path, like
'/proc/<pid>/<fd>' or any blockdev node created by the end user, as
mount source, which can cause various problems in mount_info for end users.

Although after v6.8 it looks like there are some other black magics
involved to prevent such block device being passed in.
I tried the same custom block device node, it always resolves to
"/dev/mapper/test-scratch1" in my case (and not even "/dev/dm-3").


However there is still another problem, related to get_canonical_dev_path().

As it still goes d_path(), it will return the path inside the namespace.
Which can be very different from root namespace.

So I'm wondering if we should even bother the device path resolution at
all inside btrfs?
Or the latest fsconfig API is already resolving the path correctly?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug with open between unshare(CLONE_NEWNS) calls
  2025-01-16 21:09   ` Qu Wenruo
@ 2025-01-16 21:29     ` Al Viro
  2025-01-16 21:42       ` Qu Wenruo
  2025-01-20 15:37     ` Christian Brauner
  1 sibling, 1 reply; 9+ messages in thread
From: Al Viro @ 2025-01-16 21:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christian Brauner, Boris Burkov, linux-fsdevel, daan.j.demeyer

On Fri, Jan 17, 2025 at 07:39:09AM +1030, Qu Wenruo wrote:

> The original problem is that we can get very weird device path, like
> '/proc/<pid>/<fd>' or any blockdev node created by the end user, as
> mount source, which can cause various problems in mount_info for end users.

You do realize that different namespaces may very well have the same
pathname resolve to different things, right?  So "userland can't open
a device pathname it sees in /proc/self/mountinfo" was not going to
be solved that way anyway...

While we are at it, it is entirely possible to have a trimmed-down
ramfs with the minimal set of static device nodes mounted on
/dev in user's namespace, with not a block device in sight - despite
having a bunch of local filesystems mounted.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug with open between unshare(CLONE_NEWNS) calls
  2025-01-16 21:29     ` Al Viro
@ 2025-01-16 21:42       ` Qu Wenruo
  0 siblings, 0 replies; 9+ messages in thread
From: Qu Wenruo @ 2025-01-16 21:42 UTC (permalink / raw)
  To: Al Viro; +Cc: Christian Brauner, Boris Burkov, linux-fsdevel, daan.j.demeyer



在 2025/1/17 07:59, Al Viro 写道:
> On Fri, Jan 17, 2025 at 07:39:09AM +1030, Qu Wenruo wrote:
>
>> The original problem is that we can get very weird device path, like
>> '/proc/<pid>/<fd>' or any blockdev node created by the end user, as
>> mount source, which can cause various problems in mount_info for end users.
>
> You do realize that different namespaces may very well have the same
> pathname resolve to different things, right?  So "userland can't open
> a device pathname it sees in /proc/self/mountinfo" was not going to
> be solved that way anyway...
>
> While we are at it, it is entirely possible to have a trimmed-down
> ramfs with the minimal set of static device nodes mounted on
> /dev in user's namespace, with not a block device in sight - despite
> having a bunch of local filesystems mounted.

So it just means, we will have weird names in mountinfo, and we can only
accept that?

If some weird programs (exactly the one mentioned in 7e06de7c83a7
("btrfs: canonicalize the device path before adding it")) really choose
to do stupid things, we have no way to prevent it from happening, and
can only blame the program?

Thanks,
Qu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible bug with open between unshare(CLONE_NEWNS) calls
  2025-01-16 21:09   ` Qu Wenruo
  2025-01-16 21:29     ` Al Viro
@ 2025-01-20 15:37     ` Christian Brauner
  1 sibling, 0 replies; 9+ messages in thread
From: Christian Brauner @ 2025-01-20 15:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Boris Burkov, linux-fsdevel, daan.j.demeyer

On Fri, Jan 17, 2025 at 07:39:09AM +1030, Qu Wenruo wrote:
> 
> 
> 在 2025/1/16 21:16, Christian Brauner 写道:
> > On Wed, Jan 15, 2025 at 10:56:08AM -0800, Boris Burkov wrote:
> > > Hello,
> > > 
> > > If we run the following C code:
> > > 
> > > unshare(CLONE_NEWNS);
> > > int fd = open("/dev/loop0", O_RDONLY)
> > > unshare(CLONE_NEWNS);
> > > 
> > > Then after the second unshare, the mount hierarchy created by the first
> > > unshare is fully dereferenced and gets torn down, leaving the file
> > > pointed to by fd with a broken dentry.
> > > 
> > > Specifically, subsequent calls to d_path on its path resolve to
> > > "/loop0". I was able to confirm this with drgn, and it has caused an
> > > unexpected failure in mkosi/systemd-repart attempting to mount a btrfs
> > > filesystem through such an fd, since btrfs uses d_path to resolve the
> > > source device file path fully.
> > > 
> > > I confirmed that this is definitely due to the first unshare mount
> > > namespace going away by:
> > > 1. printks/bpftrace the copy_root path in the kernel
> > > 2. rewriting my test program to fork after the first unshare to keep
> > > that namespace referenced. In this case, the fd is not broken after the
> > > second unshare.
> > > 
> > > 
> > > My question is:
> > > Is this expected behavior with respect to mount reference counts and
> > > namespace teardown?
> > > 
> > > If I mount a filesystem and have a running program with an open file
> > > descriptor in that filesystem, I would expect unmounting that filesystem
> > > to fail with EBUSY, so it stands to reason that the automatic unmount
> > > that happens from tearing down the mount namespace of the first unshare
> > > should respect similar semantics and either return EBUSY or at least
> > > have the lazy umount behavior and not wreck the still referenced mount
> > > objects.
> > > 
> > > If this behavior seems like a bug to people better versed in the
> > > expected behavior of namespaces, I would be happy to work on a fix.
> > 
> > It's expected as Al already said. And is_good_dev_path()
> > looks pretty hacky...
> > 
> > Wouldn't something like:
> > 
> > bool is_devtmpfs(const struct super_block *sb)
> > {
> >          return sb->s_type == &dev_fs_type;
> > }
> > 
> > and then:
> > 
> >          ret = kern_path(dev_path, 0, &path);
> >          if (ret)
> >                  goto out;
> > 
> > 	if (is_devtmpfs(path->mnt->mnt_sb))
> > 		// something something
> > 
> > be enough? Or do you specifically need to care where devtmpfs is
> > mounted? The current check means that anything that mounts devtmpfs
> > somewhere other than /dev would fail that check.
> 
> That above checks looks good.
> 
> > 
> > Of course, any standard Linux distribution will mount devtmpfs at /dev
> > so it probably won't matter in practice. And contains may make /dev a
> > tmpfs mount and bind-mount device nodes in from the host's devtmpfs so
> > that would work too with this check.
> > 
> > In other words, I don't get why the /dev prefix check gets you anything?
> > If you just verify that the device node is located on devtmpfs you
> > should be good.
> 
> The original problem is that we can get very weird device path, like
> '/proc/<pid>/<fd>' or any blockdev node created by the end user, as
> mount source, which can cause various problems in mount_info for end users.
> 
> Although after v6.8 it looks like there are some other black magics
> involved to prevent such block device being passed in.
> I tried the same custom block device node, it always resolves to
> "/dev/mapper/test-scratch1" in my case (and not even "/dev/dm-3").
> 
> 
> However there is still another problem, related to get_canonical_dev_path().
> 
> As it still goes d_path(), it will return the path inside the namespace.
> Which can be very different from root namespace.

I consider the source device as shown in mountinfo a hint and not more.
It's entirely possible that someone does:

mount /dev/sdd4 /mnt
mv /dev/sdd4 /dev/foo
mv /dev/sdd3 /dev/sdd4

That just usually doesn't happen because userspace isn't generally
stupid and devtmpfs device nodes are mostly managed by the kernel.

But in containers you can easily have an equivalent scenario and
whatever shows up in /dev in the container can be something completely
different than what you see in mountinfo.

So really, relying on the path name is overall pretty useless.

These mismatches between mountinfo and whatever is in /dev isn't
anything new. Containers may use devpts devices that belong to the host
devpts instance while also having a separate devpts instance mounted in
the container. And the device names just accidently match.

The container/host needs to take care to validate that the provided pty
device it uses does actually belong to the devpts instance of the
container/host and not to another instance to avoid being tricked into
opening and allocating device nodes in another devpts instance.

> 
> So I'm wondering if we should even bother the device path resolution at
> all inside btrfs?

I think that's misguided and will always lead to weird issues.

> Or the latest fsconfig API is already resolving the path correctly?

It wouldn't help with the above examples.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-01-20 15:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-15 18:56 Possible bug with open between unshare(CLONE_NEWNS) calls Boris Burkov
2025-01-16  4:14 ` Al Viro
2025-01-16  4:52   ` Boris Burkov
2025-01-16  5:12     ` Al Viro
2025-01-16 10:46 ` Christian Brauner
2025-01-16 21:09   ` Qu Wenruo
2025-01-16 21:29     ` Al Viro
2025-01-16 21:42       ` Qu Wenruo
2025-01-20 15:37     ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox