* Persisting mounts between 'ip netns' invocations
@ 2023-09-28 8:29 Toke Høiland-Jørgensen
2023-09-28 9:54 ` Nicolas Dichtel
0 siblings, 1 reply; 9+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-09-28 8:29 UTC (permalink / raw)
To: netdev, bpf; +Cc: David Ahern, Christian Brauner
Hi everyone
I recently ran into this problem again, and so I figured I'd ask if
anyone has any good idea how to solve it:
When running a command through 'ip netns exec', iproute2 will
"helpfully" create a new mount namespace and remount /sys inside it,
AFAICT to make sure /sys/class/net/* refers to the right devices inside
the namespace. This makes sense, but unfortunately it has the side
effect that no mount commands executed inside the ns persist. In
particular, this makes it difficult to work with bpffs; even when
mounting a bpffs inside the ns, it will disappear along with the
namespace as soon as the process exits.
To illustrate:
# ip netns exec <nsname> bpftool map pin id 2 /sys/fs/bpf/mymap
# ip netns exec <nsname> ls /sys/fs/bpf
<nothing>
This happens because namespaces are cleaned up as soon as they have no
processes, unless they are persisted by some other means. For the
network namespace itself, iproute2 will bind mount /proc/self/ns/net to
/var/run/netns/<nsname> (in the root mount namespace) to persist the
namespace. I tried implementing something similar for the mount
namespace, but that doesn't work; I can't manually bind mount the 'mnt'
ns reference either:
# mount -o bind /proc/104444/ns/mnt /var/run/netns/mnt/testns
mount: /run/netns/mnt/testns: wrong fs type, bad option, bad superblock on /proc/104444/ns/mnt, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
When running strace on that mount command, it seems the move_mount()
syscall returns EINVAL, which, AFAICT, is because the mount namespace
file references itself as its namespace, which means it can't be
bind-mounted into the containing mount namespace.
So, my question is, how to overcome this limitation? I know it's
possible to get a reference to the namespace of a running process, but
there is no guarantee there is any processes running inside the
namespace (hence the persisting bind mount for the netns). So is there
some other way to persist the mount namespace reference, so we can pick
it back up on the next 'ip netns' invocation?
Hoping someone has a good idea :)
-Toke
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Persisting mounts between 'ip netns' invocations
2023-09-28 8:29 Persisting mounts between 'ip netns' invocations Toke Høiland-Jørgensen
@ 2023-09-28 9:54 ` Nicolas Dichtel
2023-09-28 16:17 ` Christian Brauner
2023-09-29 15:00 ` Eric W. Biederman
0 siblings, 2 replies; 9+ messages in thread
From: Nicolas Dichtel @ 2023-09-28 9:54 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, netdev, bpf, Eric W. Biederman
Cc: David Ahern, Christian Brauner
+ Eric
Le 28/09/2023 à 10:29, Toke Høiland-Jørgensen a écrit :
> Hi everyone
>
> I recently ran into this problem again, and so I figured I'd ask if
> anyone has any good idea how to solve it:
>
> When running a command through 'ip netns exec', iproute2 will
> "helpfully" create a new mount namespace and remount /sys inside it,
> AFAICT to make sure /sys/class/net/* refers to the right devices inside
> the namespace. This makes sense, but unfortunately it has the side
> effect that no mount commands executed inside the ns persist. In
> particular, this makes it difficult to work with bpffs; even when
> mounting a bpffs inside the ns, it will disappear along with the
> namespace as soon as the process exits.
>
> To illustrate:
>
> # ip netns exec <nsname> bpftool map pin id 2 /sys/fs/bpf/mymap
> # ip netns exec <nsname> ls /sys/fs/bpf
> <nothing>
>
> This happens because namespaces are cleaned up as soon as they have no
> processes, unless they are persisted by some other means. For the
> network namespace itself, iproute2 will bind mount /proc/self/ns/net to
> /var/run/netns/<nsname> (in the root mount namespace) to persist the
> namespace. I tried implementing something similar for the mount
> namespace, but that doesn't work; I can't manually bind mount the 'mnt'
> ns reference either:
>
> # mount -o bind /proc/104444/ns/mnt /var/run/netns/mnt/testns
> mount: /run/netns/mnt/testns: wrong fs type, bad option, bad superblock on /proc/104444/ns/mnt, missing codepage or helper program, or other error.
> dmesg(1) may have more information after failed mount system call.
>
> When running strace on that mount command, it seems the move_mount()
> syscall returns EINVAL, which, AFAICT, is because the mount namespace
> file references itself as its namespace, which means it can't be
> bind-mounted into the containing mount namespace.
>
> So, my question is, how to overcome this limitation? I know it's
> possible to get a reference to the namespace of a running process, but
> there is no guarantee there is any processes running inside the
> namespace (hence the persisting bind mount for the netns). So is there
> some other way to persist the mount namespace reference, so we can pick
> it back up on the next 'ip netns' invocation?
>
> Hoping someone has a good idea :)
We ran into similar problems. The only solution we found was to use nsenter
instead of 'ip netns exec'.
To be able to bind mount a mount namespace on a file, the directory of this file
should be private. For example:
mkdir -p /run/foo
mount --make-rshared /
mount --bind /run/foo /run/foo
mount --make-private /run/foo
touch /run/foo/ns
unshare --mount --propagation=slave -- sh -c 'yes $$ 2>/dev/null' | {
read -r pid &&
mount --bind /proc/$pid/ns/mnt /run/foo/ns
}
nsenter --mount=/run/foo/ns ls /
But this doesn't work under 'ip netns exec'.
Regards,
Nicolas
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Persisting mounts between 'ip netns' invocations
2023-09-28 9:54 ` Nicolas Dichtel
@ 2023-09-28 16:17 ` Christian Brauner
2023-09-28 18:21 ` Toke Høiland-Jørgensen
2023-09-29 15:00 ` Eric W. Biederman
1 sibling, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2023-09-28 16:17 UTC (permalink / raw)
To: Nicolas Dichtel
Cc: Toke Høiland-Jørgensen, netdev, bpf, Eric W. Biederman,
David Ahern
On Thu, Sep 28, 2023 at 11:54:23AM +0200, Nicolas Dichtel wrote:
> + Eric
>
> Le 28/09/2023 à 10:29, Toke Høiland-Jørgensen a écrit :
> > Hi everyone
> >
> > I recently ran into this problem again, and so I figured I'd ask if
> > anyone has any good idea how to solve it:
> >
> > When running a command through 'ip netns exec', iproute2 will
> > "helpfully" create a new mount namespace and remount /sys inside it,
> > AFAICT to make sure /sys/class/net/* refers to the right devices inside
> > the namespace. This makes sense, but unfortunately it has the side
> > effect that no mount commands executed inside the ns persist. In
> > particular, this makes it difficult to work with bpffs; even when
> > mounting a bpffs inside the ns, it will disappear along with the
> > namespace as soon as the process exits.
> >
> > To illustrate:
> >
> > # ip netns exec <nsname> bpftool map pin id 2 /sys/fs/bpf/mymap
> > # ip netns exec <nsname> ls /sys/fs/bpf
> > <nothing>
> >
> > This happens because namespaces are cleaned up as soon as they have no
> > processes, unless they are persisted by some other means. For the
> > network namespace itself, iproute2 will bind mount /proc/self/ns/net to
> > /var/run/netns/<nsname> (in the root mount namespace) to persist the
> > namespace. I tried implementing something similar for the mount
> > namespace, but that doesn't work; I can't manually bind mount the 'mnt'
> > ns reference either:
> >
> > # mount -o bind /proc/104444/ns/mnt /var/run/netns/mnt/testns
> > mount: /run/netns/mnt/testns: wrong fs type, bad option, bad superblock on /proc/104444/ns/mnt, missing codepage or helper program, or other error.
> > dmesg(1) may have more information after failed mount system call.
> >
> > When running strace on that mount command, it seems the move_mount()
> > syscall returns EINVAL, which, AFAICT, is because the mount namespace
> > file references itself as its namespace, which means it can't be
> > bind-mounted into the containing mount namespace.
> >
> > So, my question is, how to overcome this limitation? I know it's
> > possible to get a reference to the namespace of a running process, but
> > there is no guarantee there is any processes running inside the
> > namespace (hence the persisting bind mount for the netns). So is there
> > some other way to persist the mount namespace reference, so we can pick
> > it back up on the next 'ip netns' invocation?
> >
> > Hoping someone has a good idea :)
> We ran into similar problems. The only solution we found was to use nsenter
> instead of 'ip netns exec'.
>
> To be able to bind mount a mount namespace on a file, the directory of this file
> should be private. For example:
>
> mkdir -p /run/foo
> mount --make-rshared /
> mount --bind /run/foo /run/foo
> mount --make-private /run/foo
> touch /run/foo/ns
> unshare --mount --propagation=slave -- sh -c 'yes $$ 2>/dev/null' | {
> read -r pid &&
> mount --bind /proc/$pid/ns/mnt /run/foo/ns
> }
> nsenter --mount=/run/foo/ns ls /
>
> But this doesn't work under 'ip netns exec'.
Afaiu, each ip netns exec invocation allocates a new mount namespace.
If you run multiple concurrent ip netns exec command and leave them
around then they all get a separate mount namespace. Not sure what the
design behind that was. So even if you could persist the mount namespace
of one there's still no way for ip netns exec to pick that up iiuc.
So imho, the solution is to change ip netns exec to persist a mount
namespace and netns namespace pair. unshare does this easily via:
sudo mkdir /run/mntns
sudo mount --bind /run/mntns /run/mntns
sudo mount --make-slave /run/mntns
sudo mkdir /run/netns
sudo touch /run/mntns/mnt1
sudo touch /run/netns/net1
sudo unshare --mount=/run/mntns/mnt1 --net=/run/netns/net1 true
So I'd probably patch iproute2.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Persisting mounts between 'ip netns' invocations
2023-09-28 16:17 ` Christian Brauner
@ 2023-09-28 18:21 ` Toke Høiland-Jørgensen
2023-09-29 8:26 ` Nicolas Dichtel
0 siblings, 1 reply; 9+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-09-28 18:21 UTC (permalink / raw)
To: Christian Brauner, Nicolas Dichtel
Cc: netdev, bpf, Eric W. Biederman, David Ahern
Christian Brauner <brauner@kernel.org> writes:
> On Thu, Sep 28, 2023 at 11:54:23AM +0200, Nicolas Dichtel wrote:
>> + Eric
>>
>> Le 28/09/2023 à 10:29, Toke Høiland-Jørgensen a écrit :
>> > Hi everyone
>> >
>> > I recently ran into this problem again, and so I figured I'd ask if
>> > anyone has any good idea how to solve it:
>> >
>> > When running a command through 'ip netns exec', iproute2 will
>> > "helpfully" create a new mount namespace and remount /sys inside it,
>> > AFAICT to make sure /sys/class/net/* refers to the right devices inside
>> > the namespace. This makes sense, but unfortunately it has the side
>> > effect that no mount commands executed inside the ns persist. In
>> > particular, this makes it difficult to work with bpffs; even when
>> > mounting a bpffs inside the ns, it will disappear along with the
>> > namespace as soon as the process exits.
>> >
>> > To illustrate:
>> >
>> > # ip netns exec <nsname> bpftool map pin id 2 /sys/fs/bpf/mymap
>> > # ip netns exec <nsname> ls /sys/fs/bpf
>> > <nothing>
>> >
>> > This happens because namespaces are cleaned up as soon as they have no
>> > processes, unless they are persisted by some other means. For the
>> > network namespace itself, iproute2 will bind mount /proc/self/ns/net to
>> > /var/run/netns/<nsname> (in the root mount namespace) to persist the
>> > namespace. I tried implementing something similar for the mount
>> > namespace, but that doesn't work; I can't manually bind mount the 'mnt'
>> > ns reference either:
>> >
>> > # mount -o bind /proc/104444/ns/mnt /var/run/netns/mnt/testns
>> > mount: /run/netns/mnt/testns: wrong fs type, bad option, bad superblock on /proc/104444/ns/mnt, missing codepage or helper program, or other error.
>> > dmesg(1) may have more information after failed mount system call.
>> >
>> > When running strace on that mount command, it seems the move_mount()
>> > syscall returns EINVAL, which, AFAICT, is because the mount namespace
>> > file references itself as its namespace, which means it can't be
>> > bind-mounted into the containing mount namespace.
>> >
>> > So, my question is, how to overcome this limitation? I know it's
>> > possible to get a reference to the namespace of a running process, but
>> > there is no guarantee there is any processes running inside the
>> > namespace (hence the persisting bind mount for the netns). So is there
>> > some other way to persist the mount namespace reference, so we can pick
>> > it back up on the next 'ip netns' invocation?
>> >
>> > Hoping someone has a good idea :)
>> We ran into similar problems. The only solution we found was to use nsenter
>> instead of 'ip netns exec'.
>>
>> To be able to bind mount a mount namespace on a file, the directory of this file
>> should be private. For example:
>>
>> mkdir -p /run/foo
>> mount --make-rshared /
>> mount --bind /run/foo /run/foo
>> mount --make-private /run/foo
>> touch /run/foo/ns
>> unshare --mount --propagation=slave -- sh -c 'yes $$ 2>/dev/null' | {
>> read -r pid &&
>> mount --bind /proc/$pid/ns/mnt /run/foo/ns
>> }
>> nsenter --mount=/run/foo/ns ls /
>>
>> But this doesn't work under 'ip netns exec'.
>
> Afaiu, each ip netns exec invocation allocates a new mount namespace.
> If you run multiple concurrent ip netns exec command and leave them
> around then they all get a separate mount namespace. Not sure what the
> design behind that was. So even if you could persist the mount namespace
> of one there's still no way for ip netns exec to pick that up iiuc.
>
> So imho, the solution is to change ip netns exec to persist a mount
> namespace and netns namespace pair. unshare does this easily via:
>
> sudo mkdir /run/mntns
> sudo mount --bind /run/mntns /run/mntns
> sudo mount --make-slave /run/mntns
>
> sudo mkdir /run/netns
>
> sudo touch /run/mntns/mnt1
> sudo touch /run/netns/net1
>
> sudo unshare --mount=/run/mntns/mnt1 --net=/run/netns/net1 true
>
> So I'd probably patch iproute2.
Patching iproute2 is what I'm trying to do - sorry if that wasn't clear :)
However, I couldn't get it to work. I think it's probably because I was
missing the bind-to-self/--make-slave dance on the containing folder, as
Nicolas pointed out. Will play around with that a bit more, thanks for
the pointers both of you!
-Toke
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Persisting mounts between 'ip netns' invocations
2023-09-28 18:21 ` Toke Høiland-Jørgensen
@ 2023-09-29 8:26 ` Nicolas Dichtel
2023-09-29 9:25 ` Christian Brauner
0 siblings, 1 reply; 9+ messages in thread
From: Nicolas Dichtel @ 2023-09-29 8:26 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Christian Brauner
Cc: netdev, bpf, Eric W. Biederman, David Ahern
Le 28/09/2023 à 20:21, Toke Høiland-Jørgensen a écrit :
> Christian Brauner <brauner@kernel.org> writes:
>
>> On Thu, Sep 28, 2023 at 11:54:23AM +0200, Nicolas Dichtel wrote:
>>> + Eric
>>>
>>> Le 28/09/2023 à 10:29, Toke Høiland-Jørgensen a écrit :
>>>> Hi everyone
>>>>
>>>> I recently ran into this problem again, and so I figured I'd ask if
>>>> anyone has any good idea how to solve it:
>>>>
>>>> When running a command through 'ip netns exec', iproute2 will
>>>> "helpfully" create a new mount namespace and remount /sys inside it,
>>>> AFAICT to make sure /sys/class/net/* refers to the right devices inside
>>>> the namespace. This makes sense, but unfortunately it has the side
>>>> effect that no mount commands executed inside the ns persist. In
>>>> particular, this makes it difficult to work with bpffs; even when
>>>> mounting a bpffs inside the ns, it will disappear along with the
>>>> namespace as soon as the process exits.
>>>>
>>>> To illustrate:
>>>>
>>>> # ip netns exec <nsname> bpftool map pin id 2 /sys/fs/bpf/mymap
>>>> # ip netns exec <nsname> ls /sys/fs/bpf
>>>> <nothing>
>>>>
>>>> This happens because namespaces are cleaned up as soon as they have no
>>>> processes, unless they are persisted by some other means. For the
>>>> network namespace itself, iproute2 will bind mount /proc/self/ns/net to
>>>> /var/run/netns/<nsname> (in the root mount namespace) to persist the
>>>> namespace. I tried implementing something similar for the mount
>>>> namespace, but that doesn't work; I can't manually bind mount the 'mnt'
>>>> ns reference either:
>>>>
>>>> # mount -o bind /proc/104444/ns/mnt /var/run/netns/mnt/testns
>>>> mount: /run/netns/mnt/testns: wrong fs type, bad option, bad superblock on /proc/104444/ns/mnt, missing codepage or helper program, or other error.
>>>> dmesg(1) may have more information after failed mount system call.
>>>>
>>>> When running strace on that mount command, it seems the move_mount()
>>>> syscall returns EINVAL, which, AFAICT, is because the mount namespace
>>>> file references itself as its namespace, which means it can't be
>>>> bind-mounted into the containing mount namespace.
>>>>
>>>> So, my question is, how to overcome this limitation? I know it's
>>>> possible to get a reference to the namespace of a running process, but
>>>> there is no guarantee there is any processes running inside the
>>>> namespace (hence the persisting bind mount for the netns). So is there
>>>> some other way to persist the mount namespace reference, so we can pick
>>>> it back up on the next 'ip netns' invocation?
>>>>
>>>> Hoping someone has a good idea :)
>>> We ran into similar problems. The only solution we found was to use nsenter
>>> instead of 'ip netns exec'.
>>>
>>> To be able to bind mount a mount namespace on a file, the directory of this file
>>> should be private. For example:
>>>
>>> mkdir -p /run/foo
>>> mount --make-rshared /
>>> mount --bind /run/foo /run/foo
>>> mount --make-private /run/foo
>>> touch /run/foo/ns
>>> unshare --mount --propagation=slave -- sh -c 'yes $$ 2>/dev/null' | {
>>> read -r pid &&
>>> mount --bind /proc/$pid/ns/mnt /run/foo/ns
>>> }
>>> nsenter --mount=/run/foo/ns ls /
>>>
>>> But this doesn't work under 'ip netns exec'.
>>
>> Afaiu, each ip netns exec invocation allocates a new mount namespace.
>> If you run multiple concurrent ip netns exec command and leave them
>> around then they all get a separate mount namespace. Not sure what the
>> design behind that was. So even if you could persist the mount namespace
>> of one there's still no way for ip netns exec to pick that up iiuc.
>>
>> So imho, the solution is to change ip netns exec to persist a mount
>> namespace and netns namespace pair. unshare does this easily via:
>>
>> sudo mkdir /run/mntns
>> sudo mount --bind /run/mntns /run/mntns
>> sudo mount --make-slave /run/mntns
>>
>> sudo mkdir /run/netns
>>
>> sudo touch /run/mntns/mnt1
>> sudo touch /run/netns/net1
>>
>> sudo unshare --mount=/run/mntns/mnt1 --net=/run/netns/net1 true
I fear that creating a new mount ns for each net ns will introduce more problems.
>>
>> So I'd probably patch iproute2.
>
> Patching iproute2 is what I'm trying to do - sorry if that wasn't clear :)
>
> However, I couldn't get it to work. I think it's probably because I was
> missing the bind-to-self/--make-slave dance on the containing folder, as
> Nicolas pointed out. Will play around with that a bit more, thanks for
> the pointers both of you!
The fundamental problem is that the remount of /sys should not be propagated to
the parent mount ns (and in fact the /etc remount also).
You will have to choose between 'propagating the new mount points to the parent
mount ns' and 'having the right view of /sys (ie the /sys corresponding to the
current netns)'.
Maybe this could be done via a new command, something like 'ip netns light-exec'
(which will be equivalent to 'nsenter --net=/run/netns/foo').
FWIW, here is a nice doc about mount subtleties:
https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt
Regards,
Nicolas
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Persisting mounts between 'ip netns' invocations
2023-09-29 8:26 ` Nicolas Dichtel
@ 2023-09-29 9:25 ` Christian Brauner
2023-09-29 9:45 ` Nicolas Dichtel
0 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2023-09-29 9:25 UTC (permalink / raw)
To: Nicolas Dichtel
Cc: Toke Høiland-Jørgensen, netdev, bpf, Eric W. Biederman,
David Ahern
> I fear that creating a new mount ns for each net ns will introduce more problems.
Not sure if we're talking past each other but that is what's happening
now. Each new ip netns exec invocation will allocate a _new_ mount
namespace. In other words, if you have 300 ip netns exec commands
running then there will be 300 individual mount namespaces active.
What I tried to say is that ip netns exec could be changed to
_optionally_ allocate a prepared mount namespace that is shared between
ip netns exec commands. And yeah, that would need to be a new command
line addition to ip netns exec.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Persisting mounts between 'ip netns' invocations
2023-09-29 9:25 ` Christian Brauner
@ 2023-09-29 9:45 ` Nicolas Dichtel
2023-09-29 21:23 ` David Laight
0 siblings, 1 reply; 9+ messages in thread
From: Nicolas Dichtel @ 2023-09-29 9:45 UTC (permalink / raw)
To: Christian Brauner
Cc: Toke Høiland-Jørgensen, netdev, bpf, Eric W. Biederman,
David Ahern
Le 29/09/2023 à 11:25, Christian Brauner a écrit :
>> I fear that creating a new mount ns for each net ns will introduce more problems.
>
> Not sure if we're talking past each other but that is what's happening
> now. Each new ip netns exec invocation will allocate a _new_ mount
> namespace. In other words, if you have 300 ip netns exec commands
> running then there will be 300 individual mount namespaces active.
>
> What I tried to say is that ip netns exec could be changed to
> _optionally_ allocate a prepared mount namespace that is shared between
> ip netns exec commands. And yeah, that would need to be a new command
> line addition to ip netns exec.
Ok, you talked about changing 'ip netns exec', not adding an option, thus I
thought that you suggested adding this unconditionally ;-)
I was asking myself how to propagate mount points between the parent and 'ip
netns exec' (both way), but this may be another use case than Toke's use case.
Regards,
Nicolas
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Persisting mounts between 'ip netns' invocations
2023-09-28 9:54 ` Nicolas Dichtel
2023-09-28 16:17 ` Christian Brauner
@ 2023-09-29 15:00 ` Eric W. Biederman
1 sibling, 0 replies; 9+ messages in thread
From: Eric W. Biederman @ 2023-09-29 15:00 UTC (permalink / raw)
To: Nicolas Dichtel
Cc: Toke Høiland-Jørgensen, netdev, bpf, David Ahern,
Christian Brauner
Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
> + Eric
>
> Le 28/09/2023 à 10:29, Toke Høiland-Jørgensen a écrit :
>> Hi everyone
>>
>> I recently ran into this problem again, and so I figured I'd ask if
>> anyone has any good idea how to solve it:
>>
>> When running a command through 'ip netns exec', iproute2 will
>> "helpfully" create a new mount namespace and remount /sys inside it,
>> AFAICT to make sure /sys/class/net/* refers to the right devices inside
>> the namespace. This makes sense, but unfortunately it has the side
>> effect that no mount commands executed inside the ns persist. In
>> particular, this makes it difficult to work with bpffs; even when
>> mounting a bpffs inside the ns, it will disappear along with the
>> namespace as soon as the process exits.
>>
>> To illustrate:
>>
>> # ip netns exec <nsname> bpftool map pin id 2 /sys/fs/bpf/mymap
>> # ip netns exec <nsname> ls /sys/fs/bpf
>> <nothing>
>>
>> This happens because namespaces are cleaned up as soon as they have no
>> processes, unless they are persisted by some other means. For the
>> network namespace itself, iproute2 will bind mount /proc/self/ns/net to
>> /var/run/netns/<nsname> (in the root mount namespace) to persist the
>> namespace. I tried implementing something similar for the mount
>> namespace, but that doesn't work; I can't manually bind mount the 'mnt'
>> ns reference either:
>>
>> # mount -o bind /proc/104444/ns/mnt /var/run/netns/mnt/testns
>> mount: /run/netns/mnt/testns: wrong fs type, bad option, bad superblock on /proc/104444/ns/mnt, missing codepage or helper program, or other error.
>> dmesg(1) may have more information after failed mount system call.
>>
>> When running strace on that mount command, it seems the move_mount()
>> syscall returns EINVAL, which, AFAICT, is because the mount namespace
>> file references itself as its namespace, which means it can't be
>> bind-mounted into the containing mount namespace.
>>
>> So, my question is, how to overcome this limitation? I know it's
>> possible to get a reference to the namespace of a running process, but
>> there is no guarantee there is any processes running inside the
>> namespace (hence the persisting bind mount for the netns). So is there
>> some other way to persist the mount namespace reference, so we can pick
>> it back up on the next 'ip netns' invocation?
>>
>> Hoping someone has a good idea :)
> We ran into similar problems. The only solution we found was to use nsenter
> instead of 'ip netns exec'.
>
> To be able to bind mount a mount namespace on a file, the directory of this file
> should be private. For example:
>
> mkdir -p /run/foo
> mount --make-rshared /
> mount --bind /run/foo /run/foo
> mount --make-private /run/foo
> touch /run/foo/ns
> unshare --mount --propagation=slave -- sh -c 'yes $$ 2>/dev/null' | {
> read -r pid &&
> mount --bind /proc/$pid/ns/mnt /run/foo/ns
> }
> nsenter --mount=/run/foo/ns ls /
>
> But this doesn't work under 'ip netns exec'.
My goal in writing "ip netns exec" was to be a compatibility layer for
applications that are not aware of multiple network namespaces.
My gut says to recommend you stop using the compatibility shim and have
your applications become network namespace aware (as it appears the
already partially are).
Beyond that I can not give advice unless I understand why you are
attempting to persist mounts that depend upon the network namespace.
Eric
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: Persisting mounts between 'ip netns' invocations
2023-09-29 9:45 ` Nicolas Dichtel
@ 2023-09-29 21:23 ` David Laight
0 siblings, 0 replies; 9+ messages in thread
From: David Laight @ 2023-09-29 21:23 UTC (permalink / raw)
To: 'nicolas.dichtel@6wind.com', Christian Brauner
Cc: Toke Høiland-Jørgensen, netdev@vger.kernel.org,
bpf@vger.kernel.org, Eric W. Biederman, David Ahern
From: Nicolas Dichtel
> Sent: 29 September 2023 10:46
>
> Le 29/09/2023 à 11:25, Christian Brauner a écrit :
> >> I fear that creating a new mount ns for each net ns will introduce more problems.
> >
> > Not sure if we're talking past each other but that is what's happening
> > now. Each new ip netns exec invocation will allocate a _new_ mount
> > namespace. In other words, if you have 300 ip netns exec commands
> > running then there will be 300 individual mount namespaces active.
> >
> > What I tried to say is that ip netns exec could be changed to
> > _optionally_ allocate a prepared mount namespace that is shared between
> > ip netns exec commands. And yeah, that would need to be a new command
> > line addition to ip netns exec.
>
> Ok, you talked about changing 'ip netns exec', not adding an option, thus I
> thought that you suggested adding this unconditionally ;-)
>
> I was asking myself how to propagate mount points between the parent and 'ip
> netns exec' (both way), but this may be another use case than Toke's use case.
I had a different problem.
I have a system with two network namespaces (to separate public and
private network data) and some programs that really want to read
the '/sys/class/net' nodes in both namespaces - so I'd like to
have the 'namespace' copy mounted at the same time on a different
mount point.
But I'm not at all sure that is possible.
(It is possible to pass an open directory fd through 'ip netns exec'
but that is all a bit cludgy.)
Clearing all the mounts also 'lost' the root of a chroot (if not
an actual mount point) causing 'pwd -P' (etc) to generate the full
path instead of the chroot-relative one.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-09-29 21:23 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-28 8:29 Persisting mounts between 'ip netns' invocations Toke Høiland-Jørgensen
2023-09-28 9:54 ` Nicolas Dichtel
2023-09-28 16:17 ` Christian Brauner
2023-09-28 18:21 ` Toke Høiland-Jørgensen
2023-09-29 8:26 ` Nicolas Dichtel
2023-09-29 9:25 ` Christian Brauner
2023-09-29 9:45 ` Nicolas Dichtel
2023-09-29 21:23 ` David Laight
2023-09-29 15:00 ` Eric W. Biederman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).