From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Christian Brauner <brauner@kernel.org>,
Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org,
"Eric W. Biederman" <ebiederm@xmission.com>,
David Ahern <dsahern@kernel.org>
Subject: Re: Persisting mounts between 'ip netns' invocations
Date: Thu, 28 Sep 2023 20:21:28 +0200 [thread overview]
Message-ID: <87il7ucg5z.fsf@toke.dk> (raw)
In-Reply-To: <20230928-geldbeschaffung-gekehrt-81ed7fba768d@brauner>
Christian Brauner <brauner@kernel.org> writes:
> On Thu, Sep 28, 2023 at 11:54:23AM +0200, Nicolas Dichtel wrote:
>> + Eric
>>
>> Le 28/09/2023 à 10:29, Toke Høiland-Jørgensen a écrit :
>> > Hi everyone
>> >
>> > I recently ran into this problem again, and so I figured I'd ask if
>> > anyone has any good idea how to solve it:
>> >
>> > When running a command through 'ip netns exec', iproute2 will
>> > "helpfully" create a new mount namespace and remount /sys inside it,
>> > AFAICT to make sure /sys/class/net/* refers to the right devices inside
>> > the namespace. This makes sense, but unfortunately it has the side
>> > effect that no mount commands executed inside the ns persist. In
>> > particular, this makes it difficult to work with bpffs; even when
>> > mounting a bpffs inside the ns, it will disappear along with the
>> > namespace as soon as the process exits.
>> >
>> > To illustrate:
>> >
>> > # ip netns exec <nsname> bpftool map pin id 2 /sys/fs/bpf/mymap
>> > # ip netns exec <nsname> ls /sys/fs/bpf
>> > <nothing>
>> >
>> > This happens because namespaces are cleaned up as soon as they have no
>> > processes, unless they are persisted by some other means. For the
>> > network namespace itself, iproute2 will bind mount /proc/self/ns/net to
>> > /var/run/netns/<nsname> (in the root mount namespace) to persist the
>> > namespace. I tried implementing something similar for the mount
>> > namespace, but that doesn't work; I can't manually bind mount the 'mnt'
>> > ns reference either:
>> >
>> > # mount -o bind /proc/104444/ns/mnt /var/run/netns/mnt/testns
>> > mount: /run/netns/mnt/testns: wrong fs type, bad option, bad superblock on /proc/104444/ns/mnt, missing codepage or helper program, or other error.
>> > dmesg(1) may have more information after failed mount system call.
>> >
>> > When running strace on that mount command, it seems the move_mount()
>> > syscall returns EINVAL, which, AFAICT, is because the mount namespace
>> > file references itself as its namespace, which means it can't be
>> > bind-mounted into the containing mount namespace.
>> >
>> > So, my question is, how to overcome this limitation? I know it's
>> > possible to get a reference to the namespace of a running process, but
>> > there is no guarantee there is any processes running inside the
>> > namespace (hence the persisting bind mount for the netns). So is there
>> > some other way to persist the mount namespace reference, so we can pick
>> > it back up on the next 'ip netns' invocation?
>> >
>> > Hoping someone has a good idea :)
>> We ran into similar problems. The only solution we found was to use nsenter
>> instead of 'ip netns exec'.
>>
>> To be able to bind mount a mount namespace on a file, the directory of this file
>> should be private. For example:
>>
>> mkdir -p /run/foo
>> mount --make-rshared /
>> mount --bind /run/foo /run/foo
>> mount --make-private /run/foo
>> touch /run/foo/ns
>> unshare --mount --propagation=slave -- sh -c 'yes $$ 2>/dev/null' | {
>> read -r pid &&
>> mount --bind /proc/$pid/ns/mnt /run/foo/ns
>> }
>> nsenter --mount=/run/foo/ns ls /
>>
>> But this doesn't work under 'ip netns exec'.
>
> Afaiu, each ip netns exec invocation allocates a new mount namespace.
> If you run multiple concurrent ip netns exec command and leave them
> around then they all get a separate mount namespace. Not sure what the
> design behind that was. So even if you could persist the mount namespace
> of one there's still no way for ip netns exec to pick that up iiuc.
>
> So imho, the solution is to change ip netns exec to persist a mount
> namespace and netns namespace pair. unshare does this easily via:
>
> sudo mkdir /run/mntns
> sudo mount --bind /run/mntns /run/mntns
> sudo mount --make-slave /run/mntns
>
> sudo mkdir /run/netns
>
> sudo touch /run/mntns/mnt1
> sudo touch /run/netns/net1
>
> sudo unshare --mount=/run/mntns/mnt1 --net=/run/netns/net1 true
>
> So I'd probably patch iproute2.
Patching iproute2 is what I'm trying to do - sorry if that wasn't clear :)
However, I couldn't get it to work. I think it's probably because I was
missing the bind-to-self/--make-slave dance on the containing folder, as
Nicolas pointed out. Will play around with that a bit more, thanks for
the pointers both of you!
-Toke
next prev parent reply other threads:[~2023-09-28 18:21 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-28 8:29 Persisting mounts between 'ip netns' invocations Toke Høiland-Jørgensen
2023-09-28 9:54 ` Nicolas Dichtel
2023-09-28 16:17 ` Christian Brauner
2023-09-28 18:21 ` Toke Høiland-Jørgensen [this message]
2023-09-29 8:26 ` Nicolas Dichtel
2023-09-29 9:25 ` Christian Brauner
2023-09-29 9:45 ` Nicolas Dichtel
2023-09-29 21:23 ` David Laight
2023-09-29 15:00 ` Eric W. Biederman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87il7ucg5z.fsf@toke.dk \
--to=toke@redhat.com \
--cc=bpf@vger.kernel.org \
--cc=brauner@kernel.org \
--cc=dsahern@kernel.org \
--cc=ebiederm@xmission.com \
--cc=netdev@vger.kernel.org \
--cc=nicolas.dichtel@6wind.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox