All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: David Ahern <dsahern@gmail.com>,
	Stephen Hemminger <stephen@networkplumber.org>,
	netdev@vger.kernel.org,
	Nicolas Dichtel <nicolas.dichtel@6wind.com>,
	Christian Brauner <brauner@kernel.org>,
	David Laight <David.Laight@ACULAB.COM>
Subject: Re: [RFC PATCH iproute2-next 0/5] Persisting of mount namespaces along with network namespaces
Date: Wed, 11 Oct 2023 15:49:40 +0200	[thread overview]
Message-ID: <87r0m1xo97.fsf@toke.dk> (raw)
In-Reply-To: <87y1gajne4.fsf@email.froward.int.ebiederm.org>

"Eric W. Biederman" <ebiederm@xmission.com> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>
>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>
>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>>
>>>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>>>
>>>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>>
>>> There are not many places to look so something like this is probably sufficient:
>>>
>>> # stat all of the possible/probable mount points and see if there is
>>> # something mounted there.  If so recursive bind whatever is there onto
>>> # the new /sys
>>>
>>> for dir in /old/sys/fs/* /old/sys/kernel/*; do
>>> 	if [ $(stat --format '%d' "$dir") = $(stat --format '%d' "$dir/") ; then
>>
>> What is this comparison supposed to do? I couldn't find any directories
>> in /sys/fs/* where this was *not* true, regardless of whether there's a
>> mount there or not.
>
> Bah.  I think I got my logic scrambled.  I can only get it to work
> by comparing the filesystems device on /sys/fs to the device on
> /sys/fs/cgroup etc.
>
> The idea is that st_dev changes between filesystems.  So you can detect
> a filesystem change based on st_dev.
>
> I thought the $dir vs $dir/ would have allowed stating the underlying
> directory verses the mount, but apparently my memory go that one wrong.
>
> Which makes my command actually something like:
>
> 	sys_dev=$(stat --format='%d' /sys)
>
> 	for dir in /old/sys/fs/* /old/sys/kernel/*; do
> 		if [ $(stat --format '%d' "$dir") -ne $sys_dev ] ; then
>                 	echo $dir is a mount point
>                 fi
> 	done

Ah, right that makes sense! I thought I was missing something when I
couldn't get your other example to work...

>>>>> Or is their a reason that bpffs should be per network namespace?
>>>>
>>>> Well, I first ran into this issue because of a bug report to
>>>> xdp-tools/libxdp about things not working correctly in network
>>>> namespaces:
>>>>
>>>> https://github.com/xdp-project/xdp-tools/issues/364
>>>>
>>>> And libxdp does assume that there's a separate bpffs per network
>>>> namespace: it persists things into the bpffs that is tied to the network
>>>> devices in the current namespace. So if the bpffs is shared, an
>>>> application running inside the network namespace could access XDP
>>>> programs loaded in the root namespace. I don't know, but suspect, that
>>>> such assumptions would be relatively common in networking BPF programs
>>>> that use pinning (the pinning support in libbpf and iproute2 itself at
>>>> least have the same leaking problem if the bpffs is shared).
>>>
>>> Are the names of the values truly network namespace specific?
>>>
>>> I did not see any mention of the things that are persisted in the ticket
>>> you pointed me at, and unfortunately I am not familiar with xdp.
>>>
>>> Last I looked until all of the cpu side channels are closed it is
>>> unfortunately unsafe to load ebpf programs with anything less than
>>> CAP_SYS_ADMIN (aka with permission to see and administer the entire
>>> system).  So from a system point of view I really don't see a
>>> fundamental danger from having a global /sys/fs/bpf.
>>>
>>> If there are name conflicts in /sys/fs/bpf because of duplicate names in
>>> different network namespaces I can see that being a problem.
>>
>> Yeah, you're right that someone loading a BPF program generally has
>> permissions enough that they can break out of any containment if they
>> want, but applications do make assumptions about the contents of the
>> pinning directory that can lead to conflicts.
>>
>> A couple of examples:
>>
>> - libxdp will persist files in /sys/fs/bpf/dispatch-$ifindex-$prog_id
>>
>> - If someone sets the 'pinning' attribute on a map definition in a BPF
>>   file, libbpf will pin those files in /sys/fs/bpf/$map_name
>>
>> The first one leads to obvious conflicts if shared across network
>> namespaces because of ifindex collisions. The second one leads to
>> potential false sharing of state across what are supposed to be
>> independent networking domains (e.g., if the bpffs is shared, loading
>> xdp-filter inside a namespace will share the state with another instance
>> loaded in another namespace, which would no doubt be surprising).
>
> Sigh.  So non-default network namespaces can't use /sys/fs/bpf,
> because of silly userspace assumptions.  So the entries need to be
> namespaced to prevent conflicts.

Yup, basically.

>>> At that point the name conflicts either need to be fixed or we
>>> fundamentally need to have multiple mount points for bpffs.
>>> Probably under something like /run/netns-mounts/NAME/.
>>>
>>> With ip netns updated to mount the appropriate filesystem.
>>
>> I don't think it's feasible to fix the conflicts; they've been around
>> for a while and are part of application API in some cases. Plus, we
>> don't know of all BPF-using applications.
>>
>> We could have 'ip' manage separate bpffs mounts per namespace and
>> bind-mount them into each netns (I think that's what you're suggesting),
>> but that would basically achieve the same thing as the mountns
>> persisting I am proposing in this series, but only as a special case for
>> bpffs. So why not do the more flexible thing and persist the whole
>> mountns (so applications inside the namespace can actually mount
>> additional things and have them stick around)? The current behaviour
>> seems very surprising...
>
> I don't like persisting the entire mount namespace because it is hard
> for a system administrator to see, it is difficult for something outside
> of that mount namespace to access, and it is as easy to persist a
> mistake as it is to persist something deliberate.
>
> My proposal:
>
> On "ip netns add NAME"
> - create the network namespace and mount it at /run/netns/NAME
> - mount the appropriate sysfs at /run/netns-mounts/NAME/sys
> - mount the appropriate bpffs at /run/netns-mounts/NAME/sys/fs/bpf
>
> On "ip netns delete NAME"
> - umount --recursive /run/netns-mounts/NAME
> - unlink /run/netns-mounts/NAME
> - cleanup /run/netns/NAME as we do today.
>
> On "ip netns exec NAME"
> - Walk through /run/netns-mounts/NAME like we do /etc/netns/NAME/
>   and perform bind mounts.

If we setup the full /sys hierarchy in /run/netns-mounts/NAME this
basically becomes a single recursive bind mount, doesn't it?

What about if we also include bind mounts from the host namespace into
that separate /sys instance? Will those be included into a recursive
bind into /sys inside the mount-ns, or will we have to walk the tree and
do separate bind mounts for each directory?

Anyway, this scheme sounds like it'll solve the issue I was trying to
address so I don't mind doing it this way. I'll try it out and respin
the patch series.

>>> Mount propagation is a way to configure a mount namespace (before
>>> creating a new one) that will cause mounts created in the first mount
>>> namespace to be created in it's children, and cause mounts created in
>>> the children to be created in the parent (depending on how things are
>>> configured).
>>>
>>> It is not my favorite feature (it makes locking of mount namespaces
>>> terrible) and it is probably too clever by half, unfortunately systemd
>>> started enabling mount propagation by default, so we are stuck with it.
>>
>> Right. AFAICT the current iproute2 code explicitly tries to avoid that
>> when creating a mountns (it does a 'mount --make-rslave /'); so you're
>> saying we should change that?
>
> If it makes sense.
>
> I believe I added the 'mount --make-rslave /' because otherwise all
> mount activity was propagating back, and making a mess.  Especially when
> I was unmounting /sys.
>
> I am not a huge fan of mount propagation it has lots of surprising
> little details that need to be set just right, to not cause problems.

Ah, you were talking about propagation from inside the mountns to
outside? Didn't catch that at first...

> With my proposal above I think we could in some carefully chosen
> places enable mount propagation without problem.

One thing that comes to mind would be that if we create persistent /sys
instances in /run/netns-mounts per the above, it would make sense for
any modifications done inside the netns to be propagated back to the
mount in /run; is this possible with a bind mount? Not sure I quite
understand how propagation would work in this case (since it would be a
separate (bind) mount point inside the namespace).

> But I would really like to see an application that is performing
> mounts inside of "ip netns exec" to see how it matters.

Two examples come to mind:

- I believe there are some applications that will mount a private bpffs
  instance for their own use case. Not sure if those applications switch
  in and out of namespaces, though, and if they do whether they are
  namespace-aware themselves

- Interactive use ('ip netns exec $SHELL'), which I sometimes use for
  testing various things. I've mostly had issues with bpffs in this
  setting, though, so if we solve that as per the above, maybe that's
  not needed.

> Code without concrete real world test use cases tends to get things
> wrong.

Heh, amen to that :)

-Toke


  reply	other threads:[~2023-10-11 13:49 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-09 18:27 [RFC PATCH iproute2-next 0/5] Persisting of mount namespaces along with network namespaces Toke Høiland-Jørgensen
2023-10-09 18:27 ` [RFC PATCH iproute2-next 1/5] ip: Mount netns in child process instead of from inside the new namespace Toke Høiland-Jørgensen
2023-10-09 18:27 ` [RFC PATCH iproute2-next 2/5] ip: Split out code creating namespace mount dir so it can be reused Toke Høiland-Jørgensen
2023-10-09 18:27 ` [RFC PATCH iproute2-next 3/5] lib/namespace: Factor out code for reuse Toke Høiland-Jørgensen
2023-10-09 18:27 ` [RFC PATCH iproute2-next 4/5] ip: Also create and persist mount namespace when creating netns Toke Høiland-Jørgensen
2023-10-09 18:27 ` [RFC PATCH iproute2-next 5/5] lib/namespace: Also mount a bpffs instance inside new mount namespaces Toke Høiland-Jørgensen
2023-10-09 20:32 ` [RFC PATCH iproute2-next 0/5] Persisting of mount namespaces along with network namespaces Eric W. Biederman
2023-10-09 22:03   ` Toke Høiland-Jørgensen
2023-10-10  0:14     ` Eric W. Biederman
2023-10-10 13:38       ` Toke Høiland-Jørgensen
2023-10-10 19:19         ` Eric W. Biederman
2023-10-11 13:49           ` Toke Høiland-Jørgensen [this message]
2023-10-11 14:55             ` Eric W. Biederman
2023-10-11 15:03               ` Toke Høiland-Jørgensen
2023-10-10  8:42   ` David Laight
2023-10-10 19:32     ` Eric W. Biederman
2023-10-10 21:51       ` David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r0m1xo97.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=David.Laight@ACULAB.COM \
    --cc=brauner@kernel.org \
    --cc=dsahern@gmail.com \
    --cc=ebiederm@xmission.com \
    --cc=netdev@vger.kernel.org \
    --cc=nicolas.dichtel@6wind.com \
    --cc=stephen@networkplumber.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.