From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 85DE5200A3 for ; Wed, 11 Oct 2023 13:49:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Ai0CuMO8" Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8B9F4B0 for ; Wed, 11 Oct 2023 06:49:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1697032186; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=INw5XogOJcd4veMjumqnQXf6RoJ3YrteHvnOFI2NkLQ=; b=Ai0CuMO8YnkaUgewDSRFiaMrNLxUJX4x3kvx2+GJaMEZ7gr81Al5nSBA9Aw1ItqJjTTReQ VdSqL38RWjlJw1b/Coh1TmoHN8KRruoNoFoN90VavS3fwLLDZJDL8nihKGcINbVXO5bZQK DtG1uvz4DowBJgsb95EoaAovduh+DUA= Received: from mail-ej1-f71.google.com (mail-ej1-f71.google.com [209.85.218.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-551-78Runx58NGiggBGlQM3v7Q-1; Wed, 11 Oct 2023 09:49:43 -0400 X-MC-Unique: 78Runx58NGiggBGlQM3v7Q-1 Received: by mail-ej1-f71.google.com with SMTP id a640c23a62f3a-9b98d8f6bafso548997866b.1 for ; Wed, 11 Oct 2023 06:49:43 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697032182; x=1697636982; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=INw5XogOJcd4veMjumqnQXf6RoJ3YrteHvnOFI2NkLQ=; b=N/mpwCeRyuSzxdSiIXGgMN3E41rdmVk8fcrpX01sOAx+uU2GhTDaOugeIJBRePGG/9 vUDg+4IlvXMaT++yJHNoVuYhheBpA3gVCxE9sOxvhmgS+Tu7QGe67QI5BASbm41imAPN qQLYZgS48JNkOe43ymB49dQfx3WV1CZnx0SvSfEbduksOikC2k14ZXXMaREWvVJMknZt YucGBRsZQIFSvdkpmeIFpmoQNwjuu/DXqCkpSU8IQqhqkl+lF9HZh1brZiIxqghrfuBg CPMroV4FG7PDCo7UIOywP3rPLiONIvHhz2ocr+ypj33YkgcQfi/sUHWEc9DUqato+Vr7 ObqA== X-Gm-Message-State: AOJu0YyUEZPt3sRksqJizdJC2X4Qn1XadVpbGYuRoMD4GzNIVh4DGD+9 Pm9Qn0dOk8E4ZU9bdclZgSLA4cvenY6Vhyojp8uVHR/DNk55S/4ZuSYxv1SbuQQKRRLTjJXaqH9 gbkwnG/SuMYKNIKr/ X-Received: by 2002:a17:906:cc2:b0:9ae:82b4:e306 with SMTP id l2-20020a1709060cc200b009ae82b4e306mr18748956ejh.62.1697032182010; Wed, 11 Oct 2023 06:49:42 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEBu4En6qZsKj/MxfyUJzvr0GzkkgkYKgW/mhkduSGCGGyHf11PpO65hI2r7MK3H73325Tc5g== X-Received: by 2002:a17:906:cc2:b0:9ae:82b4:e306 with SMTP id l2-20020a1709060cc200b009ae82b4e306mr18748931ejh.62.1697032181539; Wed, 11 Oct 2023 06:49:41 -0700 (PDT) Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2]) by smtp.gmail.com with ESMTPSA id gh18-20020a170906e09200b009786c8249d6sm9953435ejb.175.2023.10.11.06.49.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Oct 2023 06:49:41 -0700 (PDT) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id 7B6B3E5866F; Wed, 11 Oct 2023 15:49:40 +0200 (CEST) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: "Eric W. Biederman" Cc: David Ahern , Stephen Hemminger , netdev@vger.kernel.org, Nicolas Dichtel , Christian Brauner , David Laight Subject: Re: [RFC PATCH iproute2-next 0/5] Persisting of mount namespaces along with network namespaces In-Reply-To: <87y1gajne4.fsf@email.froward.int.ebiederm.org> References: <20231009182753.851551-1-toke@redhat.com> <877cnvtu37.fsf@email.froward.int.ebiederm.org> <87jzrvzc5v.fsf@toke.dk> <87ttqznxjm.fsf@email.froward.int.ebiederm.org> <878r8azjgd.fsf@toke.dk> <87y1gajne4.fsf@email.froward.int.ebiederm.org> X-Clacks-Overhead: GNU Terry Pratchett Date: Wed, 11 Oct 2023 15:49:40 +0200 Message-ID: <87r0m1xo97.fsf@toke.dk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net "Eric W. Biederman" writes: > Toke H=C3=B8iland-J=C3=B8rgensen writes: > >> "Eric W. Biederman" writes: >> >>> Toke H=C3=B8iland-J=C3=B8rgensen writes: >>> >>>> "Eric W. Biederman" writes: >>>> >>>>> Toke H=C3=B8iland-J=C3=B8rgensen writes: >>> >>> There are not many places to look so something like this is probably su= fficient: >>> >>> # stat all of the possible/probable mount points and see if there is >>> # something mounted there. If so recursive bind whatever is there onto >>> # the new /sys >>> >>> for dir in /old/sys/fs/* /old/sys/kernel/*; do >>> if [ $(stat --format '%d' "$dir") =3D $(stat --format '%d' "$dir/") ; = then >> >> What is this comparison supposed to do? I couldn't find any directories >> in /sys/fs/* where this was *not* true, regardless of whether there's a >> mount there or not. > > Bah. I think I got my logic scrambled. I can only get it to work > by comparing the filesystems device on /sys/fs to the device on > /sys/fs/cgroup etc. > > The idea is that st_dev changes between filesystems. So you can detect > a filesystem change based on st_dev. > > I thought the $dir vs $dir/ would have allowed stating the underlying > directory verses the mount, but apparently my memory go that one wrong. > > Which makes my command actually something like: > > sys_dev=3D$(stat --format=3D'%d' /sys) > > for dir in /old/sys/fs/* /old/sys/kernel/*; do > if [ $(stat --format '%d' "$dir") -ne $sys_dev ] ; then > echo $dir is a mount point > fi > done Ah, right that makes sense! I thought I was missing something when I couldn't get your other example to work... >>>>> Or is their a reason that bpffs should be per network namespace? >>>> >>>> Well, I first ran into this issue because of a bug report to >>>> xdp-tools/libxdp about things not working correctly in network >>>> namespaces: >>>> >>>> https://github.com/xdp-project/xdp-tools/issues/364 >>>> >>>> And libxdp does assume that there's a separate bpffs per network >>>> namespace: it persists things into the bpffs that is tied to the netwo= rk >>>> devices in the current namespace. So if the bpffs is shared, an >>>> application running inside the network namespace could access XDP >>>> programs loaded in the root namespace. I don't know, but suspect, that >>>> such assumptions would be relatively common in networking BPF programs >>>> that use pinning (the pinning support in libbpf and iproute2 itself at >>>> least have the same leaking problem if the bpffs is shared). >>> >>> Are the names of the values truly network namespace specific? >>> >>> I did not see any mention of the things that are persisted in the ticket >>> you pointed me at, and unfortunately I am not familiar with xdp. >>> >>> Last I looked until all of the cpu side channels are closed it is >>> unfortunately unsafe to load ebpf programs with anything less than >>> CAP_SYS_ADMIN (aka with permission to see and administer the entire >>> system). So from a system point of view I really don't see a >>> fundamental danger from having a global /sys/fs/bpf. >>> >>> If there are name conflicts in /sys/fs/bpf because of duplicate names in >>> different network namespaces I can see that being a problem. >> >> Yeah, you're right that someone loading a BPF program generally has >> permissions enough that they can break out of any containment if they >> want, but applications do make assumptions about the contents of the >> pinning directory that can lead to conflicts. >> >> A couple of examples: >> >> - libxdp will persist files in /sys/fs/bpf/dispatch-$ifindex-$prog_id >> >> - If someone sets the 'pinning' attribute on a map definition in a BPF >> file, libbpf will pin those files in /sys/fs/bpf/$map_name >> >> The first one leads to obvious conflicts if shared across network >> namespaces because of ifindex collisions. The second one leads to >> potential false sharing of state across what are supposed to be >> independent networking domains (e.g., if the bpffs is shared, loading >> xdp-filter inside a namespace will share the state with another instance >> loaded in another namespace, which would no doubt be surprising). > > Sigh. So non-default network namespaces can't use /sys/fs/bpf, > because of silly userspace assumptions. So the entries need to be > namespaced to prevent conflicts. Yup, basically. >>> At that point the name conflicts either need to be fixed or we >>> fundamentally need to have multiple mount points for bpffs. >>> Probably under something like /run/netns-mounts/NAME/. >>> >>> With ip netns updated to mount the appropriate filesystem. >> >> I don't think it's feasible to fix the conflicts; they've been around >> for a while and are part of application API in some cases. Plus, we >> don't know of all BPF-using applications. >> >> We could have 'ip' manage separate bpffs mounts per namespace and >> bind-mount them into each netns (I think that's what you're suggesting), >> but that would basically achieve the same thing as the mountns >> persisting I am proposing in this series, but only as a special case for >> bpffs. So why not do the more flexible thing and persist the whole >> mountns (so applications inside the namespace can actually mount >> additional things and have them stick around)? The current behaviour >> seems very surprising... > > I don't like persisting the entire mount namespace because it is hard > for a system administrator to see, it is difficult for something outside > of that mount namespace to access, and it is as easy to persist a > mistake as it is to persist something deliberate. > > My proposal: > > On "ip netns add NAME" > - create the network namespace and mount it at /run/netns/NAME > - mount the appropriate sysfs at /run/netns-mounts/NAME/sys > - mount the appropriate bpffs at /run/netns-mounts/NAME/sys/fs/bpf > > On "ip netns delete NAME" > - umount --recursive /run/netns-mounts/NAME > - unlink /run/netns-mounts/NAME > - cleanup /run/netns/NAME as we do today. > > On "ip netns exec NAME" > - Walk through /run/netns-mounts/NAME like we do /etc/netns/NAME/ > and perform bind mounts. If we setup the full /sys hierarchy in /run/netns-mounts/NAME this basically becomes a single recursive bind mount, doesn't it? What about if we also include bind mounts from the host namespace into that separate /sys instance? Will those be included into a recursive bind into /sys inside the mount-ns, or will we have to walk the tree and do separate bind mounts for each directory? Anyway, this scheme sounds like it'll solve the issue I was trying to address so I don't mind doing it this way. I'll try it out and respin the patch series. >>> Mount propagation is a way to configure a mount namespace (before >>> creating a new one) that will cause mounts created in the first mount >>> namespace to be created in it's children, and cause mounts created in >>> the children to be created in the parent (depending on how things are >>> configured). >>> >>> It is not my favorite feature (it makes locking of mount namespaces >>> terrible) and it is probably too clever by half, unfortunately systemd >>> started enabling mount propagation by default, so we are stuck with it. >> >> Right. AFAICT the current iproute2 code explicitly tries to avoid that >> when creating a mountns (it does a 'mount --make-rslave /'); so you're >> saying we should change that? > > If it makes sense. > > I believe I added the 'mount --make-rslave /' because otherwise all > mount activity was propagating back, and making a mess. Especially when > I was unmounting /sys. > > I am not a huge fan of mount propagation it has lots of surprising > little details that need to be set just right, to not cause problems. Ah, you were talking about propagation from inside the mountns to outside? Didn't catch that at first... > With my proposal above I think we could in some carefully chosen > places enable mount propagation without problem. One thing that comes to mind would be that if we create persistent /sys instances in /run/netns-mounts per the above, it would make sense for any modifications done inside the netns to be propagated back to the mount in /run; is this possible with a bind mount? Not sure I quite understand how propagation would work in this case (since it would be a separate (bind) mount point inside the namespace). > But I would really like to see an application that is performing > mounts inside of "ip netns exec" to see how it matters. Two examples come to mind: - I believe there are some applications that will mount a private bpffs instance for their own use case. Not sure if those applications switch in and out of namespaces, though, and if they do whether they are namespace-aware themselves - Interactive use ('ip netns exec $SHELL'), which I sometimes use for testing various things. I've mostly had issues with bpffs in this setting, though, so if we solve that as per the above, maybe that's not needed. > Code without concrete real world test use cases tends to get things > wrong. Heh, amen to that :) -Toke