From: Bobby Eshleman <bobbyeshleman@gmail.com>
To: Stefano Garzarella <sgarzare@redhat.com>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"K. Y. Srinivasan" <kys@microsoft.com>,
"Haiyang Zhang" <haiyangz@microsoft.com>,
"Wei Liu" <wei.liu@kernel.org>,
"Dexuan Cui" <decui@microsoft.com>,
"Stefan Hajnoczi" <stefanha@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
"Jason Wang" <jasowang@redhat.com>,
"Xuan Zhuo" <xuanzhuo@linux.alibaba.com>,
"Eugenio Pérez" <eperezma@redhat.com>,
"Bryan Tan" <bryan-bt.tan@broadcom.com>,
"Vishnu Dasa" <vishnu.dasa@broadcom.com>,
"Broadcom internal kernel review list"
<bcm-kernel-feedback-list@broadcom.com>,
"David S. Miller" <davem@davemloft.net>,
virtualization@lists.linux.dev, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
kvm@vger.kernel.org
Subject: Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock
Date: Thu, 3 Apr 2025 12:42:46 -0700 [thread overview]
Message-ID: <Z+7ktkvIeNbf39D3@devvm6277.cco0.facebook.com> (raw)
In-Reply-To: <4c2xz3xhpdjvb6jmdw7ctsebpza5lcs4gevr5wlwwyt64usr2i@o5qt2msfyvvw>
On Thu, Apr 03, 2025 at 11:33:14AM +0200, Stefano Garzarella wrote:
> On Wed, Apr 02, 2025 at 03:28:19PM -0700, Bobby Eshleman wrote:
> > On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote:
> > > On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote:
> > > > On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote:
> > > > > On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > > > > >
> > > > > > I do like Stefano's suggestion to add a sysctl for a "strict" mode,
> > > > > > Since it offers the best of both worlds, and still tends conservative in
> > > > > > protecting existing applications... but I agree, the non-strict mode
> > > > > > vsock would be unique WRT the usual concept of namespaces.
> > > > >
> > > > > Maybe we could do the opposite, enable strict mode by default (I think
> > > > > it was similar to what I had tried to do with the kernel module in v1, I
> > > > > was young I know xD)
> > > > > And provide a way to disable it for those use cases where the user wants
> > > > > backward compatibility, while paying the cost of less isolation.
> > > >
> > > > I think backwards compatible has to be the default behaviour, otherwise
> > > > the change has too high risk of breaking existing deployments that are
> > > > already using netns and relying on VSOCK being global. Breakage has to
> > > > be opt in.
> > > >
> > > > > I was thinking two options (not sure if the second one can be done):
> > > > >
> > > > > 1. provide a global sysfs/sysctl that disables strict mode, but this
> > > > > then applies to all namespaces
> > > > >
> > > > > 2. provide something that allows disabling strict mode by namespace.
> > > > > Maybe when it is created there are options, or something that can be
> > > > > set later.
> > > > >
> > > > > 2 would be ideal, but that might be too much, so 1 might be enough. In
> > > > > any case, 2 could also be a next step.
> > > > >
> > > > > WDYT?
> > > >
> > > > It occured to me that the problem we face with the CID space usage is
> > > > somewhat similar to the UID/GID space usage for user namespaces.
> > > >
> > > > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
> > > > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.
> > > >
> > > > At the risk of being overkill, is it worth trying a similar kind of
> > > > approach for the vsock CID space ?
> > > >
> > > > A simple variant would be a /proc/net/vsock_cid_outside specifying a set
> > > > of CIDs which are exclusively referencing /dev/vhost-vsock associations
> > > > created outside the namespace. Anything not listed would be exclusively
> > > > referencing associations created inside the namespace.
> > > >
> > > > A more complex variant would be to allow a full remapping of CIDs as is
> > > > done with userns, via a /proc/net/vsock_cid_map, which the same three
> > > > parameters, so that CID=15 association outside the namespace could be
> > > > remapped to CID=9015 inside the namespace, allow the inside namespace
> > > > to define its out association for CID=15 without clashing.
> > > >
> > > > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
> > > > associations created outside namespace, while unmapped CIDs would be
> > > > exclusively referencing /dev/vhost-vsock associations inside the
> > > > namespace.
> > > >
> > > > A likely benefit of relying on a kernel defined mapping/partition of
> > > > the CID space is that apps like QEMU don't need changing, as there's
> > > > no need to invent a new /dev/vhost-vsock-netns device node.
> > > >
> > > > Both approaches give the desirable security protection whereby the
> > > > inside namespace can be prevented from accessing certain CIDs that
> > > > were associated outside the namespace.
> > > >
> > > > Some rule would need to be defined for updating the /proc/net/vsock_cid_map
> > > > file as it is the security control mechanism. If it is write-once then
> > > > if the container mgmt app initializes it, nothing later could change
> > > > it.
> > > >
> > > > A key question is do we need the "first come, first served" behaviour
> > > > for CIDs where a CID can be arbitrarily used by outside or inside namespace
> > > > according to whatever tries to associate a CID first ?
> > >
> > > I think with /proc/net/vsock_cid_outside, instead of disallowing the CID
> > > from being used, this could be solved by disallowing remapping the CID
> > > while in use?
> > >
> > > The thing I like about this is that users can check
> > > /proc/net/vsock_cid_outside to figure out what might be going on,
> > > instead of trying to check lsof or ps to figure out if the VMM processes
> > > have used /dev/vhost-vsock vs /dev/vhost-vsock-netns.
>
> Yes, although the user in theory should not care about this information,
> right?
> I mean I don't even know if it makes sense to expose the contents of
> /proc/net/vsock_cid_outside in the namespace.
>
> > >
> > > Just to check I am following... I suppose we would have a few typical
> > > configurations for /proc/net/vsock_cid_outside. Following uid_map file
> > > format of:
> > > "<local cid start> <global cid start> <range size>"
>
> This seems to relate more to /proc/net/vsock_cid_map, for
> /proc/net/vsock_cid_outside I think 2 parameters are enough
> (CID, range), right?
>
True, yes vsock_cid_map.
> > >
> > > 1. Identity mapping, current namespace CID is global CID (default
> > > setting for new namespaces):
> > >
> > > # empty file
> > >
> > > OR
> > >
> > > 0 0 4294967295
> > >
> > > 2. Complete isolation from global space (initialized, but no mappings):
> > >
> > > 0 0 0
> > >
> > > 3. Mapping in ranges of global CIDs
> > >
> > > For example, global CID space starts at 7000, up to 32-bit max:
> > >
> > > 7000 0 4294960295
> > >
> > > Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to
> > > 8000-8100) :
> > >
> > > 7000 0 100
> > > 8000 1000 100
> > >
> > >
> > > One thing I don't love is that option 3 seems to not be addressing a
> > > known use case. It doesn't necessarily hurt to have, but it will add
> > > complexity to CID handling that might never get used?
>
> Yes, as I also mentioned in the previous email, we could also do a
> step-by-step thing.
>
> IMHO we can define /proc/net/vsock_cid_map (with the structure you just
> defined), but for now only support 1-1 mapping (with the ranges of
> course, I mean the first two parameters should always be the same) and
> then add option 3 in the future.
>
makes sense, sgtm!
> > >
> > > Since options 1/2 could also be represented by a boolean (yes/no
> > > "current ns shares CID with global"), I wonder if we could either A)
> > > only support the first two options at first, or B) add just
> > > /proc/net/vsock_ns_mode at first, which supports only "global" and
> > > "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside
> > > or the full mapping if the need arises?
>
> I think option A is the same as I meant above :-)
>
Indeed.
> > >
> > > This could also be how we support Option 2 from Stefano's last email of
> > > supporting per-namespace opt-in/opt-out.
>
> Hmm, how can we do it by namespace? Isn't that global?
>
I think the file path is global but the contents are tied per-namespace,
according to the namespace of the process that called open() on it.
This way the container mgr can write-once lock it, and the namespace
processes can read it?
> > >
> > > Any thoughts on this?
> > >
> >
> > Stefano,
> >
> > Would only supporting 1/2 still support the Kata use case?
>
> I think so, actually I was thinking something similar in the message I just
> sent.
>
> By default (if the file is empty), nothing should change, so that's fine
> IMO. As Paolo suggested, we absolutely have to have tests to verify these
> things.
>
Sounds like a plan! I'm working on the new vsock vmtest now and will
include the new tests in the next rev.
Also, I'm thinking we should protect vsock_cid_map behind a capability,
but I'm not sure which one is correct (CAP_NET_ADMIN?). WDYT?
Thanks!
next prev parent reply other threads:[~2025-04-03 19:42 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-12 20:59 [PATCH v2 0/3] vsock: add namespace support to vhost-vsock Bobby Eshleman
2025-03-12 20:59 ` [PATCH v2 1/3] vsock: add network namespace support Bobby Eshleman
2025-03-19 13:02 ` Stefano Garzarella
2025-03-19 19:00 ` Bobby Eshleman
2025-03-20 8:57 ` Stefano Garzarella
2025-03-20 20:56 ` Bobby Eshleman
2025-03-12 20:59 ` [PATCH v2 2/3] vsock/virtio_transport_common: handle netns of received packets Bobby Eshleman
2025-03-19 13:26 ` Stefano Garzarella
2025-03-19 19:05 ` Bobby Eshleman
2025-03-12 20:59 ` [PATCH v2 3/3] vhost/vsock: use netns of process that opens the vhost-vsock-netns device Bobby Eshleman
2025-03-19 14:15 ` Stefano Garzarella
2025-03-19 19:28 ` Bobby Eshleman
2025-03-19 21:09 ` Paolo Abeni
2025-03-20 9:08 ` Stefano Garzarella
2025-03-20 21:05 ` Bobby Eshleman
2025-03-21 10:02 ` Stefano Garzarella
2025-03-21 16:43 ` Bobby Eshleman
2025-03-26 0:11 ` Bobby Eshleman
2025-03-27 9:14 ` Stefano Garzarella
2025-03-28 16:07 ` Bobby Eshleman
2025-03-28 16:19 ` Stefano Garzarella
2025-03-28 20:14 ` Bobby Eshleman
2025-03-20 20:57 ` Bobby Eshleman
2025-03-13 2:28 ` [PATCH v2 0/3] vsock: add namespace support to vhost-vsock Bobby Eshleman
2025-03-13 15:37 ` Stefano Garzarella
2025-03-13 16:20 ` Bobby Eshleman
2025-03-21 19:49 ` Michael S. Tsirkin
2025-03-22 1:04 ` Bobby Eshleman
2025-03-28 17:03 ` Stefano Garzarella
2025-03-28 20:13 ` Bobby Eshleman
2025-04-01 19:05 ` Daniel P. Berrangé
2025-04-02 0:21 ` Bobby Eshleman
2025-04-02 8:13 ` Stefano Garzarella
2025-04-02 9:21 ` Daniel P. Berrangé
2025-04-02 22:18 ` Bobby Eshleman
2025-04-02 22:28 ` Bobby Eshleman
2025-04-03 9:33 ` Stefano Garzarella
2025-04-03 19:42 ` Bobby Eshleman [this message]
2025-04-04 13:05 ` Daniel P. Berrangé
2025-04-18 17:57 ` Bobby Eshleman
2025-04-22 13:35 ` Stefano Garzarella
2025-04-03 9:01 ` Stefano Garzarella
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z+7ktkvIeNbf39D3@devvm6277.cco0.facebook.com \
--to=bobbyeshleman@gmail.com \
--cc=bcm-kernel-feedback-list@broadcom.com \
--cc=berrange@redhat.com \
--cc=bryan-bt.tan@broadcom.com \
--cc=davem@davemloft.net \
--cc=decui@microsoft.com \
--cc=eperezma@redhat.com \
--cc=haiyangz@microsoft.com \
--cc=jasowang@redhat.com \
--cc=kuba@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=kys@microsoft.com \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mst@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=sgarzare@redhat.com \
--cc=stefanha@redhat.com \
--cc=virtualization@lists.linux.dev \
--cc=vishnu.dasa@broadcom.com \
--cc=wei.liu@kernel.org \
--cc=xuanzhuo@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).