From: Bobby Eshleman <bobbyeshleman@gmail.com>
To: Stefano Garzarella <sgarzare@redhat.com>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"K. Y. Srinivasan" <kys@microsoft.com>,
"Haiyang Zhang" <haiyangz@microsoft.com>,
"Wei Liu" <wei.liu@kernel.org>,
"Dexuan Cui" <decui@microsoft.com>,
"Stefan Hajnoczi" <stefanha@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
"Jason Wang" <jasowang@redhat.com>,
"Xuan Zhuo" <xuanzhuo@linux.alibaba.com>,
"Eugenio Pérez" <eperezma@redhat.com>,
"Bryan Tan" <bryan-bt.tan@broadcom.com>,
"Vishnu Dasa" <vishnu.dasa@broadcom.com>,
"Broadcom internal kernel review list"
<bcm-kernel-feedback-list@broadcom.com>,
"David S. Miller" <davem@davemloft.net>,
virtualization@lists.linux.dev, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
kvm@vger.kernel.org
Subject: Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock
Date: Thu, 3 Apr 2025 12:42:46 -0700 [thread overview]
Message-ID: <Z+7ktkvIeNbf39D3@devvm6277.cco0.facebook.com> (raw)
In-Reply-To: <4c2xz3xhpdjvb6jmdw7ctsebpza5lcs4gevr5wlwwyt64usr2i@o5qt2msfyvvw>
On Thu, Apr 03, 2025 at 11:33:14AM +0200, Stefano Garzarella wrote:
> On Wed, Apr 02, 2025 at 03:28:19PM -0700, Bobby Eshleman wrote:
> > On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote:
> > > On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote:
> > > > On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote:
> > > > > On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > > > > >
> > > > > > I do like Stefano's suggestion to add a sysctl for a "strict" mode,
> > > > > > Since it offers the best of both worlds, and still tends conservative in
> > > > > > protecting existing applications... but I agree, the non-strict mode
> > > > > > vsock would be unique WRT the usual concept of namespaces.
> > > > >
> > > > > Maybe we could do the opposite, enable strict mode by default (I think
> > > > > it was similar to what I had tried to do with the kernel module in v1, I
> > > > > was young I know xD)
> > > > > And provide a way to disable it for those use cases where the user wants
> > > > > backward compatibility, while paying the cost of less isolation.
> > > >
> > > > I think backwards compatible has to be the default behaviour, otherwise
> > > > the change has too high risk of breaking existing deployments that are
> > > > already using netns and relying on VSOCK being global. Breakage has to
> > > > be opt in.
> > > >
> > > > > I was thinking two options (not sure if the second one can be done):
> > > > >
> > > > > 1. provide a global sysfs/sysctl that disables strict mode, but this
> > > > > then applies to all namespaces
> > > > >
> > > > > 2. provide something that allows disabling strict mode by namespace.
> > > > > Maybe when it is created there are options, or something that can be
> > > > > set later.
> > > > >
> > > > > 2 would be ideal, but that might be too much, so 1 might be enough. In
> > > > > any case, 2 could also be a next step.
> > > > >
> > > > > WDYT?
> > > >
> > > > It occured to me that the problem we face with the CID space usage is
> > > > somewhat similar to the UID/GID space usage for user namespaces.
> > > >
> > > > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
> > > > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.
> > > >
> > > > At the risk of being overkill, is it worth trying a similar kind of
> > > > approach for the vsock CID space ?
> > > >
> > > > A simple variant would be a /proc/net/vsock_cid_outside specifying a set
> > > > of CIDs which are exclusively referencing /dev/vhost-vsock associations
> > > > created outside the namespace. Anything not listed would be exclusively
> > > > referencing associations created inside the namespace.
> > > >
> > > > A more complex variant would be to allow a full remapping of CIDs as is
> > > > done with userns, via a /proc/net/vsock_cid_map, which the same three
> > > > parameters, so that CID=15 association outside the namespace could be
> > > > remapped to CID=9015 inside the namespace, allow the inside namespace
> > > > to define its out association for CID=15 without clashing.
> > > >
> > > > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
> > > > associations created outside namespace, while unmapped CIDs would be
> > > > exclusively referencing /dev/vhost-vsock associations inside the
> > > > namespace.
> > > >
> > > > A likely benefit of relying on a kernel defined mapping/partition of
> > > > the CID space is that apps like QEMU don't need changing, as there's
> > > > no need to invent a new /dev/vhost-vsock-netns device node.
> > > >
> > > > Both approaches give the desirable security protection whereby the
> > > > inside namespace can be prevented from accessing certain CIDs that
> > > > were associated outside the namespace.
> > > >
> > > > Some rule would need to be defined for updating the /proc/net/vsock_cid_map
> > > > file as it is the security control mechanism. If it is write-once then
> > > > if the container mgmt app initializes it, nothing later could change
> > > > it.
> > > >
> > > > A key question is do we need the "first come, first served" behaviour
> > > > for CIDs where a CID can be arbitrarily used by outside or inside namespace
> > > > according to whatever tries to associate a CID first ?
> > >
> > > I think with /proc/net/vsock_cid_outside, instead of disallowing the CID
> > > from being used, this could be solved by disallowing remapping the CID
> > > while in use?
> > >
> > > The thing I like about this is that users can check
> > > /proc/net/vsock_cid_outside to figure out what might be going on,
> > > instead of trying to check lsof or ps to figure out if the VMM processes
> > > have used /dev/vhost-vsock vs /dev/vhost-vsock-netns.
>
> Yes, although the user in theory should not care about this information,
> right?
> I mean I don't even know if it makes sense to expose the contents of
> /proc/net/vsock_cid_outside in the namespace.
>
> > >
> > > Just to check I am following... I suppose we would have a few typical
> > > configurations for /proc/net/vsock_cid_outside. Following uid_map file
> > > format of:
> > > "<local cid start> <global cid start> <range size>"
>
> This seems to relate more to /proc/net/vsock_cid_map, for
> /proc/net/vsock_cid_outside I think 2 parameters are enough
> (CID, range), right?
>
True, yes vsock_cid_map.
> > >
> > > 1. Identity mapping, current namespace CID is global CID (default
> > > setting for new namespaces):
> > >
> > > # empty file
> > >
> > > OR
> > >
> > > 0 0 4294967295
> > >
> > > 2. Complete isolation from global space (initialized, but no mappings):
> > >
> > > 0 0 0
> > >
> > > 3. Mapping in ranges of global CIDs
> > >
> > > For example, global CID space starts at 7000, up to 32-bit max:
> > >
> > > 7000 0 4294960295
> > >
> > > Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to
> > > 8000-8100) :
> > >
> > > 7000 0 100
> > > 8000 1000 100
> > >
> > >
> > > One thing I don't love is that option 3 seems to not be addressing a
> > > known use case. It doesn't necessarily hurt to have, but it will add
> > > complexity to CID handling that might never get used?
>
> Yes, as I also mentioned in the previous email, we could also do a
> step-by-step thing.
>
> IMHO we can define /proc/net/vsock_cid_map (with the structure you just
> defined), but for now only support 1-1 mapping (with the ranges of
> course, I mean the first two parameters should always be the same) and
> then add option 3 in the future.
>
makes sense, sgtm!
> > >
> > > Since options 1/2 could also be represented by a boolean (yes/no
> > > "current ns shares CID with global"), I wonder if we could either A)
> > > only support the first two options at first, or B) add just
> > > /proc/net/vsock_ns_mode at first, which supports only "global" and
> > > "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside
> > > or the full mapping if the need arises?
>
> I think option A is the same as I meant above :-)
>
Indeed.
> > >
> > > This could also be how we support Option 2 from Stefano's last email of
> > > supporting per-namespace opt-in/opt-out.
>
> Hmm, how can we do it by namespace? Isn't that global?
>
I think the file path is global but the contents are tied per-namespace,
according to the namespace of the process that called open() on it.
This way the container mgr can write-once lock it, and the namespace
processes can read it?
> > >
> > > Any thoughts on this?
> > >
> >
> > Stefano,
> >
> > Would only supporting 1/2 still support the Kata use case?
>
> I think so, actually I was thinking something similar in the message I just
> sent.
>
> By default (if the file is empty), nothing should change, so that's fine
> IMO. As Paolo suggested, we absolutely have to have tests to verify these
> things.
>
Sounds like a plan! I'm working on the new vsock vmtest now and will
include the new tests in the next rev.
Also, I'm thinking we should protect vsock_cid_map behind a capability,
but I'm not sure which one is correct (CAP_NET_ADMIN?). WDYT?
Thanks!
next prev parent reply other threads:[~2025-04-03 19:42 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-12 20:59 [PATCH v2 0/3] vsock: add namespace support to vhost-vsock Bobby Eshleman
2025-03-12 20:59 ` [PATCH v2 1/3] vsock: add network namespace support Bobby Eshleman
2025-03-19 13:02 ` Stefano Garzarella
2025-03-19 19:00 ` Bobby Eshleman
2025-03-20 8:57 ` Stefano Garzarella
2025-03-20 20:56 ` Bobby Eshleman
2025-03-12 20:59 ` [PATCH v2 2/3] vsock/virtio_transport_common: handle netns of received packets Bobby Eshleman
2025-03-19 13:26 ` Stefano Garzarella
2025-03-19 19:05 ` Bobby Eshleman
2025-03-12 20:59 ` [PATCH v2 3/3] vhost/vsock: use netns of process that opens the vhost-vsock-netns device Bobby Eshleman
2025-03-19 14:15 ` Stefano Garzarella
2025-03-19 19:28 ` Bobby Eshleman
2025-03-19 21:09 ` Paolo Abeni
2025-03-20 9:08 ` Stefano Garzarella
2025-03-20 21:05 ` Bobby Eshleman
2025-03-21 10:02 ` Stefano Garzarella
2025-03-21 16:43 ` Bobby Eshleman
2025-03-26 0:11 ` Bobby Eshleman
2025-03-27 9:14 ` Stefano Garzarella
2025-03-28 16:07 ` Bobby Eshleman
2025-03-28 16:19 ` Stefano Garzarella
2025-03-28 20:14 ` Bobby Eshleman
2025-03-20 20:57 ` Bobby Eshleman
2025-03-13 2:28 ` [PATCH v2 0/3] vsock: add namespace support to vhost-vsock Bobby Eshleman
2025-03-13 15:37 ` Stefano Garzarella
2025-03-13 16:20 ` Bobby Eshleman
2025-03-21 19:49 ` Michael S. Tsirkin
2025-03-22 1:04 ` Bobby Eshleman
2025-03-28 17:03 ` Stefano Garzarella
2025-03-28 20:13 ` Bobby Eshleman
2025-04-01 19:05 ` Daniel P. Berrangé
2025-04-02 0:21 ` Bobby Eshleman
2025-04-02 8:13 ` Stefano Garzarella
2025-04-02 9:21 ` Daniel P. Berrangé
2025-04-02 22:18 ` Bobby Eshleman
2025-04-02 22:28 ` Bobby Eshleman
2025-04-03 9:33 ` Stefano Garzarella
2025-04-03 19:42 ` Bobby Eshleman [this message]
2025-04-04 13:05 ` Daniel P. Berrangé
2025-04-18 17:57 ` Bobby Eshleman
2025-04-22 13:35 ` Stefano Garzarella
2025-04-03 9:01 ` Stefano Garzarella
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z+7ktkvIeNbf39D3@devvm6277.cco0.facebook.com \
--to=bobbyeshleman@gmail.com \
--cc=bcm-kernel-feedback-list@broadcom.com \
--cc=berrange@redhat.com \
--cc=bryan-bt.tan@broadcom.com \
--cc=davem@davemloft.net \
--cc=decui@microsoft.com \
--cc=eperezma@redhat.com \
--cc=haiyangz@microsoft.com \
--cc=jasowang@redhat.com \
--cc=kuba@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=kys@microsoft.com \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mst@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=sgarzare@redhat.com \
--cc=stefanha@redhat.com \
--cc=virtualization@lists.linux.dev \
--cc=vishnu.dasa@broadcom.com \
--cc=wei.liu@kernel.org \
--cc=xuanzhuo@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.