netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bobby Eshleman <bobbyeshleman@gmail.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: "Stefano Garzarella" <sgarzare@redhat.com>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"K. Y. Srinivasan" <kys@microsoft.com>,
	"Haiyang Zhang" <haiyangz@microsoft.com>,
	"Wei Liu" <wei.liu@kernel.org>,
	"Dexuan Cui" <decui@microsoft.com>,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Jason Wang" <jasowang@redhat.com>,
	"Xuan Zhuo" <xuanzhuo@linux.alibaba.com>,
	"Eugenio Pérez" <eperezma@redhat.com>,
	"Bryan Tan" <bryan-bt.tan@broadcom.com>,
	"Vishnu Dasa" <vishnu.dasa@broadcom.com>,
	"Broadcom internal kernel review list"
	<bcm-kernel-feedback-list@broadcom.com>,
	"David S. Miller" <davem@davemloft.net>,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	kvm@vger.kernel.org
Subject: Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock
Date: Fri, 18 Apr 2025 10:57:52 -0700	[thread overview]
Message-ID: <aAKSoHQuycz24J5l@devvm6277.cco0.facebook.com> (raw)
In-Reply-To: <Z-_ZHIqDsCtQ1zf6@redhat.com>

On Fri, Apr 04, 2025 at 02:05:32PM +0100, Daniel P. Berrangé wrote:
> On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote:
> > On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote:
> > > It occured to me that the problem we face with the CID space usage is
> > > somewhat similar to the UID/GID space usage for user namespaces.
> > > 
> > > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
> > > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.
> > > 
> > > At the risk of being overkill, is it worth trying a similar kind of
> > > approach for the vsock CID space ?
> > > 
> > > A simple variant would be a /proc/net/vsock_cid_outside specifying a set
> > > of CIDs which are exclusively referencing /dev/vhost-vsock associations
> > > created outside the namespace. Anything not listed would be exclusively
> > > referencing associations created inside the namespace.
> > > 
> > > A more complex variant would be to allow a full remapping of CIDs as is
> > > done with userns, via a /proc/net/vsock_cid_map, which the same three
> > > parameters, so that CID=15 association outside the namespace could be
> > > remapped to CID=9015 inside the namespace, allow the inside namespace
> > > to define its out association for CID=15 without clashing.
> > > 
> > > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
> > > associations created outside namespace, while unmapped CIDs would be
> > > exclusively referencing /dev/vhost-vsock associations inside the
> > > namespace. 
> > > 
> > > A likely benefit of relying on a kernel defined mapping/partition of
> > > the CID space is that apps like QEMU don't need changing, as there's
> > > no need to invent a new /dev/vhost-vsock-netns device node.
> > > 
> > > Both approaches give the desirable security protection whereby the
> > > inside namespace can be prevented from accessing certain CIDs that
> > > were associated outside the namespace.
> > > 
> > > Some rule would need to be defined for updating the /proc/net/vsock_cid_map
> > > file as it is the security control mechanism. If it is write-once then
> > > if the container mgmt app initializes it, nothing later could change
> > > it.
> > > 
> > > A key question is do we need the "first come, first served" behaviour
> > > for CIDs where a CID can be arbitrarily used by outside or inside namespace
> > > according to whatever tries to associate a CID first ?
> > 
> > I think with /proc/net/vsock_cid_outside, instead of disallowing the CID
> > from being used, this could be solved by disallowing remapping the CID
> > while in use?
> > 
> > The thing I like about this is that users can check
> > /proc/net/vsock_cid_outside to figure out what might be going on,
> > instead of trying to check lsof or ps to figure out if the VMM processes
> > have used /dev/vhost-vsock vs /dev/vhost-vsock-netns.
> > 
> > Just to check I am following... I suppose we would have a few typical
> > configurations for /proc/net/vsock_cid_outside. Following uid_map file
> > format of:
> > 	"<local cid start>		<global cid start>		<range size>"
> > 
> > 	1. Identity mapping, current namespace CID is global CID (default
> > 	setting for new namespaces):
> > 
> > 		# empty file
> > 
> > 				OR
> > 
> > 		0    0    4294967295
> > 
> > 	2. Complete isolation from global space (initialized, but no mappings):
> > 
> > 		0    0    0
> > 
> > 	3. Mapping in ranges of global CIDs
> > 
> > 	For example, global CID space starts at 7000, up to 32-bit max:
> > 
> > 		7000    0    4294960295
> > 	
> > 	Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to
> > 	8000-8100) :
> > 
> > 		7000    0       100
> > 		8000    1000    100
> > 
> > 
> > One thing I don't love is that option 3 seems to not be addressing a
> > known use case. It doesn't necessarily hurt to have, but it will add
> > complexity to CID handling that might never get used?
> 
> Yeah, I have the same feeling that full remapping of CIDs is probably
> adding complexity without clear benefit, unless it somehow helps us
> with the nested-virt scenario to disambiguate L0/L1/L2 CID ranges ?
> I've not thought the latter through to any great level of detail
> though
> 
> > Since options 1/2 could also be represented by a boolean (yes/no
> > "current ns shares CID with global"), I wonder if we could either A)
> > only support the first two options at first, or B) add just
> > /proc/net/vsock_ns_mode at first, which supports only "global" and
> > "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside
> > or the full mapping if the need arises?
> 
> Two options is sufficient if you want to control AF_VSOCK usage
> and /dev/vhost-vsock usage as a pair. If you want to separately
> control them though, it would push for three options - global,
> local, and mixed. By mixed I mean AF_VSOCK in the NS can access
> the global CID from the NS, but the NS can't associate the global
> CID with a guest.
> 
> IOW, this breaks down like:
> 
>  * CID=N local - aka fully private
> 
>      Outside NS: Can associate outside CID=N with a guest.
>                  AF_VSOCK permitted to access outside CID=N
> 
>      Inside NS: Can NOT associate outside CID=N with a guest
>                 Can associate inside CID=N with a guest
>                 AF_VSOCK forbidden to access outside CID=N
>                 AF_VSOCK permitted to access inside CID=N
> 
> 
>  * CID=N mixed - aka partially shared
> 
>      Outside NS: Can associate outside CID=N with a guest.
>                  AF_VSOCK permitted to access outside CID=N
> 
>      Inside NS: Can NOT associate outside CID=N with a guest
>                 AF_VSOCK permitted to access outside CID=N
>                 No inside CID=N concept
> 
> 
>  * CID=N global - aka current historic behaviour
> 
>      Outside NS: Can associate outside CID=N with a guest.
>                  AF_VSOCK permitted to access outside CID=N
> 
>      Inside NS: Can associate outside CID=N with a guest
>                 AF_VSOCK permitted to access outside CID=N
>                 No inside CID=N concept
> 
> 
> I was thinking the 'mixed' mode might be useful if the outside NS wants
> to retain control over setting up the association, but delegate to
> processes in the inside NS for providing individual services to that
> guest.  This means if the outside NS needs to restart the VM, there is
> no race window in which the inside NS can grab the assocaition with the
> CID
> 
> As for whether we need to control this per-CID, or a single setting
> applying to all CID.
> 
> Consider that the host OS can be running one or more "service VMs" on
> well known CIDs that can be leveraged from other NS, while those other
> NS also run some  "end user VMs" that should be private to the NS.
> 
> IOW, the CIDs for the service VMs would need to be using "mixed"
> policy, while the CIDs for the end user VMs would be "local".
> 

I think this sounds pretty flexible, and IMO adding the third mode
doesn't add much more additional complexity.

Going this route, we have:
- three modes: local, global, mixed
- at first, no vsock_cid_map (local has no outside CIDs, global and mixed have no inside
	CIDs, so no cross-mapping needed)
- only later add a full mapped mode and vsock_cid_map if necessary.

Stefano, any preferences on this vs starting with the restricted
vsock_cid_map (only supporting "0 0 0" and "0 0 <size>")?

I'm leaning towards the modes because it covers more use cases and seems
like a clearer user interface?

To clarify another aspect... child namespaces must inherit the parent's
local. So if namespace P sets the mode to local, and then creates a
child process that then creates namespace C... then C's global and mixed
modes are implicitly restricted to P's local space?

Thanks,
Bobby

  reply	other threads:[~2025-04-18 17:58 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-12 20:59 [PATCH v2 0/3] vsock: add namespace support to vhost-vsock Bobby Eshleman
2025-03-12 20:59 ` [PATCH v2 1/3] vsock: add network namespace support Bobby Eshleman
2025-03-19 13:02   ` Stefano Garzarella
2025-03-19 19:00     ` Bobby Eshleman
2025-03-20  8:57       ` Stefano Garzarella
2025-03-20 20:56         ` Bobby Eshleman
2025-03-12 20:59 ` [PATCH v2 2/3] vsock/virtio_transport_common: handle netns of received packets Bobby Eshleman
2025-03-19 13:26   ` Stefano Garzarella
2025-03-19 19:05     ` Bobby Eshleman
2025-03-12 20:59 ` [PATCH v2 3/3] vhost/vsock: use netns of process that opens the vhost-vsock-netns device Bobby Eshleman
2025-03-19 14:15   ` Stefano Garzarella
2025-03-19 19:28     ` Bobby Eshleman
2025-03-19 21:09   ` Paolo Abeni
2025-03-20  9:08     ` Stefano Garzarella
2025-03-20 21:05       ` Bobby Eshleman
2025-03-21 10:02         ` Stefano Garzarella
2025-03-21 16:43           ` Bobby Eshleman
2025-03-26  0:11           ` Bobby Eshleman
2025-03-27  9:14             ` Stefano Garzarella
2025-03-28 16:07               ` Bobby Eshleman
2025-03-28 16:19                 ` Stefano Garzarella
2025-03-28 20:14                   ` Bobby Eshleman
2025-03-20 20:57     ` Bobby Eshleman
2025-03-13  2:28 ` [PATCH v2 0/3] vsock: add namespace support to vhost-vsock Bobby Eshleman
2025-03-13 15:37   ` Stefano Garzarella
2025-03-13 16:20     ` Bobby Eshleman
2025-03-21 19:49 ` Michael S. Tsirkin
2025-03-22  1:04   ` Bobby Eshleman
2025-03-28 17:03 ` Stefano Garzarella
2025-03-28 20:13   ` Bobby Eshleman
2025-04-01 19:05   ` Daniel P. Berrangé
2025-04-02  0:21     ` Bobby Eshleman
2025-04-02  8:13       ` Stefano Garzarella
2025-04-02  9:21         ` Daniel P. Berrangé
2025-04-02 22:18           ` Bobby Eshleman
2025-04-02 22:28             ` Bobby Eshleman
2025-04-03  9:33               ` Stefano Garzarella
2025-04-03 19:42                 ` Bobby Eshleman
2025-04-04 13:05             ` Daniel P. Berrangé
2025-04-18 17:57               ` Bobby Eshleman [this message]
2025-04-22 13:35                 ` Stefano Garzarella
2025-04-03  9:01           ` Stefano Garzarella

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aAKSoHQuycz24J5l@devvm6277.cco0.facebook.com \
    --to=bobbyeshleman@gmail.com \
    --cc=bcm-kernel-feedback-list@broadcom.com \
    --cc=berrange@redhat.com \
    --cc=bryan-bt.tan@broadcom.com \
    --cc=davem@davemloft.net \
    --cc=decui@microsoft.com \
    --cc=eperezma@redhat.com \
    --cc=haiyangz@microsoft.com \
    --cc=jasowang@redhat.com \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=kys@microsoft.com \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=sgarzare@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=virtualization@lists.linux.dev \
    --cc=vishnu.dasa@broadcom.com \
    --cc=wei.liu@kernel.org \
    --cc=xuanzhuo@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).