From: Alexander Graf <graf@amazon.com>
To: Stefano Garzarella <sgarzare@redhat.com>
Cc: Bryan Tan <bryan-bt.tan@broadcom.com>,
Vishnu Dasa <vishnu.dasa@broadcom.com>,
Broadcom internal kernel review list
<bcm-kernel-feedback-list@broadcom.com>,
<virtualization@lists.linux.dev>, <linux-kernel@vger.kernel.org>,
<netdev@vger.kernel.org>, <kvm@vger.kernel.org>,
<eperezma@redhat.com>, Jason Wang <jasowang@redhat.com>,
<mst@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>,
<nh-open-source@amazon.com>
Subject: Re: [PATCH] vsock: Enable H2G override
Date: Mon, 2 Mar 2026 20:04:22 +0100 [thread overview]
Message-ID: <27dcad4e-d658-4b6b-93b2-44c64fcbeb11@amazon.com> (raw)
In-Reply-To: <aaW2FgoaXIJEymyR@sgarzare-redhat>
On 02.03.26 17:25, Stefano Garzarella wrote:
> On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote:
>>
>> On 02.03.26 13:06, Stefano Garzarella wrote:
>>> CCing Bryan, Vishnu, and Broadcom list.
>>>
>>> On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
>>>>
>>>> Please target net-next tree for this new feature.
>>>>
>>>> On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
>>>>> Vsock maintains a single CID number space which can be used to
>>>>> communicate to the host (G2H) or to a child-VM (H2G). The current
>>>>> logic
>>>>> trivially assumes that G2H is only relevant for CID <= 2 because
>>>>> these
>>>>> target the hypervisor. However, in environments like Nitro
>>>>> Enclaves, an
>>>>> instance that hosts vhost_vsock powered VMs may still want to
>>>>> communicate
>>>>> to Enclaves that are reachable at higher CIDs through
>>>>> virtio-vsock-pci.
>>>>>
>>>>> That means that for CID > 2, we really want an overlay. By
>>>>> default, all
>>>>> CIDs are owned by the hypervisor. But if vhost registers a CID, it
>>>>> takes
>>>>> precedence. Implement that logic. Vhost already knows which CIDs it
>>>>> supports anyway.
>>>>>
>>>>> With this logic, I can run a Nitro Enclave as well as a nested VM
>>>>> with
>>>>> vhost-vsock support in parallel, with the parent instance able to
>>>>> communicate to both simultaneously.
>>>>
>>>> I honestly don't understand why VMADDR_FLAG_TO_HOST (added
>>>> specifically for Nitro IIRC) isn't enough for this scenario and we
>>>> have to add this change. Can you elaborate a bit more about the
>>>> relationship between this change and VMADDR_FLAG_TO_HOST we added?
>>
>>
>> The main problem I have with VMADDR_FLAG_TO_HOST for connect() is
>> that it punts the complexity to the user. Instead of a single CID
>> address space, you now effectively create 2 spaces: One for TO_HOST
>> (needs a flag) and one for TO_GUEST (no flag). But every user space
>> tool needs to learn about this flag. That may work for super
>> special-case applications. But propagating that all the way into
>> socat, iperf, etc etc? It's just creating friction.
>
> Okay, I would like to have this (or part of it) in the commit message
> to better explain why we want this change.
>
>>
>> IMHO the most natural experience is to have a single CID space,
>> potentially manually segmented by launching VMs of one kind within a
>> certain range.
>
> I see, but at this point, should the kernel set VMADDR_FLAG_TO_HOST in
> the remote address if that path is taken "automagically" ?
>
> So in that way the user space can have a way to understand if it's
> talking with a nested guest or a sibling guest.
>
>
> That said, I'm concerned about the scenario where an application does
> not even consider communicating with a sibling VM.
If that's really a realistic concern, then we should add a
VMADDR_FLAG_TO_GUEST that the application can set. Default behavior of
an application that provides no flags is "route to whatever you can
find": If vhost is loaded, it routes to vhost. If a vsock backend driver
is loaded, it routes there. But the application has no say in where it
goes: It's purely a system configuration thing.
> Until now, it knew that by not setting that flag, it could only talk
> to nested VMs, so if there was no VM with that CID, the connection
> simply failed. Whereas from this patch onwards, if the device in the
> host supports sibling VMs and there is a VM with that CID, the
> application finds itself talking to a sibling VM instead of a nested
> one, without having any idea.
I'd say an application that attempts to talk to a CID that it does now
know whether it's vhost routed or not is running into "undefined"
territory. If you rmmod the vhost driver, it would also talk to the
hypervisor provided vsock.
> Should we make this feature opt-in in some way, such as sockopt or
> sysctl? (I understand that there is the previous problem, but
> honestly, it seems like a significant change to the behavior of
> AF_VSOCK).
We can create a sysctl to enable behavior with default=on. But I'm
against making the cumbersome does-not-work-out-of-the-box experience
the default. Will include it in v2.
>
>>
>> At the end of the day, the host vs guest problem is super similar to
>> a routing table.
>
> Yeah, but the point of AF_VSOCK is precisely to avoid complexities
> such as routing tables as much as possible; otherwise, AF_INET is
> already there and ready to be used. In theory, we only want
> communication between host and guest.
Yes, but nesting is a thing and nobody thought about it :). In
retrospect, it would have been to annotate the CID with the direction:
H5 goes to CID5 on host and G5 goes to CID5 on guest. But I see no
chance to change that by now.
Alex
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
next prev parent reply other threads:[~2026-03-02 19:04 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-02 10:41 [PATCH] vsock: Enable H2G override Alexander Graf
2026-03-02 11:47 ` Stefano Garzarella
2026-03-02 12:06 ` Stefano Garzarella
2026-03-02 15:48 ` Alexander Graf
2026-03-02 16:25 ` Stefano Garzarella
2026-03-02 19:04 ` Alexander Graf [this message]
2026-03-03 9:49 ` Stefano Garzarella
2026-03-03 14:17 ` Bryan Tan
2026-03-03 20:47 ` Alexander Graf
2026-03-03 20:52 ` Michael S. Tsirkin
2026-03-03 21:05 ` Alexander Graf
2026-03-02 19:52 ` Michael S. Tsirkin
2026-03-03 6:51 ` Alexander Graf
2026-03-03 7:19 ` Michael S. Tsirkin
2026-03-03 9:57 ` Stefano Garzarella
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=27dcad4e-d658-4b6b-93b2-44c64fcbeb11@amazon.com \
--to=graf@amazon.com \
--cc=bcm-kernel-feedback-list@broadcom.com \
--cc=bryan-bt.tan@broadcom.com \
--cc=eperezma@redhat.com \
--cc=jasowang@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mst@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=nh-open-source@amazon.com \
--cc=sgarzare@redhat.com \
--cc=stefanha@redhat.com \
--cc=virtualization@lists.linux.dev \
--cc=vishnu.dasa@broadcom.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox