public inbox for virtualization@lists.linux-foundation.org
 help / color / mirror / Atom feed
* [PATCH] vsock: Enable H2G override
@ 2026-03-02 10:41 Alexander Graf
  2026-03-02 11:47 ` Stefano Garzarella
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Graf @ 2026-03-02 10:41 UTC (permalink / raw)
  To: virtualization
  Cc: linux-kernel, netdev, kvm, eperezma, Jason Wang, mst,
	Stefano Garzarella, Stefan Hajnoczi, nh-open-source

Vsock maintains a single CID number space which can be used to
communicate to the host (G2H) or to a child-VM (H2G). The current logic
trivially assumes that G2H is only relevant for CID <= 2 because these
target the hypervisor.  However, in environments like Nitro Enclaves, an
instance that hosts vhost_vsock powered VMs may still want to communicate
to Enclaves that are reachable at higher CIDs through virtio-vsock-pci.

That means that for CID > 2, we really want an overlay. By default, all
CIDs are owned by the hypervisor. But if vhost registers a CID, it takes
precedence.  Implement that logic. Vhost already knows which CIDs it
supports anyway.

With this logic, I can run a Nitro Enclave as well as a nested VM with
vhost-vsock support in parallel, with the parent instance able to
communicate to both simultaneously.

Signed-off-by: Alexander Graf <graf@amazon.com>
---
 drivers/vhost/vsock.c    | 11 +++++++++++
 include/net/af_vsock.h   |  3 +++
 net/vmw_vsock/af_vsock.c |  3 +++
 3 files changed, 17 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 054f7a718f50..223da817e305 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -91,6 +91,16 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid, struct net *net)
 	return NULL;
 }
 
+static bool vhost_transport_has_cid(u32 cid)
+{
+	bool found;
+
+	rcu_read_lock();
+	found = vhost_vsock_get(cid) != NULL;
+	rcu_read_unlock();
+	return found;
+}
+
 static void
 vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
 			    struct vhost_virtqueue *vq)
@@ -424,6 +434,7 @@ static struct virtio_transport vhost_transport = {
 		.module                   = THIS_MODULE,
 
 		.get_local_cid            = vhost_transport_get_local_cid,
+		.has_cid                  = vhost_transport_has_cid,
 
 		.init                     = virtio_transport_do_socket_init,
 		.destruct                 = virtio_transport_destruct,
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 533d8e75f7bb..4cdcb72f9765 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -179,6 +179,9 @@ struct vsock_transport {
 	/* Addressing. */
 	u32 (*get_local_cid)(void);
 
+	/* Check if this transport serves a specific remote CID. */
+	bool (*has_cid)(u32 cid);
+
 	/* Read a single skb */
 	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
 
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 2f7d94d682cb..8b34b264b246 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -584,6 +584,9 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
 		else if (remote_cid <= VMADDR_CID_HOST || !transport_h2g ||
 			 (remote_flags & VMADDR_FLAG_TO_HOST))
 			new_transport = transport_g2h;
+		else if (transport_h2g->has_cid &&
+			 !transport_h2g->has_cid(remote_cid))
+			new_transport = transport_g2h;
 		else
 			new_transport = transport_h2g;
 		break;
-- 
2.47.1




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-02 10:41 [PATCH] vsock: Enable H2G override Alexander Graf
@ 2026-03-02 11:47 ` Stefano Garzarella
  2026-03-02 12:06   ` Stefano Garzarella
  0 siblings, 1 reply; 15+ messages in thread
From: Stefano Garzarella @ 2026-03-02 11:47 UTC (permalink / raw)
  To: Alexander Graf
  Cc: virtualization, linux-kernel, netdev, kvm, eperezma, Jason Wang,
	mst, Stefan Hajnoczi, nh-open-source


Please target net-next tree for this new feature.

On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
>Vsock maintains a single CID number space which can be used to
>communicate to the host (G2H) or to a child-VM (H2G). The current logic
>trivially assumes that G2H is only relevant for CID <= 2 because these
>target the hypervisor.  However, in environments like Nitro Enclaves, an
>instance that hosts vhost_vsock powered VMs may still want to communicate
>to Enclaves that are reachable at higher CIDs through virtio-vsock-pci.
>
>That means that for CID > 2, we really want an overlay. By default, all
>CIDs are owned by the hypervisor. But if vhost registers a CID, it takes
>precedence.  Implement that logic. Vhost already knows which CIDs it
>supports anyway.
>
>With this logic, I can run a Nitro Enclave as well as a nested VM with
>vhost-vsock support in parallel, with the parent instance able to
>communicate to both simultaneously.

I honestly don't understand why VMADDR_FLAG_TO_HOST (added specifically 
for Nitro IIRC) isn't enough for this scenario and we have to add this 
change.  Can you elaborate a bit more about the relationship between 
this change and VMADDR_FLAG_TO_HOST we added?

>
>Signed-off-by: Alexander Graf <graf@amazon.com>
>---
> drivers/vhost/vsock.c    | 11 +++++++++++
> include/net/af_vsock.h   |  3 +++
> net/vmw_vsock/af_vsock.c |  3 +++
> 3 files changed, 17 insertions(+)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index 054f7a718f50..223da817e305 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -91,6 +91,16 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid, struct net *net)
> 	return NULL;
> }
>
>+static bool vhost_transport_has_cid(u32 cid)
>+{
>+	bool found;
>+
>+	rcu_read_lock();
>+	found = vhost_vsock_get(cid) != NULL;

We recently added namespaces support that changed vhost_vsock_get() 
params. This is also in net tree now and in Linus' tree, so not sure 
where this patch is based, but this needs to be rebased since it is not 
building:

../drivers/vhost/vsock.c: In function ‘vhost_transport_has_cid’:
../drivers/vhost/vsock.c:99:17: error: too few arguments to function ‘vhost_vsock_get’; expected 2, have 1
    99 |         found = vhost_vsock_get(cid) != NULL;
       |                 ^~~~~~~~~~~~~~~
../drivers/vhost/vsock.c:74:28: note: declared here
    74 | static struct vhost_vsock *vhost_vsock_get(u32 guest_cid, struct net *net)
       |

>+	rcu_read_unlock();
>+	return found;
>+}
>+
> static void
> vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> 			    struct vhost_virtqueue *vq)
>@@ -424,6 +434,7 @@ static struct virtio_transport vhost_transport = {
> 		.module                   = THIS_MODULE,
>
> 		.get_local_cid            = vhost_transport_get_local_cid,
>+		.has_cid                  = vhost_transport_has_cid,
>
> 		.init                     = virtio_transport_do_socket_init,
> 		.destruct                 = virtio_transport_destruct,
>diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>index 533d8e75f7bb..4cdcb72f9765 100644
>--- a/include/net/af_vsock.h
>+++ b/include/net/af_vsock.h
>@@ -179,6 +179,9 @@ struct vsock_transport {
> 	/* Addressing. */
> 	u32 (*get_local_cid)(void);
>
>+	/* Check if this transport serves a specific remote CID. */
>+	bool (*has_cid)(u32 cid);

What about "has_remote_cid" ?

>+
> 	/* Read a single skb */
> 	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>
>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>index 2f7d94d682cb..8b34b264b246 100644
>--- a/net/vmw_vsock/af_vsock.c
>+++ b/net/vmw_vsock/af_vsock.c
>@@ -584,6 +584,9 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
> 		else if (remote_cid <= VMADDR_CID_HOST || !transport_h2g ||
> 			 (remote_flags & VMADDR_FLAG_TO_HOST))
> 			new_transport = transport_g2h;
>+		else if (transport_h2g->has_cid &&
>+			 !transport_h2g->has_cid(remote_cid))
>+			new_transport = transport_g2h;

We should update the comment on top of this fuction, and maybe also try 
to support the other H2G transport (i.e. VMCI).

@Bryan @Vishnu can the new has_cid()/has_remote_cid() be supported by 
VMCI too?



I have a question: until now, transport assignment was based simply on 
analyzing local socket information (vsk->remote_addr), but now we are 
also adding the status of other components (e.g., VMs that have started 
and registered the CID in vhost-vsock).

Could this produce strange behavior?
For example, two sockets with the same remote_addr communicate with the 
host or with the guest depending on whether or not the VM existed when 
they were created.

Thanks,
Stefano

> 		else
> 			new_transport = transport_h2g;
> 		break;
>-- 
>2.47.1
>
>
>
>
>Amazon Web Services Development Center Germany GmbH
>Tamara-Danz-Str. 13
>10243 Berlin
>Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
>Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
>Sitz: Berlin
>Ust-ID: DE 365 538 597
>
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-02 11:47 ` Stefano Garzarella
@ 2026-03-02 12:06   ` Stefano Garzarella
  2026-03-02 15:48     ` Alexander Graf
  0 siblings, 1 reply; 15+ messages in thread
From: Stefano Garzarella @ 2026-03-02 12:06 UTC (permalink / raw)
  To: Alexander Graf, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list
  Cc: virtualization, linux-kernel, netdev, kvm, eperezma, Jason Wang,
	mst, Stefan Hajnoczi, nh-open-source

CCing Bryan, Vishnu, and Broadcom list.

On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
>
>Please target net-next tree for this new feature.
>
>On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
>>Vsock maintains a single CID number space which can be used to
>>communicate to the host (G2H) or to a child-VM (H2G). The current logic
>>trivially assumes that G2H is only relevant for CID <= 2 because these
>>target the hypervisor.  However, in environments like Nitro Enclaves, an
>>instance that hosts vhost_vsock powered VMs may still want to communicate
>>to Enclaves that are reachable at higher CIDs through virtio-vsock-pci.
>>
>>That means that for CID > 2, we really want an overlay. By default, all
>>CIDs are owned by the hypervisor. But if vhost registers a CID, it takes
>>precedence.  Implement that logic. Vhost already knows which CIDs it
>>supports anyway.
>>
>>With this logic, I can run a Nitro Enclave as well as a nested VM with
>>vhost-vsock support in parallel, with the parent instance able to
>>communicate to both simultaneously.
>
>I honestly don't understand why VMADDR_FLAG_TO_HOST (added 
>specifically for Nitro IIRC) isn't enough for this scenario and we 
>have to add this change.  Can you elaborate a bit more about the 
>relationship between this change and VMADDR_FLAG_TO_HOST we added?
>
>>
>>Signed-off-by: Alexander Graf <graf@amazon.com>
>>---
>>drivers/vhost/vsock.c    | 11 +++++++++++
>>include/net/af_vsock.h   |  3 +++
>>net/vmw_vsock/af_vsock.c |  3 +++
>>3 files changed, 17 insertions(+)
>>
>>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>>index 054f7a718f50..223da817e305 100644
>>--- a/drivers/vhost/vsock.c
>>+++ b/drivers/vhost/vsock.c
>>@@ -91,6 +91,16 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid, struct net *net)
>>	return NULL;
>>}
>>
>>+static bool vhost_transport_has_cid(u32 cid)
>>+{
>>+	bool found;
>>+
>>+	rcu_read_lock();
>>+	found = vhost_vsock_get(cid) != NULL;
>
>We recently added namespaces support that changed vhost_vsock_get() 
>params. This is also in net tree now and in Linus' tree, so not sure 
>where this patch is based, but this needs to be rebased since it is 
>not building:
>
>../drivers/vhost/vsock.c: In function ‘vhost_transport_has_cid’:
>../drivers/vhost/vsock.c:99:17: error: too few arguments to function ‘vhost_vsock_get’; expected 2, have 1
>   99 |         found = vhost_vsock_get(cid) != NULL;
>      |                 ^~~~~~~~~~~~~~~
>../drivers/vhost/vsock.c:74:28: note: declared here
>   74 | static struct vhost_vsock *vhost_vsock_get(u32 guest_cid, struct net *net)
>      |
>
>>+	rcu_read_unlock();
>>+	return found;
>>+}
>>+
>>static void
>>vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>>			    struct vhost_virtqueue *vq)
>>@@ -424,6 +434,7 @@ static struct virtio_transport vhost_transport = {
>>		.module                   = THIS_MODULE,
>>
>>		.get_local_cid            = vhost_transport_get_local_cid,
>>+		.has_cid                  = vhost_transport_has_cid,
>>
>>		.init                     = virtio_transport_do_socket_init,
>>		.destruct                 = virtio_transport_destruct,
>>diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>index 533d8e75f7bb..4cdcb72f9765 100644
>>--- a/include/net/af_vsock.h
>>+++ b/include/net/af_vsock.h
>>@@ -179,6 +179,9 @@ struct vsock_transport {
>>	/* Addressing. */
>>	u32 (*get_local_cid)(void);
>>
>>+	/* Check if this transport serves a specific remote CID. */
>>+	bool (*has_cid)(u32 cid);
>
>What about "has_remote_cid" ?
>
>>+
>>	/* Read a single skb */
>>	int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>>
>>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>>index 2f7d94d682cb..8b34b264b246 100644
>>--- a/net/vmw_vsock/af_vsock.c
>>+++ b/net/vmw_vsock/af_vsock.c
>>@@ -584,6 +584,9 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
>>		else if (remote_cid <= VMADDR_CID_HOST || !transport_h2g ||
>>			 (remote_flags & VMADDR_FLAG_TO_HOST))
>>			new_transport = transport_g2h;
>>+		else if (transport_h2g->has_cid &&
>>+			 !transport_h2g->has_cid(remote_cid))
>>+			new_transport = transport_g2h;
>
>We should update the comment on top of this fuction, and maybe also 
>try to support the other H2G transport (i.e. VMCI).
>
>@Bryan @Vishnu can the new has_cid()/has_remote_cid() be supported by 
>VMCI too?

Oops, I forgot to CC them, now they should be in copy.

Stefano


>
>
>
>I have a question: until now, transport assignment was based simply on 
>analyzing local socket information (vsk->remote_addr), but now we are 
>also adding the status of other components (e.g., VMs that have 
>started and registered the CID in vhost-vsock).
>
>Could this produce strange behavior?
>For example, two sockets with the same remote_addr communicate with 
>the host or with the guest depending on whether or not the VM existed 
>when they were created.
>
>Thanks,
>Stefano
>
>>		else
>>			new_transport = transport_h2g;
>>		break;
>>-- 
>>2.47.1
>>
>>
>>
>>
>>Amazon Web Services Development Center Germany GmbH
>>Tamara-Danz-Str. 13
>>10243 Berlin
>>Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
>>Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
>>Sitz: Berlin
>>Ust-ID: DE 365 538 597
>>
>>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-02 12:06   ` Stefano Garzarella
@ 2026-03-02 15:48     ` Alexander Graf
  2026-03-02 16:25       ` Stefano Garzarella
  2026-03-02 19:52       ` Michael S. Tsirkin
  0 siblings, 2 replies; 15+ messages in thread
From: Alexander Graf @ 2026-03-02 15:48 UTC (permalink / raw)
  To: Stefano Garzarella, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list
  Cc: virtualization, linux-kernel, netdev, kvm, eperezma, Jason Wang,
	mst, Stefan Hajnoczi, nh-open-source


On 02.03.26 13:06, Stefano Garzarella wrote:
> CCing Bryan, Vishnu, and Broadcom list.
>
> On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
>>
>> Please target net-next tree for this new feature.
>>
>> On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
>>> Vsock maintains a single CID number space which can be used to
>>> communicate to the host (G2H) or to a child-VM (H2G). The current logic
>>> trivially assumes that G2H is only relevant for CID <= 2 because these
>>> target the hypervisor.  However, in environments like Nitro 
>>> Enclaves, an
>>> instance that hosts vhost_vsock powered VMs may still want to 
>>> communicate
>>> to Enclaves that are reachable at higher CIDs through virtio-vsock-pci.
>>>
>>> That means that for CID > 2, we really want an overlay. By default, all
>>> CIDs are owned by the hypervisor. But if vhost registers a CID, it 
>>> takes
>>> precedence.  Implement that logic. Vhost already knows which CIDs it
>>> supports anyway.
>>>
>>> With this logic, I can run a Nitro Enclave as well as a nested VM with
>>> vhost-vsock support in parallel, with the parent instance able to
>>> communicate to both simultaneously.
>>
>> I honestly don't understand why VMADDR_FLAG_TO_HOST (added 
>> specifically for Nitro IIRC) isn't enough for this scenario and we 
>> have to add this change.  Can you elaborate a bit more about the 
>> relationship between this change and VMADDR_FLAG_TO_HOST we added? 


The main problem I have with VMADDR_FLAG_TO_HOST for connect() is that 
it punts the complexity to the user. Instead of a single CID address 
space, you now effectively create 2 spaces: One for TO_HOST (needs a 
flag) and one for TO_GUEST (no flag). But every user space tool needs to 
learn about this flag. That may work for super special-case 
applications. But propagating that all the way into socat, iperf, etc 
etc? It's just creating friction.

IMHO the most natural experience is to have a single CID space, 
potentially manually segmented by launching VMs of one kind within a 
certain range.

At the end of the day, the host vs guest problem is super similar to a 
routing table.


>>
>>>
>>> Signed-off-by: Alexander Graf <graf@amazon.com>
>>> ---
>>> drivers/vhost/vsock.c    | 11 +++++++++++
>>> include/net/af_vsock.h   |  3 +++
>>> net/vmw_vsock/af_vsock.c |  3 +++
>>> 3 files changed, 17 insertions(+)
>>>
>>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>>> index 054f7a718f50..223da817e305 100644
>>> --- a/drivers/vhost/vsock.c
>>> +++ b/drivers/vhost/vsock.c
>>> @@ -91,6 +91,16 @@ static struct vhost_vsock *vhost_vsock_get(u32 
>>> guest_cid, struct net *net)
>>>     return NULL;
>>> }
>>>
>>> +static bool vhost_transport_has_cid(u32 cid)
>>> +{
>>> +    bool found;
>>> +
>>> +    rcu_read_lock();
>>> +    found = vhost_vsock_get(cid) != NULL;
>>
>> We recently added namespaces support that changed vhost_vsock_get() 
>> params. This is also in net tree now and in Linus' tree, so not sure 
>> where this patch is based, but this needs to be rebased since it is 
>> not building:
>>
>> ../drivers/vhost/vsock.c: In function ‘vhost_transport_has_cid’:
>> ../drivers/vhost/vsock.c:99:17: error: too few arguments to function 
>> ‘vhost_vsock_get’; expected 2, have 1
>>   99 |         found = vhost_vsock_get(cid) != NULL;
>>      |                 ^~~~~~~~~~~~~~~
>> ../drivers/vhost/vsock.c:74:28: note: declared here
>>   74 | static struct vhost_vsock *vhost_vsock_get(u32 guest_cid, 
>> struct net *net)
>>      |


D'oh. Sorry, I built this on 6.19 and only realized after the send that 
namespace support got in. Will fix up for v2.


>>
>>> +    rcu_read_unlock();
>>> +    return found;
>>> +}
>>> +
>>> static void
>>> vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>>>                 struct vhost_virtqueue *vq)
>>> @@ -424,6 +434,7 @@ static struct virtio_transport vhost_transport = {
>>>         .module                   = THIS_MODULE,
>>>
>>>         .get_local_cid            = vhost_transport_get_local_cid,
>>> +        .has_cid                  = vhost_transport_has_cid,
>>>
>>>         .init                     = virtio_transport_do_socket_init,
>>>         .destruct                 = virtio_transport_destruct,
>>> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>> index 533d8e75f7bb..4cdcb72f9765 100644
>>> --- a/include/net/af_vsock.h
>>> +++ b/include/net/af_vsock.h
>>> @@ -179,6 +179,9 @@ struct vsock_transport {
>>>     /* Addressing. */
>>>     u32 (*get_local_cid)(void);
>>>
>>> +    /* Check if this transport serves a specific remote CID. */
>>> +    bool (*has_cid)(u32 cid);
>>
>> What about "has_remote_cid" ?
>>
>>> +
>>>     /* Read a single skb */
>>>     int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>>>
>>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>>> index 2f7d94d682cb..8b34b264b246 100644
>>> --- a/net/vmw_vsock/af_vsock.c
>>> +++ b/net/vmw_vsock/af_vsock.c
>>> @@ -584,6 +584,9 @@ int vsock_assign_transport(struct vsock_sock 
>>> *vsk, struct vsock_sock *psk)
>>>         else if (remote_cid <= VMADDR_CID_HOST || !transport_h2g ||
>>>              (remote_flags & VMADDR_FLAG_TO_HOST))
>>>             new_transport = transport_g2h;
>>> +        else if (transport_h2g->has_cid &&
>>> +             !transport_h2g->has_cid(remote_cid))
>>> +            new_transport = transport_g2h;
>>
>> We should update the comment on top of this fuction, and maybe also 
>> try to support the other H2G transport (i.e. VMCI).
>>
>> @Bryan @Vishnu can the new has_cid()/has_remote_cid() be supported by 
>> VMCI too?
>
> Oops, I forgot to CC them, now they should be in copy.


Ack. I can also take a quick look if it's trivial to add.


Alex





Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-02 15:48     ` Alexander Graf
@ 2026-03-02 16:25       ` Stefano Garzarella
  2026-03-02 19:04         ` Alexander Graf
  2026-03-02 19:52       ` Michael S. Tsirkin
  1 sibling, 1 reply; 15+ messages in thread
From: Stefano Garzarella @ 2026-03-02 16:25 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Bryan Tan, Vishnu Dasa, Broadcom internal kernel review list,
	virtualization, linux-kernel, netdev, kvm, eperezma, Jason Wang,
	mst, Stefan Hajnoczi, nh-open-source

On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote:
>
>On 02.03.26 13:06, Stefano Garzarella wrote:
>>CCing Bryan, Vishnu, and Broadcom list.
>>
>>On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
>>>
>>>Please target net-next tree for this new feature.
>>>
>>>On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
>>>>Vsock maintains a single CID number space which can be used to
>>>>communicate to the host (G2H) or to a child-VM (H2G). The current logic
>>>>trivially assumes that G2H is only relevant for CID <= 2 because these
>>>>target the hypervisor.  However, in environments like Nitro 
>>>>Enclaves, an
>>>>instance that hosts vhost_vsock powered VMs may still want to 
>>>>communicate
>>>>to Enclaves that are reachable at higher CIDs through virtio-vsock-pci.
>>>>
>>>>That means that for CID > 2, we really want an overlay. By default, all
>>>>CIDs are owned by the hypervisor. But if vhost registers a CID, 
>>>>it takes
>>>>precedence.  Implement that logic. Vhost already knows which CIDs it
>>>>supports anyway.
>>>>
>>>>With this logic, I can run a Nitro Enclave as well as a nested VM with
>>>>vhost-vsock support in parallel, with the parent instance able to
>>>>communicate to both simultaneously.
>>>
>>>I honestly don't understand why VMADDR_FLAG_TO_HOST (added 
>>>specifically for Nitro IIRC) isn't enough for this scenario and we 
>>>have to add this change.  Can you elaborate a bit more about the 
>>>relationship between this change and VMADDR_FLAG_TO_HOST we added?
>
>
>The main problem I have with VMADDR_FLAG_TO_HOST for connect() is that 
>it punts the complexity to the user. Instead of a single CID address 
>space, you now effectively create 2 spaces: One for TO_HOST (needs a 
>flag) and one for TO_GUEST (no flag). But every user space tool needs 
>to learn about this flag. That may work for super special-case 
>applications. But propagating that all the way into socat, iperf, etc 
>etc? It's just creating friction.

Okay, I would like to have this (or part of it) in the commit message to 
better explain why we want this change.

>
>IMHO the most natural experience is to have a single CID space, 
>potentially manually segmented by launching VMs of one kind within a 
>certain range.

I see, but at this point, should the kernel set VMADDR_FLAG_TO_HOST in 
the remote address if that path is taken "automagically" ?

So in that way the user space can have a way to understand if it's 
talking with a nested guest or a sibling guest.


That said, I'm concerned about the scenario where an application does 
not even consider communicating with a sibling VM.

Until now, it knew that by not setting that flag, it could only talk to 
nested VMs, so if there was no VM with that CID, the connection simply 
failed. Whereas from this patch onwards, if the device in the host 
supports sibling VMs and there is a VM with that CID, the application 
finds itself talking to a sibling VM instead of a nested one, without 
having any idea.

Should we make this feature opt-in in some way, such as sockopt or 
sysctl? (I understand that there is the previous problem, but honestly, 
it seems like a significant change to the behavior of AF_VSOCK).

>
>At the end of the day, the host vs guest problem is super similar to a 
>routing table.

Yeah, but the point of AF_VSOCK is precisely to avoid complexities such 
as routing tables as much as possible; otherwise, AF_INET is already 
there and ready to be used. In theory, we only want communication 
between host and guest.

>
>
>>>
>>>>
>>>>Signed-off-by: Alexander Graf <graf@amazon.com>
>>>>---
>>>>drivers/vhost/vsock.c    | 11 +++++++++++
>>>>include/net/af_vsock.h   |  3 +++
>>>>net/vmw_vsock/af_vsock.c |  3 +++
>>>>3 files changed, 17 insertions(+)
>>>>
>>>>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>>>>index 054f7a718f50..223da817e305 100644
>>>>--- a/drivers/vhost/vsock.c
>>>>+++ b/drivers/vhost/vsock.c
>>>>@@ -91,6 +91,16 @@ static struct vhost_vsock 
>>>>*vhost_vsock_get(u32 guest_cid, struct net *net)
>>>>    return NULL;
>>>>}
>>>>
>>>>+static bool vhost_transport_has_cid(u32 cid)
>>>>+{
>>>>+    bool found;
>>>>+
>>>>+    rcu_read_lock();
>>>>+    found = vhost_vsock_get(cid) != NULL;
>>>
>>>We recently added namespaces support that changed 
>>>vhost_vsock_get() params. This is also in net tree now and in 
>>>Linus' tree, so not sure where this patch is based, but this needs 
>>>to be rebased since it is not building:
>>>
>>>../drivers/vhost/vsock.c: In function ‘vhost_transport_has_cid’:
>>>../drivers/vhost/vsock.c:99:17: error: too few arguments to 
>>>function ‘vhost_vsock_get’; expected 2, have 1
>>>  99 |         found = vhost_vsock_get(cid) != NULL;
>>>     |                 ^~~~~~~~~~~~~~~
>>>../drivers/vhost/vsock.c:74:28: note: declared here
>>>  74 | static struct vhost_vsock *vhost_vsock_get(u32 guest_cid, 
>>>struct net *net)
>>>     |
>
>
>D'oh. Sorry, I built this on 6.19 and only realized after the send 
>that namespace support got in. Will fix up for v2.

Thanks.

>
>
>>>
>>>>+    rcu_read_unlock();
>>>>+    return found;
>>>>+}
>>>>+
>>>>static void
>>>>vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>>>>                struct vhost_virtqueue *vq)
>>>>@@ -424,6 +434,7 @@ static struct virtio_transport vhost_transport = {
>>>>        .module                   = THIS_MODULE,
>>>>
>>>>        .get_local_cid            = vhost_transport_get_local_cid,
>>>>+        .has_cid                  = vhost_transport_has_cid,
>>>>
>>>>        .init                     = virtio_transport_do_socket_init,
>>>>        .destruct                 = virtio_transport_destruct,
>>>>diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>>>>index 533d8e75f7bb..4cdcb72f9765 100644
>>>>--- a/include/net/af_vsock.h
>>>>+++ b/include/net/af_vsock.h
>>>>@@ -179,6 +179,9 @@ struct vsock_transport {
>>>>    /* Addressing. */
>>>>    u32 (*get_local_cid)(void);
>>>>
>>>>+    /* Check if this transport serves a specific remote CID. */
>>>>+    bool (*has_cid)(u32 cid);
>>>
>>>What about "has_remote_cid" ?
>>>
>>>>+
>>>>    /* Read a single skb */
>>>>    int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>>>>
>>>>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>>>>index 2f7d94d682cb..8b34b264b246 100644
>>>>--- a/net/vmw_vsock/af_vsock.c
>>>>+++ b/net/vmw_vsock/af_vsock.c
>>>>@@ -584,6 +584,9 @@ int vsock_assign_transport(struct vsock_sock 
>>>>*vsk, struct vsock_sock *psk)
>>>>        else if (remote_cid <= VMADDR_CID_HOST || !transport_h2g ||
>>>>             (remote_flags & VMADDR_FLAG_TO_HOST))
>>>>            new_transport = transport_g2h;
>>>>+        else if (transport_h2g->has_cid &&
>>>>+             !transport_h2g->has_cid(remote_cid))
>>>>+            new_transport = transport_g2h;
>>>
>>>We should update the comment on top of this fuction, and maybe 
>>>also try to support the other H2G transport (i.e. VMCI).
>>>
>>>@Bryan @Vishnu can the new has_cid()/has_remote_cid() be supported 
>>>by VMCI too?
>>
>>Oops, I forgot to CC them, now they should be in copy.
>
>
>Ack. I can also take a quick look if it's trivial to add.

Great, thanks for that!

Stefano


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-02 16:25       ` Stefano Garzarella
@ 2026-03-02 19:04         ` Alexander Graf
  2026-03-03  9:49           ` Stefano Garzarella
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Graf @ 2026-03-02 19:04 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Bryan Tan, Vishnu Dasa, Broadcom internal kernel review list,
	virtualization, linux-kernel, netdev, kvm, eperezma, Jason Wang,
	mst, Stefan Hajnoczi, nh-open-source


On 02.03.26 17:25, Stefano Garzarella wrote:
> On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote:
>>
>> On 02.03.26 13:06, Stefano Garzarella wrote:
>>> CCing Bryan, Vishnu, and Broadcom list.
>>>
>>> On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
>>>>
>>>> Please target net-next tree for this new feature.
>>>>
>>>> On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
>>>>> Vsock maintains a single CID number space which can be used to
>>>>> communicate to the host (G2H) or to a child-VM (H2G). The current 
>>>>> logic
>>>>> trivially assumes that G2H is only relevant for CID <= 2 because 
>>>>> these
>>>>> target the hypervisor.  However, in environments like Nitro 
>>>>> Enclaves, an
>>>>> instance that hosts vhost_vsock powered VMs may still want to 
>>>>> communicate
>>>>> to Enclaves that are reachable at higher CIDs through 
>>>>> virtio-vsock-pci.
>>>>>
>>>>> That means that for CID > 2, we really want an overlay. By 
>>>>> default, all
>>>>> CIDs are owned by the hypervisor. But if vhost registers a CID, it 
>>>>> takes
>>>>> precedence.  Implement that logic. Vhost already knows which CIDs it
>>>>> supports anyway.
>>>>>
>>>>> With this logic, I can run a Nitro Enclave as well as a nested VM 
>>>>> with
>>>>> vhost-vsock support in parallel, with the parent instance able to
>>>>> communicate to both simultaneously.
>>>>
>>>> I honestly don't understand why VMADDR_FLAG_TO_HOST (added 
>>>> specifically for Nitro IIRC) isn't enough for this scenario and we 
>>>> have to add this change.  Can you elaborate a bit more about the 
>>>> relationship between this change and VMADDR_FLAG_TO_HOST we added?
>>
>>
>> The main problem I have with VMADDR_FLAG_TO_HOST for connect() is 
>> that it punts the complexity to the user. Instead of a single CID 
>> address space, you now effectively create 2 spaces: One for TO_HOST 
>> (needs a flag) and one for TO_GUEST (no flag). But every user space 
>> tool needs to learn about this flag. That may work for super 
>> special-case applications. But propagating that all the way into 
>> socat, iperf, etc etc? It's just creating friction.
>
> Okay, I would like to have this (or part of it) in the commit message 
> to better explain why we want this change.
>
>>
>> IMHO the most natural experience is to have a single CID space, 
>> potentially manually segmented by launching VMs of one kind within a 
>> certain range.
>
> I see, but at this point, should the kernel set VMADDR_FLAG_TO_HOST in 
> the remote address if that path is taken "automagically" ?
>
> So in that way the user space can have a way to understand if it's 
> talking with a nested guest or a sibling guest.
>
>
> That said, I'm concerned about the scenario where an application does 
> not even consider communicating with a sibling VM.


If that's really a realistic concern, then we should add a 
VMADDR_FLAG_TO_GUEST that the application can set. Default behavior of 
an application that provides no flags is "route to whatever you can 
find": If vhost is loaded, it routes to vhost. If a vsock backend driver 
is loaded, it routes there. But the application has no say in where it 
goes: It's purely a system configuration thing.


> Until now, it knew that by not setting that flag, it could only talk 
> to nested VMs, so if there was no VM with that CID, the connection 
> simply failed. Whereas from this patch onwards, if the device in the 
> host supports sibling VMs and there is a VM with that CID, the 
> application finds itself talking to a sibling VM instead of a nested 
> one, without having any idea.


I'd say an application that attempts to talk to a CID that it does now 
know whether it's vhost routed or not is running into "undefined" 
territory. If you rmmod the vhost driver, it would also talk to the 
hypervisor provided vsock.


> Should we make this feature opt-in in some way, such as sockopt or 
> sysctl? (I understand that there is the previous problem, but 
> honestly, it seems like a significant change to the behavior of 
> AF_VSOCK).


We can create a sysctl to enable behavior with default=on. But I'm 
against making the cumbersome does-not-work-out-of-the-box experience 
the default. Will include it in v2.


>
>>
>> At the end of the day, the host vs guest problem is super similar to 
>> a routing table.
>
> Yeah, but the point of AF_VSOCK is precisely to avoid complexities 
> such as routing tables as much as possible; otherwise, AF_INET is 
> already there and ready to be used. In theory, we only want 
> communication between host and guest.


Yes, but nesting is a thing and nobody thought about it :). In 
retrospect, it would have been to annotate the CID with the direction: 
H5 goes to CID5 on host and G5 goes to CID5 on guest. But I see no 
chance to change that by now.


Alex




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-02 15:48     ` Alexander Graf
  2026-03-02 16:25       ` Stefano Garzarella
@ 2026-03-02 19:52       ` Michael S. Tsirkin
  2026-03-03  6:51         ` Alexander Graf
  1 sibling, 1 reply; 15+ messages in thread
From: Michael S. Tsirkin @ 2026-03-02 19:52 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Stefano Garzarella, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, virtualization,
	linux-kernel, netdev, kvm, eperezma, Jason Wang, Stefan Hajnoczi,
	nh-open-source

On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote:
> 
> On 02.03.26 13:06, Stefano Garzarella wrote:
> > CCing Bryan, Vishnu, and Broadcom list.
> > 
> > On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
> > > 
> > > Please target net-next tree for this new feature.
> > > 
> > > On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
> > > > Vsock maintains a single CID number space which can be used to
> > > > communicate to the host (G2H) or to a child-VM (H2G). The current logic
> > > > trivially assumes that G2H is only relevant for CID <= 2 because these
> > > > target the hypervisor.  However, in environments like Nitro
> > > > Enclaves, an
> > > > instance that hosts vhost_vsock powered VMs may still want to
> > > > communicate
> > > > to Enclaves that are reachable at higher CIDs through virtio-vsock-pci.
> > > > 
> > > > That means that for CID > 2, we really want an overlay. By default, all
> > > > CIDs are owned by the hypervisor. But if vhost registers a CID,
> > > > it takes
> > > > precedence.  Implement that logic. Vhost already knows which CIDs it
> > > > supports anyway.
> > > > 
> > > > With this logic, I can run a Nitro Enclave as well as a nested VM with
> > > > vhost-vsock support in parallel, with the parent instance able to
> > > > communicate to both simultaneously.
> > > 
> > > I honestly don't understand why VMADDR_FLAG_TO_HOST (added
> > > specifically for Nitro IIRC) isn't enough for this scenario and we
> > > have to add this change.  Can you elaborate a bit more about the
> > > relationship between this change and VMADDR_FLAG_TO_HOST we added?
> 
> 
> The main problem I have with VMADDR_FLAG_TO_HOST for connect() is that it
> punts the complexity to the user. Instead of a single CID address space, you
> now effectively create 2 spaces: One for TO_HOST (needs a flag) and one for
> TO_GUEST (no flag). But every user space tool needs to learn about this
> flag. That may work for super special-case applications. But propagating
> that all the way into socat, iperf, etc etc? It's just creating friction.
> 
> IMHO the most natural experience is to have a single CID space, potentially
> manually segmented by launching VMs of one kind within a certain range.
> 
> At the end of the day, the host vs guest problem is super similar to a
> routing table.

If this is what's desired, some bits could be stolen from the CID
to specify the destination type. Would that address the issue?
Just a thought.



> 
> > > 
> > > > 
> > > > Signed-off-by: Alexander Graf <graf@amazon.com>
> > > > ---
> > > > drivers/vhost/vsock.c    | 11 +++++++++++
> > > > include/net/af_vsock.h   |  3 +++
> > > > net/vmw_vsock/af_vsock.c |  3 +++
> > > > 3 files changed, 17 insertions(+)
> > > > 
> > > > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > > > index 054f7a718f50..223da817e305 100644
> > > > --- a/drivers/vhost/vsock.c
> > > > +++ b/drivers/vhost/vsock.c
> > > > @@ -91,6 +91,16 @@ static struct vhost_vsock
> > > > *vhost_vsock_get(u32 guest_cid, struct net *net)
> > > >     return NULL;
> > > > }
> > > > 
> > > > +static bool vhost_transport_has_cid(u32 cid)
> > > > +{
> > > > +    bool found;
> > > > +
> > > > +    rcu_read_lock();
> > > > +    found = vhost_vsock_get(cid) != NULL;
> > > 
> > > We recently added namespaces support that changed vhost_vsock_get()
> > > params. This is also in net tree now and in Linus' tree, so not sure
> > > where this patch is based, but this needs to be rebased since it is
> > > not building:
> > > 
> > > ../drivers/vhost/vsock.c: In function ‘vhost_transport_has_cid’:
> > > ../drivers/vhost/vsock.c:99:17: error: too few arguments to function
> > > ‘vhost_vsock_get’; expected 2, have 1
> > >   99 |         found = vhost_vsock_get(cid) != NULL;
> > >      |                 ^~~~~~~~~~~~~~~
> > > ../drivers/vhost/vsock.c:74:28: note: declared here
> > >   74 | static struct vhost_vsock *vhost_vsock_get(u32 guest_cid,
> > > struct net *net)
> > >      |
> 
> 
> D'oh. Sorry, I built this on 6.19 and only realized after the send that
> namespace support got in. Will fix up for v2.
> 
> 
> > > 
> > > > +    rcu_read_unlock();
> > > > +    return found;
> > > > +}
> > > > +
> > > > static void
> > > > vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> > > >                 struct vhost_virtqueue *vq)
> > > > @@ -424,6 +434,7 @@ static struct virtio_transport vhost_transport = {
> > > >         .module                   = THIS_MODULE,
> > > > 
> > > >         .get_local_cid            = vhost_transport_get_local_cid,
> > > > +        .has_cid                  = vhost_transport_has_cid,
> > > > 
> > > >         .init                     = virtio_transport_do_socket_init,
> > > >         .destruct                 = virtio_transport_destruct,
> > > > diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> > > > index 533d8e75f7bb..4cdcb72f9765 100644
> > > > --- a/include/net/af_vsock.h
> > > > +++ b/include/net/af_vsock.h
> > > > @@ -179,6 +179,9 @@ struct vsock_transport {
> > > >     /* Addressing. */
> > > >     u32 (*get_local_cid)(void);
> > > > 
> > > > +    /* Check if this transport serves a specific remote CID. */
> > > > +    bool (*has_cid)(u32 cid);
> > > 
> > > What about "has_remote_cid" ?
> > > 
> > > > +
> > > >     /* Read a single skb */
> > > >     int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
> > > > 
> > > > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> > > > index 2f7d94d682cb..8b34b264b246 100644
> > > > --- a/net/vmw_vsock/af_vsock.c
> > > > +++ b/net/vmw_vsock/af_vsock.c
> > > > @@ -584,6 +584,9 @@ int vsock_assign_transport(struct vsock_sock
> > > > *vsk, struct vsock_sock *psk)
> > > >         else if (remote_cid <= VMADDR_CID_HOST || !transport_h2g ||
> > > >              (remote_flags & VMADDR_FLAG_TO_HOST))
> > > >             new_transport = transport_g2h;
> > > > +        else if (transport_h2g->has_cid &&
> > > > +             !transport_h2g->has_cid(remote_cid))
> > > > +            new_transport = transport_g2h;
> > > 
> > > We should update the comment on top of this fuction, and maybe also
> > > try to support the other H2G transport (i.e. VMCI).
> > > 
> > > @Bryan @Vishnu can the new has_cid()/has_remote_cid() be supported
> > > by VMCI too?
> > 
> > Oops, I forgot to CC them, now they should be in copy.
> 
> 
> Ack. I can also take a quick look if it's trivial to add.
> 
> 
> Alex
> 
> 
> 
> 
> 
> Amazon Web Services Development Center Germany GmbH
> Tamara-Danz-Str. 13
> 10243 Berlin
> Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
> Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
> Sitz: Berlin
> Ust-ID: DE 365 538 597


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-02 19:52       ` Michael S. Tsirkin
@ 2026-03-03  6:51         ` Alexander Graf
  2026-03-03  7:19           ` Michael S. Tsirkin
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Graf @ 2026-03-03  6:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefano Garzarella, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, virtualization,
	linux-kernel, netdev, kvm, eperezma, Jason Wang, Stefan Hajnoczi,
	nh-open-source


On 02.03.26 20:52, Michael S. Tsirkin wrote:
> On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote:
>> On 02.03.26 13:06, Stefano Garzarella wrote:
>>> CCing Bryan, Vishnu, and Broadcom list.
>>>
>>> On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
>>>> Please target net-next tree for this new feature.
>>>>
>>>> On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
>>>>> Vsock maintains a single CID number space which can be used to
>>>>> communicate to the host (G2H) or to a child-VM (H2G). The current logic
>>>>> trivially assumes that G2H is only relevant for CID <= 2 because these
>>>>> target the hypervisor.  However, in environments like Nitro
>>>>> Enclaves, an
>>>>> instance that hosts vhost_vsock powered VMs may still want to
>>>>> communicate
>>>>> to Enclaves that are reachable at higher CIDs through virtio-vsock-pci.
>>>>>
>>>>> That means that for CID > 2, we really want an overlay. By default, all
>>>>> CIDs are owned by the hypervisor. But if vhost registers a CID,
>>>>> it takes
>>>>> precedence.  Implement that logic. Vhost already knows which CIDs it
>>>>> supports anyway.
>>>>>
>>>>> With this logic, I can run a Nitro Enclave as well as a nested VM with
>>>>> vhost-vsock support in parallel, with the parent instance able to
>>>>> communicate to both simultaneously.
>>>> I honestly don't understand why VMADDR_FLAG_TO_HOST (added
>>>> specifically for Nitro IIRC) isn't enough for this scenario and we
>>>> have to add this change.  Can you elaborate a bit more about the
>>>> relationship between this change and VMADDR_FLAG_TO_HOST we added?
>>
>> The main problem I have with VMADDR_FLAG_TO_HOST for connect() is that it
>> punts the complexity to the user. Instead of a single CID address space, you
>> now effectively create 2 spaces: One for TO_HOST (needs a flag) and one for
>> TO_GUEST (no flag). But every user space tool needs to learn about this
>> flag. That may work for super special-case applications. But propagating
>> that all the way into socat, iperf, etc etc? It's just creating friction.
>>
>> IMHO the most natural experience is to have a single CID space, potentially
>> manually segmented by launching VMs of one kind within a certain range.
>>
>> At the end of the day, the host vs guest problem is super similar to a
>> routing table.
> If this is what's desired, some bits could be stolen from the CID
> to specify the destination type. Would that address the issue?
> Just a thought.


If we had thought of this from the beginning, yes. But now that everyone 
thinks CID (guest) == CID (host), I believe this is no longer feasible.


Alex




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-03  6:51         ` Alexander Graf
@ 2026-03-03  7:19           ` Michael S. Tsirkin
  2026-03-03  9:57             ` Stefano Garzarella
  0 siblings, 1 reply; 15+ messages in thread
From: Michael S. Tsirkin @ 2026-03-03  7:19 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Stefano Garzarella, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, virtualization,
	linux-kernel, netdev, kvm, eperezma, Jason Wang, Stefan Hajnoczi,
	nh-open-source

On Tue, Mar 03, 2026 at 07:51:32AM +0100, Alexander Graf wrote:
> 
> On 02.03.26 20:52, Michael S. Tsirkin wrote:
> > On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote:
> > > On 02.03.26 13:06, Stefano Garzarella wrote:
> > > > CCing Bryan, Vishnu, and Broadcom list.
> > > > 
> > > > On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
> > > > > Please target net-next tree for this new feature.
> > > > > 
> > > > > On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
> > > > > > Vsock maintains a single CID number space which can be used to
> > > > > > communicate to the host (G2H) or to a child-VM (H2G). The current logic
> > > > > > trivially assumes that G2H is only relevant for CID <= 2 because these
> > > > > > target the hypervisor.  However, in environments like Nitro
> > > > > > Enclaves, an
> > > > > > instance that hosts vhost_vsock powered VMs may still want to
> > > > > > communicate
> > > > > > to Enclaves that are reachable at higher CIDs through virtio-vsock-pci.
> > > > > > 
> > > > > > That means that for CID > 2, we really want an overlay. By default, all
> > > > > > CIDs are owned by the hypervisor. But if vhost registers a CID,
> > > > > > it takes
> > > > > > precedence.  Implement that logic. Vhost already knows which CIDs it
> > > > > > supports anyway.
> > > > > > 
> > > > > > With this logic, I can run a Nitro Enclave as well as a nested VM with
> > > > > > vhost-vsock support in parallel, with the parent instance able to
> > > > > > communicate to both simultaneously.
> > > > > I honestly don't understand why VMADDR_FLAG_TO_HOST (added
> > > > > specifically for Nitro IIRC) isn't enough for this scenario and we
> > > > > have to add this change.  Can you elaborate a bit more about the
> > > > > relationship between this change and VMADDR_FLAG_TO_HOST we added?
> > > 
> > > The main problem I have with VMADDR_FLAG_TO_HOST for connect() is that it
> > > punts the complexity to the user. Instead of a single CID address space, you
> > > now effectively create 2 spaces: One for TO_HOST (needs a flag) and one for
> > > TO_GUEST (no flag). But every user space tool needs to learn about this
> > > flag. That may work for super special-case applications. But propagating
> > > that all the way into socat, iperf, etc etc? It's just creating friction.
> > > 
> > > IMHO the most natural experience is to have a single CID space, potentially
> > > manually segmented by launching VMs of one kind within a certain range.
> > > 
> > > At the end of the day, the host vs guest problem is super similar to a
> > > routing table.
> > If this is what's desired, some bits could be stolen from the CID
> > to specify the destination type. Would that address the issue?
> > Just a thought.
> 
> 
> If we had thought of this from the beginning, yes. But now that everyone
> thinks CID (guest) == CID (host), I believe this is no longer feasible.
> 
> 
> Alex


I don't really insist, but just to point out that if we wanted to, we
could map multiple CIDs to host. Anyway.


> 
> 
> 
> Amazon Web Services Development Center Germany GmbH
> Tamara-Danz-Str. 13
> 10243 Berlin
> Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
> Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
> Sitz: Berlin
> Ust-ID: DE 365 538 597


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-02 19:04         ` Alexander Graf
@ 2026-03-03  9:49           ` Stefano Garzarella
  2026-03-03 14:17             ` Bryan Tan
  0 siblings, 1 reply; 15+ messages in thread
From: Stefano Garzarella @ 2026-03-03  9:49 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Bryan Tan, Vishnu Dasa, Broadcom internal kernel review list,
	virtualization, linux-kernel, netdev, kvm, eperezma, Jason Wang,
	mst, Stefan Hajnoczi, nh-open-source

On Mon, Mar 02, 2026 at 08:04:22PM +0100, Alexander Graf wrote:
>
>On 02.03.26 17:25, Stefano Garzarella wrote:
>>On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote:
>>>
>>>On 02.03.26 13:06, Stefano Garzarella wrote:
>>>>CCing Bryan, Vishnu, and Broadcom list.
>>>>
>>>>On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
>>>>>
>>>>>Please target net-next tree for this new feature.
>>>>>
>>>>>On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
>>>>>>Vsock maintains a single CID number space which can be used to
>>>>>>communicate to the host (G2H) or to a child-VM (H2G). The 
>>>>>>current logic
>>>>>>trivially assumes that G2H is only relevant for CID <= 2 
>>>>>>because these
>>>>>>target the hypervisor.  However, in environments like Nitro 
>>>>>>Enclaves, an
>>>>>>instance that hosts vhost_vsock powered VMs may still want 
>>>>>>to communicate
>>>>>>to Enclaves that are reachable at higher CIDs through 
>>>>>>virtio-vsock-pci.
>>>>>>
>>>>>>That means that for CID > 2, we really want an overlay. By 
>>>>>>default, all
>>>>>>CIDs are owned by the hypervisor. But if vhost registers a 
>>>>>>CID, it takes
>>>>>>precedence.  Implement that logic. Vhost already knows which CIDs it
>>>>>>supports anyway.
>>>>>>
>>>>>>With this logic, I can run a Nitro Enclave as well as a 
>>>>>>nested VM with
>>>>>>vhost-vsock support in parallel, with the parent instance able to
>>>>>>communicate to both simultaneously.
>>>>>
>>>>>I honestly don't understand why VMADDR_FLAG_TO_HOST (added 
>>>>>specifically for Nitro IIRC) isn't enough for this scenario 
>>>>>and we have to add this change.  Can you elaborate a bit more 
>>>>>about the relationship between this change and 
>>>>>VMADDR_FLAG_TO_HOST we added?
>>>
>>>
>>>The main problem I have with VMADDR_FLAG_TO_HOST for connect() is 
>>>that it punts the complexity to the user. Instead of a single CID 
>>>address space, you now effectively create 2 spaces: One for 
>>>TO_HOST (needs a flag) and one for TO_GUEST (no flag). But every 
>>>user space tool needs to learn about this flag. That may work for 
>>>super special-case applications. But propagating that all the way 
>>>into socat, iperf, etc etc? It's just creating friction.
>>
>>Okay, I would like to have this (or part of it) in the commit 
>>message to better explain why we want this change.
>>
>>>
>>>IMHO the most natural experience is to have a single CID space, 
>>>potentially manually segmented by launching VMs of one kind within 
>>>a certain range.
>>
>>I see, but at this point, should the kernel set VMADDR_FLAG_TO_HOST 
>>in the remote address if that path is taken "automagically" ?
>>
>>So in that way the user space can have a way to understand if it's 
>>talking with a nested guest or a sibling guest.
>>
>>
>>That said, I'm concerned about the scenario where an application 
>>does not even consider communicating with a sibling VM.
>
>
>If that's really a realistic concern, then we should add a 
>VMADDR_FLAG_TO_GUEST that the application can set. Default behavior of 
>an application that provides no flags is "route to whatever you can 
>find": If vhost is loaded, it routes to vhost. If a vsock backend 

mmm, we have always documented this simple behavior:
- CID = 2 talks to the host
- CID >= 3 talks to the guest

Now we are changing this by adding fallback. I don't think we should 
change the default behavior, but rather provide new ways to enable this 
new behavior.

I find it strange that an application running on Linux 7.0 has a default 
behavior where using CID=42 always talks to a nested VM, but starting 
with Linux 7.1, it also starts talking to a sibling VM.

>driver is loaded, it routes there. But the application has no say in 
>where it goes: It's purely a system configuration thing.

This is true for complex things like IP, but for VSOCK we have always 
wanted to keep the default behavior very simple (as written above). 
Everything else must be explicitly enabled IMHO.

>
>
>>Until now, it knew that by not setting that flag, it could only talk 
>>to nested VMs, so if there was no VM with that CID, the connection 
>>simply failed. Whereas from this patch onwards, if the device in the 
>>host supports sibling VMs and there is a VM with that CID, the 
>>application finds itself talking to a sibling VM instead of a nested 
>>one, without having any idea.
>
>
>I'd say an application that attempts to talk to a CID that it does now 
>know whether it's vhost routed or not is running into "undefined" 
>territory. If you rmmod the vhost driver, it would also talk to the 
>hypervisor provided vsock.

Oh, I missed that. And I also fixed that behaviour with commit 
65b422d9b61b ("vsock: forward all packets to the host when no H2G is 
registered") after I implemented the multi-transport support.

mmm, this could change my position ;-) (although, to be honest, I don't 
understand why it was like that in the first place, but that's how it is 
now).

Please document also this in the new commit message, is a good point.
Although when H2G is loaded, we behave differently. However, it is true 
that sysctl helps us standardize this behavior.

I don't know whether to see it as a regression or not.

>
>
>>Should we make this feature opt-in in some way, such as sockopt or 
>>sysctl? (I understand that there is the previous problem, but 
>>honestly, it seems like a significant change to the behavior of 
>>AF_VSOCK).
>
>
>We can create a sysctl to enable behavior with default=on. But I'm 
>against making the cumbersome does-not-work-out-of-the-box experience 
>the default. Will include it in v2.

The opposite point of view is that we would not want to have different 
default behavior between 7.0 and 7.1 when H2G is loaded.

>
>
>>
>>>
>>>At the end of the day, the host vs guest problem is super similar 
>>>to a routing table.
>>
>>Yeah, but the point of AF_VSOCK is precisely to avoid complexities 
>>such as routing tables as much as possible; otherwise, AF_INET is 
>>already there and ready to be used. In theory, we only want 
>>communication between host and guest.
>
>
>Yes, but nesting is a thing and nobody thought about it :). In 
>retrospect, it would have been to annotate the CID with the direction: 
>H5 goes to CID5 on host and G5 goes to CID5 on guest. But I see no 
>chance to change that by now.

Yep, this is why we added the VMADDR_FLAG_TO_HOST.

Stefano


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-03  7:19           ` Michael S. Tsirkin
@ 2026-03-03  9:57             ` Stefano Garzarella
  0 siblings, 0 replies; 15+ messages in thread
From: Stefano Garzarella @ 2026-03-03  9:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Graf, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, virtualization,
	linux-kernel, netdev, kvm, eperezma, Jason Wang, Stefan Hajnoczi,
	nh-open-source

On Tue, Mar 03, 2026 at 02:19:13AM -0500, Michael S. Tsirkin wrote:
>On Tue, Mar 03, 2026 at 07:51:32AM +0100, Alexander Graf wrote:
>>
>> On 02.03.26 20:52, Michael S. Tsirkin wrote:
>> > On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote:
>> > > On 02.03.26 13:06, Stefano Garzarella wrote:
>> > > > CCing Bryan, Vishnu, and Broadcom list.
>> > > >
>> > > > On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
>> > > > > Please target net-next tree for this new feature.
>> > > > >
>> > > > > On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
>> > > > > > Vsock maintains a single CID number space which can be used to
>> > > > > > communicate to the host (G2H) or to a child-VM (H2G). The current logic
>> > > > > > trivially assumes that G2H is only relevant for CID <= 2 because these
>> > > > > > target the hypervisor.  However, in environments like Nitro
>> > > > > > Enclaves, an
>> > > > > > instance that hosts vhost_vsock powered VMs may still want to
>> > > > > > communicate
>> > > > > > to Enclaves that are reachable at higher CIDs through virtio-vsock-pci.
>> > > > > >
>> > > > > > That means that for CID > 2, we really want an overlay. By default, all
>> > > > > > CIDs are owned by the hypervisor. But if vhost registers a CID,
>> > > > > > it takes
>> > > > > > precedence.  Implement that logic. Vhost already knows which CIDs it
>> > > > > > supports anyway.
>> > > > > >
>> > > > > > With this logic, I can run a Nitro Enclave as well as a nested VM with
>> > > > > > vhost-vsock support in parallel, with the parent instance able to
>> > > > > > communicate to both simultaneously.
>> > > > > I honestly don't understand why VMADDR_FLAG_TO_HOST (added
>> > > > > specifically for Nitro IIRC) isn't enough for this scenario and we
>> > > > > have to add this change.  Can you elaborate a bit more about the
>> > > > > relationship between this change and VMADDR_FLAG_TO_HOST we added?
>> > >
>> > > The main problem I have with VMADDR_FLAG_TO_HOST for connect() is that it
>> > > punts the complexity to the user. Instead of a single CID address space, you
>> > > now effectively create 2 spaces: One for TO_HOST (needs a flag) and one for
>> > > TO_GUEST (no flag). But every user space tool needs to learn about this
>> > > flag. That may work for super special-case applications. But propagating
>> > > that all the way into socat, iperf, etc etc? It's just creating friction.
>> > >
>> > > IMHO the most natural experience is to have a single CID space, potentially
>> > > manually segmented by launching VMs of one kind within a certain range.
>> > >
>> > > At the end of the day, the host vs guest problem is super similar to a
>> > > routing table.
>> > If this is what's desired, some bits could be stolen from the CID
>> > to specify the destination type. Would that address the issue?
>> > Just a thought.

Nope :-( VMMs some times use random u32 to set CID (avoiding reserved 
ones like 0, 1, 2, 3, U32_MAX). We also documented them in virtio spec:
https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4780004

>>
>>
>> If we had thought of this from the beginning, yes. But now that everyone
>> thinks CID (guest) == CID (host), I believe this is no longer feasible.

We added a new flag (VMADDR_FLAG_TO_HOST) in struct sockaddr_vm exactly 
for that use case around 6 years ago [1], but not much work was done to 
propagate that change to userspace tools.

IMO that should be improved, and if for Nitro this is useful, you should 
try to help on that effort.

Stefano

[1] 
https://lore.kernel.org/netdev/20201214161122.37717-1-andraprs@amazon.com/

>>
>>
>> Alex
>
>
>I don't really insist, but just to point out that if we wanted to, we
>could map multiple CIDs to host. Anyway.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-03  9:49           ` Stefano Garzarella
@ 2026-03-03 14:17             ` Bryan Tan
  2026-03-03 20:47               ` Alexander Graf
  0 siblings, 1 reply; 15+ messages in thread
From: Bryan Tan @ 2026-03-03 14:17 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Alexander Graf, Vishnu Dasa, Broadcom internal kernel review list,
	virtualization, linux-kernel, netdev, kvm, eperezma, Jason Wang,
	mst, Stefan Hajnoczi, nh-open-source

[-- Attachment #1: Type: text/plain, Size: 6895 bytes --]

On Tue, Mar 3, 2026 at 9:49 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> On Mon, Mar 02, 2026 at 08:04:22PM +0100, Alexander Graf wrote:
> >
> >On 02.03.26 17:25, Stefano Garzarella wrote:
> >>On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote:
> >>>
> >>>On 02.03.26 13:06, Stefano Garzarella wrote:
> >>>>CCing Bryan, Vishnu, and Broadcom list.
> >>>>
> >>>>On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
> >>>>>
> >>>>>Please target net-next tree for this new feature.
> >>>>>
> >>>>>On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
> >>>>>>Vsock maintains a single CID number space which can be used to
> >>>>>>communicate to the host (G2H) or to a child-VM (H2G). The
> >>>>>>current logic
> >>>>>>trivially assumes that G2H is only relevant for CID <= 2
> >>>>>>because these
> >>>>>>target the hypervisor.  However, in environments like Nitro
> >>>>>>Enclaves, an
> >>>>>>instance that hosts vhost_vsock powered VMs may still want
> >>>>>>to communicate
> >>>>>>to Enclaves that are reachable at higher CIDs through
> >>>>>>virtio-vsock-pci.
> >>>>>>
> >>>>>>That means that for CID > 2, we really want an overlay. By
> >>>>>>default, all
> >>>>>>CIDs are owned by the hypervisor. But if vhost registers a
> >>>>>>CID, it takes
> >>>>>>precedence.  Implement that logic. Vhost already knows which CIDs it
> >>>>>>supports anyway.
> >>>>>>
> >>>>>>With this logic, I can run a Nitro Enclave as well as a
> >>>>>>nested VM with
> >>>>>>vhost-vsock support in parallel, with the parent instance able to
> >>>>>>communicate to both simultaneously.
> >>>>>
> >>>>>I honestly don't understand why VMADDR_FLAG_TO_HOST (added
> >>>>>specifically for Nitro IIRC) isn't enough for this scenario
> >>>>>and we have to add this change.  Can you elaborate a bit more
> >>>>>about the relationship between this change and
> >>>>>VMADDR_FLAG_TO_HOST we added?
> >>>
> >>>
> >>>The main problem I have with VMADDR_FLAG_TO_HOST for connect() is
> >>>that it punts the complexity to the user. Instead of a single CID
> >>>address space, you now effectively create 2 spaces: One for
> >>>TO_HOST (needs a flag) and one for TO_GUEST (no flag). But every
> >>>user space tool needs to learn about this flag. That may work for
> >>>super special-case applications. But propagating that all the way
> >>>into socat, iperf, etc etc? It's just creating friction.
> >>
> >>Okay, I would like to have this (or part of it) in the commit
> >>message to better explain why we want this change.
> >>
> >>>
> >>>IMHO the most natural experience is to have a single CID space,
> >>>potentially manually segmented by launching VMs of one kind within
> >>>a certain range.
> >>
> >>I see, but at this point, should the kernel set VMADDR_FLAG_TO_HOST
> >>in the remote address if that path is taken "automagically" ?
> >>
> >>So in that way the user space can have a way to understand if it's
> >>talking with a nested guest or a sibling guest.
> >>
> >>
> >>That said, I'm concerned about the scenario where an application
> >>does not even consider communicating with a sibling VM.
> >
> >
> >If that's really a realistic concern, then we should add a
> >VMADDR_FLAG_TO_GUEST that the application can set. Default behavior of
> >an application that provides no flags is "route to whatever you can
> >find": If vhost is loaded, it routes to vhost. If a vsock backend
>
> mmm, we have always documented this simple behavior:
> - CID = 2 talks to the host
> - CID >= 3 talks to the guest
>
> Now we are changing this by adding fallback. I don't think we should
> change the default behavior, but rather provide new ways to enable this
> new behavior.
>
> I find it strange that an application running on Linux 7.0 has a default
> behavior where using CID=42 always talks to a nested VM, but starting
> with Linux 7.1, it also starts talking to a sibling VM.
>
> >driver is loaded, it routes there. But the application has no say in
> >where it goes: It's purely a system configuration thing.
>
> This is true for complex things like IP, but for VSOCK we have always
> wanted to keep the default behavior very simple (as written above).
> Everything else must be explicitly enabled IMHO.
>
> >
> >
> >>Until now, it knew that by not setting that flag, it could only talk
> >>to nested VMs, so if there was no VM with that CID, the connection
> >>simply failed. Whereas from this patch onwards, if the device in the
> >>host supports sibling VMs and there is a VM with that CID, the
> >>application finds itself talking to a sibling VM instead of a nested
> >>one, without having any idea.
> >
> >
> >I'd say an application that attempts to talk to a CID that it does now
> >know whether it's vhost routed or not is running into "undefined"
> >territory. If you rmmod the vhost driver, it would also talk to the
> >hypervisor provided vsock.
>
> Oh, I missed that. And I also fixed that behaviour with commit
> 65b422d9b61b ("vsock: forward all packets to the host when no H2G is
> registered") after I implemented the multi-transport support.
>
> mmm, this could change my position ;-) (although, to be honest, I don't
> understand why it was like that in the first place, but that's how it is
> now).
>
> Please document also this in the new commit message, is a good point.
> Although when H2G is loaded, we behave differently. However, it is true
> that sysctl helps us standardize this behavior.
>
> I don't know whether to see it as a regression or not.
>
> >
> >
> >>Should we make this feature opt-in in some way, such as sockopt or
> >>sysctl? (I understand that there is the previous problem, but
> >>honestly, it seems like a significant change to the behavior of
> >>AF_VSOCK).
> >
> >
> >We can create a sysctl to enable behavior with default=on. But I'm
> >against making the cumbersome does-not-work-out-of-the-box experience
> >the default. Will include it in v2.
>
> The opposite point of view is that we would not want to have different
> default behavior between 7.0 and 7.1 when H2G is loaded.

From a VMCI perspective, we only allow communication from guest to
host CIDs 0 and 2. With has_remote_cid implemented for VMCI, we end
up attempting guest to guest communication. As mentioned this does
already happen if there isn't an H2G transport registered, so we
should be handling this anyways. But I'm not too fond of the change
in behaviour for when H2G is present, so in the very least I'd
prefer if has_remote_cid is not implemented for VMCI. Or perhaps
if there was a way for G2H transport to explicitly note that it
supports CIDs that are greater than 2?  With this, it would be
easier to see this patch as preserving the default behaviour for
some transports and fixing a bug for others.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5417 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-03 14:17             ` Bryan Tan
@ 2026-03-03 20:47               ` Alexander Graf
  2026-03-03 20:52                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Graf @ 2026-03-03 20:47 UTC (permalink / raw)
  To: Bryan Tan, Stefano Garzarella
  Cc: Vishnu Dasa, Broadcom internal kernel review list, virtualization,
	linux-kernel, netdev, kvm, eperezma, Jason Wang, mst,
	Stefan Hajnoczi, nh-open-source


On 03.03.26 15:17, Bryan Tan wrote:
> On Tue, Mar 3, 2026 at 9:49 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>> On Mon, Mar 02, 2026 at 08:04:22PM +0100, Alexander Graf wrote:
>>> On 02.03.26 17:25, Stefano Garzarella wrote:
>>>> On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote:
>>>>> On 02.03.26 13:06, Stefano Garzarella wrote:
>>>>>> CCing Bryan, Vishnu, and Broadcom list.
>>>>>>
>>>>>> On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
>>>>>>> Please target net-next tree for this new feature.
>>>>>>>
>>>>>>> On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
>>>>>>>> Vsock maintains a single CID number space which can be used to
>>>>>>>> communicate to the host (G2H) or to a child-VM (H2G). The
>>>>>>>> current logic
>>>>>>>> trivially assumes that G2H is only relevant for CID <= 2
>>>>>>>> because these
>>>>>>>> target the hypervisor.  However, in environments like Nitro
>>>>>>>> Enclaves, an
>>>>>>>> instance that hosts vhost_vsock powered VMs may still want
>>>>>>>> to communicate
>>>>>>>> to Enclaves that are reachable at higher CIDs through
>>>>>>>> virtio-vsock-pci.
>>>>>>>>
>>>>>>>> That means that for CID > 2, we really want an overlay. By
>>>>>>>> default, all
>>>>>>>> CIDs are owned by the hypervisor. But if vhost registers a
>>>>>>>> CID, it takes
>>>>>>>> precedence.  Implement that logic. Vhost already knows which CIDs it
>>>>>>>> supports anyway.
>>>>>>>>
>>>>>>>> With this logic, I can run a Nitro Enclave as well as a
>>>>>>>> nested VM with
>>>>>>>> vhost-vsock support in parallel, with the parent instance able to
>>>>>>>> communicate to both simultaneously.
>>>>>>> I honestly don't understand why VMADDR_FLAG_TO_HOST (added
>>>>>>> specifically for Nitro IIRC) isn't enough for this scenario
>>>>>>> and we have to add this change.  Can you elaborate a bit more
>>>>>>> about the relationship between this change and
>>>>>>> VMADDR_FLAG_TO_HOST we added?
>>>>>
>>>>> The main problem I have with VMADDR_FLAG_TO_HOST for connect() is
>>>>> that it punts the complexity to the user. Instead of a single CID
>>>>> address space, you now effectively create 2 spaces: One for
>>>>> TO_HOST (needs a flag) and one for TO_GUEST (no flag). But every
>>>>> user space tool needs to learn about this flag. That may work for
>>>>> super special-case applications. But propagating that all the way
>>>>> into socat, iperf, etc etc? It's just creating friction.
>>>> Okay, I would like to have this (or part of it) in the commit
>>>> message to better explain why we want this change.
>>>>
>>>>> IMHO the most natural experience is to have a single CID space,
>>>>> potentially manually segmented by launching VMs of one kind within
>>>>> a certain range.
>>>> I see, but at this point, should the kernel set VMADDR_FLAG_TO_HOST
>>>> in the remote address if that path is taken "automagically" ?
>>>>
>>>> So in that way the user space can have a way to understand if it's
>>>> talking with a nested guest or a sibling guest.
>>>>
>>>>
>>>> That said, I'm concerned about the scenario where an application
>>>> does not even consider communicating with a sibling VM.
>>>
>>> If that's really a realistic concern, then we should add a
>>> VMADDR_FLAG_TO_GUEST that the application can set. Default behavior of
>>> an application that provides no flags is "route to whatever you can
>>> find": If vhost is loaded, it routes to vhost. If a vsock backend
>> mmm, we have always documented this simple behavior:
>> - CID = 2 talks to the host
>> - CID >= 3 talks to the guest
>>
>> Now we are changing this by adding fallback. I don't think we should
>> change the default behavior, but rather provide new ways to enable this
>> new behavior.
>>
>> I find it strange that an application running on Linux 7.0 has a default
>> behavior where using CID=42 always talks to a nested VM, but starting
>> with Linux 7.1, it also starts talking to a sibling VM.
>>
>>> driver is loaded, it routes there. But the application has no say in
>>> where it goes: It's purely a system configuration thing.
>> This is true for complex things like IP, but for VSOCK we have always
>> wanted to keep the default behavior very simple (as written above).
>> Everything else must be explicitly enabled IMHO.
>>
>>>
>>>> Until now, it knew that by not setting that flag, it could only talk
>>>> to nested VMs, so if there was no VM with that CID, the connection
>>>> simply failed. Whereas from this patch onwards, if the device in the
>>>> host supports sibling VMs and there is a VM with that CID, the
>>>> application finds itself talking to a sibling VM instead of a nested
>>>> one, without having any idea.
>>>
>>> I'd say an application that attempts to talk to a CID that it does now
>>> know whether it's vhost routed or not is running into "undefined"
>>> territory. If you rmmod the vhost driver, it would also talk to the
>>> hypervisor provided vsock.
>> Oh, I missed that. And I also fixed that behaviour with commit
>> 65b422d9b61b ("vsock: forward all packets to the host when no H2G is
>> registered") after I implemented the multi-transport support.
>>
>> mmm, this could change my position ;-) (although, to be honest, I don't
>> understand why it was like that in the first place, but that's how it is
>> now).
>>
>> Please document also this in the new commit message, is a good point.
>> Although when H2G is loaded, we behave differently. However, it is true
>> that sysctl helps us standardize this behavior.
>>
>> I don't know whether to see it as a regression or not.
>>
>>>
>>>> Should we make this feature opt-in in some way, such as sockopt or
>>>> sysctl? (I understand that there is the previous problem, but
>>>> honestly, it seems like a significant change to the behavior of
>>>> AF_VSOCK).
>>>
>>> We can create a sysctl to enable behavior with default=on. But I'm
>>> against making the cumbersome does-not-work-out-of-the-box experience
>>> the default. Will include it in v2.
>> The opposite point of view is that we would not want to have different
>> default behavior between 7.0 and 7.1 when H2G is loaded.
>  From a VMCI perspective, we only allow communication from guest to
> host CIDs 0 and 2. With has_remote_cid implemented for VMCI, we end
> up attempting guest to guest communication. As mentioned this does
> already happen if there isn't an H2G transport registered, so we
> should be handling this anyways. But I'm not too fond of the change
> in behaviour for when H2G is present, so in the very least I'd
> prefer if has_remote_cid is not implemented for VMCI. Or perhaps
> if there was a way for G2H transport to explicitly note that it
> supports CIDs that are greater than 2?  With this, it would be
> easier to see this patch as preserving the default behaviour for
> some transports and fixing a bug for others.


I understand what you want, but beware that it's actually a change in 
behavior. Today, whether Linux will send vsock connects to VMCI depends 
on whether the vhost kernel module is loaded: If it's loaded, you don't 
see the connect attempt. If it's not loaded, the connect will come 
through to VMCI.

I agree that it makes sense to limit VMCI to only ever see connects to 
<= 2 consistently. But as I said above, it's actually a change in behavior.


Alex




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-03 20:47               ` Alexander Graf
@ 2026-03-03 20:52                 ` Michael S. Tsirkin
  2026-03-03 21:05                   ` Alexander Graf
  0 siblings, 1 reply; 15+ messages in thread
From: Michael S. Tsirkin @ 2026-03-03 20:52 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Bryan Tan, Stefano Garzarella, Vishnu Dasa,
	Broadcom internal kernel review list, virtualization,
	linux-kernel, netdev, kvm, eperezma, Jason Wang, Stefan Hajnoczi,
	nh-open-source

On Tue, Mar 03, 2026 at 09:47:26PM +0100, Alexander Graf wrote:
> 
> On 03.03.26 15:17, Bryan Tan wrote:
> > On Tue, Mar 3, 2026 at 9:49 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> > > On Mon, Mar 02, 2026 at 08:04:22PM +0100, Alexander Graf wrote:
> > > > On 02.03.26 17:25, Stefano Garzarella wrote:
> > > > > On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote:
> > > > > > On 02.03.26 13:06, Stefano Garzarella wrote:
> > > > > > > CCing Bryan, Vishnu, and Broadcom list.
> > > > > > > 
> > > > > > > On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
> > > > > > > > Please target net-next tree for this new feature.
> > > > > > > > 
> > > > > > > > On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
> > > > > > > > > Vsock maintains a single CID number space which can be used to
> > > > > > > > > communicate to the host (G2H) or to a child-VM (H2G). The
> > > > > > > > > current logic
> > > > > > > > > trivially assumes that G2H is only relevant for CID <= 2
> > > > > > > > > because these
> > > > > > > > > target the hypervisor.  However, in environments like Nitro
> > > > > > > > > Enclaves, an
> > > > > > > > > instance that hosts vhost_vsock powered VMs may still want
> > > > > > > > > to communicate
> > > > > > > > > to Enclaves that are reachable at higher CIDs through
> > > > > > > > > virtio-vsock-pci.
> > > > > > > > > 
> > > > > > > > > That means that for CID > 2, we really want an overlay. By
> > > > > > > > > default, all
> > > > > > > > > CIDs are owned by the hypervisor. But if vhost registers a
> > > > > > > > > CID, it takes
> > > > > > > > > precedence.  Implement that logic. Vhost already knows which CIDs it
> > > > > > > > > supports anyway.
> > > > > > > > > 
> > > > > > > > > With this logic, I can run a Nitro Enclave as well as a
> > > > > > > > > nested VM with
> > > > > > > > > vhost-vsock support in parallel, with the parent instance able to
> > > > > > > > > communicate to both simultaneously.
> > > > > > > > I honestly don't understand why VMADDR_FLAG_TO_HOST (added
> > > > > > > > specifically for Nitro IIRC) isn't enough for this scenario
> > > > > > > > and we have to add this change.  Can you elaborate a bit more
> > > > > > > > about the relationship between this change and
> > > > > > > > VMADDR_FLAG_TO_HOST we added?
> > > > > > 
> > > > > > The main problem I have with VMADDR_FLAG_TO_HOST for connect() is
> > > > > > that it punts the complexity to the user. Instead of a single CID
> > > > > > address space, you now effectively create 2 spaces: One for
> > > > > > TO_HOST (needs a flag) and one for TO_GUEST (no flag). But every
> > > > > > user space tool needs to learn about this flag. That may work for
> > > > > > super special-case applications. But propagating that all the way
> > > > > > into socat, iperf, etc etc? It's just creating friction.
> > > > > Okay, I would like to have this (or part of it) in the commit
> > > > > message to better explain why we want this change.
> > > > > 
> > > > > > IMHO the most natural experience is to have a single CID space,
> > > > > > potentially manually segmented by launching VMs of one kind within
> > > > > > a certain range.
> > > > > I see, but at this point, should the kernel set VMADDR_FLAG_TO_HOST
> > > > > in the remote address if that path is taken "automagically" ?
> > > > > 
> > > > > So in that way the user space can have a way to understand if it's
> > > > > talking with a nested guest or a sibling guest.
> > > > > 
> > > > > 
> > > > > That said, I'm concerned about the scenario where an application
> > > > > does not even consider communicating with a sibling VM.
> > > > 
> > > > If that's really a realistic concern, then we should add a
> > > > VMADDR_FLAG_TO_GUEST that the application can set. Default behavior of
> > > > an application that provides no flags is "route to whatever you can
> > > > find": If vhost is loaded, it routes to vhost. If a vsock backend
> > > mmm, we have always documented this simple behavior:
> > > - CID = 2 talks to the host
> > > - CID >= 3 talks to the guest
> > > 
> > > Now we are changing this by adding fallback. I don't think we should
> > > change the default behavior, but rather provide new ways to enable this
> > > new behavior.
> > > 
> > > I find it strange that an application running on Linux 7.0 has a default
> > > behavior where using CID=42 always talks to a nested VM, but starting
> > > with Linux 7.1, it also starts talking to a sibling VM.
> > > 
> > > > driver is loaded, it routes there. But the application has no say in
> > > > where it goes: It's purely a system configuration thing.
> > > This is true for complex things like IP, but for VSOCK we have always
> > > wanted to keep the default behavior very simple (as written above).
> > > Everything else must be explicitly enabled IMHO.
> > > 
> > > > 
> > > > > Until now, it knew that by not setting that flag, it could only talk
> > > > > to nested VMs, so if there was no VM with that CID, the connection
> > > > > simply failed. Whereas from this patch onwards, if the device in the
> > > > > host supports sibling VMs and there is a VM with that CID, the
> > > > > application finds itself talking to a sibling VM instead of a nested
> > > > > one, without having any idea.
> > > > 
> > > > I'd say an application that attempts to talk to a CID that it does now
> > > > know whether it's vhost routed or not is running into "undefined"
> > > > territory. If you rmmod the vhost driver, it would also talk to the
> > > > hypervisor provided vsock.
> > > Oh, I missed that. And I also fixed that behaviour with commit
> > > 65b422d9b61b ("vsock: forward all packets to the host when no H2G is
> > > registered") after I implemented the multi-transport support.
> > > 
> > > mmm, this could change my position ;-) (although, to be honest, I don't
> > > understand why it was like that in the first place, but that's how it is
> > > now).
> > > 
> > > Please document also this in the new commit message, is a good point.
> > > Although when H2G is loaded, we behave differently. However, it is true
> > > that sysctl helps us standardize this behavior.
> > > 
> > > I don't know whether to see it as a regression or not.
> > > 
> > > > 
> > > > > Should we make this feature opt-in in some way, such as sockopt or
> > > > > sysctl? (I understand that there is the previous problem, but
> > > > > honestly, it seems like a significant change to the behavior of
> > > > > AF_VSOCK).
> > > > 
> > > > We can create a sysctl to enable behavior with default=on. But I'm
> > > > against making the cumbersome does-not-work-out-of-the-box experience
> > > > the default. Will include it in v2.
> > > The opposite point of view is that we would not want to have different
> > > default behavior between 7.0 and 7.1 when H2G is loaded.
> >  From a VMCI perspective, we only allow communication from guest to
> > host CIDs 0 and 2. With has_remote_cid implemented for VMCI, we end
> > up attempting guest to guest communication. As mentioned this does
> > already happen if there isn't an H2G transport registered, so we
> > should be handling this anyways. But I'm not too fond of the change
> > in behaviour for when H2G is present, so in the very least I'd
> > prefer if has_remote_cid is not implemented for VMCI. Or perhaps
> > if there was a way for G2H transport to explicitly note that it
> > supports CIDs that are greater than 2?  With this, it would be
> > easier to see this patch as preserving the default behaviour for
> > some transports and fixing a bug for others.
> 
> 
> I understand what you want, but beware that it's actually a change in
> behavior. Today, whether Linux will send vsock connects to VMCI depends on
> whether the vhost kernel module is loaded: If it's loaded, you don't see the
> connect attempt. If it's not loaded, the connect will come through to VMCI.
> 
> I agree that it makes sense to limit VMCI to only ever see connects to <= 2
> consistently. But as I said above, it's actually a change in behavior.
> 
> 
> Alex
> 

I think it was unintentional, but if you really think people want a
special module that changes kernel's behaviour on load, we can certainly
do that. But any hack like this will not be namespace safe.


> 
> 
> Amazon Web Services Development Center Germany GmbH
> Tamara-Danz-Str. 13
> 10243 Berlin
> Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
> Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
> Sitz: Berlin
> Ust-ID: DE 365 538 597


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] vsock: Enable H2G override
  2026-03-03 20:52                 ` Michael S. Tsirkin
@ 2026-03-03 21:05                   ` Alexander Graf
  0 siblings, 0 replies; 15+ messages in thread
From: Alexander Graf @ 2026-03-03 21:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Bryan Tan, Stefano Garzarella, Vishnu Dasa,
	Broadcom internal kernel review list, virtualization,
	linux-kernel, netdev, kvm, eperezma, Jason Wang, Stefan Hajnoczi,
	nh-open-source


On 03.03.26 21:52, Michael S. Tsirkin wrote:
> On Tue, Mar 03, 2026 at 09:47:26PM +0100, Alexander Graf wrote:
>> On 03.03.26 15:17, Bryan Tan wrote:
>>> On Tue, Mar 3, 2026 at 9:49 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>>>> On Mon, Mar 02, 2026 at 08:04:22PM +0100, Alexander Graf wrote:
>>>>> On 02.03.26 17:25, Stefano Garzarella wrote:
>>>>>> On Mon, Mar 02, 2026 at 04:48:33PM +0100, Alexander Graf wrote:
>>>>>>> On 02.03.26 13:06, Stefano Garzarella wrote:
>>>>>>>> CCing Bryan, Vishnu, and Broadcom list.
>>>>>>>>
>>>>>>>> On Mon, Mar 02, 2026 at 12:47:05PM +0100, Stefano Garzarella wrote:
>>>>>>>>> Please target net-next tree for this new feature.
>>>>>>>>>
>>>>>>>>> On Mon, Mar 02, 2026 at 10:41:38AM +0000, Alexander Graf wrote:
>>>>>>>>>> Vsock maintains a single CID number space which can be used to
>>>>>>>>>> communicate to the host (G2H) or to a child-VM (H2G). The
>>>>>>>>>> current logic
>>>>>>>>>> trivially assumes that G2H is only relevant for CID <= 2
>>>>>>>>>> because these
>>>>>>>>>> target the hypervisor.  However, in environments like Nitro
>>>>>>>>>> Enclaves, an
>>>>>>>>>> instance that hosts vhost_vsock powered VMs may still want
>>>>>>>>>> to communicate
>>>>>>>>>> to Enclaves that are reachable at higher CIDs through
>>>>>>>>>> virtio-vsock-pci.
>>>>>>>>>>
>>>>>>>>>> That means that for CID > 2, we really want an overlay. By
>>>>>>>>>> default, all
>>>>>>>>>> CIDs are owned by the hypervisor. But if vhost registers a
>>>>>>>>>> CID, it takes
>>>>>>>>>> precedence.  Implement that logic. Vhost already knows which CIDs it
>>>>>>>>>> supports anyway.
>>>>>>>>>>
>>>>>>>>>> With this logic, I can run a Nitro Enclave as well as a
>>>>>>>>>> nested VM with
>>>>>>>>>> vhost-vsock support in parallel, with the parent instance able to
>>>>>>>>>> communicate to both simultaneously.
>>>>>>>>> I honestly don't understand why VMADDR_FLAG_TO_HOST (added
>>>>>>>>> specifically for Nitro IIRC) isn't enough for this scenario
>>>>>>>>> and we have to add this change.  Can you elaborate a bit more
>>>>>>>>> about the relationship between this change and
>>>>>>>>> VMADDR_FLAG_TO_HOST we added?
>>>>>>> The main problem I have with VMADDR_FLAG_TO_HOST for connect() is
>>>>>>> that it punts the complexity to the user. Instead of a single CID
>>>>>>> address space, you now effectively create 2 spaces: One for
>>>>>>> TO_HOST (needs a flag) and one for TO_GUEST (no flag). But every
>>>>>>> user space tool needs to learn about this flag. That may work for
>>>>>>> super special-case applications. But propagating that all the way
>>>>>>> into socat, iperf, etc etc? It's just creating friction.
>>>>>> Okay, I would like to have this (or part of it) in the commit
>>>>>> message to better explain why we want this change.
>>>>>>
>>>>>>> IMHO the most natural experience is to have a single CID space,
>>>>>>> potentially manually segmented by launching VMs of one kind within
>>>>>>> a certain range.
>>>>>> I see, but at this point, should the kernel set VMADDR_FLAG_TO_HOST
>>>>>> in the remote address if that path is taken "automagically" ?
>>>>>>
>>>>>> So in that way the user space can have a way to understand if it's
>>>>>> talking with a nested guest or a sibling guest.
>>>>>>
>>>>>>
>>>>>> That said, I'm concerned about the scenario where an application
>>>>>> does not even consider communicating with a sibling VM.
>>>>> If that's really a realistic concern, then we should add a
>>>>> VMADDR_FLAG_TO_GUEST that the application can set. Default behavior of
>>>>> an application that provides no flags is "route to whatever you can
>>>>> find": If vhost is loaded, it routes to vhost. If a vsock backend
>>>> mmm, we have always documented this simple behavior:
>>>> - CID = 2 talks to the host
>>>> - CID >= 3 talks to the guest
>>>>
>>>> Now we are changing this by adding fallback. I don't think we should
>>>> change the default behavior, but rather provide new ways to enable this
>>>> new behavior.
>>>>
>>>> I find it strange that an application running on Linux 7.0 has a default
>>>> behavior where using CID=42 always talks to a nested VM, but starting
>>>> with Linux 7.1, it also starts talking to a sibling VM.
>>>>
>>>>> driver is loaded, it routes there. But the application has no say in
>>>>> where it goes: It's purely a system configuration thing.
>>>> This is true for complex things like IP, but for VSOCK we have always
>>>> wanted to keep the default behavior very simple (as written above).
>>>> Everything else must be explicitly enabled IMHO.
>>>>
>>>>>> Until now, it knew that by not setting that flag, it could only talk
>>>>>> to nested VMs, so if there was no VM with that CID, the connection
>>>>>> simply failed. Whereas from this patch onwards, if the device in the
>>>>>> host supports sibling VMs and there is a VM with that CID, the
>>>>>> application finds itself talking to a sibling VM instead of a nested
>>>>>> one, without having any idea.
>>>>> I'd say an application that attempts to talk to a CID that it does now
>>>>> know whether it's vhost routed or not is running into "undefined"
>>>>> territory. If you rmmod the vhost driver, it would also talk to the
>>>>> hypervisor provided vsock.
>>>> Oh, I missed that. And I also fixed that behaviour with commit
>>>> 65b422d9b61b ("vsock: forward all packets to the host when no H2G is
>>>> registered") after I implemented the multi-transport support.
>>>>
>>>> mmm, this could change my position ;-) (although, to be honest, I don't
>>>> understand why it was like that in the first place, but that's how it is
>>>> now).
>>>>
>>>> Please document also this in the new commit message, is a good point.
>>>> Although when H2G is loaded, we behave differently. However, it is true
>>>> that sysctl helps us standardize this behavior.
>>>>
>>>> I don't know whether to see it as a regression or not.
>>>>
>>>>>> Should we make this feature opt-in in some way, such as sockopt or
>>>>>> sysctl? (I understand that there is the previous problem, but
>>>>>> honestly, it seems like a significant change to the behavior of
>>>>>> AF_VSOCK).
>>>>> We can create a sysctl to enable behavior with default=on. But I'm
>>>>> against making the cumbersome does-not-work-out-of-the-box experience
>>>>> the default. Will include it in v2.
>>>> The opposite point of view is that we would not want to have different
>>>> default behavior between 7.0 and 7.1 when H2G is loaded.
>>>   From a VMCI perspective, we only allow communication from guest to
>>> host CIDs 0 and 2. With has_remote_cid implemented for VMCI, we end
>>> up attempting guest to guest communication. As mentioned this does
>>> already happen if there isn't an H2G transport registered, so we
>>> should be handling this anyways. But I'm not too fond of the change
>>> in behaviour for when H2G is present, so in the very least I'd
>>> prefer if has_remote_cid is not implemented for VMCI. Or perhaps
>>> if there was a way for G2H transport to explicitly note that it
>>> supports CIDs that are greater than 2?  With this, it would be
>>> easier to see this patch as preserving the default behaviour for
>>> some transports and fixing a bug for others.
>>
>> I understand what you want, but beware that it's actually a change in
>> behavior. Today, whether Linux will send vsock connects to VMCI depends on
>> whether the vhost kernel module is loaded: If it's loaded, you don't see the
>> connect attempt. If it's not loaded, the connect will come through to VMCI.
>>
>> I agree that it makes sense to limit VMCI to only ever see connects to <= 2
>> consistently. But as I said above, it's actually a change in behavior.
>>
>>
>> Alex
>>
> I think it was unintentional, but if you really think people want a
> special module that changes kernel's behaviour on load, we can certainly
> do that. But any hack like this will not be namespace safe.


No, I think any behavior that changes based on whether a vhost kernel 
module is loaded is broken by design :). I'll make it consistent in v3. 
Just wanted to make sure that we're on the same page that it's not "VMCI 
is not affected by this". It also has the same inconsistency today.


Alex




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-03-03 21:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-02 10:41 [PATCH] vsock: Enable H2G override Alexander Graf
2026-03-02 11:47 ` Stefano Garzarella
2026-03-02 12:06   ` Stefano Garzarella
2026-03-02 15:48     ` Alexander Graf
2026-03-02 16:25       ` Stefano Garzarella
2026-03-02 19:04         ` Alexander Graf
2026-03-03  9:49           ` Stefano Garzarella
2026-03-03 14:17             ` Bryan Tan
2026-03-03 20:47               ` Alexander Graf
2026-03-03 20:52                 ` Michael S. Tsirkin
2026-03-03 21:05                   ` Alexander Graf
2026-03-02 19:52       ` Michael S. Tsirkin
2026-03-03  6:51         ` Alexander Graf
2026-03-03  7:19           ` Michael S. Tsirkin
2026-03-03  9:57             ` Stefano Garzarella

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox