Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH 4/4] vhost/vsock: re-scan TX virtqueue on device start
From: Andrey Drobyshev @ 2026-06-16 15:58 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <ajFbT6sDESh9FDOl@sgarzare-redhat>

On 6/16/26 5:23 PM, Stefano Garzarella wrote:
> On Fri, Jun 12, 2026 at 07:57:18PM +0300, Andrey Drobyshev wrote:
>> During QEMU CPR live-update (and VHOST_RESET_OWNER in general) the guest
>> keeps running while the host drops and later re-attaches vhost backends.
>> If the guest adds a buffer to the TX virtqueue (guest->host) and kicks
>> while the backend is temporarily NULL (between vhost_vsock_drop_backends()
>> and the next vhost_vsock_start()), then the kick is delivered to the
>> vhost worker, handle_tx_kick() sees a NULL backend and returns, and the
>> kick signal is consumed.  The buffer is then left in the ring.
>>
>> Then upon device start vhost_vsock_start() only re-kicks the RX send
>> worker, never the TX VQ, so the buffer is processed only if the guest
>> happens to kick again.  But if the guest itself is now waiting for data
>>from the host, it will never kick TX VQ again, and we end up in a
>> deadlock.
>>
>> The deadlock is reproduced during active host->guest socat data transfer
>> under multiple consecutive CPR live-update's.
>>
>> To fix this, in vhost_vsock_start(), after kicking the RX send worker, also
>> queue the TX vq poll so any buffers the guest enqueued while we were paused
>> get scanned.
> 
> Again, it seems like we're fixing an issue that existed before this 
> series, but IIUC without support for VHOST_RESET_OWNER, this could never 
> have happened, so the wording should be changed to make it clear that 
> this is can happen only with the new VHOST_RESET_OWNER support.
> 
> In addition, this patch must also be applied before the 
> VHOST_RESET_OWNER support or merged into it.
>

Agreed.
>>
>> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>> drivers/vhost/vsock.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index bcaba36becd7..1fcfe71d18be 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -655,6 +655,12 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
>> 	 */
>> 	vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
>>
>> +	/*
>> +	 * Some packets might've also been queued in TX VQ.  Re-scan it here,
>> +	 * mirroring the RX send-worker kick above.
>> +	 */
> 
> Can we also mention that this is related to VHOST_RESET_OWNER?
>

Agreed.
> Thanks,
> Stefano
> 
>> +	vhost_poll_queue(&vsock->vqs[VSOCK_VQ_TX].poll);
>> +
>> 	mutex_unlock(&vsock->dev.mutex);
>> 	return 0;
>>
>> -- 
>> 2.47.1
>>
> 


^ permalink raw reply

* Re: [PATCH 3/4] vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
From: Andrey Drobyshev @ 2026-06-16 15:58 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <ajFUk7quPhbI7Te-@sgarzare-redhat>

On 6/16/26 5:18 PM, Stefano Garzarella wrote:
> On Fri, Jun 12, 2026 at 07:57:17PM +0300, Andrey Drobyshev wrote:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> Earlier commit ("ms/vhost/vsock: Refuse the connection immediately when
> 
> Please follow 
> https://docs.kernel.org/process/submitting-patches.html#describe-your-changes 
> on how to refer to a commit.
>

I omitted the hash on purpose as the commit is not yet in the mainline
tree, although our series is based and depends on it, as I mentioned:

https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/commit/?id=bb26ed5f3a8b

So it's a different (Michael's) repo and the commit is about to get
merged (but not yet there).  But maybe usual reference style + repo link
would be better.
>> guest isn't ready") added a fast-fail in vhost_transport_send_pkt().  It
>> rejects every host send with -EHOSTUNREACH until the destination calls
>> SET_RUNNING(1).  The fast-fail condition checks whether device's backends
>> are dropped, and if they're, the guest is considered to be not ready.
> 
> Okay, so it's not a regression, I mean without this series that patch is 
> not adding any regression, no?
> 
> If it's the case, I'll change the wording in the cover letter.
>

Agreed.

>>
>> However, there might be other reasons for backends to be nulled.  In
>> particular, when QEMU is performing CPR (checkpoint-restore) migration,
>> device ownership is being RESET and SET again, which leads to backends
>> drop and reattach.  If we end up connecting during this window, an
>> AF_VSOCK client gets -EHOSTUNREACH, which is wrong.
> 
> Please add this change before starting to support VHOST_RESET_OWNER 
> ioctl in vhost-vsock, otherwise we are breaking the bisectability.
>

Agreed.

>>
>> Add a cpr_paused flag set inside vhost_vsock_drop_backends() when the
>> backend was previously live, cleared by vhost_vsock_start(). When set,
>> vhost_transport_send_pkt() queues the skb instead of fast-failing; the
>> existing kick of send_pkt_work in vhost_vsock_start() drains it on
>> resume. A device that has never run keeps cpr_paused == false and the
>> boot-time fast-fail behaviour is preserved.
>>
>> Pair the cpr_paused store with the backend store using an
>> smp_wmb()/smp_rmb() pair so a concurrent sender on a weakly-ordered
>> architecture never observes (NULL backend, !paused):
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> ---
>> drivers/vhost/vsock.c | 22 +++++++++++++++++++---
>> 1 file changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index e629886e5cf8..bcaba36becd7 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -61,6 +61,7 @@ struct vhost_vsock {
>>
>> 	u32 guest_cid;
>> 	bool seqpacket_allow;
>> +	bool cpr_paused;	/* between stop and next start */
>> };
>>
>> static u32 vhost_transport_get_local_cid(void)
>> @@ -311,11 +312,17 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
>> 	 * the mutex would be too expensive in this hot path, and we already have
>> 	 * all the outcomes covered: if the backend becomes NULL right after the check,
>> 	 * vhost_transport_do_send_pkt() will check it under the mutex anyway.
>> +	 *
>> +	 * Don't fast-fail if cpr_paused is set, keep queueing skbs instead.
>> +	 * The kick in vhost_vsock_start() will drain them on resume.
>> 	 */
>> 	if (unlikely(!data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))) {
>> -		rcu_read_unlock();
>> -		kfree_skb(skb);
>> -		return -EHOSTUNREACH;
>> +		smp_rmb();	/* pairs with smp_wmb() in start/drop_backends */
>> +		if (!READ_ONCE(vsock->cpr_paused)) {
> 
> Can we avoid this which is not really readable and maybe add a single 
> variable to control the fast-fail at all?
> 
> I mean replacing both cpr_paused + backend-pointer with a single 
> `started` flag: set it to false at open, true on start via 
> smp_store_release(), back to false on normal stop, and leave it true 
> during CPR pause.
> 
> The reader in send_pkt can do just:
> 
>      if (!smp_load_acquire(&vsock->started))
>          return -EHOSTUNREACH;
> 
> WDYT?
>

I don't think it's gonna work as suggested.  As I understand, the order
during CPR migration is:

1) SET_RUNNING(0)
       -> vhost_vsock_stop()
           -> vhost_vsock_drop_backends()
2) RESET_OWNER
       -> vhost_vsock_drop_backends()
3) SET_OWNER
4) SET_RUNNING(1)
       -> vhost_vsock_start
           -> for (...) vhost_vq_set_backend()

(Btw I just noticed backends are already NULL at step 2), but that's
just our CPR case, for any potential RESET_OWNER users it might not be
the case).

So the race windows starts from 1) (not from 2)).  We have no way of
differentiating whether device is actually being stopped for good, or
we're in the middle of CPR.  If we set the flag to false on stop as you
suggested, we'll still hit the -EHOSTUNREACH case eventually, and
avoiding it is the whole purpose of this patch.

The fast-fail with -EHOSTUNREACH relies on the presence of backends.
IIUC the backend will only become set after initial SET_RUNNING(1),
which will only happen once the guest driver writes smth to virtio
config register, QEMU catches it and calls SET_RUNNING(1).  So we have
ordering with the guest's actions here, which is logical.  But for our
issue that means that the only true marker of paused/not paused is the
presence of backends - and that's why the flag is set in
vhost_vsock_drop_backends().

>> +			rcu_read_unlock();
>> +			kfree_skb(skb);
>> +			return -EHOSTUNREACH;
>> +		}
> 
> 
> That said claude here is reporting a potential issue that I think we 
> should consider:
>      After VHOST_RESET_OWNER, the guest CID stays in the hash, so 
>      vhost_transport_send_pkt() can still find the vsock, skip the 
>      fast-fail (cpr_paused=true), and call vhost_vq_work_queue() while 
>      vhost_workers_free() is freeing workers without a synchronize_rcu() 
>      — risking a use-after-free. Also, any send_pkt_work queued between 
>      the last flush and worker teardown gets its VHOST_WORK_QUEUED bit 
>      stuck (the vhost task exits without draining), deadlocking 
>      host→guest traffic after restart.
> 
>      A synchronize_rcu() in vhost_workers_free() between the 
>      rcu_assign_pointer(NULL) loop and the destroy loop would close the 
>      use-after-free, and reinitializing send_pkt_work via 
>      vhost_work_init() after vhost_dev_reset_owner() returns would clear 
>      the stuck QUEUED bit.
> 
> 

Yes, this looks real indeed.  Though I couldn't hit the UAF issue while
testing host->guest transfer under KASAN.

>> 	}
>>
>> 	if (virtio_vsock_skb_reply(skb))
>> @@ -640,6 +647,9 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
>> 		mutex_unlock(&vq->mutex);
>> 	}
>>
>> +	smp_wmb();	/* pairs with smp_rmb() in send_pkt */
>> +	WRITE_ONCE(vsock->cpr_paused, false);
>> +
>> 	/* Some packets may have been queued before the device was started,
>> 	 * let's kick the send worker to send them.
>> 	 */
>> @@ -671,6 +681,11 @@ static void vhost_vsock_drop_backends(struct vhost_vsock *vsock)
>>
>> 	lockdep_assert_held(&vsock->dev.mutex);
>>
>> +	if (vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])) {
>> +		WRITE_ONCE(vsock->cpr_paused, true);
>> +		smp_wmb();	/* pairs with smp_rmb() in send_pkt */
>> +	}
> 
> Why here and not in vhost_vsock_reset_owner()?
> 
> Also having this here will set it to true also with 
> VHOST_VSOCK_SET_RUNNING(0), is that right?
>

That was added here precisely to cover the vhost_vsock_stop() case (see
above).

> Thanks,
> Stefano
> 
>> +
>> 	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
>> 		vq = &vsock->vqs[i];
>>
>> @@ -728,6 +743,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
>>
>> 	vsock->guest_cid = 0; /* no CID assigned yet */
>> 	vsock->seqpacket_allow = false;
>> +	vsock->cpr_paused = false;
>>
>> 	atomic_set(&vsock->queued_replies, 0);
>>
>> -- 
>> 2.47.1
>>
> 


^ permalink raw reply

* Re: [RFC PATCH 2/2] virtio-balloon: add stats push mode
From: Gregory Price @ 2026-06-16 15:58 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: virtualization, linux-kernel, kernel-team, mst, jasowang,
	xuanzhuo, eperezma, hannes, surenb, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak
In-Reply-To: <3e7d2d74-ceb2-4064-928f-921401fea75e@kernel.org>

On Tue, Jun 16, 2026 at 05:52:29PM +0200, David Hildenbrand (Arm) wrote:
> On 6/16/26 16:44, Gregory Price wrote:
> > That makes sense, although don't you just push the blocking operation
> > into yet another thread on the host?
> 
> I think timers are run from the QEMU main thread, so no separate thread just for
> the timer.
> 
> And IIRC, there will be no blocking. At least if I understand your concern
> correctly.
> 
> balloon_stats_poll_cb() will do a virtqueue_push()+virtio_notify(), which will
> notify the device. The main thread will continue afterwards doing what a main
> thread usually does.
> 
> A VCPU will process the request in the VM and send it back + notify the device.
>

Entirely possible I just bungled the interaction then and/or CH's
interfaces introduce a blocking op that shouldn't.

Thanks for the feedback, we can probably drop this patch.  Unless
there's any particular pushback for 1/2, should i leave as-is or
resubmit separately w/o RFC?

~Gregory

^ permalink raw reply

* Re: [RFC PATCH 2/2] virtio-balloon: add stats push mode
From: David Hildenbrand (Arm) @ 2026-06-16 15:52 UTC (permalink / raw)
  To: Gregory Price
  Cc: virtualization, linux-kernel, kernel-team, mst, jasowang,
	xuanzhuo, eperezma, hannes, surenb, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak
In-Reply-To: <ajFhOa5kEkPfqPVD@gourry-fedora-PF4VCD3F>

On 6/16/26 16:44, Gregory Price wrote:
> On Tue, Jun 16, 2026 at 04:32:46PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/16/26 15:57, Gregory Price wrote:
>>>
>>> Definitely an RFC here because I'm not sure if I was missing something
>>> that might help me solve the problem.
>>
>> Well, in QEMU we just run a timer internally that does the polling.
>>
>> Then, upper layers in the stack can ask QEMU for the latest stats.
>>
>> There, you just get the stats along with a "last-update" timestamp.
>>
> 
> That makes sense, although don't you just push the blocking operation
> into yet another thread on the host?

I think timers are run from the QEMU main thread, so no separate thread just for
the timer.

And IIRC, there will be no blocking. At least if I understand your concern
correctly.

balloon_stats_poll_cb() will do a virtqueue_push()+virtio_notify(), which will
notify the device. The main thread will continue afterwards doing what a main
thread usually does.

A VCPU will process the request in the VM and send it back + notify the device.

-- 
Cheers,

David

^ permalink raw reply

* [PATCH 7.0 370/378] vsock/virtio: fix potential unbounded skb queue
From: Greg Kroah-Hartman @ 2026-06-16 15:00 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Eric Dumazet, Arseniy Krasnov,
	Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, virtualization,
	Jakub Kicinski
In-Reply-To: <20260616145109.744539446@linuxfoundation.org>

7.0-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Eric Dumazet <edumazet@google.com>

commit 059b7dbd20a6f0c539a45ddff1573cb8946685b5 upstream.

virtio_transport_inc_rx_pkt() checks vvs->rx_bytes + len > vvs->buf_alloc.

virtio_transport_recv_enqueue() skips coalescing for packets
with VIRTIO_VSOCK_SEQ_EOM.

If fed with packets with len == 0 and VIRTIO_VSOCK_SEQ_EOM,
a very large number of packets can be queued
because vvs->rx_bytes stays at 0.

Fix this by estimating the skb metadata size:

	(Number of skbs in the queue) * SKB_TRUESIZE(0)

Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: virtualization@lists.linux.dev
Link: https://patch.msgid.link/20260430122653.554058-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 net/vmw_vsock/virtio_transport_common.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -425,7 +425,9 @@ static int virtio_transport_send_pkt_inf
 static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
 					u32 len)
 {
-	if (vvs->buf_used + len > vvs->buf_alloc)
+	u64 skb_overhead = (skb_queue_len(&vvs->rx_queue) + 1) * SKB_TRUESIZE(0);
+
+	if (skb_overhead + vvs->buf_used + len > vvs->buf_alloc)
 		return false;
 
 	vvs->rx_bytes += len;



^ permalink raw reply

* Re: [PATCH v2] vsock/virtio: rework MSG_ZEROCOPY flag handling
From: Arseniy Krasnov @ 2026-06-16 15:02 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michael S. Tsirkin, Jason Wang, Bobby Eshleman,
	Xuan Zhuo, Eugenio Pérez, Simon Horman, kvm, virtualization,
	netdev, linux-kernel, oxffffaa, rulkc
In-Reply-To: <ajFFOJSdqpNWgohB@sgarzare-redhat>


On 16/06/2026 16:09, Stefano Garzarella wrote:
> On Sun, Jun 14, 2026 at 08:47:56PM +0300, Arseniy Krasnov wrote:
>> Logically it was based on TCP implementation, so make further support
>> easier, rewrite it in the TCP way.
>
> Hi Arseniy, and thank you so much for the patch!
>
> I’d like to ask you to expand on the message a bit, especially to explain why we’re making this change.
>
> In particular, I’d like to better understand whether this is just a cosmetic change or if we’re fixing any issues (and if so, which ones), so we can determine whether this patch should be backported to the stable branches.

This is cosmetic change. I'll update commit message in v3.

>
>>
>> Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>
>> ---
>> Changelog v1->v2:
>> * Rebase on last 'net-next'. Don't need 'skb_zcopy_set()' now - it was
>>   already added.
>
> Ah, okay is net-next material, please use the net-next tag (ie. [PATCH net-next v2]). 

Sure!

>
>>
>> net/vmw_vsock/virtio_transport_common.c | 48 ++++++++++++-------------
>> 1 file changed, 23 insertions(+), 25 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 09475007165b..787524b8cb44 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -328,38 +328,36 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>     if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>         return pkt_len;
>>
>> -    if (info->msg) {
>> -        /* If zerocopy is not enabled by 'setsockopt()', we behave as
>> -         * there is no MSG_ZEROCOPY flag set.
>> +    if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
>> +        /* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
>> +         * 'MSG_ZEROCOPY' flag handling here is based on the same flag
>> +         * handling from 'tcp_sendmsg_locked()'.
>>          */
>> -        if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>> -            info->msg->msg_flags &= ~MSG_ZEROCOPY;
>> +        if (info->msg->msg_ubuf) {
>> +            uarg = info->msg->msg_ubuf;
>> +            can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>> +        } else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
>> +            uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
>> +                            NULL, false);
>> +            if (!uarg) {
>> +                virtio_transport_put_credit(vvs, pkt_len);
>> +                return -ENOMEM;
>> +            }
>>
>> -        if (info->msg->msg_flags & MSG_ZEROCOPY)
>>             can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>>
>
> nit: we can remove this extra blank line.
>
> For the rest I can't see anything wrong, but a bit more context in the commit would help me in the review.

Ack, I'll update commit message in v3. Also need to check some reports about pre-existing issues from sashiko, triggered by this patch.

Thanks!

>
> Thanks,
> Stefano
>

^ permalink raw reply

* Re: [PATCH v1] s390/virtio_ccw: Also suppress -EINVAL on device detach
From: Peter Oberparleiter @ 2026-06-16 14:53 UTC (permalink / raw)
  To: Cornelia Huck, Halil Pasic, William Bezenah, vneethv
  Cc: linux-s390, farman, hca, gor, agordeev, borntraeger, svens,
	mjrosato, virtualization, kvm, linux-kernel
In-Reply-To: <8733ymn8vd.fsf@redhat.com>

On 16.06.2026 11:16, Cornelia Huck wrote:
> On Tue, Jun 16 2026, Peter Oberparleiter <oberpar@linux.ibm.com> wrote:
> 
>> On 15.06.2026 23:42, Halil Pasic wrote:
>>> On Mon, 15 Jun 2026 16:01:55 -0400
>>> William Bezenah <wbezenah@linux.ibm.com> wrote:
>>>
>>>> On 6/15/2026 10:58 AM, Cornelia Huck wrote:
>>>>> On Mon, Jun 15 2026, Halil Pasic <pasic@linux.ibm.com> wrote:
>>>>>  
>>>>>> On Fri, 12 Jun 2026 17:54:07 +0200
>>>>>> William Bezenah <wbezenah@linux.ibm.com> wrote:
>>>>>>  
>>>>>>> Since commit 8c58a229688c ("s390/cio: Do not unregister the
>>>>>>> subchannel based on DNV"), subchannel behavior following a device
>>>>>>> detach has been updated and results in -EINVAL being propagated
>>>>>>> rather than -ENODEV, originating from ccw_device_start_timeout_key()
>>>>>>> in cio/device_ops. In the end, the virtio driver has no ability to
>>>>>>> react to the difference between device and subchannel states here,
>>>>>>> and during detach, both -ENODEV and -EINVAL indicate the device
>>>>>>> cannot be used and should not be treated as errors requiring
>>>>>>> attention. Update error handling in virtio_ccw_del_vq() and
>>>>>>> virtio_ccw_drop_indicator() to suppress -EINVAL in addition to
>>>>>>> -ENODEV.  
>>>>>> Hi William!
>>>>>>
>>>>>> Are you saying that ccw_device_start() started returning -EINVAL
>>>>>> since 8c58a229688c ("s390/cio: Do not unregister the subchannel based on
>>>>>> DNV")? Or did I somehow read the paragraph wrong?
>>>>>>
>>>>>> The funcition ccw_device_start is documented to return:
>>>>>>  * Returns:                                                                     
>>>>>>  *  %0, if the operation was successful;                                        
>>>>>>  *  -%EBUSY, if the device is busy, or status pending;                          
>>>>>>  *  -%EACCES, if no path specified in @lpm is operational;                      
>>>>>>  *  -%ENODEV, if the device is not operational. 
>>>>>> and the commit message does not say a thing about introducing -EINVAL to
>>>>>> the mix.  
>>>>> The function may return -EINVAL for non-enabled subchannels
>>>>> (i.e. pmcw.ena == 0), maybe we get an all-zeroes schib with dnv == 0?
>>>>> I'd expect it not to be enabled in that case anyway.  
>>>>
>>>> Yep, that's at least how I've come to understand what changed. The
>>>> function ccw_device_start_timeout_key() has always returned -EINVAL
>>>> for non-enabled subchannels (pmcw.ena == 0), though it's not
>>>> documented in the header.
>>>
>>> Wasn't his -EINVAL actually introduced by commit:
>>> 823d494ac111 ("[S390] pm: ccw bus power management callbacks")?
>>
>> In the context of virtio-ccw added in 2012, an EINVAL return code
>> introduced in 2009 might be considered "always" :)
> 
> :)
> 
> I'm wondering whether we should still expect to hit the "ssch with
> ena==0" situation, given that pm support has been removed again in the
> meanwhile. (Well, other than in situations like this, where it is a
> follow-up to other problems.) IOW, can callers expect not to see
> -EINVAL, unless they are doing something really stupid?

As Halil also pointed out, this would be a programming error, either on
the side of the driver that starts I/O without setting the device
properly online, or in the common I/O layer (hopefully not, but you
never know). Having a dedicated return code to identify this situation
is definitely useful, and we'll also consider documenting it accordingly
in the function comment.

>>>> What changed with commit 8c58a229688c is that cio_update_schib() now
>>>> updates the schib even when DNV=0, rather than returning early as it
>>>> did previously. Somehow this update results in pmcw.ena == 0 in
>>>> ccw_device_start_timeout_key(). Previously, it saw pmcw.ena == 1 and
>>>> moved to the condition (cdev->private->state == DEV_STATE_NOT_OPER)
>>>> where it returned -ENODEV.
>>>
>>> Sounds fishy to me. As far as I understand the DNV takes precedence over
>>> all other pieces of PMCW.
>>
>> And you're right about that! The Principles of Operation states (p. 15-4
>> in SA22-7832-14 [1]) that the contents of all other fields in the PMCW
>> are unpredictable when DNV is 0, therefore 8c58a229688c is in error.
>>
>> I'll work with Vineeth to determine how to fix this issue, potentially
>> via manually clearing some relevant SCHIB fields instead of copying the
>> unpredictable results of the STSCH instruction.
> 
> Can't you zero the whole SCHIB, or do you still need some of the
> measurement block things for cleanup?

I faintly remember that there WAS a reason to use the remainder of the
SCHIB contents because of some unwanted effect that occurred if we
didn't, but I don't recall the details. We'll need to dig up the
associated bug report to understand it and determine if we can simply
clear all of the SCHIB, or need to keep some of the information intact.

>>>> So the commit didn't introduce -EINVAL as a new return value, rather,
>>>> it changed the subchannel lifecycle such that existing paths now
>>>> propagate -EINVAL rather than -ENODEV during the device detach
>>>> scenario.
>>>
>>> I'm not convinced returning -EINVAL in the given situation is the
>>> right thing to do. Peter, would you mind to chime in?
>>
>> I tend to agree that an attempt to start I/O for a subchannel that has
>> DNV 0 should result in ENODEV rather than EINVAL, though the latter is
>> still valid when a driver tries to start I/O on a subchannel that is not
>> enabled for I/O.
>>
>> We'll make sure to design the fix for 8c58a229688c in away that ENODEV
>> will be returned when DNV is 0. Assuming that this is the only situation
>> where virtio-ccw's ccw_io_helper() receives -EINVAL from
>> ccw_device_start__timeout_key() during detach, the subject patch should
>> no longer be necessary.
> 
> I agree, I'd not expect to get -EINVAL in ccw_io_helper().
Yeah, this was definitely an unexpected side effect of the DNV commit.


-- 
Peter Oberparleiter
Linux on IBM Z Development - IBM Germany R&D

^ permalink raw reply

* Re: [RFC PATCH 2/2] virtio-balloon: add stats push mode
From: Gregory Price @ 2026-06-16 14:44 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: virtualization, linux-kernel, kernel-team, mst, jasowang,
	xuanzhuo, eperezma, hannes, surenb, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak
In-Reply-To: <4d8ddf50-e032-44a9-84d4-ddf2cd0c5bf6@kernel.org>

On Tue, Jun 16, 2026 at 04:32:46PM +0200, David Hildenbrand (Arm) wrote:
> On 6/16/26 15:57, Gregory Price wrote:
> > 
> > Definitely an RFC here because I'm not sure if I was missing something
> > that might help me solve the problem.
> 
> Well, in QEMU we just run a timer internally that does the polling.
> 
> Then, upper layers in the stack can ask QEMU for the latest stats.
> 
> There, you just get the stats along with a "last-update" timestamp.
>

That makes sense, although don't you just push the blocking operation
into yet another thread on the host?

Assuming it's not cancel-able, there's a blocked thread there you have
to reap.  Vs the guest being unresponsive and not sending updates, you
just reap the whole guest or take some other corrective action.

I suppose to the orchestrator reaping QEMU and reaping the guest looks
the same.  The difference is just where the thread lives, hmmmm

I'll make a note to inspect QEMU's solution to see if that's handled
or if it would be subject to the same issue.

Thanks!
~Gregory

^ permalink raw reply

* Re: [PATCH net v4] virtio-net: fix len check in receive_big()
From: Bui Quang Minh @ 2026-06-16 14:43 UTC (permalink / raw)
  To: Xiang Mei
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, netdev,
	virtualization, linux-kernel, bestswngs, mst, jasowang, xuanzhuo,
	eperezma
In-Reply-To: <20260616042837.2249468-1-xmei5@asu.edu>

On 6/16/26 11:28, Xiang Mei wrote:
> receive_big() bounds the device-announced length by
> (big_packets_num_skbfrags + 1) * PAGE_SIZE.  That is still too loose:
> add_recvbuf_big() sets sg[1] to start at offset
> sizeof(struct padded_vnet_hdr) into the first page, so the chain
> actually carries hdr_len + (PAGE_SIZE - sizeof(padded_vnet_hdr)) +
> big_packets_num_skbfrags * PAGE_SIZE bytes -- 20 bytes less than the
> check allows for the common hdr_len == 12 case.
>
> A malicious virtio backend can announce a len in that gap.  page_to_skb()
> then walks one frag past the page chain, storing a NULL page->private
> into skb_shinfo()->frags[MAX_SKB_FRAGS], which is both an out-of-bounds
> write past the static frag array and a NULL frag handed up the rx path.
>
> Bound len by the size add_recvbuf_big() actually advertised.
>
> Fixes: 0c716703965f ("virtio-net: fix received length check in big packets")

I think we should Cc: stable@vger.kernel.org

> Reported-by: Weiming Shi <bestswngs@gmail.com>
> Signed-off-by: Xiang Mei <xmei5@asu.edu>
> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
> v4: use easy to understand math to compute the max_len
> v3: revoke 2/2 and add Xuan Zhuo's Reviewed-by tag
> v2: add additiona check as 2/2
>
>   drivers/net/virtio_net.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..8f4562316aaa 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1999,15 +1999,16 @@ static struct sk_buff *receive_big(struct net_device *dev,
>   				   struct virtnet_rq_stats *stats)
>   {
>   	struct page *page = buf;
> +	unsigned long max_len = (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE -
> +				sizeof(struct padded_vnet_hdr) + vi->hdr_len;
>   	struct sk_buff *skb;
>   
>   	/* Make sure that len does not exceed the size allocated in
>   	 * add_recvbuf_big.
>   	 */
> -	if (unlikely(len > (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE)) {
> +	if (unlikely(len > max_len)) {
>   		pr_debug("%s: rx error: len %u exceeds allocated size %lu\n",
> -			 dev->name, len,
> -			 (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE);
> +			 dev->name, len, max_len);
>   		goto err;
>   	}
>   

Reviewed-by: Bui Quang Minh <minhquangbui99@gmail.com>

Thanks,
Quang Minh.


^ permalink raw reply

* Re: [RFC PATCH 2/2] virtio-balloon: add stats push mode
From: David Hildenbrand (Arm) @ 2026-06-16 14:32 UTC (permalink / raw)
  To: Gregory Price
  Cc: virtualization, linux-kernel, kernel-team, mst, jasowang,
	xuanzhuo, eperezma, hannes, surenb, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak
In-Reply-To: <ajFWRhSNsOBoUfb5@gourry-fedora-PF4VCD3F>

On 6/16/26 15:57, Gregory Price wrote:
> On Tue, Jun 16, 2026 at 02:33:43PM +0200, David Hildenbrand (Arm) wrote:
>> On 5/13/26 18:50, Gregory Price wrote:
>>>
>>> The pull model remains available and is the default.
>>
>> I don't quite see the big benefit here, really: either it's a timer in the
>> hypervisor or a timer in the VM. A slow VM will, in either model, delay the
>> update of stats.
>>
>> If you need some "liveness detection", is virtio-balloon stats updates really
>> the right mechanism?
>>
>> I don't quite understand the "Latency-sensitive consumers" problem. If the VM is
>> slow, it is slow and will mess with latency-sensitive consumers in either way?
>>
> 
> Latency sensitive here should probably be defined as "Does not like
> blocking operations".  This was prototyped in the context of
> cloud-hypervisor [1] and an orchestrator trying poll 1000 VMs on a
> single machine for stats. 
> 
> The poller couldn't determine the difference between "guest is slow" and
> "guest is hung" and so had to block on the operation (I didn't see how
> to solve this async).
> 
> Similarly, having a single thread just round-robin poll the VMs is
> bluntly inefficient and provides poor guarantees about the liveliness
> of the stats (a couple slow guests can cause other guests' stats to
> become stale for 10s of seconds).
> 
> Definitely an RFC here because I'm not sure if I was missing something
> that might help me solve the problem.

Well, in QEMU we just run a timer internally that does the polling.

Then, upper layers in the stack can ask QEMU for the latest stats.

There, you just get the stats along with a "last-update" timestamp.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH 0/4] vhost/vsock: add support for VHOST_RESET_OWNER and CPR migration
From: Stefano Garzarella @ 2026-06-16 14:28 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <4fa88fa6-a188-4c63-876c-ed748809bf0b@virtuozzo.com>

On Tue, Jun 16, 2026 at 05:01:34PM +0300, Andrey Drobyshev wrote:
>Hello Stefano,
>
>On 6/16/26 4:35 PM, Stefano Garzarella wrote:
>> Hi Andrey,
>> thanks for the series!
>>
>> On Fri, Jun 12, 2026 at 07:57:14PM +0300, Andrey Drobyshev wrote:
>>> Host<-->guest connections via AF_VSOCK sockets aren't supposed to
>>> outlive VM migration, since VM is moving to another host.  However
>>> there's a special case, which is QEMU live-update, or CPR
>>> (checkpoint-restore) migration.  In this case, VM remains on the same
>>> host, and we'd like such connections to persist.
>>
>> In the spec we have VIRTIO_VSOCK_EVENT_TRANSPORT_RESET which is usually
>> sent by the device after a migration.
>>
>> IIUC the specs don't say this has to be done all the time, so we don't
>> need to change anything in the specs, right?
>>
>> We just need to avoid sending it (which I think is what we're doing
>> here... I still need to look at the patches).
>>
>
>Sending this exact ioctl is guarded by one of my patches in the QEMU
>counterpart series:
>
>https://lore.kernel.org/qemu-devel/20260612165110.431376-6-andrey.drobyshev@virtuozzo.com/
>
>So we indeed avoid sending it on migration target in case of CPR migration.

Great, so we are aligned :-)

>
>>>
>>> For this to work, we need to be able to transfer device ownership from
>>> source QEMU to dest QEMU.  Namely, source needs to reset ownership by
>>> issuing VHOST_RESET_OWNER ioctl, and then target has to claim it by
>>> calling VHOST_SET_OWNER.
>>>
>>> Since VHOST_RESET_OWNER isn't yet implemented for vhost-vsock, let's add
>>> such implementation (patches 1-2).  Also fix regression introduced by
>>> the earlier commit [1] (patch 3), and fix the deadlock bug (commit 4).
>>
>> If it's a regression, should we fix it separately?
>>
>> Or is it related to this series?
>>
>
>Probably my wording wasn't quite correct.  I posted this patch here
>because we found the problem during testing this particular
>functionality, i.e. vsock data transfer + CPR migration.  And the
>problem was introduced by a recent commit, which is fine on its own, but
>breaks the CPR case.

Yeah, I figured out while reviewing the patch.
I'd avoid "regression" here and use just "issue", because at the end is 
just affecting this work that is not yet merged, so it can be a 
regression.

Thanks,
Stefano


^ permalink raw reply

* Re: [PATCH 2/4] vhost/vsock: add VHOST_RESET_OWNER ioctl
From: Stefano Garzarella @ 2026-06-16 14:26 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <129f5833-3a7f-4b2d-a965-20903e4e2fb5@virtuozzo.com>

On Tue, Jun 16, 2026 at 05:10:38PM +0300, Andrey Drobyshev wrote:
>On 6/16/26 4:48 PM, Stefano Garzarella wrote:
>> On Fri, Jun 12, 2026 at 07:57:16PM +0300, Andrey Drobyshev wrote:
>>> From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>>>
>>> This ioctl is needed for QEMU's CPR (checkpoint-restore) migration of
>>> the guest with vhost-vsock device.  For this to work, we need to reset
>>> the device ownership on the source side by calling RESET_OWNER, and then
>>> claim it on the dest side by calling SET_OWNER.  We expect not to lose any
>>> AF_VSOCK connection while this happens.
>>>
>>> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>>> ---
>>> drivers/vhost/vsock.c | 28 ++++++++++++++++++++++++++++
>>> 1 file changed, 28 insertions(+)
>>>
>>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>>> index b12221ce6faf..e629886e5cf8 100644
>>> --- a/drivers/vhost/vsock.c
>>> +++ b/drivers/vhost/vsock.c
>>> @@ -894,6 +894,32 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
>>> 	return -EFAULT;
>>> }
>>>
>>> +static int vhost_vsock_reset_owner(struct vhost_vsock *vsock)
>>> +{
>>> +	struct vhost_iotlb *umem;
>>> +	long err;
>>> +
>>> +	mutex_lock(&vsock->dev.mutex);
>>> +	err = vhost_dev_check_owner(&vsock->dev);
>>> +	if (err)
>>> +		goto done;
>>> +	umem = vhost_dev_reset_owner_prepare();
>>> +	if (!umem) {
>>> +		err = -ENOMEM;
>>> +		goto done;
>>> +	}
>>> +	/* Follows vhost_vsock_dev_release closely except for guest_cid drop */
>>> +	vsock_for_each_connected_socket(&vhost_transport.transport,
>>> +					vhost_vsock_reset_orphans);
>>
>> In vhost_vsock_reset_orphans() we have:
>>
>> 	rcu_read_lock();
>>
>> 	/* If the peer is still valid, no need to reset connection */
>> 	if (vhost_vsock_get(vsk->remote_addr.svm_cid, sock_net(sk))) {
>> 		rcu_read_unlock();
>> 		return;
>> 	}
>>
>> IIUC we are not removing the guest cid from the hash table, so this
>> check will be always true, and nothing is done.
>>
>> So, is this call really useful?
>>
>
>You're right, and it's most probably an artifact from mimicking the
>vhost_vsock_dev_release() implementation, as mentioned in the comment.
>In our case this whole iteration is a no-op, we better remove it.
>
>BTW earlier I received some feedback from Sashiko AI reviewer, which
>also spotted that same issue (and some more interesting races):
>
>https://sashiko.dev/#/patchset/20260612165718.433546-1-andrey.drobyshev@virtuozzo.com

Oh they seems similar to claude comments I included in my comment on 
patch 3.

Yeah, we should takes a look, they seems real issues.

>
>Apparently it only CC's its reviews to kvm@vger.kernel.org so you can't
>see them right away.  Just wanted to let you know to save your time
>here.  I'll send a v2 with respect to Sashiko remarks.  But of course
>would be great if you spot some more issues here.
>

Thanks for pointing that out, but in general I try to do my reviews 
before looking at AI reviews (both sashiko or claude locally) to avoid 
to be too much biased.

Thanks,
Stefano


^ permalink raw reply

* Re: [PATCH 4/4] vhost/vsock: re-scan TX virtqueue on device start
From: Stefano Garzarella @ 2026-06-16 14:23 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <20260612165718.433546-5-andrey.drobyshev@virtuozzo.com>

On Fri, Jun 12, 2026 at 07:57:18PM +0300, Andrey Drobyshev wrote:
>During QEMU CPR live-update (and VHOST_RESET_OWNER in general) the guest
>keeps running while the host drops and later re-attaches vhost backends.
>If the guest adds a buffer to the TX virtqueue (guest->host) and kicks
>while the backend is temporarily NULL (between vhost_vsock_drop_backends()
>and the next vhost_vsock_start()), then the kick is delivered to the
>vhost worker, handle_tx_kick() sees a NULL backend and returns, and the
>kick signal is consumed.  The buffer is then left in the ring.
>
>Then upon device start vhost_vsock_start() only re-kicks the RX send
>worker, never the TX VQ, so the buffer is processed only if the guest
>happens to kick again.  But if the guest itself is now waiting for data
>from the host, it will never kick TX VQ again, and we end up in a
>deadlock.
>
>The deadlock is reproduced during active host->guest socat data transfer
>under multiple consecutive CPR live-update's.
>
>To fix this, in vhost_vsock_start(), after kicking the RX send worker, also
>queue the TX vq poll so any buffers the guest enqueued while we were paused
>get scanned.

Again, it seems like we're fixing an issue that existed before this 
series, but IIUC without support for VHOST_RESET_OWNER, this could never 
have happened, so the wording should be changed to make it clear that 
this is can happen only with the new VHOST_RESET_OWNER support.

In addition, this patch must also be applied before the 
VHOST_RESET_OWNER support or merged into it.

>
>Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>---
> drivers/vhost/vsock.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index bcaba36becd7..1fcfe71d18be 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -655,6 +655,12 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
> 	 */
> 	vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
>
>+	/*
>+	 * Some packets might've also been queued in TX VQ.  Re-scan it here,
>+	 * mirroring the RX send-worker kick above.
>+	 */

Can we also mention that this is related to VHOST_RESET_OWNER?

Thanks,
Stefano

>+	vhost_poll_queue(&vsock->vqs[VSOCK_VQ_TX].poll);
>+
> 	mutex_unlock(&vsock->dev.mutex);
> 	return 0;
>
>-- 
>2.47.1
>


^ permalink raw reply

* Re: [RFC PATCH 1/2] virtio-balloon: extend stats with memory composition and pressure data
From: David Hildenbrand (Arm) @ 2026-06-16 14:19 UTC (permalink / raw)
  To: Gregory Price
  Cc: virtualization, linux-kernel, kernel-team, mst, jasowang,
	xuanzhuo, eperezma, hannes, surenb, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak
In-Reply-To: <ajFUaDX8U40X2oLx@gourry-fedora-PF4VCD3F>

On 6/16/26 15:49, Gregory Price wrote:
> On Tue, Jun 16, 2026 at 02:30:34PM +0200, David Hildenbrand (Arm) wrote:
>>
>> Nothing too crazy here, however, the question is which of these values we
>> actually want to guarantee that we will provide them with unchanged semantics in
>> the future ... I guess anything we already expose to user space is alright
>> (because it effectively already must remain mostly unchanged I assume).
>>
> 
> I suppose a risk of them going away entirely?  I suppose that's
> fixable but unlikely.

Yeah, the first bunch is all exported through /proc/vmstat AFAIKS.

The other through /proc/pressure/

So I assume this is just fine.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH 3/4] vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
From: Stefano Garzarella @ 2026-06-16 14:18 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <20260612165718.433546-4-andrey.drobyshev@virtuozzo.com>

On Fri, Jun 12, 2026 at 07:57:17PM +0300, Andrey Drobyshev wrote:
>From: "Denis V. Lunev" <den@openvz.org>
>
>Earlier commit ("ms/vhost/vsock: Refuse the connection immediately when

Please follow 
https://docs.kernel.org/process/submitting-patches.html#describe-your-changes 
on how to refer to a commit.

>guest isn't ready") added a fast-fail in vhost_transport_send_pkt().  It
>rejects every host send with -EHOSTUNREACH until the destination calls
>SET_RUNNING(1).  The fast-fail condition checks whether device's backends
>are dropped, and if they're, the guest is considered to be not ready.

Okay, so it's not a regression, I mean without this series that patch is 
not adding any regression, no?

If it's the case, I'll change the wording in the cover letter.

>
>However, there might be other reasons for backends to be nulled.  In
>particular, when QEMU is performing CPR (checkpoint-restore) migration,
>device ownership is being RESET and SET again, which leads to backends
>drop and reattach.  If we end up connecting during this window, an
>AF_VSOCK client gets -EHOSTUNREACH, which is wrong.

Please add this change before starting to support VHOST_RESET_OWNER 
ioctl in vhost-vsock, otherwise we are breaking the bisectability.

>
>Add a cpr_paused flag set inside vhost_vsock_drop_backends() when the
>backend was previously live, cleared by vhost_vsock_start(). When set,
>vhost_transport_send_pkt() queues the skb instead of fast-failing; the
>existing kick of send_pkt_work in vhost_vsock_start() drains it on
>resume. A device that has never run keeps cpr_paused == false and the
>boot-time fast-fail behaviour is preserved.
>
>Pair the cpr_paused store with the backend store using an
>smp_wmb()/smp_rmb() pair so a concurrent sender on a weakly-ordered
>architecture never observes (NULL backend, !paused):
>
>Signed-off-by: Denis V. Lunev <den@openvz.org>
>---
> drivers/vhost/vsock.c | 22 +++++++++++++++++++---
> 1 file changed, 19 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index e629886e5cf8..bcaba36becd7 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -61,6 +61,7 @@ struct vhost_vsock {
>
> 	u32 guest_cid;
> 	bool seqpacket_allow;
>+	bool cpr_paused;	/* between stop and next start */
> };
>
> static u32 vhost_transport_get_local_cid(void)
>@@ -311,11 +312,17 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
> 	 * the mutex would be too expensive in this hot path, and we already have
> 	 * all the outcomes covered: if the backend becomes NULL right after the check,
> 	 * vhost_transport_do_send_pkt() will check it under the mutex anyway.
>+	 *
>+	 * Don't fast-fail if cpr_paused is set, keep queueing skbs instead.
>+	 * The kick in vhost_vsock_start() will drain them on resume.
> 	 */
> 	if (unlikely(!data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))) {
>-		rcu_read_unlock();
>-		kfree_skb(skb);
>-		return -EHOSTUNREACH;
>+		smp_rmb();	/* pairs with smp_wmb() in start/drop_backends */
>+		if (!READ_ONCE(vsock->cpr_paused)) {

Can we avoid this which is not really readable and maybe add a single 
variable to control the fast-fail at all?

I mean replacing both cpr_paused + backend-pointer with a single 
`started` flag: set it to false at open, true on start via 
smp_store_release(), back to false on normal stop, and leave it true 
during CPR pause.

The reader in send_pkt can do just:

     if (!smp_load_acquire(&vsock->started))
         return -EHOSTUNREACH;

WDYT?

>+			rcu_read_unlock();
>+			kfree_skb(skb);
>+			return -EHOSTUNREACH;
>+		}


That said claude here is reporting a potential issue that I think we 
should consider:
     After VHOST_RESET_OWNER, the guest CID stays in the hash, so 
     vhost_transport_send_pkt() can still find the vsock, skip the 
     fast-fail (cpr_paused=true), and call vhost_vq_work_queue() while 
     vhost_workers_free() is freeing workers without a synchronize_rcu() 
     — risking a use-after-free. Also, any send_pkt_work queued between 
     the last flush and worker teardown gets its VHOST_WORK_QUEUED bit 
     stuck (the vhost task exits without draining), deadlocking 
     host→guest traffic after restart.

     A synchronize_rcu() in vhost_workers_free() between the 
     rcu_assign_pointer(NULL) loop and the destroy loop would close the 
     use-after-free, and reinitializing send_pkt_work via 
     vhost_work_init() after vhost_dev_reset_owner() returns would clear 
     the stuck QUEUED bit.


> 	}
>
> 	if (virtio_vsock_skb_reply(skb))
>@@ -640,6 +647,9 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
> 		mutex_unlock(&vq->mutex);
> 	}
>
>+	smp_wmb();	/* pairs with smp_rmb() in send_pkt */
>+	WRITE_ONCE(vsock->cpr_paused, false);
>+
> 	/* Some packets may have been queued before the device was started,
> 	 * let's kick the send worker to send them.
> 	 */
>@@ -671,6 +681,11 @@ static void vhost_vsock_drop_backends(struct vhost_vsock *vsock)
>
> 	lockdep_assert_held(&vsock->dev.mutex);
>
>+	if (vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])) {
>+		WRITE_ONCE(vsock->cpr_paused, true);
>+		smp_wmb();	/* pairs with smp_rmb() in send_pkt */
>+	}

Why here and not in vhost_vsock_reset_owner()?

Also having this here will set it to true also with 
VHOST_VSOCK_SET_RUNNING(0), is that right?

Thanks,
Stefano

>+
> 	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
> 		vq = &vsock->vqs[i];
>
>@@ -728,6 +743,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
>
> 	vsock->guest_cid = 0; /* no CID assigned yet */
> 	vsock->seqpacket_allow = false;
>+	vsock->cpr_paused = false;
>
> 	atomic_set(&vsock->queued_replies, 0);
>
>-- 
>2.47.1
>


^ permalink raw reply

* Re: [PATCH 2/4] vhost/vsock: add VHOST_RESET_OWNER ioctl
From: Andrey Drobyshev @ 2026-06-16 14:10 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <ajFRRmA9req1muX6@sgarzare-redhat>

On 6/16/26 4:48 PM, Stefano Garzarella wrote:
> On Fri, Jun 12, 2026 at 07:57:16PM +0300, Andrey Drobyshev wrote:
>> From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>>
>> This ioctl is needed for QEMU's CPR (checkpoint-restore) migration of
>> the guest with vhost-vsock device.  For this to work, we need to reset
>> the device ownership on the source side by calling RESET_OWNER, and then
>> claim it on the dest side by calling SET_OWNER.  We expect not to lose any
>> AF_VSOCK connection while this happens.
>>
>> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>> ---
>> drivers/vhost/vsock.c | 28 ++++++++++++++++++++++++++++
>> 1 file changed, 28 insertions(+)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index b12221ce6faf..e629886e5cf8 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -894,6 +894,32 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
>> 	return -EFAULT;
>> }
>>
>> +static int vhost_vsock_reset_owner(struct vhost_vsock *vsock)
>> +{
>> +	struct vhost_iotlb *umem;
>> +	long err;
>> +
>> +	mutex_lock(&vsock->dev.mutex);
>> +	err = vhost_dev_check_owner(&vsock->dev);
>> +	if (err)
>> +		goto done;
>> +	umem = vhost_dev_reset_owner_prepare();
>> +	if (!umem) {
>> +		err = -ENOMEM;
>> +		goto done;
>> +	}
>> +	/* Follows vhost_vsock_dev_release closely except for guest_cid drop */
>> +	vsock_for_each_connected_socket(&vhost_transport.transport,
>> +					vhost_vsock_reset_orphans);
> 
> In vhost_vsock_reset_orphans() we have:
> 
> 	rcu_read_lock();
> 
> 	/* If the peer is still valid, no need to reset connection */
> 	if (vhost_vsock_get(vsk->remote_addr.svm_cid, sock_net(sk))) {
> 		rcu_read_unlock();
> 		return;
> 	}
> 
> IIUC we are not removing the guest cid from the hash table, so this 
> check will be always true, and nothing is done.
> 
> So, is this call really useful?
>

You're right, and it's most probably an artifact from mimicking the
vhost_vsock_dev_release() implementation, as mentioned in the comment.
In our case this whole iteration is a no-op, we better remove it.

BTW earlier I received some feedback from Sashiko AI reviewer, which
also spotted that same issue (and some more interesting races):

https://sashiko.dev/#/patchset/20260612165718.433546-1-andrey.drobyshev@virtuozzo.com

Apparently it only CC's its reviews to kvm@vger.kernel.org so you can't
see them right away.  Just wanted to let you know to save your time
here.  I'll send a v2 with respect to Sashiko remarks.  But of course
would be great if you spot some more issues here.


>> +	vhost_vsock_drop_backends(vsock);
>> +	vhost_vsock_flush(vsock);
>> +	vhost_dev_stop(&vsock->dev);
>> +	vhost_dev_reset_owner(&vsock->dev, umem);
>> +done:
>> +	mutex_unlock(&vsock->dev.mutex);
>> +	return err;
>> +}
>> +
>> static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
>> 				  unsigned long arg)
>> {
>> @@ -937,6 +963,8 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
>> 			return -EOPNOTSUPP;
>> 		vhost_set_backend_features(&vsock->dev, features);
>> 		return 0;
>> +	case VHOST_RESET_OWNER:
>> +		return vhost_vsock_reset_owner(vsock);
>> 	default:
>> 		mutex_lock(&vsock->dev.mutex);
>> 		r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
>> -- 
>> 2.47.1
>>
> 


^ permalink raw reply

* Re: [PATCH 0/4] vhost/vsock: add support for VHOST_RESET_OWNER and CPR migration
From: Andrey Drobyshev @ 2026-06-16 14:01 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <ajFLvKcT-A0wvLYW@sgarzare-redhat>

Hello Stefano,

On 6/16/26 4:35 PM, Stefano Garzarella wrote:
> Hi Andrey,
> thanks for the series!
> 
> On Fri, Jun 12, 2026 at 07:57:14PM +0300, Andrey Drobyshev wrote:
>> Host<-->guest connections via AF_VSOCK sockets aren't supposed to
>> outlive VM migration, since VM is moving to another host.  However
>> there's a special case, which is QEMU live-update, or CPR
>> (checkpoint-restore) migration.  In this case, VM remains on the same
>> host, and we'd like such connections to persist.
> 
> In the spec we have VIRTIO_VSOCK_EVENT_TRANSPORT_RESET which is usually 
> sent by the device after a migration.
> 
> IIUC the specs don't say this has to be done all the time, so we don't 
> need to change anything in the specs, right?
> 
> We just need to avoid sending it (which I think is what we're doing 
> here... I still need to look at the patches).
>

Sending this exact ioctl is guarded by one of my patches in the QEMU
counterpart series:

https://lore.kernel.org/qemu-devel/20260612165110.431376-6-andrey.drobyshev@virtuozzo.com/

So we indeed avoid sending it on migration target in case of CPR migration.

>>
>> For this to work, we need to be able to transfer device ownership from
>> source QEMU to dest QEMU.  Namely, source needs to reset ownership by
>> issuing VHOST_RESET_OWNER ioctl, and then target has to claim it by
>> calling VHOST_SET_OWNER.
>>
>> Since VHOST_RESET_OWNER isn't yet implemented for vhost-vsock, let's add
>> such implementation (patches 1-2).  Also fix regression introduced by
>> the earlier commit [1] (patch 3), and fix the deadlock bug (commit 4).
> 
> If it's a regression, should we fix it separately?
> 
> Or is it related to this series?
>

Probably my wording wasn't quite correct.  I posted this patch here
because we found the problem during testing this particular
functionality, i.e. vsock data transfer + CPR migration.  And the
problem was introduced by a recent commit, which is fine on its own, but
breaks the CPR case.
>>
>> There's a complementary series for QEMU [0] adding support of vhost-vsock
>> devices during CPR migration.
>>
>> NOTE: this series needs to be applied on top of Michael's vhost/linux-next
>> tree as it contains relevant commit [1], not yet present in master branch.
>>
>> I've tested this (patched QEMU + patched kernel) approximately as follows:
>>
>>  * Run listener in the guest:
>>  socat -u VSOCK-LISTEN:9999 - >/tmp/recv.bin
>>
>>  * Run data transfer from host to guest:
>>  socat -u FILE:/root/bigfile.bin VSOCK-CONNECT:CID:9999
>>
>>  * Perform CPR migration during transfer (either cpr-exec or cpr-transfer)
>>  * Check that file hash sum matches
>>
>> [0] https://lore.kernel.org/qemu-devel/20260612165110.431376-1-andrey.drobyshev@virtuozzo.com
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/commit/?id=bb26ed5f3a8b
>>
>> Andrey Drobyshev (1):
>>  vhost/vsock: re-scan TX virtqueue on device start
>>
>> Denis V. Lunev (1):
>>  vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
>>
>> Pavel Tikhomirov (2):
>>  vhost/vsock: split out vhost_vsock_drop_backends helper
>>  vhost/vsock: add VHOST_RESET_OWNER ioctl
>>
>> drivers/vhost/vsock.c | 80 +++++++++++++++++++++++++++++++++++++------
>> 1 file changed, 69 insertions(+), 11 deletions(-)
>>
>> -- 
>> 2.47.1
>>
> 


^ permalink raw reply

* Re: [RFC PATCH 2/2] virtio-balloon: add stats push mode
From: Gregory Price @ 2026-06-16 13:57 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: virtualization, linux-kernel, kernel-team, mst, jasowang,
	xuanzhuo, eperezma, hannes, surenb, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak
In-Reply-To: <feee7831-aa82-4f02-b5e4-cffe7818dc65@kernel.org>

On Tue, Jun 16, 2026 at 02:33:43PM +0200, David Hildenbrand (Arm) wrote:
> On 5/13/26 18:50, Gregory Price wrote:
> > 
> > The pull model remains available and is the default.
> 
> I don't quite see the big benefit here, really: either it's a timer in the
> hypervisor or a timer in the VM. A slow VM will, in either model, delay the
> update of stats.
> 
> If you need some "liveness detection", is virtio-balloon stats updates really
> the right mechanism?
> 
> I don't quite understand the "Latency-sensitive consumers" problem. If the VM is
> slow, it is slow and will mess with latency-sensitive consumers in either way?
>

Latency sensitive here should probably be defined as "Does not like
blocking operations".  This was prototyped in the context of
cloud-hypervisor [1] and an orchestrator trying poll 1000 VMs on a
single machine for stats. 

The poller couldn't determine the difference between "guest is slow" and
"guest is hung" and so had to block on the operation (I didn't see how
to solve this async).

Similarly, having a single thread just round-robin poll the VMs is
bluntly inefficient and provides poor guarantees about the liveliness
of the stats (a couple slow guests can cause other guests' stats to
become stale for 10s of seconds).

Definitely an RFC here because I'm not sure if I was missing something
that might help me solve the problem.

~Gregory

[1] https://github.com/cloud-hypervisor/cloud-hypervisor

^ permalink raw reply

* Re: [RFC PATCH 1/2] virtio-balloon: extend stats with memory composition and pressure data
From: Gregory Price @ 2026-06-16 13:49 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: virtualization, linux-kernel, kernel-team, mst, jasowang,
	xuanzhuo, eperezma, hannes, surenb, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak
In-Reply-To: <8e766803-e68d-4efd-a69e-cec5b2492988@kernel.org>

On Tue, Jun 16, 2026 at 02:30:34PM +0200, David Hildenbrand (Arm) wrote:
> 
> Nothing too crazy here, however, the question is which of these values we
> actually want to guarantee that we will provide them with unchanged semantics in
> the future ... I guess anything we already expose to user space is alright
> (because it effectively already must remain mostly unchanged I assume).
>

I suppose a risk of them going away entirely?  I suppose that's
fixable but unlikely.

~Gregory

^ permalink raw reply

* Re: [PATCH 2/4] vhost/vsock: add VHOST_RESET_OWNER ioctl
From: Stefano Garzarella @ 2026-06-16 13:48 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <20260612165718.433546-3-andrey.drobyshev@virtuozzo.com>

On Fri, Jun 12, 2026 at 07:57:16PM +0300, Andrey Drobyshev wrote:
>From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>
>This ioctl is needed for QEMU's CPR (checkpoint-restore) migration of
>the guest with vhost-vsock device.  For this to work, we need to reset
>the device ownership on the source side by calling RESET_OWNER, and then
>claim it on the dest side by calling SET_OWNER.  We expect not to lose any
>AF_VSOCK connection while this happens.
>
>Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>---
> drivers/vhost/vsock.c | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index b12221ce6faf..e629886e5cf8 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -894,6 +894,32 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
> 	return -EFAULT;
> }
>
>+static int vhost_vsock_reset_owner(struct vhost_vsock *vsock)
>+{
>+	struct vhost_iotlb *umem;
>+	long err;
>+
>+	mutex_lock(&vsock->dev.mutex);
>+	err = vhost_dev_check_owner(&vsock->dev);
>+	if (err)
>+		goto done;
>+	umem = vhost_dev_reset_owner_prepare();
>+	if (!umem) {
>+		err = -ENOMEM;
>+		goto done;
>+	}
>+	/* Follows vhost_vsock_dev_release closely except for guest_cid drop */
>+	vsock_for_each_connected_socket(&vhost_transport.transport,
>+					vhost_vsock_reset_orphans);

In vhost_vsock_reset_orphans() we have:

	rcu_read_lock();

	/* If the peer is still valid, no need to reset connection */
	if (vhost_vsock_get(vsk->remote_addr.svm_cid, sock_net(sk))) {
		rcu_read_unlock();
		return;
	}

IIUC we are not removing the guest cid from the hash table, so this 
check will be always true, and nothing is done.

So, is this call really useful?

>+	vhost_vsock_drop_backends(vsock);
>+	vhost_vsock_flush(vsock);
>+	vhost_dev_stop(&vsock->dev);
>+	vhost_dev_reset_owner(&vsock->dev, umem);
>+done:
>+	mutex_unlock(&vsock->dev.mutex);
>+	return err;
>+}
>+
> static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
> 				  unsigned long arg)
> {
>@@ -937,6 +963,8 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
> 			return -EOPNOTSUPP;
> 		vhost_set_backend_features(&vsock->dev, features);
> 		return 0;
>+	case VHOST_RESET_OWNER:
>+		return vhost_vsock_reset_owner(vsock);
> 	default:
> 		mutex_lock(&vsock->dev.mutex);
> 		r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
>-- 
>2.47.1
>


^ permalink raw reply

* Re: [PATCH 1/4] vhost/vsock: split out vhost_vsock_drop_backends helper
From: Stefano Garzarella @ 2026-06-16 13:42 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <20260612165718.433546-2-andrey.drobyshev@virtuozzo.com>

On Fri, Jun 12, 2026 at 07:57:15PM +0300, Andrey Drobyshev wrote:
>From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>
>Split the actual backend dropping part from vhost_vsock_stop.  We're
>going to need it for the VHOST_RESET_OWNER implementation in the
>following patch, when vsock->dev.mutex is already taken and owner is
>checked.
>
>Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>---
> drivers/vhost/vsock.c | 26 +++++++++++++++++---------
> 1 file changed, 17 insertions(+), 9 deletions(-)

LGTM!

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>

>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index 9aaab6bb8061..b12221ce6faf 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -664,9 +664,24 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
> 	return ret;
> }
>
>-static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
>+static void vhost_vsock_drop_backends(struct vhost_vsock *vsock)
> {
>+	struct vhost_virtqueue *vq;
> 	size_t i;
>+
>+	lockdep_assert_held(&vsock->dev.mutex);
>+
>+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
>+		vq = &vsock->vqs[i];
>+
>+		mutex_lock(&vq->mutex);
>+		vhost_vq_set_backend(vq, NULL);
>+		mutex_unlock(&vq->mutex);
>+	}
>+}
>+
>+static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
>+{
> 	int ret = 0;
>
> 	mutex_lock(&vsock->dev.mutex);
>@@ -677,14 +692,7 @@ static int vhost_vsock_stop(struct vhost_vsock *vsock, bool check_owner)
> 			goto err;
> 	}
>
>-	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
>-		struct vhost_virtqueue *vq = &vsock->vqs[i];
>-
>-		mutex_lock(&vq->mutex);
>-		vhost_vq_set_backend(vq, NULL);
>-		mutex_unlock(&vq->mutex);
>-	}
>-
>+	vhost_vsock_drop_backends(vsock);
> err:
> 	mutex_unlock(&vsock->dev.mutex);
> 	return ret;
>-- 
>2.47.1
>


^ permalink raw reply

* Re: [PATCH 0/4] vhost/vsock: add support for VHOST_RESET_OWNER and CPR migration
From: Stefano Garzarella @ 2026-06-16 13:35 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <20260612165718.433546-1-andrey.drobyshev@virtuozzo.com>

Hi Andrey,
thanks for the series!

On Fri, Jun 12, 2026 at 07:57:14PM +0300, Andrey Drobyshev wrote:
>Host<-->guest connections via AF_VSOCK sockets aren't supposed to
>outlive VM migration, since VM is moving to another host.  However
>there's a special case, which is QEMU live-update, or CPR
>(checkpoint-restore) migration.  In this case, VM remains on the same
>host, and we'd like such connections to persist.

In the spec we have VIRTIO_VSOCK_EVENT_TRANSPORT_RESET which is usually 
sent by the device after a migration.

IIUC the specs don't say this has to be done all the time, so we don't 
need to change anything in the specs, right?

We just need to avoid sending it (which I think is what we're doing 
here... I still need to look at the patches).

>
>For this to work, we need to be able to transfer device ownership from
>source QEMU to dest QEMU.  Namely, source needs to reset ownership by
>issuing VHOST_RESET_OWNER ioctl, and then target has to claim it by
>calling VHOST_SET_OWNER.
>
>Since VHOST_RESET_OWNER isn't yet implemented for vhost-vsock, let's add
>such implementation (patches 1-2).  Also fix regression introduced by
>the earlier commit [1] (patch 3), and fix the deadlock bug (commit 4).

If it's a regression, should we fix it separately?

Or is it related to this series?

>
>There's a complementary series for QEMU [0] adding support of vhost-vsock
>devices during CPR migration.
>
>NOTE: this series needs to be applied on top of Michael's vhost/linux-next
>tree as it contains relevant commit [1], not yet present in master branch.
>
>I've tested this (patched QEMU + patched kernel) approximately as follows:
>
>  * Run listener in the guest:
>  socat -u VSOCK-LISTEN:9999 - >/tmp/recv.bin
>
>  * Run data transfer from host to guest:
>  socat -u FILE:/root/bigfile.bin VSOCK-CONNECT:CID:9999
>
>  * Perform CPR migration during transfer (either cpr-exec or cpr-transfer)
>  * Check that file hash sum matches
>
>[0] https://lore.kernel.org/qemu-devel/20260612165110.431376-1-andrey.drobyshev@virtuozzo.com
>[1] https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/commit/?id=bb26ed5f3a8b
>
>Andrey Drobyshev (1):
>  vhost/vsock: re-scan TX virtqueue on device start
>
>Denis V. Lunev (1):
>  vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
>
>Pavel Tikhomirov (2):
>  vhost/vsock: split out vhost_vsock_drop_backends helper
>  vhost/vsock: add VHOST_RESET_OWNER ioctl
>
> drivers/vhost/vsock.c | 80 +++++++++++++++++++++++++++++++++++++------
> 1 file changed, 69 insertions(+), 11 deletions(-)
>
>-- 
>2.47.1
>


^ permalink raw reply

* Re: [PATCH v2] vsock/virtio: rework MSG_ZEROCOPY flag handling
From: Stefano Garzarella @ 2026-06-16 13:09 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michael S. Tsirkin, Jason Wang, Bobby Eshleman,
	Xuan Zhuo, Eugenio Pérez, Simon Horman, kvm, virtualization,
	netdev, linux-kernel, oxffffaa, rulkc
In-Reply-To: <20260614174756.170631-1-avkrasnov@rulkc.org>

On Sun, Jun 14, 2026 at 08:47:56PM +0300, Arseniy Krasnov wrote:
>Logically it was based on TCP implementation, so make further support
>easier, rewrite it in the TCP way.

Hi Arseniy, and thank you so much for the patch!

I’d like to ask you to expand on the message a bit, especially to 
explain why we’re making this change.

In particular, I’d like to better understand whether this is just a 
cosmetic change or if we’re fixing any issues (and if so, which ones), 
so we can determine whether this patch should be backported to the 
stable branches.

>
>Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>
>---
> Changelog v1->v2:
> * Rebase on last 'net-next'. Don't need 'skb_zcopy_set()' now - it was
>   already added.

Ah, okay is net-next material, please use the net-next tag (ie. [PATCH 
net-next v2]).

>
> net/vmw_vsock/virtio_transport_common.c | 48 ++++++++++++-------------
> 1 file changed, 23 insertions(+), 25 deletions(-)
>
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index 09475007165b..787524b8cb44 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -328,38 +328,36 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> 	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> 		return pkt_len;
>
>-	if (info->msg) {
>-		/* If zerocopy is not enabled by 'setsockopt()', we behave as
>-		 * there is no MSG_ZEROCOPY flag set.
>+	if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
>+		/* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
>+		 * 'MSG_ZEROCOPY' flag handling here is based on the same flag
>+		 * handling from 'tcp_sendmsg_locked()'.
> 		 */
>-		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>-			info->msg->msg_flags &= ~MSG_ZEROCOPY;
>+		if (info->msg->msg_ubuf) {
>+			uarg = info->msg->msg_ubuf;
>+			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>+		} else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
>+			uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
>+						    NULL, false);
>+			if (!uarg) {
>+				virtio_transport_put_credit(vvs, pkt_len);
>+				return -ENOMEM;
>+			}
>
>-		if (info->msg->msg_flags & MSG_ZEROCOPY)
> 			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>

nit: we can remove this extra blank line.

For the rest I can't see anything wrong, but a bit more context in the 
commit would help me in the review.

Thanks,
Stefano


^ permalink raw reply

* Re: [PATCH net v2 2/2] vsock/virtio: restore msg_iter on transmission failure
From: Stefano Garzarella @ 2026-06-16 12:59 UTC (permalink / raw)
  To: Octavian Purdila, g
  Cc: netdev, Alexander Viro, Andrew Morton, Arseniy Krasnov,
	David S. Miller, Eric Dumazet, Eugenio Pérez, Jakub Kicinski,
	Jason Wang, kvm, linux-block, linux-fsdevel, linux-kernel,
	Michael S. Tsirkin, Paolo Abeni, Simon Horman, Stefan Hajnoczi,
	virtualization, Xuan Zhuo, syzbot+28e5f3d207b14bae122a
In-Reply-To: <20260613000953.467473-3-tavip@google.com>

On Sat, Jun 13, 2026 at 12:09:53AM +0000, Octavian Purdila wrote:
>When transmission fails in virtio_transport_send_pkt_info, the msg_iter
>might have been partially advanced. If we don't restore it, the next
>attempt to send data will use an incorrect iterator state, leading to
>desync and warnings like "send_pkt() returns 0, but X expected".
>
>Specifically, this can happen in the following scenario, triggered by
>the syzkaller repro:
>
>1. A write-only VMA (PROT_WRITE only) is partially populated by a
>   prior TUN write that failed with -EIO but still faulted in some
>   pages).
>2. A vsock sendmmsg call with MSG_ZEROCOPY requests transmission of a
>   buffer from this VMA.
>3. The first packet (64KB) is sent successfully because the pages are
>   populated.
>4. The second packet allocation fails because GUP fast pins the first page
>   but GUP slow fails on the next unpopulated page due to PROT_WRITE-only
>   permissions.
>5. The iterator is advanced by the partially successful GUP (68KB total
>   advanced: 64KB from first packet + 4KB from second), but the send loop
>   breaks and only reports 64KB sent. This creates a 4KB desync.
>6. The next retry starts with a non-zero iov_offset, disabling zerocopy
>   and falling back to copy mode.
>7. In copy mode, the transmission succeeds for the next packets but
>   exhausts the iterator early because of the desync.
>8. The final retry sees an empty iterator but zerocopy is re-enabled
>   (offset resets). It attempts to send the remaining bytes with zerocopy
>   but pins 0 pages, creating an empty packet.
>9. The transport sends the empty packet, triggering the warning because
>   the returned bytes (header only) do not match the expected payload size.
>10. The loop continues to spin, allocating ubuf_info each time, eventually
>    exhausting sysctl_optmem_max and returning -ENOMEM to userspace.
>
>Restore msg_iter to its original state before the packet allocation
>and transmission attempt if they fail.
>
>Fixes: e0718bd82e27 ("vsock: enable setting SO_ZEROCOPY")
>Reported-by: syzbot+28e5f3d207b14bae122a@syzkaller.appspotmail.com
>Closes: https://syzkaller.appspot.com/bug?extid=28e5f3d207b14bae122a
>Assisted-by: gemini:gemini-3.1-pro
>Signed-off-by: Octavian Purdila <tavip@google.com>
>---
> net/vmw_vsock/virtio_transport_common.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)

Thanks, looks much better to me now!

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>


^ permalink raw reply

* Re: [PATCH net v2 1/2] iov_iter: export iov_iter_restore
From: Stefano Garzarella @ 2026-06-16 12:35 UTC (permalink / raw)
  To: Octavian Purdila
  Cc: netdev, Alexander Viro, Andrew Morton, Arseniy Krasnov,
	David S. Miller, Eric Dumazet, Eugenio Pérez, Jakub Kicinski,
	Jason Wang, kvm, linux-block, linux-fsdevel, linux-kernel,
	Michael S. Tsirkin, Paolo Abeni, Simon Horman, Stefan Hajnoczi,
	virtualization, Xuan Zhuo
In-Reply-To: <20260613000953.467473-2-tavip@google.com>

On Sat, Jun 13, 2026 at 12:09:52AM +0000, Octavian Purdila wrote:
>Export iov_iter_restore so that it can be used by modules.
>
>This is needed by the virtio vsock transport (which can be built as a
>module) to restore the msg_iter state when transmission fails.
>
>Signed-off-by: Octavian Purdila <tavip@google.com>
>---
> lib/iov_iter.c | 1 +
> 1 file changed, 1 insertion(+)

Acked-by: Stefano Garzarella <sgarzare@redhat.com>

>
>diff --git a/lib/iov_iter.c b/lib/iov_iter.c
>index 243662af1af73..067e745f9ef53 100644
>--- a/lib/iov_iter.c
>+++ b/lib/iov_iter.c
>@@ -1491,6 +1491,7 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
> 		i->__iov -= state->nr_segs - i->nr_segs;
> 	i->nr_segs = state->nr_segs;
> }
>+EXPORT_SYMBOL(iov_iter_restore);
>
> /*
>  * Extract a list of contiguous pages from an ITER_FOLIOQ iterator.  This does
>-- 
>2.54.0.1136.gdb2ca164c4-goog
>


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox