Netdev List
 help / color / mirror / Atom feed
From: Stefano Garzarella <sgarzare@redhat.com>
To: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	 virtualization@lists.linux.dev, netdev@vger.kernel.org,
	mst@redhat.com, stefanha@redhat.com,  dongli.zhang@oracle.com,
	maciej.szmigiero@oracle.com, bchaney@akamai.com,
	 mark.kanda@oracle.com, ptikhomirov@virtuozzo.com,
	den@openvz.org
Subject: Re: [PATCH v3 4/4] vhost/vsock: add VHOST_RESET_OWNER ioctl
Date: Tue, 30 Jun 2026 15:40:03 +0200	[thread overview]
Message-ID: <akO6tps94iFxCAWv@sgarzare-redhat> (raw)
In-Reply-To: <20260625155416.480669-5-andrey.drobyshev@virtuozzo.com>

On Thu, Jun 25, 2026 at 06:54:16PM +0300, Andrey Drobyshev wrote:
>From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>
>This ioctl is needed for QEMU's CPR (checkpoint-restore) migration of
>the guest with vhost-vsock device.  For this to work, we need to reset
>the device ownership on the source side by calling RESET_OWNER, and then
>claim it on the dest side by calling SET_OWNER.  We expect not to lose any
>AF_VSOCK connection while this happens.
>
>RESET_OWNER keeps the guest CID hashed, so that connections survive. That
>leaves the device reachable by a lockless send/cancel path while the worker
>is being torn down: a concurrent vhost_transport_send_pkt() or
>vhost_transport_cancel_pkt() can call vhost_vq_work_queue() as
>vhost_workers_free() frees the worker.  That might cause a use-after-free
>of vq->worker.  In addition, any work queued onto the dying worker leaves
>VHOST_WORK_QUEUED stuck, stalling send_pkt_queue after resume.
>
>Fence the send/cancel paths around the teardown: send_pkt()/cancel_pkt()
>only kick the worker while the backend is alive.  And reset_owner() calls
>synchronize_rcu() after drop_backends() so in-flight send/cancel finish
>before the worker is freed.
>
>Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>---
> drivers/vhost/vsock.c | 51 +++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 49 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index 81d4f7209719..f0a0aa7d3200 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -318,7 +318,14 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
> 		atomic_inc(&vsock->queued_replies);
>
> 	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
>-	vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
>+
>+	/* Skip the kick once the backend is gone (stop/RESET_OWNER); the skb
>+	 * stays queued and vhost_vsock_start() drains it. Pairs with the
>+	 * synchronize_rcu() in vhost_vsock_reset_owner().
>+	 */

Please explain better (as done by commit bb26ed5f3a8b ("vhost/vsock: 
Refuse the connection immediately when guest isn't ready") in the 
comment removed by this seris) why we can use vhost_vq_get_backend() 
without vq->mutex held.

>+	if (data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))
>+		vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX],
>+				    &vsock->send_pkt_work);

BTW I'm now confused about what we are preventing here. A better
explanation should be added both in the commit and in the comment,
because it's hard to understand what we're preventing.

That said, if there is a problem, perhaps it should be fixed in vhost.c,
because it seems more like a generic issue.

vhost_vq_work_queue() has `worker = rcu_dereference(vq->worker);` so
should already prevent UAF, no?

Or maybe vhost_workers_free() is missing a synchronize_rcu()?

>
> 	rcu_read_unlock();
> 	return len;
>@@ -346,7 +353,15 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
> 		int new_cnt;
>
> 		new_cnt = atomic_sub_return(cnt, &vsock->queued_replies);
>-		if (new_cnt + cnt >= tx_vq->num && new_cnt < tx_vq->num)
>+
>+		/* Skip the kick once the backend is gone (stop/RESET_OWNER):
>+		 * vhost_poll_queue() would touch the worker which is being freed
>+		 * by teardown, e.g. on RESET_OWNER.  Pairs with the
>+		 * synchronize_rcu() in vhost_vsock_reset_owner().  The TX VQ is

Ditto about the comment.

>+		 * re-kicked by vhost_vsock_start().
>+		 */
>+		if (data_race(vhost_vq_get_backend(tx_vq)) &&
>+		    new_cnt + cnt >= tx_vq->num && new_cnt < tx_vq->num)
> 			vhost_poll_queue(&tx_vq->poll);
> 	}
>
>@@ -903,6 +918,36 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
> 	return -EFAULT;
> }
>
>+static int vhost_vsock_reset_owner(struct vhost_vsock *vsock)

Why returning int?

We are defining err as long here, also the caller vhost_vsock_dev_ioctl()
returns long, so it is not clear to me why here we are not just
returning long.

>+{
>+	struct vhost_iotlb *umem;
>+	long err;
>+
>+	mutex_lock(&vsock->dev.mutex);
>+	err = vhost_dev_check_owner(&vsock->dev);
>+	if (err)
>+		goto done;
>+	umem = vhost_dev_reset_owner_prepare();
>+	if (!umem) {
>+		err = -ENOMEM;
>+		goto done;
>+	}
>+	vhost_vsock_drop_backends(vsock);
>+
>+	/* Let in-flight send_pkt() callers stop touching the worker before the
>+	 * flush + free below. Pairs with the backend check in
>+	 * vhost_transport_send_pkt().

This is also paired with vhost_transport_cancel_pkt(), so please update
this comment.

>+	 */
>+	synchronize_rcu();
>+
>+	vhost_vsock_flush(vsock);
>+	vhost_dev_stop(&vsock->dev);
>+	vhost_dev_reset_owner(&vsock->dev, umem);
>+done:
>+	mutex_unlock(&vsock->dev.mutex);
>+	return err;
>+}
>+
> static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
> 				  unsigned long arg)
> {
>@@ -946,6 +991,8 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
> 			return -EOPNOTSUPP;
> 		vhost_set_backend_features(&vsock->dev, features);
> 		return 0;
>+	case VHOST_RESET_OWNER:
>+		return vhost_vsock_reset_owner(vsock);
> 	default:
> 		mutex_lock(&vsock->dev.mutex);
> 		r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
>-- 
>2.47.1
>


      parent reply	other threads:[~2026-06-30 13:40 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-25 15:54 [PATCH v3 0/4] vhost/vsock: add support for VHOST_RESET_OWNER and CPR migration Andrey Drobyshev
2026-06-25 15:54 ` [PATCH v3 1/4] vhost/vsock: split out vhost_vsock_drop_backends helper Andrey Drobyshev
2026-06-25 15:54 ` [PATCH v3 2/4] vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause Andrey Drobyshev
2026-06-30 12:39   ` Stefano Garzarella
2026-06-25 15:54 ` [PATCH v3 3/4] vhost/vsock: re-scan TX virtqueue on device start Andrey Drobyshev
2026-06-30 12:45   ` Stefano Garzarella
2026-06-25 15:54 ` [PATCH v3 4/4] vhost/vsock: add VHOST_RESET_OWNER ioctl Andrey Drobyshev
2026-06-25 16:13   ` Pavel Tikhomirov
2026-06-30 13:40   ` Stefano Garzarella [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=akO6tps94iFxCAWv@sgarzare-redhat \
    --to=sgarzare@redhat.com \
    --cc=andrey.drobyshev@virtuozzo.com \
    --cc=bchaney@akamai.com \
    --cc=den@openvz.org \
    --cc=dongli.zhang@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maciej.szmigiero@oracle.com \
    --cc=mark.kanda@oracle.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=ptikhomirov@virtuozzo.com \
    --cc=stefanha@redhat.com \
    --cc=virtualization@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox