From: Stefano Garzarella <sgarzare@redhat.com>
To: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
virtualization@lists.linux.dev, netdev@vger.kernel.org,
mst@redhat.com, stefanha@redhat.com,
maciej.szmigiero@oracle.com, bchaney@akamai.com,
mark.kanda@oracle.com, ptikhomirov@virtuozzo.com,
den@openvz.org
Subject: Re: [PATCH 3/4] vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
Date: Tue, 16 Jun 2026 18:13:19 +0200 [thread overview]
Message-ID: <ajF0To9N3yc2NlIr@sgarzare-redhat> (raw)
In-Reply-To: <021a6604-289c-4dd8-a0be-33c7812c0105@virtuozzo.com>
On Tue, Jun 16, 2026 at 06:58:40PM +0300, Andrey Drobyshev wrote:
>On 6/16/26 5:18 PM, Stefano Garzarella wrote:
>> On Fri, Jun 12, 2026 at 07:57:17PM +0300, Andrey Drobyshev wrote:
[...]
>>> static u32 vhost_transport_get_local_cid(void)
>>> @@ -311,11 +312,17 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
>>> * the mutex would be too expensive in this hot path, and we already have
>>> * all the outcomes covered: if the backend becomes NULL right after the check,
>>> * vhost_transport_do_send_pkt() will check it under the mutex anyway.
>>> + *
>>> + * Don't fast-fail if cpr_paused is set, keep queueing skbs instead.
>>> + * The kick in vhost_vsock_start() will drain them on resume.
>>> */
>>> if (unlikely(!data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))) {
>>> - rcu_read_unlock();
>>> - kfree_skb(skb);
>>> ] return -EHOSTUNREACH;
>>> + smp_rmb(); /* pairs with smp_wmb() in start/drop_backends */
>>> + if (!READ_ONCE(vsock->cpr_paused)) {
>>
>> Can we avoid this which is not really readable and maybe add a single
>> variable to control the fast-fail at all?
>>
>> I mean replacing both cpr_paused + backend-pointer with a single
>> `started` flag: set it to false at open, true on start via
>> smp_store_release(), back to false on normal stop, and leave it true
>> during CPR pause.
>>
>> The reader in send_pkt can do just:
>>
>> if (!smp_load_acquire(&vsock->started))
>> return -EHOSTUNREACH;
>>
>> WDYT?
>>
>
>I don't think it's gonna work as suggested. As I understand, the order
>during CPR migration is:
>
>1) SET_RUNNING(0)
> -> vhost_vsock_stop()
> -> vhost_vsock_drop_backends()
>2) RESET_OWNER
> -> vhost_vsock_drop_backends()
>3) SET_OWNER
>4) SET_RUNNING(1)
> -> vhost_vsock_start
> -> for (...) vhost_vq_set_backend()
>
>(Btw I just noticed backends are already NULL at step 2), but that's
>just our CPR case, for any potential RESET_OWNER users it might not be
>the case).
>
>So the race windows starts from 1) (not from 2)). We have no way of
>differentiating whether device is actually being stopped for good, or
>we're in the middle of CPR. If we set the flag to false on stop as you
>suggested, we'll still hit the -EHOSTUNREACH case eventually, and
>avoiding it is the whole purpose of this patch.
>
>The fast-fail with -EHOSTUNREACH relies on the presence of backends.
>IIUC the backend will only become set after initial SET_RUNNING(1),
>which will only happen once the guest driver writes smth to virtio
>config register, QEMU catches it and calls SET_RUNNING(1). So we have
>ordering with the guest's actions here, which is logical. But for our
>issue that means that the only true marker of paused/not paused is the
>presence of backends - and that's why the flag is set in
>vhost_vsock_drop_backends().
Okay, so what about avoiding to set `started` to false in
SET_RUNNING(0)? I mean use it just to track the first SET_RUNNING(1).
(And maybe changing the name to that variable).
Apart from CPR, when can SET_RUNNING(0) occur?
At the end that was just an optimization, if we queue the packet is not
a big issue IMO.
>
>>> + rcu_read_unlock();
>>> + kfree_skb(skb);
>>> + return -EHOSTUNREACH;
>>> + }
>>
>>
>> That said claude here is reporting a potential issue that I think we
>> should consider:
>> After VHOST_RESET_OWNER, the guest CID stays in the hash, so
>> vhost_transport_send_pkt() can still find the vsock, skip the
>> fast-fail (cpr_paused=true), and call vhost_vq_work_queue() while
>> vhost_workers_free() is freeing workers without a synchronize_rcu()
>> — risking a use-after-free. Also, any send_pkt_work queued between
>> the last flush and worker teardown gets its VHOST_WORK_QUEUED
>> bit
>> stuck (the vhost task exits without draining), deadlocking
>> host→guest traffic after restart.
>>
>> A synchronize_rcu() in vhost_workers_free() between the
>> rcu_assign_pointer(NULL) loop and the destroy loop would close the
>> use-after-free, and reinitializing send_pkt_work via
>> vhost_work_init() after vhost_dev_reset_owner() returns would clear
>> the stuck QUEUED bit.
>>
>>
>
>Yes, this looks real indeed. Though I couldn't hit the UAF issue while
>testing host->guest transfer under KASAN.
>
>>> }
>>>
>>> if (virtio_vsock_skb_reply(skb))
>>> @@ -640,6 +647,9 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
>>> mutex_unlock(&vq->mutex);
>>> }
>>>
>>> + smp_wmb(); /* pairs with smp_rmb() in send_pkt */
>>> + WRITE_ONCE(vsock->cpr_paused, false);
>>> +
>>> /* Some packets may have been queued before the device was started,
>>> * let's kick the send worker to send them.
>>> */
>>> @@ -671,6 +681,11 @@ static void vhost_vsock_drop_backends(struct vhost_vsock *vsock)
>>>
>>> lockdep_assert_held(&vsock->dev.mutex);
>>>
>>> + if (vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])) {
>>> + WRITE_ONCE(vsock->cpr_paused, true);
>>> + smp_wmb(); /* pairs with smp_rmb() in send_pkt */
>>> + }
>>
>> Why here and not in vhost_vsock_reset_owner()?
>>
>> Also having this here will set it to true also with
>> VHOST_VSOCK_SET_RUNNING(0), is that right?
>>
>
>That was added here precisely to cover the vhost_vsock_stop() case (see
>above).
I see now, a comment or something in the commit would have helped.
Thanks,
Stefano
next prev parent reply other threads:[~2026-06-16 16:13 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-12 16:57 [PATCH 0/4] vhost/vsock: add support for VHOST_RESET_OWNER and CPR migration Andrey Drobyshev
2026-06-12 16:57 ` [PATCH 1/4] vhost/vsock: split out vhost_vsock_drop_backends helper Andrey Drobyshev
2026-06-16 13:42 ` Stefano Garzarella
2026-06-12 16:57 ` [PATCH 2/4] vhost/vsock: add VHOST_RESET_OWNER ioctl Andrey Drobyshev
2026-06-16 13:48 ` Stefano Garzarella
2026-06-16 14:10 ` Andrey Drobyshev
2026-06-16 14:26 ` Stefano Garzarella
2026-06-12 16:57 ` [PATCH 3/4] vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause Andrey Drobyshev
2026-06-16 14:18 ` Stefano Garzarella
2026-06-16 15:58 ` Andrey Drobyshev
2026-06-16 16:13 ` Stefano Garzarella [this message]
2026-06-12 16:57 ` [PATCH 4/4] vhost/vsock: re-scan TX virtqueue on device start Andrey Drobyshev
2026-06-16 14:23 ` Stefano Garzarella
2026-06-16 15:58 ` Andrey Drobyshev
2026-06-16 13:35 ` [PATCH 0/4] vhost/vsock: add support for VHOST_RESET_OWNER and CPR migration Stefano Garzarella
2026-06-16 14:01 ` Andrey Drobyshev
2026-06-16 14:28 ` Stefano Garzarella
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ajF0To9N3yc2NlIr@sgarzare-redhat \
--to=sgarzare@redhat.com \
--cc=andrey.drobyshev@virtuozzo.com \
--cc=bchaney@akamai.com \
--cc=den@openvz.org \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maciej.szmigiero@oracle.com \
--cc=mark.kanda@oracle.com \
--cc=mst@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=ptikhomirov@virtuozzo.com \
--cc=stefanha@redhat.com \
--cc=virtualization@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox