qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Laszlo Ersek <lersek@redhat.com>
To: Eugenio Perez Martin <eperezma@redhat.com>
Cc: qemu-devel@nongnu.org, "Michael S. Tsirkin" <mst@redhat.com>,
	German Maglione <gmaglione@redhat.com>,
	Liu Jiang <gerry@linux.alibaba.com>,
	Sergio Lopez Pascual <slp@redhat.com>,
	Stefano Garzarella <sgarzare@redhat.com>
Subject: Re: [PATCH v2 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
Date: Thu, 7 Sep 2023 22:27:12 +0200	[thread overview]
Message-ID: <4f1b737e-09c8-8776-c4d7-a6dcd56e5819@redhat.com> (raw)
In-Reply-To: <CAJaqyWe9X_qneUA-ToiSVNEZ23+cjbqVrheOhOQ8mCWF8dfrLA@mail.gmail.com>

On 9/7/23 17:59, Eugenio Perez Martin wrote:
> On Wed, Aug 30, 2023 at 3:41 PM Laszlo Ersek <lersek@redhat.com> wrote:
>>
>> (1) The virtio-1.2 specification
>> <http://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html> writes:
>>
>>> 3     General Initialization And Device Operation
>>> 3.1   Device Initialization
>>> 3.1.1 Driver Requirements: Device Initialization
>>>
>>> [...]
>>>
>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>>    the device, optional per-bus setup, reading and possibly writing the
>>>    device’s virtio configuration space, and population of virtqueues.
>>>
>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>
>> and
>>
>>> 4         Virtio Transport Options
>>> 4.1       Virtio Over PCI Bus
>>> 4.1.4     Virtio Structure PCI Capabilities
>>> 4.1.4.3   Common configuration structure layout
>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>
>>> [...]
>>>
>>> The driver MUST configure the other virtqueue fields before enabling the
>>> virtqueue with queue_enable.
>>>
>>> [...]
>>
>> (The same statements are present in virtio-1.0 identically, at
>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html>.)
>>
>> These together mean that the following sub-sequence of steps is valid for
>> a virtio-1.0 guest driver:
>>
>> (1.1) set "queue_enable" for the needed queues as the final part of device
>> initialization step (7),
>>
>> (1.2) set DRIVER_OK in step (8),
>>
>> (1.3) immediately start sending virtio requests to the device.
>>
>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>> special virtio feature is negotiated, then virtio rings start in disabled
>> state, according to
>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>> enabling vrings.
>>
>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>> operation, which travels from the guest through QEMU to the vhost-user
>> backend, using a unix domain socket.
>>
> 
> The code looks good to me, but this part of the message is not precise
> if I understood it correctly.
> 
> Guest PCI "queue_enable" writes remain in the qemu virtio device model
> until the guest writes DRIVER_OK to the status. I'm referring to
> hw/virtio/virtio-pci.c:virtio_pci_common_write, case
> VIRTIO_PCI_COMMON_Q_ENABLE. From there, virtio_queue_enable just saves
> the info in VirtIOPCIProxy.
> 
> After the needed queues are enabled, the guest writes DRIVER_OK status
> bit. Then, the vhost backend is started and qemu sends the
> VHOST_USER_SET_VRING_ENABLE through the unix socket. And that is the
> source of the message that is racing with the dataplane.

OK, so this means that 1.1 is "buffered" in QEMU until 1.2, but the race
between 1.2 and 1.3 is just the same.

I can reword the commit message to take this into account.

Thanks!
Laszlo

> 
> I didn't confirm it with virtiofs through tracing / debugging, so I
> may be missing something.
> 
> Even with the small nit, the code fixes the problem.
> 
> Acked-by: Eugenio Pérez <eperezma@redhat.com>
> 
>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>> evades QEMU -- it travels from guest to the vhost-user backend via
>> eventfd.
>>
>> This means that steps (1.1) and (1.3) travel through different channels,
>> and their relative order can be reversed, as perceived by the vhost-user
>> backend.
>>
>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>> crate.)
>>
>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>> device initialization steps (i.e., control plane operations), and
>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>> operation). In the Rust-language virtiofsd, this creates a race between
>> two components that run *concurrently*, i.e., in different threads or
>> processes:
>>
>> - Control plane, handling vhost-user protocol messages:
>>
>>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>   [crates/vhost-user-backend/src/handler.rs] handles
>>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>   flag according to the message processed.
>>
>> - Data plane, handling virtio / FUSE requests:
>>
>>   The "VringEpollHandler::handle_event" method
>>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>   virtio / FUSE request, consuming the virtio kick at the same time. If
>>   the vring's "enabled" flag is set, the virtio / FUSE request is
>>   processed genuinely. If the vring's "enabled" flag is clear, then the
>>   virtio / FUSE request is discarded.
>>
>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>> However, if the data plane processor in virtiofsd wins the race, then it
>> sees the FUSE_INIT *before* the control plane processor took notice of
>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>> back to waiting for further virtio / FUSE requests with epoll_wait.
>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
>>
>> The deadlock is not deterministic. OVMF hangs infrequently during first
>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>> shell.
>>
>> The race can be "reliably masked" by inserting a very small delay -- a
>> single debug message -- at the top of "VringEpollHandler::handle_event",
>> i.e., just before the data plane processor checks the "enabled" field of
>> the vring. That delay suffices for the control plane processor to act upon
>> VHOST_USER_SET_VRING_ENABLE.
>>
>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
>> cannot advance to the FUSE_INIT submission before virtiofsd's control
>> plane processor takes notice of the queue being enabled.
>>
>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>>
>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
>>   for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
>>   has been negotiated, or
>>
>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
>>   a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
>> Cc: Eugenio Perez Martin <eperezma@redhat.com>
>> Cc: German Maglione <gmaglione@redhat.com>
>> Cc: Liu Jiang <gerry@linux.alibaba.com>
>> Cc: Sergio Lopez Pascual <slp@redhat.com>
>> Cc: Stefano Garzarella <sgarzare@redhat.com>
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
>> ---
>>
>> Notes:
>>     v2:
>>
>>     - pick up R-b from Stefano
>>
>>     - update virtio spec reference from 1.0 to 1.2 (also keep the 1.0 ref)
>>       in the commit message; re-check the quotes / section headers [Stefano]
>>
>>     - summarize commit message in code comment [Stefano]
>>
>>  hw/virtio/vhost-user.c | 16 +++++++++++++++-
>>  1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>> index 18e15a9bb359..41842eb023b5 100644
>> --- a/hw/virtio/vhost-user.c
>> +++ b/hw/virtio/vhost-user.c
>> @@ -1235,7 +1235,21 @@ static int vhost_user_set_vring_enable(struct vhost_dev *dev, int enable)
>>              .num   = enable,
>>          };
>>
>> -        ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, false);
>> +        /*
>> +         * SET_VRING_ENABLE travels from guest to QEMU to vhost-user backend /
>> +         * control plane thread via unix domain socket. Virtio requests travel
>> +         * from guest to vhost-user backend / data plane thread via eventfd.
>> +         * Even if the guest enables the ring first, and pushes its first virtio
>> +         * request second (conforming to the virtio spec), the data plane thread
>> +         * in the backend may see the virtio request before the control plane
>> +         * thread sees the queue enablement. This causes (in fact, requires) the
>> +         * data plane thread to discard the virtio request (it arrived on a
>> +         * seemingly disabled queue). To prevent this out-of-order delivery,
>> +         * don't let the guest proceed to pushing the virtio request until the
>> +         * backend control plane acknowledges enabling the queue -- IOW, pass
>> +         * wait_for_reply=true below.
>> +         */
>> +        ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, true);
>>          if (ret < 0) {
>>              /*
>>               * Restoring the previous state is likely infeasible, as well as
> 



  reply	other threads:[~2023-09-07 20:28 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-30 13:40 [PATCH v2 0/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
2023-08-30 13:40 ` [PATCH v2 1/7] vhost-user: strip superfluous whitespace Laszlo Ersek
2023-08-31  7:13   ` Laszlo Ersek
2023-09-06  7:12     ` Albert Esteve
2023-09-06  9:09       ` Laszlo Ersek
2023-10-02 19:48       ` Laszlo Ersek
2023-08-30 13:40 ` [PATCH v2 2/7] vhost-user: tighten "reply_supported" scope in "set_vring_addr" Laszlo Ersek
2023-08-31  7:14   ` Laszlo Ersek
2023-08-30 13:40 ` [PATCH v2 3/7] vhost-user: factor out "vhost_user_write_sync" Laszlo Ersek
2023-08-30 13:40 ` [PATCH v2 4/7] vhost-user: flatten "enforce_reply" into "vhost_user_write_sync" Laszlo Ersek
2023-08-30 13:40 ` [PATCH v2 5/7] vhost-user: hoist "write_sync", "get_features", "get_u64" Laszlo Ersek
2023-08-31  7:20   ` Laszlo Ersek
2023-09-06  9:59     ` Philippe Mathieu-Daudé
2023-08-30 13:40 ` [PATCH v2 6/7] vhost-user: allow "vhost_set_vring" to wait for a reply Laszlo Ersek
2023-08-30 13:40 ` [PATCH v2 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously Laszlo Ersek
2023-09-07 15:59   ` Eugenio Perez Martin
2023-09-07 20:27     ` Laszlo Ersek [this message]
2023-09-07 16:01 ` [PATCH v2 0/7] " Eugenio Perez Martin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4f1b737e-09c8-8776-c4d7-a6dcd56e5819@redhat.com \
    --to=lersek@redhat.com \
    --cc=eperezma@redhat.com \
    --cc=gerry@linux.alibaba.com \
    --cc=gmaglione@redhat.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=sgarzare@redhat.com \
    --cc=slp@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).