qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Hanna Czenczek <hreitz@redhat.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: qemu-devel@nongnu.org, "Michael S . Tsirkin" <mst@redhat.com>,
	"Eugenio Pérez" <eperezma@redhat.com>,
	"German Maglione" <gmaglione@redhat.com>
Subject: Re: [PATCH 1/6] vhost-user.rst: Add suspend/resume
Date: Tue, 25 Jul 2023 10:30:49 +0200	[thread overview]
Message-ID: <370fc01d-a090-e400-6ebe-a3e8df5b02b6@redhat.com> (raw)
In-Reply-To: <20230724175532.GA222590@fedora>

On 24.07.23 19:55, Stefan Hajnoczi wrote:
> On Wed, Jul 19, 2023 at 03:59:32PM +0200, Hanna Czenczek wrote:
>> On 18.07.23 16:25, Stefan Hajnoczi wrote:
>>> On Tue, Jul 11, 2023 at 05:52:23PM +0200, Hanna Czenczek wrote:
>>>> When stopping the VM, qemu wants all devices to fully cease any
>>>> operation, too.  Currently, we can only have vhost-user back-ends stop
>>>> processing vrings, but not background operations.  Add the SUSPEND and
>>>> RESUME commands from vDPA, which allow the front-end (qemu) to tell
>>>> back-ends to cease all operations, including those running in the
>>>> background.
>>>>
>>>> qemu's current work-around for this is to reset the back-end instead of
>>>> suspending it, which will not work for back-ends that have internal
>>>> state that must be preserved across e.g. stop/cont.
>>>>
>>>> Note that the given specification requires the back-end to delay
>>>> processing kicks (i.e. starting vrings) until the device is resumed,
>>>> instead of requiring the front-end not to send kicks while suspended.
>>>> qemu starts devices (and would just resume them) only when the VM is in
>>>> a running state, so it would be difficult to have qemu delay kicks until
>>>> the device is resumed, which is why this patch specifies handling of
>>>> kicks as it does.
>>>>
>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>> ---
>>>>    docs/interop/vhost-user.rst | 35 +++++++++++++++++++++++++++++++++--
>>>>    1 file changed, 33 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
>>>> index 5a070adbc1..ac6be34c4c 100644
>>>> --- a/docs/interop/vhost-user.rst
>>>> +++ b/docs/interop/vhost-user.rst
>>>> @@ -381,6 +381,10 @@ readable) on the descriptor specified by ``VHOST_USER_SET_VRING_KICK``
>>>>    or receiving the in-band message ``VHOST_USER_VRING_KICK`` if negotiated,
>>>>    and stop ring upon receiving ``VHOST_USER_GET_VRING_BASE``.
>>>> +While the back-end is suspended (via ``VHOST_USER_SUSPEND``), it must
>>>> +never process rings, and thus also delay handling kicks until the
>>> If you respin this series, I suggest replacing "never" with "not" to
>>> emphasize that ring processing is only skipped while the device is
>>> suspended (rather than forever). "Never" feels too strong to use when
>>> describing a temporary state.
>> Sure.
>>
>>>> +back-end is resumed again.
>>>> +
>>>>    Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.
>>>>    If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
>>>> @@ -479,8 +483,9 @@ supplied by ``VhostUserLog``.
>>>>    ancillary data, it may be used to inform the front-end that the log has
>>>>    been modified.
>>>> -Once the source has finished migration, rings will be stopped by the
>>>> -source. No further update must be done before rings are restarted.
>>>> +Once the source has finished migration, the device will be suspended and
>>>> +its rings will be stopped by the source. No further update must be done
>>>> +before the device and its rings are resumed.
>>> This paragraph is abstract and doesn't directly identify the mechanisms
>>> or who does what:
>>> - "the device will be suspended" via VHOST_USER_SUSPEND (or reset when
>>>     VHOST_USER_SUSPEND is not supported?) or automatically by the device
>>>     itself or some other mechanism?
>> OK, I’ll add VHOST_USER_SUSPEND.
>>
>> So far I hadn’t considered making a note of resetting as a fallback right in
>> the specification.  I don’t think I would want it in the specification, but
>> on the other hand, it is probably important for back-end authors to know
>> that if they do not implement SUSPEND, their device is going to be reset by
>> qemu.
>>
>> Can we make that a ”may”, i.e.:
>>
>> ```
>> Once the source has finished migration, the device will be suspended via
>> VHOST_USER_SUSPEND and its rings will be stopped by the source.
> I'm not sure what "its rings will be stopped by the source" means
> exactly. Is it summarizing the effect of VHOST_USER_SUSPEND or is there
> an additional action after VHOST_USER_SUSPEND that stops rings? And I'm
> not sure whether "by the source" means by the front-end or back-end on
> the source machine?

This is pre-existing text and I assumed it (with not doubt) to refer to 
GET_VRING_BASE, because that is how rings are stopped.

I can improve the existing documentation and add the reference to 
GET_VRING_BASE, and say that it clearly must come from the front-end.

>> No further
>> update must be done before the device and its rings are resumed.
> "Update" to what? Guest RAM? Device state? Rings?
>
> I feel like this text is too vague for a spec. People may interpret it
> differently. Can you make rephrase this to more concrete?

Honestly, no.  This is pre-existing, and I have the same questions as 
you do.

I cannot “rephrase” this to make it more concrete.  I can try to 
actually specify that was was left unspecified, but that would be a 
change in specification that would require its own patch, separate from 
this series.

Personally, I’ve generally taken this sentence to be fluff.  If the 
rings are stopped, clearly, they should not be accessed at all. Probably 
the back-end should also refrain from writing to guest memory, because 
that is a logical conclusion from having the rings stopped.  But now 
it’s even clearer: The back-end ideally is suspended, which directly 
means not to modify guest memory, and not to perform “background 
operations”.  Updating device state of course is possible through 
vhost-user commands, because those must always be executed.

So basically it’s just “Device and rings are stopped (RESUME and 
GET_VRING_BASE, resp.), so you know, adhere to that.”

Or maybe I’m completely wrong and “Once the source has finished 
migration, rings will be stopped by the source. No further update must 
be done before rings are restarted.” is to be taken together and the 
second sentence just refers to the rings, i.e. the front-end stops the 
rings, and the back-end must not update them.  Or it means that the 
front-end must not send any commands to the back-end until restarting 
the rings, but that feels practically impossible.

Again, because this sentence currently doesn’t specify anything, really, 
changing it to carry any meaning would be to add to the specification, 
not just clarify it.

>> If and only
>> if the back-end does not support VHOST_USER_SUSPEND, the front-end may reset
>> it instead (via VHOST_USER_SET_STATUS, VHOST_USER_RESET_DEVICE, or
>> VHOST_USER_RESET_OWNER).
>> ```
>>
>> I’m unsure about the “If and only if” – older qemu versions will break this,
>> but I feel like we must promise back-end writers that if they implement
>> SUSPEND, their back-end is not going to be reset; if it is, and something
>> breaks because of it, it’s the front-end that must be updated to match the
>> specification.
> I this the trick is to say "if and only if VHOST_USER_F_SUSPEND has not
> been negotiated". That way really only new front-ends that support
> VHOST_USER_SUSPEND are required to use suspend instead of reset and
> older versions of QEMU will not violate this statement.

Ah, right, thanks!

>>> - "before the device and its rings are resumed" via VHOST_USER_RESUME?
>>>     And is this referring to the source device?
>> Yes, via VHOST_USER_RESUME, and restarting the rings by starting them (i.e.
>> a kick).
>>
>> Whether this is referring to the source device…  Well, the text as it was
>> before begs the exact same question, so honestly, I don’t know for sure.
>> “Restarting” only makes sense if the rings were stopped before, so I assume
>> it’s referring to the source, e.g. for the case of a failed migration.
>> RESUME at least definitely will only happen after a prior SUSPEND, so this
>> one will definitely only apply on the source side.
>>
>>> Please rephrase the paragraph to identify the vhost-user messages
>>> involved.
>>>
>>>>    In postcopy migration the back-end is started before all the memory has
>>>>    been received from the source host, and care must be taken to avoid
>>>> @@ -885,6 +890,7 @@ Protocol features
>>>>      #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
>>>>      #define VHOST_USER_PROTOCOL_F_STATUS               16
>>>>      #define VHOST_USER_PROTOCOL_F_XEN_MMAP             17
>>>> +  #define VHOST_USER_PROTOCOL_F_SUSPEND              18
>>>>    Front-end message types
>>>>    -----------------------
>>>> @@ -1440,6 +1446,31 @@ Front-end message types
>>>>      query the back-end for its device status as defined in the Virtio
>>>>      specification.
>>>> +``VHOST_USER_SUSPEND``
>>>> +  :id: 41
>>>> +  :equivalent ioctl: VHOST_VDPA_SUSPEND
>>>> +  :request payload: N/A
>>>> +  :reply payload: N/A
>>>> +
>>>> +  When the ``VHOST_USER_PROTOCOL_F_SUSPEND`` protocol feature has been
>>>> +  successfully negotiated, this message is submitted by the front-end to
>>>> +  have the back-end cease all operations except for handling vhost-user
>>>> +  requests.  The back-end must stop processing all virt queues, and it
>>>> +  must not perform any background operations.  It may not resume until a
>>> "background operations" are not defined. What does it mean:
>>> - Anything that writes to memory slots
>>> - Anything that changes the visible state of the device
>>> - Anything that changes the non-visible internal state of the device
>>> - etc
>>> ?
>> My best answer (honestly) is: You tell me.  This series is introducing
>> SUSPEND/RESUME because qemu wants to reset devices to make them stop
>> “background operations”, and this would break virtiofsd if any form of reset
>> were actually implemented.  The implementation of SUSPEND/RESUME in
>> virtiofsd on the other hand is supposed to basically be a no-op (besides
>> delaying ring processing until a RESUME, but even if we processed them
>> before, i.e. really make SUSPEND/RESUME no-ops, it would most likely work
>> out fine), so I have no idea what kind of background operations we are even
>> talking about, or whether any such actually exist in practice.
>>
>> I don’t know what anyone had in mind when introducing the RESET.  It comes
>> from vDPA (c3716f260bf moved it from vdpa into the common code), and exists
>> since the code was originally added (108a64818e6), so there’s no comment
>> explaining why it exists.  I can’t explain what the back-end is supposed to
>> stop doing, because so far it isn’t explained anywhere in qemu, its git log,
>> or in any documentation (there basically is no vdpa documentation).
>>
>> I can only say what I just completely naïvely assumed it to mean so far:
>> That the back-end basically should stop doing anything besides handling and
>> replying to vhost-user messages.  If such a message requires changing any
>> state, visible or not, then this state change is permissible.
> Okay, I suggest the following instead of "background operations":
>
> - Changes to the device state produced by SET_DEVICE_STATE_FD.

This is definitely something that I would absolutely allow after 
SUSPEND.  If the front-end does not wish the back-end to do this, it can 
just not send this command while the back-end is suspended.

> - Writing to memory regions.
> - Signalling that buffers have been used.

This feels both too tight and not concrete enough.  The only buffers I 
can think of are buffers supplied through virt queues, which I intended 
to already be included in “stop processing all virt queues”.  (I took 
this to mean the used-buffer ring, too, but I can of course be more 
explicit about this, e.g. “stop processing all virt queues, including 
returning buffers to the driver”.) “Signalling” sounds like triggering 
the callfd, but that also seems clear; if you can’t process virt queues, 
including returning buffers, you can never trigger the callfd (or send 
VHOST_USER_BACKEND_VRING_CALL), because there can never be a new buffer 
returned to the driver.

So too tight because it feels like a subset of virt queue processing, 
but not concrete enough because “buffers” makes me feel like I’m 
overlooking something besides virt queues.

> - Signalling that the configuration space has changed.

Maybe this could be more general, i.e. the back-end must not send any 
vhost-user messages to the front-end?

> The goal is to ensure the device state and memory regions are stable and
> that back-end doesn't trigger activity in the front-end.

If the goal is to ensure that the device state is stable, I feel like we 
must then specify precisely this, and not just to say it must not be 
mutable through SET_DEVICE_STATE_FD: In general, the state is more 
likely to be changed by other factors after all.

On the other hand, I also don’t see why that’s a goal.  For migration, 
it might seem desirable, but I don’t actually think so: The back-end is 
required to send a consistent device state anyway, so it is up to the 
back-end to ensure the state is consistent, we don’t need to make it a 
requirement for SUSPEND.  It is implicitly clear from the migration 
model that if the device state were to change after the back-end has 
sent it to the front-end, those change will be lost on the destination 
side, so it is clear that the back-end must anticipate this and work 
around it.

Other than migration, qemu doesn’t see the device state at all.  I don’t 
see why internal state should not change between stop/cont. If a device 
experience some signal that (for some reason) it can’t pause to receive 
only after the subsequent RESUME, it might need to make note of that 
signal to act on it after RESUME.  I would consider that a change in 
internal state, and I don’t immediately see a problem with it.  (It may 
be problematic when migrating, because receiving such signals on the 
source side after transferring the state would mean they’re lost, but 
again, this is something the device clearly has to solve, e.g. by 
redirecting the signals to the destination starting with the 
SET_TRANSFER_STATE_FD call.)

Hanna



  reply	other threads:[~2023-07-25  8:31 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-11 15:52 [PATCH 0/6] vhost-user: Add suspend/resume Hanna Czenczek
2023-07-11 15:52 ` [PATCH 1/6] vhost-user.rst: " Hanna Czenczek
2023-07-18 14:25   ` Stefan Hajnoczi
2023-07-19 13:59     ` Hanna Czenczek
2023-07-24 17:55       ` Stefan Hajnoczi
2023-07-25  8:30         ` Hanna Czenczek [this message]
2023-07-27 21:12           ` Stefan Hajnoczi
2023-07-11 15:52 ` [PATCH 2/6] vhost-vdpa: Move vhost_vdpa_reset_status() up Hanna Czenczek
2023-07-18 14:29   ` Stefan Hajnoczi
2023-07-11 15:52 ` [PATCH 3/6] vhost: Do not reset suspended devices on stop Hanna Czenczek
2023-07-18 14:33   ` Stefan Hajnoczi
2023-07-21 15:25   ` Eugenio Perez Martin
2023-07-21 16:07     ` Hanna Czenczek
2023-07-24 15:48       ` Eugenio Perez Martin
2023-07-25  7:53         ` Hanna Czenczek
2023-07-25 10:03           ` Eugenio Perez Martin
2023-07-25 13:09             ` Hanna Czenczek
2023-07-25 18:53               ` Eugenio Perez Martin
2023-07-26  6:57                 ` Hanna Czenczek
2023-07-27 12:49                   ` Eugenio Perez Martin
2023-07-27 20:26                     ` Stefan Hajnoczi
2023-07-11 15:52 ` [PATCH 4/6] vhost-user: Implement suspend/resume Hanna Czenczek
2023-07-18 14:37   ` Stefan Hajnoczi
2023-07-11 15:52 ` [PATCH 5/6] vhost-vdpa: Match vhost-user's status reset Hanna Czenczek
2023-07-18 14:50   ` Stefan Hajnoczi
2023-07-19 14:09     ` Hanna Czenczek
2023-07-19 15:06       ` Stefan Hajnoczi
2023-07-21 15:47       ` Eugenio Perez Martin
2023-07-11 15:52 ` [PATCH 6/6] vhost-user: Have reset_status fall back to reset Hanna Czenczek
2023-07-18 15:10   ` Stefan Hajnoczi
2023-07-19 14:11     ` Hanna Czenczek
2023-07-19 14:27       ` Hanna Czenczek
2023-07-20 16:03         ` Stefan Hajnoczi
2023-07-21 14:16           ` Hanna Czenczek
2023-07-24 18:04             ` Stefan Hajnoczi
2023-07-25  8:39               ` Hanna Czenczek
2023-07-18 15:14 ` [PATCH 0/6] vhost-user: Add suspend/resume Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=370fc01d-a090-e400-6ebe-a3e8df5b02b6@redhat.com \
    --to=hreitz@redhat.com \
    --cc=eperezma@redhat.com \
    --cc=gmaglione@redhat.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).