* [PATCH v4 0/8] vhost-user: Back-end state migration @ 2023-10-04 12:58 Hanna Czenczek 2023-10-04 12:58 ` [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS Hanna Czenczek ` (8 more replies) 0 siblings, 9 replies; 53+ messages in thread From: Hanna Czenczek @ 2023-10-04 12:58 UTC (permalink / raw) To: qemu-devel, virtio-fs Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi, German Maglione, Eugenio Pérez, Anton Kuchin RFC: https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04263.html v1: https://lists.nongnu.org/archive/html/qemu-devel/2023-04/msg01575.html v2: https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02604.html v3: https://lists.nongnu.org/archive/html/qemu-devel/2023-09/msg03750.html Based-on: <20231004014532.1228637-1-stefanha@redhat.com> ([PATCH v2 0/3] vhost: clean up device reset) Hi, This v4 includes largely unchanged patches from v3. The main addition/change is what came out of the discussion between Stefan and me around how to proceed without SUSPEND/RESUME, which is that this series is now based on his reset fix, and it includes more documentation changes. Changes in detail: - Patch 1: Fall-out from the reset fix: Currently, the status byte is effectively unused (qemu only uses it for resetting, which all back-ends ignore; DPDK uses it to announce potential feature negotiation failure, which qemu ignores). It is also not defined what exactly front-end or back-end should do with this byte, except pointing at the virtio spec, which however naturally does not say how this integrates with vhost-user’s RESET_DEVICE or [GS]ET_FEATURES. Furthermore, there does not seem to be a use for this; we have RESET_DEVICE for resetting, and we have [GS]ET_FEATURES (and REPLY_ACK, which can be used on SET_FEATURES) for feature negotation. Therefore, deprecate the status byte, pointing to those other commands instead. - Patch 2: Patch 4 defines a suspended state for the whole back-end if all vrings are stopped. I think this should be mentioned in GET_VRING_BASE, but upon trying to add it, I found that it does not even mention that it stops the vring (mentioned only in the Ring States section), and remembered that the whole description of both GET_VRING_BASE and SET_VRING_BASE really was not helpful when trying to implement a vhost-user back-end. Took the opportunity to overhaul both. - Patch 3: This one’s from v3, but quite heavily modified. Stefan suggested consistently defining the started/stopped and enabled/disabled states to be independent, and indeed doing so simplifies a whole lot of stuff. Specifically, it makes the magic “enabled/disabled when started” go away. Basically, I found this change alone is enough to remove the confusion I had with the existing documentation. - Patch 4: As suggested by Stefan, just define a suspended state without introducing SUSPEND. vDPA needs SUSPEND because its GET_VRING_BASE does not stop the vring, but vhost-user’s does, so we can define the suspended state to be when all vrings are stopped. - Patch 5: Reference the suspended state. - Patches 6 through 8: Unmodified, except for them being rebase on Stefan’s series. Hanna Czenczek (8): vhost-user.rst: Deprecate [GS]ET_STATUS vhost-user.rst: Improve [GS]ET_VRING_BASE doc vhost-user.rst: Clarify enabling/disabling vrings vhost-user.rst: Introduce suspended state vhost-user.rst: Migrating back-end-internal state vhost-user: Interface for migration state transfer vhost: Add high-level state save/load functions vhost-user-fs: Implement internal migration docs/interop/vhost-user.rst | 318 +++++++++++++++++++++++++++--- include/hw/virtio/vhost-backend.h | 24 +++ include/hw/virtio/vhost.h | 113 +++++++++++ hw/virtio/vhost-user-fs.c | 101 +++++++++- hw/virtio/vhost-user.c | 148 ++++++++++++++ hw/virtio/vhost.c | 241 ++++++++++++++++++++++ 6 files changed, 917 insertions(+), 28 deletions(-) -- 2.41.0 ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS 2023-10-04 12:58 [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek @ 2023-10-04 12:58 ` Hanna Czenczek 2023-10-05 17:08 ` Stefan Hajnoczi 2023-10-04 12:58 ` [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc Hanna Czenczek ` (7 subsequent siblings) 8 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-04 12:58 UTC (permalink / raw) To: qemu-devel, virtio-fs Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi, German Maglione, Eugenio Pérez, Anton Kuchin There is no clearly defined purpose for the virtio status byte in vhost-user: For resetting, we already have RESET_DEVICE; and for virtio feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK protocol extension, it is possible for SET_FEATURES to return errors (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). As for implementations, SET_STATUS is not widely implemented. dpdk does implement it, but only uses it to signal feature negotiation failure. While it does log reset requests (SET_STATUS 0) as such, it effectively ignores them, in contrast to RESET_OWNER (which is deprecated, and today means the same thing as RESET_DEVICE). While qemu superficially has support for [GS]ET_STATUS, it does not forward the guest-set status byte, but instead just makes it up internally, and actually completely ignores what the back-end returns, only using it as the template for a subsequent SET_STATUS to add single bits to it. Notably, after setting FEATURES_OK, it never reads it back to see whether the flag is still set, which is the only way in which dpdk uses the status byte. As-is, no front-end or back-end can rely on the other side handling this field in a useful manner, and it also provides no practical use over other mechanisms the vhost-user protocol has, which are more clearly defined. Deprecate it. Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> --- docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst index 5a070adbc1..2f68e67a1a 100644 --- a/docs/interop/vhost-user.rst +++ b/docs/interop/vhost-user.rst @@ -1424,21 +1424,35 @@ Front-end message types :request payload: ``u64`` :reply payload: N/A - When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been - successfully negotiated, this message is submitted by the front-end to - notify the back-end with updated device status as defined in the Virtio +.. admonition:: Deprecated + + This is no longer used. Used to be sent by the front-end to notify the + back-end with updated device status as defined in the Virtio specification. + However, its purpose in vhost-user was never well-defined; for + example, how or if it would replace VHOST_USER_RESET_DEVICE, or how it + integrates with the feature negotiation phase. Therefore, + implementations in practice were less than strict in how the status + value was handled, which means there was actually no protocol between + front-end and back-end on the use of the status value. + + For resetting, use VHOST_USER_RESET_DEVICE instead. For feature + negotiation with acknowledgment from the device, use + VHOST_USER_SET_FEATURES with the :ref:`REPLY_ACK <reply_ack>` feature + instead. + ``VHOST_USER_GET_STATUS`` :id: 40 :equivalent ioctl: VHOST_VDPA_GET_STATUS :request payload: N/A :reply payload: ``u64`` - When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been - successfully negotiated, this message is submitted by the front-end to - query the back-end for its device status as defined in the Virtio - specification. +.. admonition:: Deprecated + + This is no longer used. Used to be sent by the front-end to query the + back-end for its device status as defined in the Virtio specification. + Deprecated together with VHOST_USER_SET_STATUS. Back-end message types -- 2.41.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS 2023-10-04 12:58 ` [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS Hanna Czenczek @ 2023-10-05 17:08 ` Stefan Hajnoczi 2023-10-05 17:15 ` Michael S. Tsirkin 0 siblings, 1 reply; 53+ messages in thread From: Stefan Hajnoczi @ 2023-10-05 17:08 UTC (permalink / raw) To: Hanna Czenczek Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione, Eugenio Pérez, Anton Kuchin [-- Attachment #1: Type: text/plain, Size: 1769 bytes --] On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: > There is no clearly defined purpose for the virtio status byte in > vhost-user: For resetting, we already have RESET_DEVICE; and for virtio > feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK > protocol extension, it is possible for SET_FEATURES to return errors > (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). > > As for implementations, SET_STATUS is not widely implemented. dpdk does > implement it, but only uses it to signal feature negotiation failure. > While it does log reset requests (SET_STATUS 0) as such, it effectively > ignores them, in contrast to RESET_OWNER (which is deprecated, and today > means the same thing as RESET_DEVICE). > > While qemu superficially has support for [GS]ET_STATUS, it does not > forward the guest-set status byte, but instead just makes it up > internally, and actually completely ignores what the back-end returns, > only using it as the template for a subsequent SET_STATUS to add single > bits to it. Notably, after setting FEATURES_OK, it never reads it back > to see whether the flag is still set, which is the only way in which > dpdk uses the status byte. > > As-is, no front-end or back-end can rely on the other side handling this > field in a useful manner, and it also provides no practical use over > other mechanisms the vhost-user protocol has, which are more clearly > defined. Deprecate it. > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- > 1 file changed, 21 insertions(+), 7 deletions(-) Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
* (no subject) 2023-10-05 17:08 ` Stefan Hajnoczi @ 2023-10-05 17:15 ` Michael S. Tsirkin 2023-10-06 7:48 ` [Virtio-fs] (no subject) Hanna Czenczek 0 siblings, 1 reply; 53+ messages in thread From: Michael S. Tsirkin @ 2023-10-05 17:15 UTC (permalink / raw) Cc: Hanna Czenczek, qemu-devel, virtio-fs, German Maglione, Eugenio Pérez, Anton Kuchin On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: > On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: > > There is no clearly defined purpose for the virtio status byte in > > vhost-user: For resetting, we already have RESET_DEVICE; and for virtio > > feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK > > protocol extension, it is possible for SET_FEATURES to return errors > > (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). > > > > As for implementations, SET_STATUS is not widely implemented. dpdk does > > implement it, but only uses it to signal feature negotiation failure. > > While it does log reset requests (SET_STATUS 0) as such, it effectively > > ignores them, in contrast to RESET_OWNER (which is deprecated, and today > > means the same thing as RESET_DEVICE). > > > > While qemu superficially has support for [GS]ET_STATUS, it does not > > forward the guest-set status byte, but instead just makes it up > > internally, and actually completely ignores what the back-end returns, > > only using it as the template for a subsequent SET_STATUS to add single > > bits to it. Notably, after setting FEATURES_OK, it never reads it back > > to see whether the flag is still set, which is the only way in which > > dpdk uses the status byte. > > > > As-is, no front-end or back-end can rely on the other side handling this > > field in a useful manner, and it also provides no practical use over > > other mechanisms the vhost-user protocol has, which are more clearly > > defined. Deprecate it. > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > --- > > docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- > > 1 file changed, 21 insertions(+), 7 deletions(-) > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK. The fact current backends never check errors does not mean they never will. So no, not applying this. -- MST ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-05 17:15 ` Michael S. Tsirkin @ 2023-10-06 7:48 ` Hanna Czenczek 2023-10-06 8:45 ` Michael S. Tsirkin 0 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-06 7:48 UTC (permalink / raw) To: Michael S. Tsirkin Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin On 05.10.23 19:15, Michael S. Tsirkin wrote: > On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: >> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: >>> There is no clearly defined purpose for the virtio status byte in >>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio >>> feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK >>> protocol extension, it is possible for SET_FEATURES to return errors >>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). >>> >>> As for implementations, SET_STATUS is not widely implemented. dpdk does >>> implement it, but only uses it to signal feature negotiation failure. >>> While it does log reset requests (SET_STATUS 0) as such, it effectively >>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today >>> means the same thing as RESET_DEVICE). >>> >>> While qemu superficially has support for [GS]ET_STATUS, it does not >>> forward the guest-set status byte, but instead just makes it up >>> internally, and actually completely ignores what the back-end returns, >>> only using it as the template for a subsequent SET_STATUS to add single >>> bits to it. Notably, after setting FEATURES_OK, it never reads it back >>> to see whether the flag is still set, which is the only way in which >>> dpdk uses the status byte. >>> >>> As-is, no front-end or back-end can rely on the other side handling this >>> field in a useful manner, and it also provides no practical use over >>> other mechanisms the vhost-user protocol has, which are more clearly >>> defined. Deprecate it. >>> >>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>> --- >>> docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- >>> 1 file changed, 21 insertions(+), 7 deletions(-) >> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > > SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK. > The fact current backends never check errors does not mean they never > will. So no, not applying this. Can this not be done with REPLY_ACK? I.e., with the following message order: 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is present 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK 4. SET_FEATURES with need_reply If not, the problem is that qemu has sent SET_STATUS 0 for a while when the vCPUs are stopped, which generally seems to request a device reset. If we don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will implement SET_STATUS later may break with at least these qemu versions. But documenting that a particular use of the status byte is to be ignored would be really strange. Hanna ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-06 7:48 ` [Virtio-fs] (no subject) Hanna Czenczek @ 2023-10-06 8:45 ` Michael S. Tsirkin 2023-10-06 9:15 ` Hanna Czenczek 0 siblings, 1 reply; 53+ messages in thread From: Michael S. Tsirkin @ 2023-10-06 8:45 UTC (permalink / raw) To: Hanna Czenczek Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: > On 05.10.23 19:15, Michael S. Tsirkin wrote: > > On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: > > > On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: > > > > There is no clearly defined purpose for the virtio status byte in > > > > vhost-user: For resetting, we already have RESET_DEVICE; and for virtio > > > > feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK > > > > protocol extension, it is possible for SET_FEATURES to return errors > > > > (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). > > > > > > > > As for implementations, SET_STATUS is not widely implemented. dpdk does > > > > implement it, but only uses it to signal feature negotiation failure. > > > > While it does log reset requests (SET_STATUS 0) as such, it effectively > > > > ignores them, in contrast to RESET_OWNER (which is deprecated, and today > > > > means the same thing as RESET_DEVICE). > > > > > > > > While qemu superficially has support for [GS]ET_STATUS, it does not > > > > forward the guest-set status byte, but instead just makes it up > > > > internally, and actually completely ignores what the back-end returns, > > > > only using it as the template for a subsequent SET_STATUS to add single > > > > bits to it. Notably, after setting FEATURES_OK, it never reads it back > > > > to see whether the flag is still set, which is the only way in which > > > > dpdk uses the status byte. > > > > > > > > As-is, no front-end or back-end can rely on the other side handling this > > > > field in a useful manner, and it also provides no practical use over > > > > other mechanisms the vhost-user protocol has, which are more clearly > > > > defined. Deprecate it. > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > --- > > > > docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- > > > > 1 file changed, 21 insertions(+), 7 deletions(-) > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK. > > The fact current backends never check errors does not mean they never > > will. So no, not applying this. > > Can this not be done with REPLY_ACK? I.e., with the following message > order: > > 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is > present > 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK > 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK > 4. SET_FEATURES with need_reply > > If not, the problem is that qemu has sent SET_STATUS 0 for a while when the > vCPUs are stopped, which generally seems to request a device reset. If we > don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will > implement SET_STATUS later may break with at least these qemu versions. But > documenting that a particular use of the status byte is to be ignored would > be really strange. > > Hanna Hmm I guess. Though just following virtio spec seems cleaner to me... vhost-user reconfigures the state fully on start. I guess symmetry was the point. So I don't see why SET_STATUS 0 has to be ignored. SET_STATUS was introduced by: commit 923b8921d210763359e96246a58658ac0db6c645 Author: Yajun Wu <yajunw@nvidia.com> Date: Mon Oct 17 14:44:52 2022 +0800 vhost-user: Support vhost_dev_start CC the author. -- MST ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-06 8:45 ` Michael S. Tsirkin @ 2023-10-06 9:15 ` Hanna Czenczek 2023-10-06 9:26 ` Michael S. Tsirkin 0 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-06 9:15 UTC (permalink / raw) To: Michael S. Tsirkin Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu On 06.10.23 10:45, Michael S. Tsirkin wrote: > On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: >> On 05.10.23 19:15, Michael S. Tsirkin wrote: >>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: >>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: >>>>> There is no clearly defined purpose for the virtio status byte in >>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio >>>>> feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK >>>>> protocol extension, it is possible for SET_FEATURES to return errors >>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). >>>>> >>>>> As for implementations, SET_STATUS is not widely implemented. dpdk does >>>>> implement it, but only uses it to signal feature negotiation failure. >>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively >>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today >>>>> means the same thing as RESET_DEVICE). >>>>> >>>>> While qemu superficially has support for [GS]ET_STATUS, it does not >>>>> forward the guest-set status byte, but instead just makes it up >>>>> internally, and actually completely ignores what the back-end returns, >>>>> only using it as the template for a subsequent SET_STATUS to add single >>>>> bits to it. Notably, after setting FEATURES_OK, it never reads it back >>>>> to see whether the flag is still set, which is the only way in which >>>>> dpdk uses the status byte. >>>>> >>>>> As-is, no front-end or back-end can rely on the other side handling this >>>>> field in a useful manner, and it also provides no practical use over >>>>> other mechanisms the vhost-user protocol has, which are more clearly >>>>> defined. Deprecate it. >>>>> >>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>>>> --- >>>>> docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- >>>>> 1 file changed, 21 insertions(+), 7 deletions(-) >>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> >>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK. >>> The fact current backends never check errors does not mean they never >>> will. So no, not applying this. >> Can this not be done with REPLY_ACK? I.e., with the following message >> order: >> >> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is >> present >> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK >> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK >> 4. SET_FEATURES with need_reply >> >> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the >> vCPUs are stopped, which generally seems to request a device reset. If we >> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will >> implement SET_STATUS later may break with at least these qemu versions. But >> documenting that a particular use of the status byte is to be ignored would >> be really strange. >> >> Hanna > Hmm I guess. Though just following virtio spec seems cleaner to me... > vhost-user reconfigures the state fully on start. Not the internal device state, though. virtiofsd has internal state, and other devices like vhost-gpu back-ends would probably, too. Stefan has recently sent a series (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to put the reset (RESET_DEVICE) into virtio_reset() (when we really need a reset). I really don’t like our current approach with the status byte. Following the virtio specification to me would mean that the guest directly controls this byte, which it does not. qemu makes up values as it deems appropriate, and this includes sending a SET_STATUS 0 when the guest is just paused, i.e. when the guest really doesn’t want a device reset. That means that qemu does not treat this as a virtio device field (because that would mean exposing it to the guest driver), but instead treats it as part of the vhost(-user) protocol. It doesn’t feel right to me that we use a virtio-defined feature for communication on the vhost level, i.e. between front-end and back-end, and not between guest driver and device. I think all vhost-level protocol features should be fully defined in the vhost-user specification, which REPLY_ACK is. Now, we could hand full control of the status byte to the guest, and that would make me content. But I feel like that doesn’t really work, because qemu needs to intercept the status byte anyway (it needs to know when there is a reset, probably wants to know when the device is configured, etc.), so I don’t think having the status byte in vhost-user really gains us much when qemu could translate status byte changes to/from other vhost-user commands. Hanna > I guess symmetry was the > point. So I don't see why SET_STATUS 0 has to be ignored. > > > SET_STATUS was introduced by: > > commit 923b8921d210763359e96246a58658ac0db6c645 > Author: Yajun Wu <yajunw@nvidia.com> > Date: Mon Oct 17 14:44:52 2022 +0800 > > vhost-user: Support vhost_dev_start > > CC the author. > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-06 9:15 ` Hanna Czenczek @ 2023-10-06 9:26 ` Michael S. Tsirkin 2023-10-06 9:47 ` Hanna Czenczek 0 siblings, 1 reply; 53+ messages in thread From: Michael S. Tsirkin @ 2023-10-06 9:26 UTC (permalink / raw) To: Hanna Czenczek Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote: > On 06.10.23 10:45, Michael S. Tsirkin wrote: > > On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: > > > On 05.10.23 19:15, Michael S. Tsirkin wrote: > > > > On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: > > > > > On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: > > > > > > There is no clearly defined purpose for the virtio status byte in > > > > > > vhost-user: For resetting, we already have RESET_DEVICE; and for virtio > > > > > > feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK > > > > > > protocol extension, it is possible for SET_FEATURES to return errors > > > > > > (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). > > > > > > > > > > > > As for implementations, SET_STATUS is not widely implemented. dpdk does > > > > > > implement it, but only uses it to signal feature negotiation failure. > > > > > > While it does log reset requests (SET_STATUS 0) as such, it effectively > > > > > > ignores them, in contrast to RESET_OWNER (which is deprecated, and today > > > > > > means the same thing as RESET_DEVICE). > > > > > > > > > > > > While qemu superficially has support for [GS]ET_STATUS, it does not > > > > > > forward the guest-set status byte, but instead just makes it up > > > > > > internally, and actually completely ignores what the back-end returns, > > > > > > only using it as the template for a subsequent SET_STATUS to add single > > > > > > bits to it. Notably, after setting FEATURES_OK, it never reads it back > > > > > > to see whether the flag is still set, which is the only way in which > > > > > > dpdk uses the status byte. > > > > > > > > > > > > As-is, no front-end or back-end can rely on the other side handling this > > > > > > field in a useful manner, and it also provides no practical use over > > > > > > other mechanisms the vhost-user protocol has, which are more clearly > > > > > > defined. Deprecate it. > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > --- > > > > > > docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- > > > > > > 1 file changed, 21 insertions(+), 7 deletions(-) > > > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK. > > > > The fact current backends never check errors does not mean they never > > > > will. So no, not applying this. > > > Can this not be done with REPLY_ACK? I.e., with the following message > > > order: > > > > > > 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is > > > present > > > 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK > > > 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK > > > 4. SET_FEATURES with need_reply > > > > > > If not, the problem is that qemu has sent SET_STATUS 0 for a while when the > > > vCPUs are stopped, which generally seems to request a device reset. If we > > > don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will > > > implement SET_STATUS later may break with at least these qemu versions. But > > > documenting that a particular use of the status byte is to be ignored would > > > be really strange. > > > > > > Hanna > > Hmm I guess. Though just following virtio spec seems cleaner to me... > > vhost-user reconfigures the state fully on start. > > Not the internal device state, though. virtiofsd has internal state, and > other devices like vhost-gpu back-ends would probably, too. > > Stefan has recently sent a series > (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to > put the reset (RESET_DEVICE) into virtio_reset() (when we really need a > reset). > > I really don’t like our current approach with the status byte. Following the > virtio specification to me would mean that the guest directly controls this > byte, which it does not. qemu makes up values as it deems appropriate, and > this includes sending a SET_STATUS 0 when the guest is just paused, i.e. > when the guest really doesn’t want a device reset. > > That means that qemu does not treat this as a virtio device field (because > that would mean exposing it to the guest driver), but instead treats it as > part of the vhost(-user) protocol. It doesn’t feel right to me that we use > a virtio-defined feature for communication on the vhost level, i.e. between > front-end and back-end, and not between guest driver and device. I think > all vhost-level protocol features should be fully defined in the vhost-user > specification, which REPLY_ACK is. Hmm that makes sense. Maybe we should have done what stefan's patch is doing. Do look at the original commit that introduced it to understand why it was added. > Now, we could hand full control of the status byte to the guest, and that > would make me content. But I feel like that doesn’t really work, because > qemu needs to intercept the status byte anyway (it needs to know when there > is a reset, probably wants to know when the device is configured, etc.), so > I don’t think having the status byte in vhost-user really gains us much when > qemu could translate status byte changes to/from other vhost-user commands. > > Hanna well it intercepts it but I think it could pass it on unchanged. > > I guess symmetry was the > > point. So I don't see why SET_STATUS 0 has to be ignored. > > > > > > SET_STATUS was introduced by: > > > > commit 923b8921d210763359e96246a58658ac0db6c645 > > Author: Yajun Wu <yajunw@nvidia.com> > > Date: Mon Oct 17 14:44:52 2022 +0800 > > > > vhost-user: Support vhost_dev_start > > > > CC the author. > > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-06 9:26 ` Michael S. Tsirkin @ 2023-10-06 9:47 ` Hanna Czenczek 2023-10-06 10:34 ` Michael S. Tsirkin 0 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-06 9:47 UTC (permalink / raw) To: Michael S. Tsirkin Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu On 06.10.23 11:26, Michael S. Tsirkin wrote: > On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote: >> On 06.10.23 10:45, Michael S. Tsirkin wrote: >>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: >>>> On 05.10.23 19:15, Michael S. Tsirkin wrote: >>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: >>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: >>>>>>> There is no clearly defined purpose for the virtio status byte in >>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio >>>>>>> feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK >>>>>>> protocol extension, it is possible for SET_FEATURES to return errors >>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). >>>>>>> >>>>>>> As for implementations, SET_STATUS is not widely implemented. dpdk does >>>>>>> implement it, but only uses it to signal feature negotiation failure. >>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively >>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today >>>>>>> means the same thing as RESET_DEVICE). >>>>>>> >>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not >>>>>>> forward the guest-set status byte, but instead just makes it up >>>>>>> internally, and actually completely ignores what the back-end returns, >>>>>>> only using it as the template for a subsequent SET_STATUS to add single >>>>>>> bits to it. Notably, after setting FEATURES_OK, it never reads it back >>>>>>> to see whether the flag is still set, which is the only way in which >>>>>>> dpdk uses the status byte. >>>>>>> >>>>>>> As-is, no front-end or back-end can rely on the other side handling this >>>>>>> field in a useful manner, and it also provides no practical use over >>>>>>> other mechanisms the vhost-user protocol has, which are more clearly >>>>>>> defined. Deprecate it. >>>>>>> >>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>>>>>> --- >>>>>>> docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- >>>>>>> 1 file changed, 21 insertions(+), 7 deletions(-) >>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK. >>>>> The fact current backends never check errors does not mean they never >>>>> will. So no, not applying this. >>>> Can this not be done with REPLY_ACK? I.e., with the following message >>>> order: >>>> >>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is >>>> present >>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK >>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK >>>> 4. SET_FEATURES with need_reply >>>> >>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the >>>> vCPUs are stopped, which generally seems to request a device reset. If we >>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will >>>> implement SET_STATUS later may break with at least these qemu versions. But >>>> documenting that a particular use of the status byte is to be ignored would >>>> be really strange. >>>> >>>> Hanna >>> Hmm I guess. Though just following virtio spec seems cleaner to me... >>> vhost-user reconfigures the state fully on start. >> Not the internal device state, though. virtiofsd has internal state, and >> other devices like vhost-gpu back-ends would probably, too. >> >> Stefan has recently sent a series >> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to >> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a >> reset). >> >> I really don’t like our current approach with the status byte. Following the >> virtio specification to me would mean that the guest directly controls this >> byte, which it does not. qemu makes up values as it deems appropriate, and >> this includes sending a SET_STATUS 0 when the guest is just paused, i.e. >> when the guest really doesn’t want a device reset. >> >> That means that qemu does not treat this as a virtio device field (because >> that would mean exposing it to the guest driver), but instead treats it as >> part of the vhost(-user) protocol. It doesn’t feel right to me that we use >> a virtio-defined feature for communication on the vhost level, i.e. between >> front-end and back-end, and not between guest driver and device. I think >> all vhost-level protocol features should be fully defined in the vhost-user >> specification, which REPLY_ACK is. > Hmm that makes sense. Maybe we should have done what stefan's patch > is doing. > > Do look at the original commit that introduced it to understand why > it was added. I don’t understand why this was added to the stop/cont code, though. If it is time consuming to make these changes, why are they done every time the VM is paused and resumed? It makes sense that this would be done for the initial configuration (where a reset also wouldn’t hurt), but here it seems wrong. (To be clear, a reset in the stop/cont code is wrong, because it breaks stateful devices.) Also, note the newer commits 6f8be29ec17 and c3716f260bf. The reset as originally introduced was wrong even for non-stateful devices, because it occurred before we fetched the state (vring indices) so we could restore it later. I don’t know how 923b8921d21 was tested, but if the back-end used for testing implemented SET_STATUS 0 as a reset, it could not have survived either migration or a stop/cont in general, because the vring indices would have been reset to 0. What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all devices that would implement them as per virtio spec, and even today it’s broken for stateful devices. The mentioned performance issue is likely real, but we can’t address it by making up SET_STATUS calls that are wrong. I concede that I didn’t think about DRIVER_OK. Personally, I would do all final configuration that would happen upon a DRIVER_OK once the first vring is started (i.e. receives a kick). That has the added benefit of being asynchronous because it doesn’t block any vhost-user messages (which are synchronous, and thus block downtime). Hanna >> Now, we could hand full control of the status byte to the guest, and that >> would make me content. But I feel like that doesn’t really work, because >> qemu needs to intercept the status byte anyway (it needs to know when there >> is a reset, probably wants to know when the device is configured, etc.), so >> I don’t think having the status byte in vhost-user really gains us much when >> qemu could translate status byte changes to/from other vhost-user commands. >> >> Hanna > well it intercepts it but I think it could pass it on unchanged. > > >>> I guess symmetry was the >>> point. So I don't see why SET_STATUS 0 has to be ignored. >>> >>> >>> SET_STATUS was introduced by: >>> >>> commit 923b8921d210763359e96246a58658ac0db6c645 >>> Author: Yajun Wu <yajunw@nvidia.com> >>> Date: Mon Oct 17 14:44:52 2022 +0800 >>> >>> vhost-user: Support vhost_dev_start >>> >>> CC the author. >>> ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-06 9:47 ` Hanna Czenczek @ 2023-10-06 10:34 ` Michael S. Tsirkin 2023-10-06 11:42 ` Hanna Czenczek 2023-10-07 2:22 ` Yajun Wu 0 siblings, 2 replies; 53+ messages in thread From: Michael S. Tsirkin @ 2023-10-06 10:34 UTC (permalink / raw) To: Hanna Czenczek Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote: > On 06.10.23 11:26, Michael S. Tsirkin wrote: > > On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote: > > > On 06.10.23 10:45, Michael S. Tsirkin wrote: > > > > On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: > > > > > On 05.10.23 19:15, Michael S. Tsirkin wrote: > > > > > > On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: > > > > > > > On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: > > > > > > > > There is no clearly defined purpose for the virtio status byte in > > > > > > > > vhost-user: For resetting, we already have RESET_DEVICE; and for virtio > > > > > > > > feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK > > > > > > > > protocol extension, it is possible for SET_FEATURES to return errors > > > > > > > > (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). > > > > > > > > > > > > > > > > As for implementations, SET_STATUS is not widely implemented. dpdk does > > > > > > > > implement it, but only uses it to signal feature negotiation failure. > > > > > > > > While it does log reset requests (SET_STATUS 0) as such, it effectively > > > > > > > > ignores them, in contrast to RESET_OWNER (which is deprecated, and today > > > > > > > > means the same thing as RESET_DEVICE). > > > > > > > > > > > > > > > > While qemu superficially has support for [GS]ET_STATUS, it does not > > > > > > > > forward the guest-set status byte, but instead just makes it up > > > > > > > > internally, and actually completely ignores what the back-end returns, > > > > > > > > only using it as the template for a subsequent SET_STATUS to add single > > > > > > > > bits to it. Notably, after setting FEATURES_OK, it never reads it back > > > > > > > > to see whether the flag is still set, which is the only way in which > > > > > > > > dpdk uses the status byte. > > > > > > > > > > > > > > > > As-is, no front-end or back-end can rely on the other side handling this > > > > > > > > field in a useful manner, and it also provides no practical use over > > > > > > > > other mechanisms the vhost-user protocol has, which are more clearly > > > > > > > > defined. Deprecate it. > > > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > > --- > > > > > > > > docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- > > > > > > > > 1 file changed, 21 insertions(+), 7 deletions(-) > > > > > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK. > > > > > > The fact current backends never check errors does not mean they never > > > > > > will. So no, not applying this. > > > > > Can this not be done with REPLY_ACK? I.e., with the following message > > > > > order: > > > > > > > > > > 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is > > > > > present > > > > > 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK > > > > > 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK > > > > > 4. SET_FEATURES with need_reply > > > > > > > > > > If not, the problem is that qemu has sent SET_STATUS 0 for a while when the > > > > > vCPUs are stopped, which generally seems to request a device reset. If we > > > > > don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will > > > > > implement SET_STATUS later may break with at least these qemu versions. But > > > > > documenting that a particular use of the status byte is to be ignored would > > > > > be really strange. > > > > > > > > > > Hanna > > > > Hmm I guess. Though just following virtio spec seems cleaner to me... > > > > vhost-user reconfigures the state fully on start. > > > Not the internal device state, though. virtiofsd has internal state, and > > > other devices like vhost-gpu back-ends would probably, too. > > > > > > Stefan has recently sent a series > > > (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to > > > put the reset (RESET_DEVICE) into virtio_reset() (when we really need a > > > reset). > > > > > > I really don’t like our current approach with the status byte. Following the > > > virtio specification to me would mean that the guest directly controls this > > > byte, which it does not. qemu makes up values as it deems appropriate, and > > > this includes sending a SET_STATUS 0 when the guest is just paused, i.e. > > > when the guest really doesn’t want a device reset. > > > > > > That means that qemu does not treat this as a virtio device field (because > > > that would mean exposing it to the guest driver), but instead treats it as > > > part of the vhost(-user) protocol. It doesn’t feel right to me that we use > > > a virtio-defined feature for communication on the vhost level, i.e. between > > > front-end and back-end, and not between guest driver and device. I think > > > all vhost-level protocol features should be fully defined in the vhost-user > > > specification, which REPLY_ACK is. > > Hmm that makes sense. Maybe we should have done what stefan's patch > > is doing. > > > > Do look at the original commit that introduced it to understand why > > it was added. > > I don’t understand why this was added to the stop/cont code, though. If it > is time consuming to make these changes, why are they done every time the VM > is paused > and resumed? It makes sense that this would be done for the initial > configuration (where a reset also wouldn’t hurt), but here it seems wrong. > > (To be clear, a reset in the stop/cont code is wrong, because it breaks > stateful devices.) > > Also, note the newer commits 6f8be29ec17 and c3716f260bf. The reset as > originally introduced was wrong even for non-stateful devices, because it > occurred before we fetched the state (vring indices) so we could restore it > later. I don’t know how 923b8921d21 was tested, but if the back-end used > for testing implemented SET_STATUS 0 as a reset, it could not have survived > either migration or a stop/cont in general, because the vring indices would > have been reset to 0. > > What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all > devices that would implement them as per virtio spec, and even today it’s > broken for stateful devices. The mentioned performance issue is likely > real, but we can’t address it by making up SET_STATUS calls that are wrong. > > I concede that I didn’t think about DRIVER_OK. Personally, I would do all > final configuration that would happen upon a DRIVER_OK once the first vring > is started (i.e. receives a kick). That has the added benefit of being > asynchronous because it doesn’t block any vhost-user messages (which are > synchronous, and thus block downtime). > > Hanna For better or worse kick is per ring. It's out of spec to start rings that were not kicked but I guess you could do configuration ... Seems somewhat asymmetrical though. Let's wait until next week, hopefully Yajun Wu will answer. > > > Now, we could hand full control of the status byte to the guest, and that > > > would make me content. But I feel like that doesn’t really work, because > > > qemu needs to intercept the status byte anyway (it needs to know when there > > > is a reset, probably wants to know when the device is configured, etc.), so > > > I don’t think having the status byte in vhost-user really gains us much when > > > qemu could translate status byte changes to/from other vhost-user commands. > > > > > > Hanna > > well it intercepts it but I think it could pass it on unchanged. > > > > > > > > I guess symmetry was the > > > > point. So I don't see why SET_STATUS 0 has to be ignored. > > > > > > > > > > > > SET_STATUS was introduced by: > > > > > > > > commit 923b8921d210763359e96246a58658ac0db6c645 > > > > Author: Yajun Wu <yajunw@nvidia.com> > > > > Date: Mon Oct 17 14:44:52 2022 +0800 > > > > > > > > vhost-user: Support vhost_dev_start > > > > > > > > CC the author. > > > > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-06 10:34 ` Michael S. Tsirkin @ 2023-10-06 11:42 ` Hanna Czenczek 2023-10-06 15:17 ` Alex Bennée 2023-10-07 2:22 ` Yajun Wu 1 sibling, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-06 11:42 UTC (permalink / raw) To: Michael S. Tsirkin Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu On 06.10.23 12:34, Michael S. Tsirkin wrote: > On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote: >> On 06.10.23 11:26, Michael S. Tsirkin wrote: >>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote: >>>> On 06.10.23 10:45, Michael S. Tsirkin wrote: >>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: >>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote: >>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: >>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: >>>>>>>>> There is no clearly defined purpose for the virtio status byte in >>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio >>>>>>>>> feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK >>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors >>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). >>>>>>>>> >>>>>>>>> As for implementations, SET_STATUS is not widely implemented. dpdk does >>>>>>>>> implement it, but only uses it to signal feature negotiation failure. >>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively >>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today >>>>>>>>> means the same thing as RESET_DEVICE). >>>>>>>>> >>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not >>>>>>>>> forward the guest-set status byte, but instead just makes it up >>>>>>>>> internally, and actually completely ignores what the back-end returns, >>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single >>>>>>>>> bits to it. Notably, after setting FEATURES_OK, it never reads it back >>>>>>>>> to see whether the flag is still set, which is the only way in which >>>>>>>>> dpdk uses the status byte. >>>>>>>>> >>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this >>>>>>>>> field in a useful manner, and it also provides no practical use over >>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly >>>>>>>>> defined. Deprecate it. >>>>>>>>> >>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>>>>>>>> --- >>>>>>>>> docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- >>>>>>>>> 1 file changed, 21 insertions(+), 7 deletions(-) >>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK. >>>>>>> The fact current backends never check errors does not mean they never >>>>>>> will. So no, not applying this. >>>>>> Can this not be done with REPLY_ACK? I.e., with the following message >>>>>> order: >>>>>> >>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is >>>>>> present >>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK >>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK >>>>>> 4. SET_FEATURES with need_reply >>>>>> >>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the >>>>>> vCPUs are stopped, which generally seems to request a device reset. If we >>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will >>>>>> implement SET_STATUS later may break with at least these qemu versions. But >>>>>> documenting that a particular use of the status byte is to be ignored would >>>>>> be really strange. >>>>>> >>>>>> Hanna >>>>> Hmm I guess. Though just following virtio spec seems cleaner to me... >>>>> vhost-user reconfigures the state fully on start. >>>> Not the internal device state, though. virtiofsd has internal state, and >>>> other devices like vhost-gpu back-ends would probably, too. >>>> >>>> Stefan has recently sent a series >>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to >>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a >>>> reset). >>>> >>>> I really don’t like our current approach with the status byte. Following the >>>> virtio specification to me would mean that the guest directly controls this >>>> byte, which it does not. qemu makes up values as it deems appropriate, and >>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e. >>>> when the guest really doesn’t want a device reset. >>>> >>>> That means that qemu does not treat this as a virtio device field (because >>>> that would mean exposing it to the guest driver), but instead treats it as >>>> part of the vhost(-user) protocol. It doesn’t feel right to me that we use >>>> a virtio-defined feature for communication on the vhost level, i.e. between >>>> front-end and back-end, and not between guest driver and device. I think >>>> all vhost-level protocol features should be fully defined in the vhost-user >>>> specification, which REPLY_ACK is. >>> Hmm that makes sense. Maybe we should have done what stefan's patch >>> is doing. >>> >>> Do look at the original commit that introduced it to understand why >>> it was added. >> I don’t understand why this was added to the stop/cont code, though. If it >> is time consuming to make these changes, why are they done every time the VM >> is paused >> and resumed? It makes sense that this would be done for the initial >> configuration (where a reset also wouldn’t hurt), but here it seems wrong. >> >> (To be clear, a reset in the stop/cont code is wrong, because it breaks >> stateful devices.) >> >> Also, note the newer commits 6f8be29ec17 and c3716f260bf. The reset as >> originally introduced was wrong even for non-stateful devices, because it >> occurred before we fetched the state (vring indices) so we could restore it >> later. I don’t know how 923b8921d21 was tested, but if the back-end used >> for testing implemented SET_STATUS 0 as a reset, it could not have survived >> either migration or a stop/cont in general, because the vring indices would >> have been reset to 0. >> >> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all >> devices that would implement them as per virtio spec, and even today it’s >> broken for stateful devices. The mentioned performance issue is likely >> real, but we can’t address it by making up SET_STATUS calls that are wrong. >> >> I concede that I didn’t think about DRIVER_OK. Personally, I would do all >> final configuration that would happen upon a DRIVER_OK once the first vring >> is started (i.e. receives a kick). That has the added benefit of being >> asynchronous because it doesn’t block any vhost-user messages (which are >> synchronous, and thus block downtime). >> >> Hanna > > For better or worse kick is per ring. It's out of spec to start rings > that were not kicked but I guess you could do configuration ... > Seems somewhat asymmetrical though. I meant to take the first ring being started as the signal to do the global configuration, i.e. not do this once per vring, but once globally. > Let's wait until next week, hopefully Yajun Wu will answer. I mean, personally I don’t really care about the whole SET_STATUS thing. It’s clear that it’s broken for stateful devices. The fact that it took until 6f8be29ec17d to fix it for just any device that would implement it according to spec to me is a strong indication that nobody does implement it according to spec, and is currently only used to signal to some specific back-end that all rings have been set up and should be configured in a single block. (By the way, our SET_STATUS call that adds ACKNOWLEDGE | DRIVER | DRIVER_OK is also completely against the spec, and any well-behaving device should reject it. These flags must be set one after another, and specifically, features must be read and set after setting DRIVER, but before setting FEATURES_OK, and FEATURES_OK must be set before setting DRIVER_OK. Any well-behaving device should error out when DRIVER_OK is set without FEATURES_OK set, or when FEATURES_OK is set without ACKNOWLEDGE | DRIVER set.) I can just drop this patch from the migration series, because in my opinion it doesn’t affect it whatsoever (although I understood Stefan disagrees). But honestly, I think any vhost-user back-end developer is well-advised to completely ignore the status byte. Not ignoring it means relying on qemu’s implementation-defined behavior. Hanna ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-06 11:42 ` Hanna Czenczek @ 2023-10-06 15:17 ` Alex Bennée 2023-10-06 15:47 ` Hanna Czenczek 0 siblings, 1 reply; 53+ messages in thread From: Alex Bennée @ 2023-10-06 15:17 UTC (permalink / raw) To: Hanna Czenczek Cc: Michael S. Tsirkin, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu, qemu-devel Hanna Czenczek <hreitz@redhat.com> writes: > On 06.10.23 12:34, Michael S. Tsirkin wrote: >> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote: >>> On 06.10.23 11:26, Michael S. Tsirkin wrote: >>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote: >>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote: >>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: >>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote: >>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: >>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: <snip> >>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all >>> devices that would implement them as per virtio spec, and even today it’s >>> broken for stateful devices. The mentioned performance issue is likely >>> real, but we can’t address it by making up SET_STATUS calls that are wrong. >>> >>> I concede that I didn’t think about DRIVER_OK. Personally, I would do all >>> final configuration that would happen upon a DRIVER_OK once the first vring >>> is started (i.e. receives a kick). That has the added benefit of being >>> asynchronous because it doesn’t block any vhost-user messages (which are >>> synchronous, and thus block downtime). >>> >>> Hanna >> >> For better or worse kick is per ring. It's out of spec to start rings >> that were not kicked but I guess you could do configuration ... >> Seems somewhat asymmetrical though. > > I meant to take the first ring being started as the signal to do the > global configuration, i.e. not do this once per vring, but once > globally. > >> Let's wait until next week, hopefully Yajun Wu will answer. > > I mean, personally I don’t really care about the whole SET_STATUS > thing. It’s clear that it’s broken for stateful devices. The fact > that it took until 6f8be29ec17d to fix it for just any device that > would implement it according to spec to me is a strong indication that > nobody does implement it according to spec, and is currently only used > to signal to some specific back-end that all rings have been set up > and should be configured in a single block. I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT extensions where everything is off-loaded to the vhost-user backend. -- Alex Bennée Virtualisation Tech Lead @ Linaro ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-06 15:17 ` Alex Bennée @ 2023-10-06 15:47 ` Hanna Czenczek 2023-10-06 20:49 ` Alex Bennée 0 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-06 15:47 UTC (permalink / raw) To: Alex Bennée Cc: Michael S. Tsirkin, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu, qemu-devel On 06.10.23 17:17, Alex Bennée wrote: > Hanna Czenczek <hreitz@redhat.com> writes: > >> On 06.10.23 12:34, Michael S. Tsirkin wrote: >>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote: >>>> On 06.10.23 11:26, Michael S. Tsirkin wrote: >>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote: >>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote: >>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: >>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote: >>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: >>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: > <snip> >>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all >>>> devices that would implement them as per virtio spec, and even today it’s >>>> broken for stateful devices. The mentioned performance issue is likely >>>> real, but we can’t address it by making up SET_STATUS calls that are wrong. >>>> >>>> I concede that I didn’t think about DRIVER_OK. Personally, I would do all >>>> final configuration that would happen upon a DRIVER_OK once the first vring >>>> is started (i.e. receives a kick). That has the added benefit of being >>>> asynchronous because it doesn’t block any vhost-user messages (which are >>>> synchronous, and thus block downtime). >>>> >>>> Hanna >>> For better or worse kick is per ring. It's out of spec to start rings >>> that were not kicked but I guess you could do configuration ... >>> Seems somewhat asymmetrical though. >> I meant to take the first ring being started as the signal to do the >> global configuration, i.e. not do this once per vring, but once >> globally. >> >>> Let's wait until next week, hopefully Yajun Wu will answer. >> I mean, personally I don’t really care about the whole SET_STATUS >> thing. It’s clear that it’s broken for stateful devices. The fact >> that it took until 6f8be29ec17d to fix it for just any device that >> would implement it according to spec to me is a strong indication that >> nobody does implement it according to spec, and is currently only used >> to signal to some specific back-end that all rings have been set up >> and should be configured in a single block. > I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT > extensions where everything is off-loaded to the vhost-user backend. How do these back-ends work with the fact that qemu uses SET_STATUS incorrectly when not offloading? Do you plan on fixing that? (I.e. that we send SET_STATUS 0 when the VM is paused, potentially resetting state that is not recoverable, and that we set DRIVER and DRIVER_OK simultaneously.) Hanna ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-06 15:47 ` Hanna Czenczek @ 2023-10-06 20:49 ` Alex Bennée 2023-10-09 8:07 ` Hanna Czenczek 0 siblings, 1 reply; 53+ messages in thread From: Alex Bennée @ 2023-10-06 20:49 UTC (permalink / raw) To: Hanna Czenczek Cc: Michael S. Tsirkin, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu, qemu-devel Hanna Czenczek <hreitz@redhat.com> writes: > On 06.10.23 17:17, Alex Bennée wrote: >> Hanna Czenczek <hreitz@redhat.com> writes: >> >>> On 06.10.23 12:34, Michael S. Tsirkin wrote: >>>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote: >>>>> On 06.10.23 11:26, Michael S. Tsirkin wrote: >>>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote: >>>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote: >>>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: >>>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote: >>>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: >>>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: >> <snip> >>>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all >>>>> devices that would implement them as per virtio spec, and even today it’s >>>>> broken for stateful devices. The mentioned performance issue is likely >>>>> real, but we can’t address it by making up SET_STATUS calls that are wrong. >>>>> >>>>> I concede that I didn’t think about DRIVER_OK. Personally, I would do all >>>>> final configuration that would happen upon a DRIVER_OK once the first vring >>>>> is started (i.e. receives a kick). That has the added benefit of being >>>>> asynchronous because it doesn’t block any vhost-user messages (which are >>>>> synchronous, and thus block downtime). >>>>> >>>>> Hanna >>>> For better or worse kick is per ring. It's out of spec to start rings >>>> that were not kicked but I guess you could do configuration ... >>>> Seems somewhat asymmetrical though. >>> I meant to take the first ring being started as the signal to do the >>> global configuration, i.e. not do this once per vring, but once >>> globally. >>> >>>> Let's wait until next week, hopefully Yajun Wu will answer. >>> I mean, personally I don’t really care about the whole SET_STATUS >>> thing. It’s clear that it’s broken for stateful devices. The fact >>> that it took until 6f8be29ec17d to fix it for just any device that >>> would implement it according to spec to me is a strong indication that >>> nobody does implement it according to spec, and is currently only used >>> to signal to some specific back-end that all rings have been set up >>> and should be configured in a single block. >> I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT >> extensions where everything is off-loaded to the vhost-user backend. > > How do these back-ends work with the fact that qemu uses SET_STATUS > incorrectly when not offloading? Do you plan on fixing that? Mainly having a common base implementation which does it right and having very lightweight derivations for legacy stubs using it. The aim is to eliminate the need for QEMU stubs entirely by fully specifying the device from the vhost-user API. > (I.e. that we send SET_STATUS 0 when the VM is paused, potentially > resetting state that is not recoverable, and that we set DRIVER and > DRIVER_OK simultaneously.) This is QEMU simulating a SET_STATUS rather than the guest triggering it? > > Hanna -- Alex Bennée Virtualisation Tech Lead @ Linaro ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-06 20:49 ` Alex Bennée @ 2023-10-09 8:07 ` Hanna Czenczek 0 siblings, 0 replies; 53+ messages in thread From: Hanna Czenczek @ 2023-10-09 8:07 UTC (permalink / raw) To: Alex Bennée Cc: Michael S. Tsirkin, qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, Yajun Wu On 06.10.23 22:49, Alex Bennée wrote: > Hanna Czenczek <hreitz@redhat.com> writes: > >> On 06.10.23 17:17, Alex Bennée wrote: >>> Hanna Czenczek <hreitz@redhat.com> writes: >>> >>>> On 06.10.23 12:34, Michael S. Tsirkin wrote: >>>>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote: >>>>>> On 06.10.23 11:26, Michael S. Tsirkin wrote: >>>>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote: >>>>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote: >>>>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: >>>>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote: >>>>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: >>>>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: >>> <snip> >>>>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all >>>>>> devices that would implement them as per virtio spec, and even today it’s >>>>>> broken for stateful devices. The mentioned performance issue is likely >>>>>> real, but we can’t address it by making up SET_STATUS calls that are wrong. >>>>>> >>>>>> I concede that I didn’t think about DRIVER_OK. Personally, I would do all >>>>>> final configuration that would happen upon a DRIVER_OK once the first vring >>>>>> is started (i.e. receives a kick). That has the added benefit of being >>>>>> asynchronous because it doesn’t block any vhost-user messages (which are >>>>>> synchronous, and thus block downtime). >>>>>> >>>>>> Hanna >>>>> For better or worse kick is per ring. It's out of spec to start rings >>>>> that were not kicked but I guess you could do configuration ... >>>>> Seems somewhat asymmetrical though. >>>> I meant to take the first ring being started as the signal to do the >>>> global configuration, i.e. not do this once per vring, but once >>>> globally. >>>> >>>>> Let's wait until next week, hopefully Yajun Wu will answer. >>>> I mean, personally I don’t really care about the whole SET_STATUS >>>> thing. It’s clear that it’s broken for stateful devices. The fact >>>> that it took until 6f8be29ec17d to fix it for just any device that >>>> would implement it according to spec to me is a strong indication that >>>> nobody does implement it according to spec, and is currently only used >>>> to signal to some specific back-end that all rings have been set up >>>> and should be configured in a single block. >>> I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT >>> extensions where everything is off-loaded to the vhost-user backend. >> How do these back-ends work with the fact that qemu uses SET_STATUS >> incorrectly when not offloading? Do you plan on fixing that? > Mainly having a common base implementation which does it right and > having very lightweight derivations for legacy stubs using it. The > aim is to eliminate the need for QEMU stubs entirely by fully specifying > the device from the vhost-user API. If the current SET_STATUS use is overhauled, too, that would be good. I wonder why you need the status byte, though. >> (I.e. that we send SET_STATUS 0 when the VM is paused, potentially >> resetting state that is not recoverable, and that we set DRIVER and >> DRIVER_OK simultaneously.) > This is QEMU simulating a SET_STATUS rather than the guest triggering > it? Yes, and the fact that we simulate it when the guest will not have triggered it, i.e. we reset the device (SET_STATUS 0) when the VM is paused. Effectively, qemu injects virtio commands that the guest has never requested, which generally feels like a bad idea, because qemu will need to get the device back to its previous state before the guest is resumed, which may or may not work. Specifically, it won’t work for devices that have internal state. Furthermore, we use SET_STATUS to set ACKNOWLEDGE | DRIVER | DRIVER_OK simultaneously, which is wrong. ACKNOWLEDGE | DRIVER may perhaps be set simultaneously, but then comes feature negotiation (setting and checking FEATURES_OK), and then DRIVER_OK. Finally, how the status byte is to be used is not noted in the vhost-user specification, which instead points to the virtio specification. I think if we keep SET_STATUS, it must be documented how it interacts with other vhost-user commands. For example, how the FEATURES_OK protocol described in the virtio specification interacts with GET_FEATURES/SET_FEATURES, or whether SET_STATUS 0 and RESET_DEVICE are equivalent. Currently, the only implementation of SET_STATUS I know (DPDK) ignores SET_STATUS 0, i.e. doesn’t do a reset. To me that indicates that the spec must be clear on what these status values mean with regards to the vhost-user protocol as a whole. So every software implementation with STATUS support that I know implements SET_STATUS wrongly right now, and that’s a problem, because it prevents implementations like virtiofsd from doing so correctly. Hanna ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-06 10:34 ` Michael S. Tsirkin 2023-10-06 11:42 ` Hanna Czenczek @ 2023-10-07 2:22 ` Yajun Wu 2023-10-09 8:21 ` Hanna Czenczek 2023-10-09 10:28 ` German Maglione 1 sibling, 2 replies; 53+ messages in thread From: Yajun Wu @ 2023-10-07 2:22 UTC (permalink / raw) To: Michael S. Tsirkin, Hanna Czenczek Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, parav, maxime.coquelin On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote: > External email: Use caution opening links or attachments > > > On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote: >> On 06.10.23 11:26, Michael S. Tsirkin wrote: >>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote: >>>> On 06.10.23 10:45, Michael S. Tsirkin wrote: >>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: >>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote: >>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: >>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: >>>>>>>>> There is no clearly defined purpose for the virtio status byte in >>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio >>>>>>>>> feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK >>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors >>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). >>>>>>>>> >>>>>>>>> As for implementations, SET_STATUS is not widely implemented. dpdk does >>>>>>>>> implement it, but only uses it to signal feature negotiation failure. >>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively >>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today >>>>>>>>> means the same thing as RESET_DEVICE). >>>>>>>>> >>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not >>>>>>>>> forward the guest-set status byte, but instead just makes it up >>>>>>>>> internally, and actually completely ignores what the back-end returns, >>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single >>>>>>>>> bits to it. Notably, after setting FEATURES_OK, it never reads it back >>>>>>>>> to see whether the flag is still set, which is the only way in which >>>>>>>>> dpdk uses the status byte. >>>>>>>>> >>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this >>>>>>>>> field in a useful manner, and it also provides no practical use over >>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly >>>>>>>>> defined. Deprecate it. >>>>>>>>> >>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>>>>>>>> --- >>>>>>>>> docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- >>>>>>>>> 1 file changed, 21 insertions(+), 7 deletions(-) >>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK. >>>>>>> The fact current backends never check errors does not mean they never >>>>>>> will. So no, not applying this. >>>>>> Can this not be done with REPLY_ACK? I.e., with the following message >>>>>> order: >>>>>> >>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is >>>>>> present >>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK >>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK >>>>>> 4. SET_FEATURES with need_reply >>>>>> >>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the >>>>>> vCPUs are stopped, which generally seems to request a device reset. If we >>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will >>>>>> implement SET_STATUS later may break with at least these qemu versions. But >>>>>> documenting that a particular use of the status byte is to be ignored would >>>>>> be really strange. >>>>>> >>>>>> Hanna >>>>> Hmm I guess. Though just following virtio spec seems cleaner to me... >>>>> vhost-user reconfigures the state fully on start. >>>> Not the internal device state, though. virtiofsd has internal state, and >>>> other devices like vhost-gpu back-ends would probably, too. >>>> >>>> Stefan has recently sent a series >>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to >>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a >>>> reset). >>>> >>>> I really don’t like our current approach with the status byte. Following the >>>> virtio specification to me would mean that the guest directly controls this >>>> byte, which it does not. qemu makes up values as it deems appropriate, and >>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e. >>>> when the guest really doesn’t want a device reset. >>>> >>>> That means that qemu does not treat this as a virtio device field (because >>>> that would mean exposing it to the guest driver), but instead treats it as >>>> part of the vhost(-user) protocol. It doesn’t feel right to me that we use >>>> a virtio-defined feature for communication on the vhost level, i.e. between >>>> front-end and back-end, and not between guest driver and device. I think >>>> all vhost-level protocol features should be fully defined in the vhost-user >>>> specification, which REPLY_ACK is. >>> Hmm that makes sense. Maybe we should have done what stefan's patch >>> is doing. >>> >>> Do look at the original commit that introduced it to understand why >>> it was added. >> I don’t understand why this was added to the stop/cont code, though. If it >> is time consuming to make these changes, why are they done every time the VM >> is paused >> and resumed? It makes sense that this would be done for the initial >> configuration (where a reset also wouldn’t hurt), but here it seems wrong. >> >> (To be clear, a reset in the stop/cont code is wrong, because it breaks >> stateful devices.) >> >> Also, note the newer commits 6f8be29ec17 and c3716f260bf. The reset as >> originally introduced was wrong even for non-stateful devices, because it >> occurred before we fetched the state (vring indices) so we could restore it >> later. I don’t know how 923b8921d21 was tested, but if the back-end used >> for testing implemented SET_STATUS 0 as a reset, it could not have survived >> either migration or a stop/cont in general, because the vring indices would >> have been reset to 0. >> >> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all >> devices that would implement them as per virtio spec, and even today it’s >> broken for stateful devices. The mentioned performance issue is likely >> real, but we can’t address it by making up SET_STATUS calls that are wrong. >> >> I concede that I didn’t think about DRIVER_OK. Personally, I would do all >> final configuration that would happen upon a DRIVER_OK once the first vring >> is started (i.e. receives a kick). That has the added benefit of being >> asynchronous because it doesn’t block any vhost-user messages (which are >> synchronous, and thus block downtime). >> >> Hanna > > For better or worse kick is per ring. It's out of spec to start rings > that were not kicked but I guess you could do configuration ... > Seems somewhat asymmetrical though. > > Let's wait until next week, hopefully Yajun Wu will answer. The main motivation of adding VHOST_USER_SET_STATUS is to let backend DPDK know when DRIVER_OK bit is valid. It's an indication of all VQ configuration has sent, otherwise DPDK has to rely on first queue pair is ready, then receiving/applying VQ configuration one by one. During live migration, configuring VQ one by one is very time consuming. For VIRTIO net vDPA, HW needs to know how many VQs are enabled to set RSS(Receive-Side Scaling). If you don’t want SET_STATUS message, backend can remove protocol feature bit VHOST_USER_PROTOCOL_F_STATUS. DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device close/reset. I'm not involved in discussion about adding SET_STATUS in Vhost protocol. This feature is essential for vDPA(same as vhost-vdpa implements VHOST_VDPA_SET_STATUS). Thanks, Yajun > >>>> Now, we could hand full control of the status byte to the guest, and that >>>> would make me content. But I feel like that doesn’t really work, because >>>> qemu needs to intercept the status byte anyway (it needs to know when there >>>> is a reset, probably wants to know when the device is configured, etc.), so >>>> I don’t think having the status byte in vhost-user really gains us much when >>>> qemu could translate status byte changes to/from other vhost-user commands. >>>> >>>> Hanna >>> well it intercepts it but I think it could pass it on unchanged. >>> >>> >>>>> I guess symmetry was the >>>>> point. So I don't see why SET_STATUS 0 has to be ignored. >>>>> >>>>> >>>>> SET_STATUS was introduced by: >>>>> >>>>> commit 923b8921d210763359e96246a58658ac0db6c645 >>>>> Author: Yajun Wu <yajunw@nvidia.com> >>>>> Date: Mon Oct 17 14:44:52 2022 +0800 >>>>> >>>>> vhost-user: Support vhost_dev_start >>>>> >>>>> CC the author. >>>>> ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-07 2:22 ` Yajun Wu @ 2023-10-09 8:21 ` Hanna Czenczek 2023-10-09 9:07 ` Hanna Czenczek 2023-10-09 10:28 ` German Maglione 1 sibling, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-09 8:21 UTC (permalink / raw) To: Yajun Wu, Michael S. Tsirkin Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, parav, maxime.coquelin On 07.10.23 04:22, Yajun Wu wrote: > > On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote: >> External email: Use caution opening links or attachments >> >> >> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote: >>> On 06.10.23 11:26, Michael S. Tsirkin wrote: >>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote: >>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote: >>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: >>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote: >>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: >>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: >>>>>>>>>> There is no clearly defined purpose for the virtio status >>>>>>>>>> byte in >>>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and >>>>>>>>>> for virtio >>>>>>>>>> feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK >>>>>>>>>> protocol extension, it is possible for SET_FEATURES to return >>>>>>>>>> errors >>>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). >>>>>>>>>> >>>>>>>>>> As for implementations, SET_STATUS is not widely >>>>>>>>>> implemented. dpdk does >>>>>>>>>> implement it, but only uses it to signal feature negotiation >>>>>>>>>> failure. >>>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it >>>>>>>>>> effectively >>>>>>>>>> ignores them, in contrast to RESET_OWNER (which is >>>>>>>>>> deprecated, and today >>>>>>>>>> means the same thing as RESET_DEVICE). >>>>>>>>>> >>>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it >>>>>>>>>> does not >>>>>>>>>> forward the guest-set status byte, but instead just makes it up >>>>>>>>>> internally, and actually completely ignores what the back-end >>>>>>>>>> returns, >>>>>>>>>> only using it as the template for a subsequent SET_STATUS to >>>>>>>>>> add single >>>>>>>>>> bits to it. Notably, after setting FEATURES_OK, it never >>>>>>>>>> reads it back >>>>>>>>>> to see whether the flag is still set, which is the only way >>>>>>>>>> in which >>>>>>>>>> dpdk uses the status byte. >>>>>>>>>> >>>>>>>>>> As-is, no front-end or back-end can rely on the other side >>>>>>>>>> handling this >>>>>>>>>> field in a useful manner, and it also provides no practical >>>>>>>>>> use over >>>>>>>>>> other mechanisms the vhost-user protocol has, which are more >>>>>>>>>> clearly >>>>>>>>>> defined. Deprecate it. >>>>>>>>>> >>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>>>>>>>>> --- >>>>>>>>>> docs/interop/vhost-user.rst | 28 >>>>>>>>>> +++++++++++++++++++++------- >>>>>>>>>> 1 file changed, 21 insertions(+), 7 deletions(-) >>>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>>> SET_STATUS is the only way to signal failure to acknowledge >>>>>>>> FEATURES_OK. >>>>>>>> The fact current backends never check errors does not mean they >>>>>>>> never >>>>>>>> will. So no, not applying this. >>>>>>> Can this not be done with REPLY_ACK? I.e., with the following >>>>>>> message >>>>>>> order: >>>>>>> >>>>>>> 1. GET_FEATURES to find out whether >>>>>>> VHOST_USER_F_PROTOCOL_FEATURES is >>>>>>> present >>>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get >>>>>>> VHOST_USER_PROTOCOL_F_REPLY_ACK >>>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK >>>>>>> 4. SET_FEATURES with need_reply >>>>>>> >>>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a >>>>>>> while when the >>>>>>> vCPUs are stopped, which generally seems to request a device >>>>>>> reset. If we >>>>>>> don’t state at least that SET_STATUS 0 is to be ignored, >>>>>>> back-ends that will >>>>>>> implement SET_STATUS later may break with at least these qemu >>>>>>> versions. But >>>>>>> documenting that a particular use of the status byte is to be >>>>>>> ignored would >>>>>>> be really strange. >>>>>>> >>>>>>> Hanna >>>>>> Hmm I guess. Though just following virtio spec seems cleaner to >>>>>> me... >>>>>> vhost-user reconfigures the state fully on start. >>>>> Not the internal device state, though. virtiofsd has internal >>>>> state, and >>>>> other devices like vhost-gpu back-ends would probably, too. >>>>> >>>>> Stefan has recently sent a series >>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) >>>>> to >>>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really >>>>> need a >>>>> reset). >>>>> >>>>> I really don’t like our current approach with the status byte. >>>>> Following the >>>>> virtio specification to me would mean that the guest directly >>>>> controls this >>>>> byte, which it does not. qemu makes up values as it deems >>>>> appropriate, and >>>>> this includes sending a SET_STATUS 0 when the guest is just >>>>> paused, i.e. >>>>> when the guest really doesn’t want a device reset. >>>>> >>>>> That means that qemu does not treat this as a virtio device field >>>>> (because >>>>> that would mean exposing it to the guest driver), but instead >>>>> treats it as >>>>> part of the vhost(-user) protocol. It doesn’t feel right to me >>>>> that we use >>>>> a virtio-defined feature for communication on the vhost level, >>>>> i.e. between >>>>> front-end and back-end, and not between guest driver and device. >>>>> I think >>>>> all vhost-level protocol features should be fully defined in the >>>>> vhost-user >>>>> specification, which REPLY_ACK is. >>>> Hmm that makes sense. Maybe we should have done what stefan's patch >>>> is doing. >>>> >>>> Do look at the original commit that introduced it to understand why >>>> it was added. >>> I don’t understand why this was added to the stop/cont code, >>> though. If it >>> is time consuming to make these changes, why are they done every >>> time the VM >>> is paused >>> and resumed? It makes sense that this would be done for the initial >>> configuration (where a reset also wouldn’t hurt), but here it seems >>> wrong. >>> >>> (To be clear, a reset in the stop/cont code is wrong, because it breaks >>> stateful devices.) >>> >>> Also, note the newer commits 6f8be29ec17 and c3716f260bf. The reset as >>> originally introduced was wrong even for non-stateful devices, >>> because it >>> occurred before we fetched the state (vring indices) so we could >>> restore it >>> later. I don’t know how 923b8921d21 was tested, but if the back-end >>> used >>> for testing implemented SET_STATUS 0 as a reset, it could not have >>> survived >>> either migration or a stop/cont in general, because the vring >>> indices would >>> have been reset to 0. >>> >>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that >>> broke all >>> devices that would implement them as per virtio spec, and even today >>> it’s >>> broken for stateful devices. The mentioned performance issue is likely >>> real, but we can’t address it by making up SET_STATUS calls that are >>> wrong. >>> >>> I concede that I didn’t think about DRIVER_OK. Personally, I would >>> do all >>> final configuration that would happen upon a DRIVER_OK once the >>> first vring >>> is started (i.e. receives a kick). That has the added benefit of being >>> asynchronous because it doesn’t block any vhost-user messages (which >>> are >>> synchronous, and thus block downtime). >>> >>> Hanna >> >> For better or worse kick is per ring. It's out of spec to start rings >> that were not kicked but I guess you could do configuration ... >> Seems somewhat asymmetrical though. >> >> Let's wait until next week, hopefully Yajun Wu will answer. > The main motivation of adding VHOST_USER_SET_STATUS is to let backend > DPDK know > when DRIVER_OK bit is valid. It's an indication of all VQ > configuration has sent, > otherwise DPDK has to rely on first queue pair is ready, then > receiving/applying > VQ configuration one by one. > > During live migration, configuring VQ one by one is very time consuming. One question I have here is why it wasn’t then introduced in the live migration code, but in the general VM stop/cont code instead. It does seem time-consuming to do this every time the VM is paused and resumed. > For VIRTIO > net vDPA, HW needs to know how many VQs are enabled to set > RSS(Receive-Side Scaling). > > If you don’t want SET_STATUS message, backend can remove protocol > feature bit > VHOST_USER_PROTOCOL_F_STATUS. The problem isn’t back-ends that don’t want the message, the problem is that qemu uses the message wrongly, which prevents well-behaving back-ends from implementing the message. > DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device > close/reset. So the right thing to do for back-ends is to announce STATUS support and then not implement it correctly? GET_VRING_BASE should not reset the close or reset the device, by the way. It should stop that one vring, not more. We have a RESET_DEVICE command for resetting. > I'm not involved in discussion about adding SET_STATUS in Vhost > protocol. This feature > is essential for vDPA(same as vhost-vdpa implements > VHOST_VDPA_SET_STATUS). So from what I gather from your response is that there is only a single use for SET_STATUS, which is the DRIVER_OK bit. If so, documenting that all other bits are to be ignored by both back-end and front-end would be fine by me. I’m not fully serious about that suggestion, but I hear the strong implication that nothing but DRIVER_OK was of any concern, and this is really important to note when we talk about the status of the STATUS feature in vhost today. It seems to me now that it was not intended to be the virtio-level status byte, but just a DRIVER_OK signalling path from front-end to back-end. That makes it a vhost-level protocol feature to me. Hanna > > Thanks, > Yajun >> >>>>> Now, we could hand full control of the status byte to the guest, >>>>> and that >>>>> would make me content. But I feel like that doesn’t really work, >>>>> because >>>>> qemu needs to intercept the status byte anyway (it needs to know >>>>> when there >>>>> is a reset, probably wants to know when the device is configured, >>>>> etc.), so >>>>> I don’t think having the status byte in vhost-user really gains us >>>>> much when >>>>> qemu could translate status byte changes to/from other vhost-user >>>>> commands. >>>>> >>>>> Hanna >>>> well it intercepts it but I think it could pass it on unchanged. >>>> >>>> >>>>>> I guess symmetry was the >>>>>> point. So I don't see why SET_STATUS 0 has to be ignored. >>>>>> >>>>>> >>>>>> SET_STATUS was introduced by: >>>>>> >>>>>> commit 923b8921d210763359e96246a58658ac0db6c645 >>>>>> Author: Yajun Wu <yajunw@nvidia.com> >>>>>> Date: Mon Oct 17 14:44:52 2022 +0800 >>>>>> >>>>>> vhost-user: Support vhost_dev_start >>>>>> >>>>>> CC the author. >>>>>> > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-09 8:21 ` Hanna Czenczek @ 2023-10-09 9:07 ` Hanna Czenczek 2023-10-09 9:13 ` Hanna Czenczek 0 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-09 9:07 UTC (permalink / raw) To: Yajun Wu, Michael S. Tsirkin Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, parav, maxime.coquelin, Alex Bennée On 09.10.23 10:21, Hanna Czenczek wrote: > On 07.10.23 04:22, Yajun Wu wrote: [...] >> The main motivation of adding VHOST_USER_SET_STATUS is to let backend >> DPDK know >> when DRIVER_OK bit is valid. It's an indication of all VQ >> configuration has sent, >> otherwise DPDK has to rely on first queue pair is ready, then >> receiving/applying >> VQ configuration one by one. >> >> During live migration, configuring VQ one by one is very time consuming. > > One question I have here is why it wasn’t then introduced in the live > migration code, but in the general VM stop/cont code instead. It does > seem time-consuming to do this every time the VM is paused and resumed. > >> For VIRTIO >> net vDPA, HW needs to know how many VQs are enabled to set >> RSS(Receive-Side Scaling). >> >> If you don’t want SET_STATUS message, backend can remove protocol >> feature bit >> VHOST_USER_PROTOCOL_F_STATUS. > > The problem isn’t back-ends that don’t want the message, the problem > is that qemu uses the message wrongly, which prevents well-behaving > back-ends from implementing the message. > >> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device >> close/reset. > > So the right thing to do for back-ends is to announce STATUS support > and then not implement it correctly? > > GET_VRING_BASE should not reset the close or reset the device, by the > way. It should stop that one vring, not more. We have a RESET_DEVICE > command for resetting. > >> I'm not involved in discussion about adding SET_STATUS in Vhost >> protocol. This feature >> is essential for vDPA(same as vhost-vdpa implements >> VHOST_VDPA_SET_STATUS). > > So from what I gather from your response is that there is only a > single use for SET_STATUS, which is the DRIVER_OK bit. If so, > documenting that all other bits are to be ignored by both back-end and > front-end would be fine by me. > > I’m not fully serious about that suggestion, but I hear the strong > implication that nothing but DRIVER_OK was of any concern, and this is > really important to note when we talk about the status of the STATUS > feature in vhost today. It seems to me now that it was not intended > to be the virtio-level status byte, but just a DRIVER_OK signalling > path from front-end to back-end. That makes it a vhost-level protocol > feature to me. On second thought, it just is a pure vhost-level protocol feature, and has nothing to do with the virtio status byte as-is. The only stated purpose is for the front-end to send DRIVER_OK after migration, but migration is transparent to the guest, so the guest would never change the status byte during migration. Therefore, if this feature is essential, we will never be able to have a status byte that is transparently shared between guest and back-end device, i.e. the virtio status byte. Cc-ing Alex on this mail, because to me, this seems like an important detail when he plans on using the byte in the future. If we need a virtio status byte, I can’t see how we could use the existing F_STATUS for it. Hanna ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-09 9:07 ` Hanna Czenczek @ 2023-10-09 9:13 ` Hanna Czenczek 2023-10-10 4:00 ` Yajun Wu 0 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-09 9:13 UTC (permalink / raw) To: Yajun Wu, Michael S. Tsirkin Cc: qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin, parav, maxime.coquelin, Alex Bennée On 09.10.23 11:07, Hanna Czenczek wrote: > On 09.10.23 10:21, Hanna Czenczek wrote: >> On 07.10.23 04:22, Yajun Wu wrote: > > [...] > >>> The main motivation of adding VHOST_USER_SET_STATUS is to let >>> backend DPDK know >>> when DRIVER_OK bit is valid. It's an indication of all VQ >>> configuration has sent, >>> otherwise DPDK has to rely on first queue pair is ready, then >>> receiving/applying >>> VQ configuration one by one. >>> >>> During live migration, configuring VQ one by one is very time >>> consuming. >> >> One question I have here is why it wasn’t then introduced in the live >> migration code, but in the general VM stop/cont code instead. It does >> seem time-consuming to do this every time the VM is paused and resumed. >> >>> For VIRTIO >>> net vDPA, HW needs to know how many VQs are enabled to set >>> RSS(Receive-Side Scaling). >>> >>> If you don’t want SET_STATUS message, backend can remove protocol >>> feature bit >>> VHOST_USER_PROTOCOL_F_STATUS. >> >> The problem isn’t back-ends that don’t want the message, the problem >> is that qemu uses the message wrongly, which prevents well-behaving >> back-ends from implementing the message. >> >>> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device >>> close/reset. >> >> So the right thing to do for back-ends is to announce STATUS support >> and then not implement it correctly? >> >> GET_VRING_BASE should not reset the close or reset the device, by the >> way. It should stop that one vring, not more. We have a >> RESET_DEVICE command for resetting. >> >>> I'm not involved in discussion about adding SET_STATUS in Vhost >>> protocol. This feature >>> is essential for vDPA(same as vhost-vdpa implements >>> VHOST_VDPA_SET_STATUS). >> >> So from what I gather from your response is that there is only a >> single use for SET_STATUS, which is the DRIVER_OK bit. If so, >> documenting that all other bits are to be ignored by both back-end >> and front-end would be fine by me. >> >> I’m not fully serious about that suggestion, but I hear the strong >> implication that nothing but DRIVER_OK was of any concern, and this >> is really important to note when we talk about the status of the >> STATUS feature in vhost today. It seems to me now that it was not >> intended to be the virtio-level status byte, but just a DRIVER_OK >> signalling path from front-end to back-end. That makes it a >> vhost-level protocol feature to me. > > On second thought, it just is a pure vhost-level protocol feature, and > has nothing to do with the virtio status byte as-is. The only stated > purpose is for the front-end to send DRIVER_OK after migration, but > migration is transparent to the guest, so the guest would never change > the status byte during migration. Therefore, if this feature is > essential, we will never be able to have a status byte that is > transparently shared between guest and back-end device, i.e. the > virtio status byte. On third thought, scratch that. The guest wouldn’t set it, but naturally, after migration, the front-end will need to restore the status byte from the source, so the front-end will always need to set it, even if it were otherwise used controlled only by the guest and the back-end device. So technically, this doesn’t prevent such a use case. (In practice, it isn’t controlled by the guest right now, but that could be fixed.) > Cc-ing Alex on this mail, because to me, this seems like an important > detail when he plans on using the byte in the future. If we need a > virtio status byte, I can’t see how we could use the existing F_STATUS > for it. > > Hanna ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-09 9:13 ` Hanna Czenczek @ 2023-10-10 4:00 ` Yajun Wu 2023-10-10 8:18 ` Hanna Czenczek 0 siblings, 1 reply; 53+ messages in thread From: Yajun Wu @ 2023-10-10 4:00 UTC (permalink / raw) To: Hanna Czenczek, Michael S. Tsirkin Cc: qemu-devel@nongnu.org, virtio-fs@redhat.com, Eugenio Pérez, Anton Kuchin, Parav Pandit, maxime.coquelin@redhat.com, Alex Bennée On 10/9/2023 5:13 PM, Hanna Czenczek wrote: > External email: Use caution opening links or attachments > > > On 09.10.23 11:07, Hanna Czenczek wrote: >> On 09.10.23 10:21, Hanna Czenczek wrote: >>> On 07.10.23 04:22, Yajun Wu wrote: >> [...] >> >>>> The main motivation of adding VHOST_USER_SET_STATUS is to let >>>> backend DPDK know >>>> when DRIVER_OK bit is valid. It's an indication of all VQ >>>> configuration has sent, >>>> otherwise DPDK has to rely on first queue pair is ready, then >>>> receiving/applying >>>> VQ configuration one by one. >>>> >>>> During live migration, configuring VQ one by one is very time >>>> consuming. >>> One question I have here is why it wasn’t then introduced in the live >>> migration code, but in the general VM stop/cont code instead. It does >>> seem time-consuming to do this every time the VM is paused and resumed. Yes, VM stop/cont will call vhost_net_stop/vhost_net_start. Maybe because there's no device level stop/cont vhost message? >>> >>>> For VIRTIO >>>> net vDPA, HW needs to know how many VQs are enabled to set >>>> RSS(Receive-Side Scaling). >>>> >>>> If you don’t want SET_STATUS message, backend can remove protocol >>>> feature bit >>>> VHOST_USER_PROTOCOL_F_STATUS. >>> The problem isn’t back-ends that don’t want the message, the problem >>> is that qemu uses the message wrongly, which prevents well-behaving >>> back-ends from implementing the message. >>> >>>> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device >>>> close/reset. >>> So the right thing to do for back-ends is to announce STATUS support >>> and then not implement it correctly? >>> >>> GET_VRING_BASE should not reset the close or reset the device, by the >>> way. It should stop that one vring, not more. We have a >>> RESET_DEVICE command for resetting. I believe dpdk uses GET_VRING_BASE long before qemu has RESET_DEVICE? It's a compatible issue. For new backend implements, we can have better solution, right? >>>> I'm not involved in discussion about adding SET_STATUS in Vhost >>>> protocol. This feature >>>> is essential for vDPA(same as vhost-vdpa implements >>>> VHOST_VDPA_SET_STATUS). >>> So from what I gather from your response is that there is only a >>> single use for SET_STATUS, which is the DRIVER_OK bit. If so, >>> documenting that all other bits are to be ignored by both back-end >>> and front-end would be fine by me. >>> >>> I’m not fully serious about that suggestion, but I hear the strong >>> implication that nothing but DRIVER_OK was of any concern, and this >>> is really important to note when we talk about the status of the >>> STATUS feature in vhost today. It seems to me now that it was not >>> intended to be the virtio-level status byte, but just a DRIVER_OK >>> signalling path from front-end to back-end. That makes it a >>> vhost-level protocol feature to me. >> On second thought, it just is a pure vhost-level protocol feature, and >> has nothing to do with the virtio status byte as-is. The only stated >> purpose is for the front-end to send DRIVER_OK after migration, but >> migration is transparent to the guest, so the guest would never change >> the status byte during migration. Therefore, if this feature is >> essential, we will never be able to have a status byte that is >> transparently shared between guest and back-end device, i.e. the >> virtio status byte. > On third thought, scratch that. The guest wouldn’t set it, but > naturally, after migration, the front-end will need to restore the > status byte from the source, so the front-end will always need to set > it, even if it were otherwise used controlled only by the guest and the > back-end device. So technically, this doesn’t prevent such a use case. > (In practice, it isn’t controlled by the guest right now, but that could > be fixed.) I only tested the feature with DPDK(the only backend use it today?). Max defined the protocol and added the corresponding code in DPDK before I added QEMU support. If other backend or different device type want to use this, we can have further discussion? >> Cc-ing Alex on this mail, because to me, this seems like an important >> detail when he plans on using the byte in the future. If we need a >> virtio status byte, I can’t see how we could use the existing F_STATUS >> for it. >> >> Hanna ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-10 4:00 ` Yajun Wu @ 2023-10-10 8:18 ` Hanna Czenczek 2023-10-10 10:36 ` Alex Bennée 0 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-10 8:18 UTC (permalink / raw) To: Yajun Wu, Michael S. Tsirkin Cc: qemu-devel@nongnu.org, virtio-fs@redhat.com, Eugenio Pérez, Anton Kuchin, Parav Pandit, maxime.coquelin@redhat.com, Alex Bennée On 10.10.23 06:00, Yajun Wu wrote: > > On 10/9/2023 5:13 PM, Hanna Czenczek wrote: >> External email: Use caution opening links or attachments >> >> >> On 09.10.23 11:07, Hanna Czenczek wrote: >>> On 09.10.23 10:21, Hanna Czenczek wrote: >>>> On 07.10.23 04:22, Yajun Wu wrote: >>> [...] >>> >>>>> The main motivation of adding VHOST_USER_SET_STATUS is to let >>>>> backend DPDK know >>>>> when DRIVER_OK bit is valid. It's an indication of all VQ >>>>> configuration has sent, >>>>> otherwise DPDK has to rely on first queue pair is ready, then >>>>> receiving/applying >>>>> VQ configuration one by one. >>>>> >>>>> During live migration, configuring VQ one by one is very time >>>>> consuming. >>>> One question I have here is why it wasn’t then introduced in the live >>>> migration code, but in the general VM stop/cont code instead. It does >>>> seem time-consuming to do this every time the VM is paused and >>>> resumed. > > Yes, VM stop/cont will call vhost_net_stop/vhost_net_start. Maybe > because there's no device level stop/cont vhost message? No, it is because qemu will reset the status in stop/cont*, which it should not do. Aside from guest-initiated resets, the only thing where a reset comes into play is when the back-end is changed, e.g. during migration. In that case, the source back-end will see a disconnect on the vhost-user socket and can then do whatever uninitialization it needs to do, and the destination front-end will need to be reconfigured by qemu anyway, because it’s just a case of the destination qemu initiating a fresh connection to a new back-end (except that it will need to restore the state from the source). *Yes, technically, dpdk will ignore that reset, but it still stops the device on a different message (when it should just pause processing vrings), so the outcome is the same. >>>> >>>>> For VIRTIO >>>>> net vDPA, HW needs to know how many VQs are enabled to set >>>>> RSS(Receive-Side Scaling). >>>>> >>>>> If you don’t want SET_STATUS message, backend can remove protocol >>>>> feature bit >>>>> VHOST_USER_PROTOCOL_F_STATUS. >>>> The problem isn’t back-ends that don’t want the message, the problem >>>> is that qemu uses the message wrongly, which prevents well-behaving >>>> back-ends from implementing the message. >>>> >>>>> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device >>>>> close/reset. >>>> So the right thing to do for back-ends is to announce STATUS support >>>> and then not implement it correctly? >>>> >>>> GET_VRING_BASE should not reset the close or reset the device, by the >>>> way. It should stop that one vring, not more. We have a >>>> RESET_DEVICE command for resetting. > I believe dpdk uses GET_VRING_BASE long before qemu has RESET_DEVICE? I don’t think it matters who came first. What matters is the specification, and that dpdk decided to rely on implementation-specific behavior without having all involved parties agree by matters of putting that in the specification. And now dpdk clearly deviates from the specification as a result of that action, which can result in problems if the front-end doesn’t do what qemu always used to do. (E.g. the front-end might just send GET_VRING_BASE for all vrings when suspending the guest, and then only send kicks on resume to re-start the vrings. dpdk would most likely be left in a state where the whole device is stopped, expecting DRIVER_OK. Same thing in general for front-ends that don’t support F_STATUS.) > It's a compatible issue. For new backend implements, we can have > better solution, right? The fact that dpdk and qemu deviate from the specification is a problem as-is. >>>>> I'm not involved in discussion about adding SET_STATUS in Vhost >>>>> protocol. This feature >>>>> is essential for vDPA(same as vhost-vdpa implements >>>>> VHOST_VDPA_SET_STATUS). >>>> So from what I gather from your response is that there is only a >>>> single use for SET_STATUS, which is the DRIVER_OK bit. If so, >>>> documenting that all other bits are to be ignored by both back-end >>>> and front-end would be fine by me. >>>> >>>> I’m not fully serious about that suggestion, but I hear the strong >>>> implication that nothing but DRIVER_OK was of any concern, and this >>>> is really important to note when we talk about the status of the >>>> STATUS feature in vhost today. It seems to me now that it was not >>>> intended to be the virtio-level status byte, but just a DRIVER_OK >>>> signalling path from front-end to back-end. That makes it a >>>> vhost-level protocol feature to me. >>> On second thought, it just is a pure vhost-level protocol feature, and >>> has nothing to do with the virtio status byte as-is. The only stated >>> purpose is for the front-end to send DRIVER_OK after migration, but >>> migration is transparent to the guest, so the guest would never change >>> the status byte during migration. Therefore, if this feature is >>> essential, we will never be able to have a status byte that is >>> transparently shared between guest and back-end device, i.e. the >>> virtio status byte. >> On third thought, scratch that. The guest wouldn’t set it, but >> naturally, after migration, the front-end will need to restore the >> status byte from the source, so the front-end will always need to set >> it, even if it were otherwise used controlled only by the guest and the >> back-end device. So technically, this doesn’t prevent such a use case. >> (In practice, it isn’t controlled by the guest right now, but that could >> be fixed.) > I only tested the feature with DPDK(the only backend use it today?). > Max defined the protocol and added the corresponding code in DPDK > before I added QEMU support. If other backend or different device type > want to use this, we can have further discussion? So as far as I understand, the feature is supposed to rely on implementation-specific behavior between specifically qemu as a front-end and dpdk as a back-end, nothing else. Honestly, that to me is a very good reason to deprecate it. That would make it clear that any implementation that implements it does so because it relies on implementation-specific behavior from other implementations. Option 2 is to fix it. It is not right to use this broadly defined feature with its clear protocol as given in the virtio specification just to set and clear a single bit (DRIVER_OK). The vhost-user specification points to that virtio protocol. We must adhere to the protocol. And note that we must not reset devices just because the VM is paused/resumed. (That is why I wanted to deprecate SET_STATUS, so that Stefan’s series would introduce RESET_DEVICE where we need it, and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().) Option 3 would be to just be honest in the specification, and limit the scope of F_STATUS to say the only bit that matters is DRIVER_OK. I would say this is not really different from deprecating, though it wouldn’t affect your case. However, I understand Alex relies on a full status byte. I’m still interested to know why that is. Option 4 is of course not to do anything, and leave everything as-is, waiting for the next person to stir the hornet’s nest. >>> Cc-ing Alex on this mail, because to me, this seems like an important >>> detail when he plans on using the byte in the future. If we need a >>> virtio status byte, I can’t see how we could use the existing F_STATUS >>> for it. >>> >>> Hanna > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-10 8:18 ` Hanna Czenczek @ 2023-10-10 10:36 ` Alex Bennée 2023-10-10 13:18 ` Hanna Czenczek 0 siblings, 1 reply; 53+ messages in thread From: Alex Bennée @ 2023-10-10 10:36 UTC (permalink / raw) To: Hanna Czenczek Cc: Yajun Wu, Michael S. Tsirkin, qemu-devel@nongnu.org, virtio-fs@redhat.com, Eugenio Pérez, Anton Kuchin, Parav Pandit, maxime.coquelin@redhat.com Hanna Czenczek <hreitz@redhat.com> writes: > On 10.10.23 06:00, Yajun Wu wrote: >> >> On 10/9/2023 5:13 PM, Hanna Czenczek wrote: >>> External email: Use caution opening links or attachments >>> >>> >>> On 09.10.23 11:07, Hanna Czenczek wrote: >>>> On 09.10.23 10:21, Hanna Czenczek wrote: >>>>> On 07.10.23 04:22, Yajun Wu wrote: >>>> [...] >>>> <snip> > So as far as I understand, the feature is supposed to rely on > implementation-specific behavior between specifically qemu as a > front-end and dpdk as a back-end, nothing else. Honestly, that to me > is a very good reason to deprecate it. That would make it clear that > any implementation that implements it does so because it relies on > implementation-specific behavior from other implementations. > > Option 2 is to fix it. It is not right to use this broadly defined > feature with its clear protocol as given in the virtio specification > just to set and clear a single bit (DRIVER_OK). The vhost-user > specification points to that virtio protocol. We must adhere to the > protocol. And note that we must not reset devices just because the VM > is paused/resumed. (That is why I wanted to deprecate SET_STATUS, so > that Stefan’s series would introduce RESET_DEVICE where we need it, > and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().) > > Option 3 would be to just be honest in the specification, and limit > the scope of F_STATUS to say the only bit that matters is DRIVER_OK. > I would say this is not really different from deprecating, though it > wouldn’t affect your case. However, I understand Alex relies on a > full status byte. I’m still interested to know why that is. For an F_TRANSPORT backend (or whatever the final name ends up being) we need the backend to have full control of the status byte because all the handling of VirtIO is deferred to it. Therefor it has to handle all the feature negotiation and indicate when the device needs resetting. (side note: feature negotiation is another slippery area when QEMU gets involved in gating which feature bits may or may not be exposed to the backend. The only one it should ever mask is F_UNUSED which is used (sic) to trigger the vhost protocol negotiation) > Option 4 is of course not to do anything, and leave everything as-is, > waiting for the next person to stir the hornet’s nest. > >>>> Cc-ing Alex on this mail, because to me, this seems like an important >>>> detail when he plans on using the byte in the future. If we need a >>>> virtio status byte, I can’t see how we could use the existing F_STATUS >>>> for it. What would we use instead of F_STATUS to query the Device Status field? >>>> >>>> Hanna >> -- Alex Bennée Virtualisation Tech Lead @ Linaro ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-10 10:36 ` Alex Bennée @ 2023-10-10 13:18 ` Hanna Czenczek 2023-10-10 14:35 ` Alex Bennée 0 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-10 13:18 UTC (permalink / raw) To: Alex Bennée Cc: Michael S. Tsirkin, qemu-devel@nongnu.org, virtio-fs@redhat.com, Eugenio Pérez, maxime.coquelin@redhat.com, Parav Pandit, Anton Kuchin, Yajun Wu On 10.10.23 12:36, Alex Bennée wrote: > Hanna Czenczek <hreitz@redhat.com> writes: > >> On 10.10.23 06:00, Yajun Wu wrote: >>> On 10/9/2023 5:13 PM, Hanna Czenczek wrote: >>>> External email: Use caution opening links or attachments >>>> >>>> >>>> On 09.10.23 11:07, Hanna Czenczek wrote: >>>>> On 09.10.23 10:21, Hanna Czenczek wrote: >>>>>> On 07.10.23 04:22, Yajun Wu wrote: >>>>> [...] >>>>> > <snip> >> So as far as I understand, the feature is supposed to rely on >> implementation-specific behavior between specifically qemu as a >> front-end and dpdk as a back-end, nothing else. Honestly, that to me >> is a very good reason to deprecate it. That would make it clear that >> any implementation that implements it does so because it relies on >> implementation-specific behavior from other implementations. >> >> Option 2 is to fix it. It is not right to use this broadly defined >> feature with its clear protocol as given in the virtio specification >> just to set and clear a single bit (DRIVER_OK). The vhost-user >> specification points to that virtio protocol. We must adhere to the >> protocol. And note that we must not reset devices just because the VM >> is paused/resumed. (That is why I wanted to deprecate SET_STATUS, so >> that Stefan’s series would introduce RESET_DEVICE where we need it, >> and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().) >> >> Option 3 would be to just be honest in the specification, and limit >> the scope of F_STATUS to say the only bit that matters is DRIVER_OK. >> I would say this is not really different from deprecating, though it >> wouldn’t affect your case. However, I understand Alex relies on a >> full status byte. I’m still interested to know why that is. > For an F_TRANSPORT backend (or whatever the final name ends up being) we > need the backend to have full control of the status byte because all the > handling of VirtIO is deferred to it. Therefor it has to handle all the > feature negotiation and indicate when the device needs resetting. > > (side note: feature negotiation is another slippery area when QEMU gets > involved in gating which feature bits may or may not be exposed to the > backend. The only one it should ever mask is F_UNUSED which is used > (sic) to trigger the vhost protocol negotiation) That’s the thing, feature negotiation is done with GET_FEATURES and SET_FEATURES. Configuring F_REPLY_ACK lets SET_FEATURES return errors. Indicating that the device needs reset is a good point, there is no other feature to do that. (And something qemu currently ignores, just like any value the device returns through GET_STATUS, but that’s besides the point.) >> Option 4 is of course not to do anything, and leave everything as-is, >> waiting for the next person to stir the hornet’s nest. >> >>>>> Cc-ing Alex on this mail, because to me, this seems like an important >>>>> detail when he plans on using the byte in the future. If we need a >>>>> virtio status byte, I can’t see how we could use the existing F_STATUS >>>>> for it. > What would we use instead of F_STATUS to query the Device Status field? We would emulate it in the front-end, just like we need to do for back-ends without F_STATUS. We can’t emulate the DEVICE_NEEDS_RESET bit, though, that’s correct. Given that qemu currently ignores DEVICE_NEEDS_RESET, I’m not 100 % convinced that your use case has a hard dependency on F_STATUS. However, this still does make a fair point in general that it would be useful to keep it. That still leaves us with the situation that currently, the only implementations with F_STATUS support are qemu and dpdk, which both handle it incorrectly. Furthermore, the specification leaves much to be desired, specifically in how F_STATUS interacts with other vhost-user commands (which is something I cited as a reason for my original patch), i.e. whether RESET_DEVICE and SET_STATUS 0 are equivalent, and whether failures in feature negotiation must result in both SET_FEATURES returning an error (with F_REPLY_ACK), and FEATURES_OK being reset in the status byte, or whether either is sufficient. What happens when DEVICE_NEEDS_RESET is set, i.e. do we just need RESET_DEVICE / SET_STATUS 0, or do we also need to reset some protocol state? (This is also connected to the fact that what happens on RESET_DEVICE is largely undefined, which I said on Stefan’s series.) In general, because we have our own transport, we should make a note how it interacts with the status negotiation phases, i.e. that GET_FEATURES must not be called before S_ACKNOWLEDGE | S_DRIVER are set, that FEATURES_OK must be set after the SET_FEATURES call, and that DRIVER_OK must not be set without FEATURES_OK set / SET_FEATURES having returned success. Here we would also answer the question about the interaction of F_REPLY_ACK+SET_FEATURES with F_STATUS, specifically whether an implementation with F_REPLY_ACK even needs to read back the status byte after setting FEATURES_OK because it could have got the feature negotiation result already as a result of the SET_FEATURES call. After migration, can you just set all flags immediately or do we need to follow this step-by-step protocol? I think we do need to do it step-by-step, mostly for simplicity in the back-end, i.e. that it just sees a normal device start-up. We should also clarify whether SET_STATUS can fail, i.e. whether setting an invalid status (is setting FEATURES_OK when the device doesn’t think so invalid?) has SET_STATUS fail (with F_REPLY_ACK) and/or immediately gets the device into DEVICE_NEEDS_RESET. We should clarify whether SET_STATUS can block. The current use of DRIVER_OK seems to indicate to me that dpdk does do time-consuming operations when it sees DRIVER_OK (code looks like it, too) and only returns when that’s done, but naïvely, I would expect SET_STATUS to be just setting some value and doing whatever needs to be done in the background, not actually launching and blocking on an operation. I think it is dangerous to just push ahead with using F_STATUS without acknowledging that its implementation is broken right now, and that it is so *on purpose* because the DRIVER_OK bit is the only thing that it was supposed to be used for. Using it for its purported original use (actually the virtio status byte) is contradictory to that. It’s probably fixable, but I think it requires taking a step back and seeing what needs to be done to remedy the conflict. If you rip out all the existing STATUS code and replace it such that qemu will let the guest have full control over the status byte (except for migration, where we restore it on the destination, which will result in DRIVER_OK being set at the end, fulfilling that requirement), that will fix the implementation in qemu. I think. But the specification should be amended to think about all these corner cases, not least because I think they will also affect your implementation. (The answers to many of the questions I raise for documentation may be obvious to you, based on “in virtio, it’s just an MMIO byte that’s written and read, so the rest follows from there”. But evidently the implementation we have kind of ignores that e.g. SET_STATUS 0 is a reset (6f8be29ec17d44496b9ed67599bceaaba72d1864 is a work-around, not much more) or that there is actually a protocol to setting the status flags and you can’t just set them all at once, so I don’t think the answers are immediately obvious, and should be documented.) As for me and the original patch: I claimed nobody really needs F_STATUS, you say you do, so plainly, I assumed wrong and will naturally take my hands off of F_STATUS and just ensure not to implement it in any back-end until you’ve fixed it, as Yajun has advised. I’d still prefer mentioning this advice in the documentation until it’s fixed, but, you know, I wouldn’t be the first one to say “I now know about the quirk, so I can work around it, no need to tell anyone else as long as my stuff works”. Hanna ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-10 13:18 ` Hanna Czenczek @ 2023-10-10 14:35 ` Alex Bennée 2023-10-13 18:02 ` Hanna Czenczek 0 siblings, 1 reply; 53+ messages in thread From: Alex Bennée @ 2023-10-10 14:35 UTC (permalink / raw) To: Hanna Czenczek Cc: Michael S. Tsirkin, qemu-devel@nongnu.org, virtio-fs@redhat.com, Eugenio Pérez, maxime.coquelin@redhat.com, Parav Pandit, Anton Kuchin, Yajun Wu, Viresh Kumar Hanna Czenczek <hreitz@redhat.com> writes: (adding Viresh to CC for Xen Vhost questions) > On 10.10.23 12:36, Alex Bennée wrote: >> Hanna Czenczek <hreitz@redhat.com> writes: >> >>> On 10.10.23 06:00, Yajun Wu wrote: >>>> On 10/9/2023 5:13 PM, Hanna Czenczek wrote: >>>>> External email: Use caution opening links or attachments >>>>> >>>>> >>>>> On 09.10.23 11:07, Hanna Czenczek wrote: >>>>>> On 09.10.23 10:21, Hanna Czenczek wrote: >>>>>>> On 07.10.23 04:22, Yajun Wu wrote: >>>>>> [...] >>>>>> >> <snip> >>> So as far as I understand, the feature is supposed to rely on >>> implementation-specific behavior between specifically qemu as a >>> front-end and dpdk as a back-end, nothing else. Honestly, that to me >>> is a very good reason to deprecate it. That would make it clear that >>> any implementation that implements it does so because it relies on >>> implementation-specific behavior from other implementations. >>> >>> Option 2 is to fix it. It is not right to use this broadly defined >>> feature with its clear protocol as given in the virtio specification >>> just to set and clear a single bit (DRIVER_OK). The vhost-user >>> specification points to that virtio protocol. We must adhere to the >>> protocol. And note that we must not reset devices just because the VM >>> is paused/resumed. (That is why I wanted to deprecate SET_STATUS, so >>> that Stefan’s series would introduce RESET_DEVICE where we need it, >>> and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().) >>> >>> Option 3 would be to just be honest in the specification, and limit >>> the scope of F_STATUS to say the only bit that matters is DRIVER_OK. >>> I would say this is not really different from deprecating, though it >>> wouldn’t affect your case. However, I understand Alex relies on a >>> full status byte. I’m still interested to know why that is. >> For an F_TRANSPORT backend (or whatever the final name ends up being) we >> need the backend to have full control of the status byte because all the >> handling of VirtIO is deferred to it. Therefor it has to handle all the >> feature negotiation and indicate when the device needs resetting. >> >> (side note: feature negotiation is another slippery area when QEMU gets >> involved in gating which feature bits may or may not be exposed to the >> backend. The only one it should ever mask is F_UNUSED which is used >> (sic) to trigger the vhost protocol negotiation) > > That’s the thing, feature negotiation is done with GET_FEATURES and > SET_FEATURES. Configuring F_REPLY_ACK lets SET_FEATURES return > errors. OK but then what - QEMU fakes up FEATURES_OK in the Device Status field on the behalf of the backend? I should point out QEMU doesn't exist in some of these use case. When using the rust-vmm backends with Xen for example there is no VMM to talk to so we have a Xen Vhost Frontend which is entirely concerned with setup and then once connected up leaves the backend to do its thing. I'd rather leave the frontend as dumb as possible rather than splitting logic between the two. > Indicating that the device needs reset is a good point, there is no > other feature to do that. (And something qemu currently ignores, just > like any value the device returns through GET_STATUS, but that’s > besides the point.) > >>> Option 4 is of course not to do anything, and leave everything as-is, >>> waiting for the next person to stir the hornet’s nest. >>> >>>>>> Cc-ing Alex on this mail, because to me, this seems like an important >>>>>> detail when he plans on using the byte in the future. If we need a >>>>>> virtio status byte, I can’t see how we could use the existing F_STATUS >>>>>> for it. >> What would we use instead of F_STATUS to query the Device Status field? > > We would emulate it in the front-end, just like we need to do for > back-ends without F_STATUS. We can’t emulate the DEVICE_NEEDS_RESET > bit, though, that’s correct. > > Given that qemu currently ignores DEVICE_NEEDS_RESET, I’m not 100 % > convinced that your use case has a hard dependency on F_STATUS. > However, this still does make a fair point in general that it would be > useful to keep it. OK/ > That still leaves us with the situation that currently, the only > implementations with F_STATUS support are qemu and dpdk, which both > handle it incorrectly. I was going to say there is also the rust-vmm vhost-user-master crates which we've imported: https://github.com/vireshk/vhost for the Xen Vhost Frontend: https://github.com/vireshk/xen-vhost-frontend but I can't actually see any handling for GET/SET_STATUS at all which makes me wonder how we actually work. Viresh? > Furthermore, the specification leaves much to > be desired, specifically in how F_STATUS interacts with other > vhost-user commands (which is something I cited as a reason for my > original patch), i.e. whether RESET_DEVICE and SET_STATUS 0 are > equivalent, and whether failures in feature negotiation must result in > both SET_FEATURES returning an error (with F_REPLY_ACK), and > FEATURES_OK being reset in the status byte, or whether either is > sufficient. What happens when DEVICE_NEEDS_RESET is set, i.e. do we > just need RESET_DEVICE / SET_STATUS 0, or do we also need to reset > some protocol state? (This is also connected to the fact that what > happens on RESET_DEVICE is largely undefined, which I said on Stefan’s > series.) I'm all for strengthening the vhost-user protocol definitions. I'm just wary of encoding QEMU<->backend implementation details. > In general, because we have our own transport, we should make a note > how it interacts with the status negotiation phases, i.e. that > GET_FEATURES must not be called before S_ACKNOWLEDGE | S_DRIVER are > set, that FEATURES_OK must be set after the SET_FEATURES call, and > that DRIVER_OK must not be set without FEATURES_OK set / SET_FEATURES > having returned success. Here we would also answer the question about > the interaction of F_REPLY_ACK+SET_FEATURES with F_STATUS, > specifically whether an implementation with F_REPLY_ACK even needs to > read back the status byte after setting FEATURES_OK because it could > have got the feature negotiation result already as a result of the > SET_FEATURES call. Some sequence diagrams would remove a lot of the ambiguity from parsing the words. I wonder if there is a pretty way to do that to render nicely in our published docs? > After migration, can you just set all flags immediately or do we need > to follow this step-by-step protocol? I think we do need to do it > step-by-step, mostly for simplicity in the back-end, i.e. that it just > sees a normal device start-up. Makes sense. > We should also clarify whether SET_STATUS can fail, i.e. whether > setting an invalid status (is setting FEATURES_OK when the device > doesn’t think so invalid?) has SET_STATUS fail (with F_REPLY_ACK) > and/or immediately gets the device into DEVICE_NEEDS_RESET. > > We should clarify whether SET_STATUS can block. The current use of > DRIVER_OK seems to indicate to me that dpdk does do time-consuming > operations when it sees DRIVER_OK (code looks like it, too) and only > returns when that’s done, but naïvely, I would expect SET_STATUS to be > just setting some value and doing whatever needs to be done in the > background, not actually launching and blocking on an operation. Shouldn't the guest driver be reading the status bit until it flips? So potentially there could be multiple GET_STATUS calls. > I think it is dangerous to just push ahead with using F_STATUS without > acknowledging that its implementation is broken right now, and that it > is so *on purpose* because the DRIVER_OK bit is the only thing that it > was supposed to be used for. Using it for its purported original use > (actually the virtio status byte) is contradictory to that. It’s > probably fixable, but I think it requires taking a step back and > seeing what needs to be done to remedy the conflict. If you rip out > all the existing STATUS code and replace it such that qemu will let > the guest have full control over the status byte (except for > migration, where we restore it on the destination, which will result > in DRIVER_OK being set at the end, fulfilling that requirement), that > will fix the implementation in qemu. I think. But the specification > should be amended to think about all these corner cases, not least > because I think they will also affect your implementation. > > (The answers to many of the questions I raise for documentation may be > obvious to you, based on “in virtio, it’s just an MMIO byte that’s > written and read, so the rest follows from there”. But evidently the > implementation we have kind of ignores that e.g. SET_STATUS 0 is a > reset (6f8be29ec17d44496b9ed67599bceaaba72d1864 is a work-around, not > much more) or that there is actually a protocol to setting the status > flags and you can’t just set them all at once, so I don’t think the > answers are immediately obvious, and should be documented.) > > As for me and the original patch: I claimed nobody really needs > F_STATUS, you say you do, so plainly, I assumed wrong and will > naturally take my hands off of F_STATUS and just ensure not to > implement it in any back-end until you’ve fixed it, as Yajun has > advised. I’d still prefer mentioning this advice in the documentation > until it’s fixed, but, you know, I wouldn’t be the first one to say “I > now know about the quirk, so I can work around it, no need to tell > anyone else as long as my stuff works”. > > Hanna -- Alex Bennée Virtualisation Tech Lead @ Linaro ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-10 14:35 ` Alex Bennée @ 2023-10-13 18:02 ` Hanna Czenczek 2023-10-17 7:49 ` Viresh Kumar 0 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-13 18:02 UTC (permalink / raw) To: Alex Bennée Cc: Michael S. Tsirkin, Viresh Kumar, qemu-devel@nongnu.org, virtio-fs@redhat.com, Eugenio Pérez, maxime.coquelin@redhat.com, Parav Pandit, Anton Kuchin, Yajun Wu On 10.10.23 16:35, Alex Bennée wrote: > Hanna Czenczek <hreitz@redhat.com> writes: > > (adding Viresh to CC for Xen Vhost questions) > >> On 10.10.23 12:36, Alex Bennée wrote: >>> Hanna Czenczek <hreitz@redhat.com> writes: >>> >>>> On 10.10.23 06:00, Yajun Wu wrote: >>>>> On 10/9/2023 5:13 PM, Hanna Czenczek wrote: >>>>>> External email: Use caution opening links or attachments >>>>>> >>>>>> >>>>>> On 09.10.23 11:07, Hanna Czenczek wrote: >>>>>>> On 09.10.23 10:21, Hanna Czenczek wrote: >>>>>>>> On 07.10.23 04:22, Yajun Wu wrote: >>>>>>> [...] >>>>>>> >>> <snip> >>>> So as far as I understand, the feature is supposed to rely on >>>> implementation-specific behavior between specifically qemu as a >>>> front-end and dpdk as a back-end, nothing else. Honestly, that to me >>>> is a very good reason to deprecate it. That would make it clear that >>>> any implementation that implements it does so because it relies on >>>> implementation-specific behavior from other implementations. >>>> >>>> Option 2 is to fix it. It is not right to use this broadly defined >>>> feature with its clear protocol as given in the virtio specification >>>> just to set and clear a single bit (DRIVER_OK). The vhost-user >>>> specification points to that virtio protocol. We must adhere to the >>>> protocol. And note that we must not reset devices just because the VM >>>> is paused/resumed. (That is why I wanted to deprecate SET_STATUS, so >>>> that Stefan’s series would introduce RESET_DEVICE where we need it, >>>> and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().) >>>> >>>> Option 3 would be to just be honest in the specification, and limit >>>> the scope of F_STATUS to say the only bit that matters is DRIVER_OK. >>>> I would say this is not really different from deprecating, though it >>>> wouldn’t affect your case. However, I understand Alex relies on a >>>> full status byte. I’m still interested to know why that is. >>> For an F_TRANSPORT backend (or whatever the final name ends up being) we >>> need the backend to have full control of the status byte because all the >>> handling of VirtIO is deferred to it. Therefor it has to handle all the >>> feature negotiation and indicate when the device needs resetting. >>> >>> (side note: feature negotiation is another slippery area when QEMU gets >>> involved in gating which feature bits may or may not be exposed to the >>> backend. The only one it should ever mask is F_UNUSED which is used >>> (sic) to trigger the vhost protocol negotiation) >> That’s the thing, feature negotiation is done with GET_FEATURES and >> SET_FEATURES. Configuring F_REPLY_ACK lets SET_FEATURES return >> errors. > OK but then what - QEMU fakes up FEATURES_OK in the Device Status field > on the behalf of the backend? It does that right now. When using qemu, vhost-user status byte is not exposed to the guest at all. qemu makes it up completely, and effectively ignores the response from GET_STATUS completely. (The only use of GET_STATUS is (right now): There is a function to set a flag in the status byte, and it calls GET_STATUS, ORs the flag in, and calls SET_STATUS with the result.) > I should point out QEMU doesn't exist in some of these use case. When > using the rust-vmm backends with Xen for example there is no VMM to talk > to so we have a Xen Vhost Frontend which is entirely concerned with > setup and then once connected up leaves the backend to do its thing. I'd > rather leave the frontend as dumb as possible rather than splitting > logic between the two. > >> Indicating that the device needs reset is a good point, there is no >> other feature to do that. (And something qemu currently ignores, just >> like any value the device returns through GET_STATUS, but that’s >> besides the point.) >> >>>> Option 4 is of course not to do anything, and leave everything as-is, >>>> waiting for the next person to stir the hornet’s nest. >>>> >>>>>>> Cc-ing Alex on this mail, because to me, this seems like an important >>>>>>> detail when he plans on using the byte in the future. If we need a >>>>>>> virtio status byte, I can’t see how we could use the existing F_STATUS >>>>>>> for it. >>> What would we use instead of F_STATUS to query the Device Status field? >> We would emulate it in the front-end, just like we need to do for >> back-ends without F_STATUS. We can’t emulate the DEVICE_NEEDS_RESET >> bit, though, that’s correct. >> >> Given that qemu currently ignores DEVICE_NEEDS_RESET, I’m not 100 % >> convinced that your use case has a hard dependency on F_STATUS. >> However, this still does make a fair point in general that it would be >> useful to keep it. > OK/ > >> That still leaves us with the situation that currently, the only >> implementations with F_STATUS support are qemu and dpdk, which both >> handle it incorrectly. > I was going to say there is also the rust-vmm vhost-user-master crates > which we've imported: > > https://github.com/vireshk/vhost > > for the Xen Vhost Frontend: > > https://github.com/vireshk/xen-vhost-frontend > > but I can't actually see any handling for GET/SET_STATUS at all which > makes me wonder how we actually work. Viresh? As far as I know the only back-end implementation of F_STATUS is in DPDK. As I said, if anyone else implemented it right now, that would be dangerous, because qemu doesn’t adhere to the virtio protocol when it comes to the status byte. >> Furthermore, the specification leaves much to >> be desired, specifically in how F_STATUS interacts with other >> vhost-user commands (which is something I cited as a reason for my >> original patch), i.e. whether RESET_DEVICE and SET_STATUS 0 are >> equivalent, and whether failures in feature negotiation must result in >> both SET_FEATURES returning an error (with F_REPLY_ACK), and >> FEATURES_OK being reset in the status byte, or whether either is >> sufficient. What happens when DEVICE_NEEDS_RESET is set, i.e. do we >> just need RESET_DEVICE / SET_STATUS 0, or do we also need to reset >> some protocol state? (This is also connected to the fact that what >> happens on RESET_DEVICE is largely undefined, which I said on Stefan’s >> series.) > I'm all for strengthening the vhost-user protocol definitions. I'm just > wary of encoding QEMU<->backend implementation details. > >> In general, because we have our own transport, we should make a note >> how it interacts with the status negotiation phases, i.e. that >> GET_FEATURES must not be called before S_ACKNOWLEDGE | S_DRIVER are >> set, that FEATURES_OK must be set after the SET_FEATURES call, and >> that DRIVER_OK must not be set without FEATURES_OK set / SET_FEATURES >> having returned success. Here we would also answer the question about >> the interaction of F_REPLY_ACK+SET_FEATURES with F_STATUS, >> specifically whether an implementation with F_REPLY_ACK even needs to >> read back the status byte after setting FEATURES_OK because it could >> have got the feature negotiation result already as a result of the >> SET_FEATURES call. > Some sequence diagrams would remove a lot of the ambiguity from parsing > the words. I wonder if there is a pretty way to do that to render nicely > in our published docs? I’m sure some form of SVG will work. Somehow. If not, it should. :) >> After migration, can you just set all flags immediately or do we need >> to follow this step-by-step protocol? I think we do need to do it >> step-by-step, mostly for simplicity in the back-end, i.e. that it just >> sees a normal device start-up. > Makes sense. > >> We should also clarify whether SET_STATUS can fail, i.e. whether >> setting an invalid status (is setting FEATURES_OK when the device >> doesn’t think so invalid?) has SET_STATUS fail (with F_REPLY_ACK) >> and/or immediately gets the device into DEVICE_NEEDS_RESET. >> >> We should clarify whether SET_STATUS can block. The current use of >> DRIVER_OK seems to indicate to me that dpdk does do time-consuming >> operations when it sees DRIVER_OK (code looks like it, too) and only >> returns when that’s done, but naïvely, I would expect SET_STATUS to be >> just setting some value and doing whatever needs to be done in the >> background, not actually launching and blocking on an operation. > Shouldn't the guest driver be reading the status bit until it flips? So > potentially there could be multiple GET_STATUS calls. Ah, the device will only show DRIVER_OK set once the device is ready to serve the driver? Hanna ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-13 18:02 ` Hanna Czenczek @ 2023-10-17 7:49 ` Viresh Kumar 2023-10-17 8:13 ` Hanna Czenczek 0 siblings, 1 reply; 53+ messages in thread From: Viresh Kumar @ 2023-10-17 7:49 UTC (permalink / raw) To: Hanna Czenczek Cc: Alex Bennée, Michael S. Tsirkin, qemu-devel@nongnu.org, virtio-fs@redhat.com, Eugenio Pérez, maxime.coquelin@redhat.com, Parav Pandit, Anton Kuchin, Yajun Wu On 13-10-23, 20:02, Hanna Czenczek wrote: > On 10.10.23 16:35, Alex Bennée wrote: > > I was going to say there is also the rust-vmm vhost-user-master crates > > which we've imported: > > > > https://github.com/vireshk/vhost > > > > for the Xen Vhost Frontend: > > > > https://github.com/vireshk/xen-vhost-frontend > > > > but I can't actually see any handling for GET/SET_STATUS at all which > > makes me wonder how we actually work. Viresh? > > As far as I know the only back-end implementation of F_STATUS is in DPDK. > As I said, if anyone else implemented it right now, that would be dangerous, > because qemu doesn’t adhere to the virtio protocol when it comes to the > status byte. Yeah, none of the Rust based Virtio backends enable `STATUS` in `VhostUserProtocolFeatures` and so these messages are never exchanged. The generic Rust code for the backends, doesn't even implement them. Not sure if they should or not. -- viresh ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-17 7:49 ` Viresh Kumar @ 2023-10-17 8:13 ` Hanna Czenczek 0 siblings, 0 replies; 53+ messages in thread From: Hanna Czenczek @ 2023-10-17 8:13 UTC (permalink / raw) To: Viresh Kumar Cc: Alex Bennée, Michael S. Tsirkin, qemu-devel@nongnu.org, virtio-fs@redhat.com, Eugenio Pérez, maxime.coquelin@redhat.com, Parav Pandit, Anton Kuchin, Yajun Wu On 17.10.23 09:49, Viresh Kumar wrote: > On 13-10-23, 20:02, Hanna Czenczek wrote: >> On 10.10.23 16:35, Alex Bennée wrote: >>> I was going to say there is also the rust-vmm vhost-user-master crates >>> which we've imported: >>> >>> https://github.com/vireshk/vhost >>> >>> for the Xen Vhost Frontend: >>> >>> https://github.com/vireshk/xen-vhost-frontend >>> >>> but I can't actually see any handling for GET/SET_STATUS at all which >>> makes me wonder how we actually work. Viresh? >> As far as I know the only back-end implementation of F_STATUS is in DPDK. >> As I said, if anyone else implemented it right now, that would be dangerous, >> because qemu doesn’t adhere to the virtio protocol when it comes to the >> status byte. > Yeah, none of the Rust based Virtio backends enable `STATUS` in > `VhostUserProtocolFeatures` and so these messages are never exchanged. > > The generic Rust code for the backends, doesn't even implement them. > Not sure if they should or not. It absolutely should not, for evidence see this whole thread. qemu sends a SET_STATUS 0, which amounts to a reset, when the VM is merely paused[1], and when it sets status bytes, it does not set them according to virtio specification. Implementing it right now means relying on and working around qemu’s implementation-defined spec-breaking behavior. Also, note that qemu ignores feature negotiation response through FEATURES_OK, and DEVICE_NEEDS_RESET, so unless it’s worth working around the problems just to get some form of DRIVER_OK information (note this information does not come from the driver, but qemu makes it up), I absolutely would not implement it. [1] Notably, it does restore the virtio state to the best of its abilities when the VM is resumed, but this is all still wrong (there is no point in doing so much on a pause/resume, it needlessly costs time) and any implementation that does a reset then will rely on the implementation-defined behavior that qemu is actually able to restore all the state that the back-end would lose during a reset. Notably, reset is not even well-defined in the vhost-user specification. It was argued, in this thread, that DPDK works just fine with this, precisely because it ignores SET_STATUS 0. Finally, if virtiofsd in particular, as a user of the Rust crates, is reset, it would lose its internal state, which qemu cannot restore short of using the upcoming migration facilities. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-07 2:22 ` Yajun Wu 2023-10-09 8:21 ` Hanna Czenczek @ 2023-10-09 10:28 ` German Maglione 2023-10-10 2:56 ` Yajun Wu 1 sibling, 1 reply; 53+ messages in thread From: German Maglione @ 2023-10-09 10:28 UTC (permalink / raw) To: Yajun Wu Cc: Michael S. Tsirkin, Hanna Czenczek, qemu-devel, virtio-fs, Eugenio Pérez, maxime.coquelin, parav, Anton Kuchin On Sat, Oct 7, 2023 at 4:23 AM Yajun Wu <yajunw@nvidia.com> wrote: > > > On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote: > > External email: Use caution opening links or attachments > > > > > > On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote: > >> On 06.10.23 11:26, Michael S. Tsirkin wrote: > >>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote: > >>>> On 06.10.23 10:45, Michael S. Tsirkin wrote: > >>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: > >>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote: > >>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: > >>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: > >>>>>>>>> There is no clearly defined purpose for the virtio status byte in > >>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio > >>>>>>>>> feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK > >>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors > >>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). > >>>>>>>>> > >>>>>>>>> As for implementations, SET_STATUS is not widely implemented. dpdk does > >>>>>>>>> implement it, but only uses it to signal feature negotiation failure. > >>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively > >>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today > >>>>>>>>> means the same thing as RESET_DEVICE). > >>>>>>>>> > >>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not > >>>>>>>>> forward the guest-set status byte, but instead just makes it up > >>>>>>>>> internally, and actually completely ignores what the back-end returns, > >>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single > >>>>>>>>> bits to it. Notably, after setting FEATURES_OK, it never reads it back > >>>>>>>>> to see whether the flag is still set, which is the only way in which > >>>>>>>>> dpdk uses the status byte. > >>>>>>>>> > >>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this > >>>>>>>>> field in a useful manner, and it also provides no practical use over > >>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly > >>>>>>>>> defined. Deprecate it. > >>>>>>>>> > >>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > >>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > >>>>>>>>> --- > >>>>>>>>> docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- > >>>>>>>>> 1 file changed, 21 insertions(+), 7 deletions(-) > >>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > >>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK. > >>>>>>> The fact current backends never check errors does not mean they never > >>>>>>> will. So no, not applying this. > >>>>>> Can this not be done with REPLY_ACK? I.e., with the following message > >>>>>> order: > >>>>>> > >>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is > >>>>>> present > >>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK > >>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK > >>>>>> 4. SET_FEATURES with need_reply > >>>>>> > >>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the > >>>>>> vCPUs are stopped, which generally seems to request a device reset. If we > >>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will > >>>>>> implement SET_STATUS later may break with at least these qemu versions. But > >>>>>> documenting that a particular use of the status byte is to be ignored would > >>>>>> be really strange. > >>>>>> > >>>>>> Hanna > >>>>> Hmm I guess. Though just following virtio spec seems cleaner to me... > >>>>> vhost-user reconfigures the state fully on start. > >>>> Not the internal device state, though. virtiofsd has internal state, and > >>>> other devices like vhost-gpu back-ends would probably, too. > >>>> > >>>> Stefan has recently sent a series > >>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to > >>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a > >>>> reset). > >>>> > >>>> I really don’t like our current approach with the status byte. Following the > >>>> virtio specification to me would mean that the guest directly controls this > >>>> byte, which it does not. qemu makes up values as it deems appropriate, and > >>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e. > >>>> when the guest really doesn’t want a device reset. > >>>> > >>>> That means that qemu does not treat this as a virtio device field (because > >>>> that would mean exposing it to the guest driver), but instead treats it as > >>>> part of the vhost(-user) protocol. It doesn’t feel right to me that we use > >>>> a virtio-defined feature for communication on the vhost level, i.e. between > >>>> front-end and back-end, and not between guest driver and device. I think > >>>> all vhost-level protocol features should be fully defined in the vhost-user > >>>> specification, which REPLY_ACK is. > >>> Hmm that makes sense. Maybe we should have done what stefan's patch > >>> is doing. > >>> > >>> Do look at the original commit that introduced it to understand why > >>> it was added. > >> I don’t understand why this was added to the stop/cont code, though. If it > >> is time consuming to make these changes, why are they done every time the VM > >> is paused > >> and resumed? It makes sense that this would be done for the initial > >> configuration (where a reset also wouldn’t hurt), but here it seems wrong. > >> > >> (To be clear, a reset in the stop/cont code is wrong, because it breaks > >> stateful devices.) > >> > >> Also, note the newer commits 6f8be29ec17 and c3716f260bf. The reset as > >> originally introduced was wrong even for non-stateful devices, because it > >> occurred before we fetched the state (vring indices) so we could restore it > >> later. I don’t know how 923b8921d21 was tested, but if the back-end used > >> for testing implemented SET_STATUS 0 as a reset, it could not have survived > >> either migration or a stop/cont in general, because the vring indices would > >> have been reset to 0. > >> > >> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all > >> devices that would implement them as per virtio spec, and even today it’s > >> broken for stateful devices. The mentioned performance issue is likely > >> real, but we can’t address it by making up SET_STATUS calls that are wrong. > >> > >> I concede that I didn’t think about DRIVER_OK. Personally, I would do all > >> final configuration that would happen upon a DRIVER_OK once the first vring > >> is started (i.e. receives a kick). That has the added benefit of being > >> asynchronous because it doesn’t block any vhost-user messages (which are > >> synchronous, and thus block downtime). > >> > >> Hanna > > > > For better or worse kick is per ring. It's out of spec to start rings > > that were not kicked but I guess you could do configuration ... > > Seems somewhat asymmetrical though. > > > > Let's wait until next week, hopefully Yajun Wu will answer. > The main motivation of adding VHOST_USER_SET_STATUS is to let backend > DPDK know > when DRIVER_OK bit is valid. It's an indication of all VQ configuration > has sent, > otherwise DPDK has to rely on first queue pair is ready, then > receiving/applying > VQ configuration one by one. > > During live migration, configuring VQ one by one is very time consuming. > For VIRTIO > net vDPA, HW needs to know how many VQs are enabled to set > RSS(Receive-Side Scaling). > > If you don’t want SET_STATUS message, backend can remove protocol > feature bit > VHOST_USER_PROTOCOL_F_STATUS. > DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device > close/reset. This is incorrect, resetting the device on GET_VRING_BASE breaks the stop/cont. Since you don't want to reset the VQs on stop/cont. > > I'm not involved in discussion about adding SET_STATUS in Vhost > protocol. This feature > is essential for vDPA(same as vhost-vdpa implements VHOST_VDPA_SET_STATUS). > > Thanks, > Yajun > > > >>>> Now, we could hand full control of the status byte to the guest, and that > >>>> would make me content. But I feel like that doesn’t really work, because > >>>> qemu needs to intercept the status byte anyway (it needs to know when there > >>>> is a reset, probably wants to know when the device is configured, etc.), so > >>>> I don’t think having the status byte in vhost-user really gains us much when > >>>> qemu could translate status byte changes to/from other vhost-user commands. > >>>> > >>>> Hanna > >>> well it intercepts it but I think it could pass it on unchanged. > >>> > >>> > >>>>> I guess symmetry was the > >>>>> point. So I don't see why SET_STATUS 0 has to be ignored. > >>>>> > >>>>> > >>>>> SET_STATUS was introduced by: > >>>>> > >>>>> commit 923b8921d210763359e96246a58658ac0db6c645 > >>>>> Author: Yajun Wu <yajunw@nvidia.com> > >>>>> Date: Mon Oct 17 14:44:52 2022 +0800 > >>>>> > >>>>> vhost-user: Support vhost_dev_start > >>>>> > >>>>> CC the author. > >>>>> > > _______________________________________________ > Virtio-fs mailing list > Virtio-fs@redhat.com > https://listman.redhat.com/mailman/listinfo/virtio-fs -- German ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-09 10:28 ` German Maglione @ 2023-10-10 2:56 ` Yajun Wu 2023-10-10 10:04 ` German Maglione 0 siblings, 1 reply; 53+ messages in thread From: Yajun Wu @ 2023-10-10 2:56 UTC (permalink / raw) To: German Maglione Cc: Michael S. Tsirkin, Hanna Czenczek, qemu-devel@nongnu.org, virtio-fs@redhat.com, Eugenio Pérez, maxime.coquelin@redhat.com, Parav Pandit, Anton Kuchin On 10/9/2023 6:28 PM, German Maglione wrote: > External email: Use caution opening links or attachments > > > On Sat, Oct 7, 2023 at 4:23 AM Yajun Wu <yajunw@nvidia.com> wrote: >> >> On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote: >>> External email: Use caution opening links or attachments >>> >>> >>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote: >>>> On 06.10.23 11:26, Michael S. Tsirkin wrote: >>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote: >>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote: >>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: >>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote: >>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: >>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: >>>>>>>>>>> There is no clearly defined purpose for the virtio status byte in >>>>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio >>>>>>>>>>> feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK >>>>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors >>>>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). >>>>>>>>>>> >>>>>>>>>>> As for implementations, SET_STATUS is not widely implemented. dpdk does >>>>>>>>>>> implement it, but only uses it to signal feature negotiation failure. >>>>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively >>>>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today >>>>>>>>>>> means the same thing as RESET_DEVICE). >>>>>>>>>>> >>>>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not >>>>>>>>>>> forward the guest-set status byte, but instead just makes it up >>>>>>>>>>> internally, and actually completely ignores what the back-end returns, >>>>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single >>>>>>>>>>> bits to it. Notably, after setting FEATURES_OK, it never reads it back >>>>>>>>>>> to see whether the flag is still set, which is the only way in which >>>>>>>>>>> dpdk uses the status byte. >>>>>>>>>>> >>>>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this >>>>>>>>>>> field in a useful manner, and it also provides no practical use over >>>>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly >>>>>>>>>>> defined. Deprecate it. >>>>>>>>>>> >>>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>>>>>>>>>> --- >>>>>>>>>>> docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- >>>>>>>>>>> 1 file changed, 21 insertions(+), 7 deletions(-) >>>>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK. >>>>>>>>> The fact current backends never check errors does not mean they never >>>>>>>>> will. So no, not applying this. >>>>>>>> Can this not be done with REPLY_ACK? I.e., with the following message >>>>>>>> order: >>>>>>>> >>>>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is >>>>>>>> present >>>>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK >>>>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK >>>>>>>> 4. SET_FEATURES with need_reply >>>>>>>> >>>>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the >>>>>>>> vCPUs are stopped, which generally seems to request a device reset. If we >>>>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will >>>>>>>> implement SET_STATUS later may break with at least these qemu versions. But >>>>>>>> documenting that a particular use of the status byte is to be ignored would >>>>>>>> be really strange. >>>>>>>> >>>>>>>> Hanna >>>>>>> Hmm I guess. Though just following virtio spec seems cleaner to me... >>>>>>> vhost-user reconfigures the state fully on start. >>>>>> Not the internal device state, though. virtiofsd has internal state, and >>>>>> other devices like vhost-gpu back-ends would probably, too. >>>>>> >>>>>> Stefan has recently sent a series >>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to >>>>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a >>>>>> reset). >>>>>> >>>>>> I really don’t like our current approach with the status byte. Following the >>>>>> virtio specification to me would mean that the guest directly controls this >>>>>> byte, which it does not. qemu makes up values as it deems appropriate, and >>>>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e. >>>>>> when the guest really doesn’t want a device reset. >>>>>> >>>>>> That means that qemu does not treat this as a virtio device field (because >>>>>> that would mean exposing it to the guest driver), but instead treats it as >>>>>> part of the vhost(-user) protocol. It doesn’t feel right to me that we use >>>>>> a virtio-defined feature for communication on the vhost level, i.e. between >>>>>> front-end and back-end, and not between guest driver and device. I think >>>>>> all vhost-level protocol features should be fully defined in the vhost-user >>>>>> specification, which REPLY_ACK is. >>>>> Hmm that makes sense. Maybe we should have done what stefan's patch >>>>> is doing. >>>>> >>>>> Do look at the original commit that introduced it to understand why >>>>> it was added. >>>> I don’t understand why this was added to the stop/cont code, though. If it >>>> is time consuming to make these changes, why are they done every time the VM >>>> is paused >>>> and resumed? It makes sense that this would be done for the initial >>>> configuration (where a reset also wouldn’t hurt), but here it seems wrong. >>>> >>>> (To be clear, a reset in the stop/cont code is wrong, because it breaks >>>> stateful devices.) >>>> >>>> Also, note the newer commits 6f8be29ec17 and c3716f260bf. The reset as >>>> originally introduced was wrong even for non-stateful devices, because it >>>> occurred before we fetched the state (vring indices) so we could restore it >>>> later. I don’t know how 923b8921d21 was tested, but if the back-end used >>>> for testing implemented SET_STATUS 0 as a reset, it could not have survived >>>> either migration or a stop/cont in general, because the vring indices would >>>> have been reset to 0. >>>> >>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all >>>> devices that would implement them as per virtio spec, and even today it’s >>>> broken for stateful devices. The mentioned performance issue is likely >>>> real, but we can’t address it by making up SET_STATUS calls that are wrong. >>>> >>>> I concede that I didn’t think about DRIVER_OK. Personally, I would do all >>>> final configuration that would happen upon a DRIVER_OK once the first vring >>>> is started (i.e. receives a kick). That has the added benefit of being >>>> asynchronous because it doesn’t block any vhost-user messages (which are >>>> synchronous, and thus block downtime). >>>> >>>> Hanna >>> For better or worse kick is per ring. It's out of spec to start rings >>> that were not kicked but I guess you could do configuration ... >>> Seems somewhat asymmetrical though. >>> >>> Let's wait until next week, hopefully Yajun Wu will answer. >> The main motivation of adding VHOST_USER_SET_STATUS is to let backend >> DPDK know >> when DRIVER_OK bit is valid. It's an indication of all VQ configuration >> has sent, >> otherwise DPDK has to rely on first queue pair is ready, then >> receiving/applying >> VQ configuration one by one. >> >> During live migration, configuring VQ one by one is very time consuming. >> For VIRTIO >> net vDPA, HW needs to know how many VQs are enabled to set >> RSS(Receive-Side Scaling). >> >> If you don’t want SET_STATUS message, backend can remove protocol >> feature bit >> VHOST_USER_PROTOCOL_F_STATUS. >> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device >> close/reset. > This is incorrect, resetting the device on GET_VRING_BASE breaks > the stop/cont. Since you don't want to reset the VQs on stop/cont. Sorry for the misunderstanding, dpdk vhost backend framework doesn't have RESET concept(only device level .dev_conf and .dev_close). On receiving DRIVER_OK does dev_conf, on receiving GET_VRING_BASE does dev_close. For every VM suspend/resume, dpdk issues dev_close then dev_conf. > >> I'm not involved in discussion about adding SET_STATUS in Vhost >> protocol. This feature >> is essential for vDPA(same as vhost-vdpa implements VHOST_VDPA_SET_STATUS). >> >> Thanks, >> Yajun >>>>>> Now, we could hand full control of the status byte to the guest, and that >>>>>> would make me content. But I feel like that doesn’t really work, because >>>>>> qemu needs to intercept the status byte anyway (it needs to know when there >>>>>> is a reset, probably wants to know when the device is configured, etc.), so >>>>>> I don’t think having the status byte in vhost-user really gains us much when >>>>>> qemu could translate status byte changes to/from other vhost-user commands. >>>>>> >>>>>> Hanna >>>>> well it intercepts it but I think it could pass it on unchanged. >>>>> >>>>> >>>>>>> I guess symmetry was the >>>>>>> point. So I don't see why SET_STATUS 0 has to be ignored. >>>>>>> >>>>>>> >>>>>>> SET_STATUS was introduced by: >>>>>>> >>>>>>> commit 923b8921d210763359e96246a58658ac0db6c645 >>>>>>> Author: Yajun Wu <yajunw@nvidia.com> >>>>>>> Date: Mon Oct 17 14:44:52 2022 +0800 >>>>>>> >>>>>>> vhost-user: Support vhost_dev_start >>>>>>> >>>>>>> CC the author. >>>>>>> >> _______________________________________________ >> Virtio-fs mailing list >> Virtio-fs@redhat.com >> https://listman.redhat.com/mailman/listinfo/virtio-fs > > > -- > German > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] (no subject) 2023-10-10 2:56 ` Yajun Wu @ 2023-10-10 10:04 ` German Maglione 0 siblings, 0 replies; 53+ messages in thread From: German Maglione @ 2023-10-10 10:04 UTC (permalink / raw) To: Yajun Wu Cc: Michael S. Tsirkin, Hanna Czenczek, qemu-devel@nongnu.org, virtio-fs@redhat.com, Eugenio Pérez, maxime.coquelin@redhat.com, Parav Pandit, Anton Kuchin On Tue, Oct 10, 2023 at 4:57 AM Yajun Wu <yajunw@nvidia.com> wrote: > > > On 10/9/2023 6:28 PM, German Maglione wrote: > > External email: Use caution opening links or attachments > > > > > > On Sat, Oct 7, 2023 at 4:23 AM Yajun Wu <yajunw@nvidia.com> wrote: > >> > >> On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote: > >>> External email: Use caution opening links or attachments > >>> > >>> > >>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote: > >>>> On 06.10.23 11:26, Michael S. Tsirkin wrote: > >>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote: > >>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote: > >>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote: > >>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote: > >>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote: > >>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote: > >>>>>>>>>>> There is no clearly defined purpose for the virtio status byte in > >>>>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio > >>>>>>>>>>> feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK > >>>>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors > >>>>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES). > >>>>>>>>>>> > >>>>>>>>>>> As for implementations, SET_STATUS is not widely implemented. dpdk does > >>>>>>>>>>> implement it, but only uses it to signal feature negotiation failure. > >>>>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively > >>>>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today > >>>>>>>>>>> means the same thing as RESET_DEVICE). > >>>>>>>>>>> > >>>>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not > >>>>>>>>>>> forward the guest-set status byte, but instead just makes it up > >>>>>>>>>>> internally, and actually completely ignores what the back-end returns, > >>>>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single > >>>>>>>>>>> bits to it. Notably, after setting FEATURES_OK, it never reads it back > >>>>>>>>>>> to see whether the flag is still set, which is the only way in which > >>>>>>>>>>> dpdk uses the status byte. > >>>>>>>>>>> > >>>>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this > >>>>>>>>>>> field in a useful manner, and it also provides no practical use over > >>>>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly > >>>>>>>>>>> defined. Deprecate it. > >>>>>>>>>>> > >>>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > >>>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > >>>>>>>>>>> --- > >>>>>>>>>>> docs/interop/vhost-user.rst | 28 +++++++++++++++++++++------- > >>>>>>>>>>> 1 file changed, 21 insertions(+), 7 deletions(-) > >>>>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > >>>>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK. > >>>>>>>>> The fact current backends never check errors does not mean they never > >>>>>>>>> will. So no, not applying this. > >>>>>>>> Can this not be done with REPLY_ACK? I.e., with the following message > >>>>>>>> order: > >>>>>>>> > >>>>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is > >>>>>>>> present > >>>>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK > >>>>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK > >>>>>>>> 4. SET_FEATURES with need_reply > >>>>>>>> > >>>>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the > >>>>>>>> vCPUs are stopped, which generally seems to request a device reset. If we > >>>>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will > >>>>>>>> implement SET_STATUS later may break with at least these qemu versions. But > >>>>>>>> documenting that a particular use of the status byte is to be ignored would > >>>>>>>> be really strange. > >>>>>>>> > >>>>>>>> Hanna > >>>>>>> Hmm I guess. Though just following virtio spec seems cleaner to me... > >>>>>>> vhost-user reconfigures the state fully on start. > >>>>>> Not the internal device state, though. virtiofsd has internal state, and > >>>>>> other devices like vhost-gpu back-ends would probably, too. > >>>>>> > >>>>>> Stefan has recently sent a series > >>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to > >>>>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a > >>>>>> reset). > >>>>>> > >>>>>> I really don’t like our current approach with the status byte. Following the > >>>>>> virtio specification to me would mean that the guest directly controls this > >>>>>> byte, which it does not. qemu makes up values as it deems appropriate, and > >>>>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e. > >>>>>> when the guest really doesn’t want a device reset. > >>>>>> > >>>>>> That means that qemu does not treat this as a virtio device field (because > >>>>>> that would mean exposing it to the guest driver), but instead treats it as > >>>>>> part of the vhost(-user) protocol. It doesn’t feel right to me that we use > >>>>>> a virtio-defined feature for communication on the vhost level, i.e. between > >>>>>> front-end and back-end, and not between guest driver and device. I think > >>>>>> all vhost-level protocol features should be fully defined in the vhost-user > >>>>>> specification, which REPLY_ACK is. > >>>>> Hmm that makes sense. Maybe we should have done what stefan's patch > >>>>> is doing. > >>>>> > >>>>> Do look at the original commit that introduced it to understand why > >>>>> it was added. > >>>> I don’t understand why this was added to the stop/cont code, though. If it > >>>> is time consuming to make these changes, why are they done every time the VM > >>>> is paused > >>>> and resumed? It makes sense that this would be done for the initial > >>>> configuration (where a reset also wouldn’t hurt), but here it seems wrong. > >>>> > >>>> (To be clear, a reset in the stop/cont code is wrong, because it breaks > >>>> stateful devices.) > >>>> > >>>> Also, note the newer commits 6f8be29ec17 and c3716f260bf. The reset as > >>>> originally introduced was wrong even for non-stateful devices, because it > >>>> occurred before we fetched the state (vring indices) so we could restore it > >>>> later. I don’t know how 923b8921d21 was tested, but if the back-end used > >>>> for testing implemented SET_STATUS 0 as a reset, it could not have survived > >>>> either migration or a stop/cont in general, because the vring indices would > >>>> have been reset to 0. > >>>> > >>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all > >>>> devices that would implement them as per virtio spec, and even today it’s > >>>> broken for stateful devices. The mentioned performance issue is likely > >>>> real, but we can’t address it by making up SET_STATUS calls that are wrong. > >>>> > >>>> I concede that I didn’t think about DRIVER_OK. Personally, I would do all > >>>> final configuration that would happen upon a DRIVER_OK once the first vring > >>>> is started (i.e. receives a kick). That has the added benefit of being > >>>> asynchronous because it doesn’t block any vhost-user messages (which are > >>>> synchronous, and thus block downtime). > >>>> > >>>> Hanna > >>> For better or worse kick is per ring. It's out of spec to start rings > >>> that were not kicked but I guess you could do configuration ... > >>> Seems somewhat asymmetrical though. > >>> > >>> Let's wait until next week, hopefully Yajun Wu will answer. > >> The main motivation of adding VHOST_USER_SET_STATUS is to let backend > >> DPDK know > >> when DRIVER_OK bit is valid. It's an indication of all VQ configuration > >> has sent, > >> otherwise DPDK has to rely on first queue pair is ready, then > >> receiving/applying > >> VQ configuration one by one. > >> > >> During live migration, configuring VQ one by one is very time consuming. > >> For VIRTIO > >> net vDPA, HW needs to know how many VQs are enabled to set > >> RSS(Receive-Side Scaling). > >> > >> If you don’t want SET_STATUS message, backend can remove protocol > >> feature bit > >> VHOST_USER_PROTOCOL_F_STATUS. > >> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device > >> close/reset. > > This is incorrect, resetting the device on GET_VRING_BASE breaks > > the stop/cont. Since you don't want to reset the VQs on stop/cont. > Sorry for the misunderstanding, dpdk vhost backend framework doesn't > have RESET concept(only device level .dev_conf and .dev_close). On > receiving DRIVER_OK does dev_conf, on receiving GET_VRING_BASE does > dev_close. For every VM suspend/resume, dpdk issues dev_close then dev_conf. (sorry I did not explain myself well) I meant that resetting the VQs upon receiveng GET_VRING_BASE makes the backend to fail if qemu continues after a "stop". I notice that in dpdk, when it receives a GET_VRING_BASE[0], it calls 'vring_invalidate(dev, vq);'[1], resetting the VQ[2], doing that is incorrect. [0] https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2135 [1] https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2201 [2] https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost.c#L580 > > > >> I'm not involved in discussion about adding SET_STATUS in Vhost > >> protocol. This feature > >> is essential for vDPA(same as vhost-vdpa implements VHOST_VDPA_SET_STATUS). > >> > >> Thanks, > >> Yajun > >>>>>> Now, we could hand full control of the status byte to the guest, and that > >>>>>> would make me content. But I feel like that doesn’t really work, because > >>>>>> qemu needs to intercept the status byte anyway (it needs to know when there > >>>>>> is a reset, probably wants to know when the device is configured, etc.), so > >>>>>> I don’t think having the status byte in vhost-user really gains us much when > >>>>>> qemu could translate status byte changes to/from other vhost-user commands. > >>>>>> > >>>>>> Hanna > >>>>> well it intercepts it but I think it could pass it on unchanged. > >>>>> > >>>>> > >>>>>>> I guess symmetry was the > >>>>>>> point. So I don't see why SET_STATUS 0 has to be ignored. > >>>>>>> > >>>>>>> > >>>>>>> SET_STATUS was introduced by: > >>>>>>> > >>>>>>> commit 923b8921d210763359e96246a58658ac0db6c645 > >>>>>>> Author: Yajun Wu <yajunw@nvidia.com> > >>>>>>> Date: Mon Oct 17 14:44:52 2022 +0800 > >>>>>>> > >>>>>>> vhost-user: Support vhost_dev_start > >>>>>>> > >>>>>>> CC the author. > >>>>>>> > >> _______________________________________________ > >> Virtio-fs mailing list > >> Virtio-fs@redhat.com > >> https://listman.redhat.com/mailman/listinfo/virtio-fs > > > > > > -- > > German > > > -- German ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc 2023-10-04 12:58 [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek 2023-10-04 12:58 ` [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS Hanna Czenczek @ 2023-10-04 12:58 ` Hanna Czenczek 2023-10-05 17:38 ` Stefan Hajnoczi 2023-10-04 12:58 ` [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings Hanna Czenczek ` (6 subsequent siblings) 8 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-04 12:58 UTC (permalink / raw) To: qemu-devel, virtio-fs Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi, German Maglione, Eugenio Pérez, Anton Kuchin GET_VRING_BASE does not mention that it stops the respective ring. Fix that. Furthermore, it is not fully clear what the "base offset" these commands' documentation refers to is; an offset could be many things. Be more precise and verbose about it, especially given that these commands use different payload structures depending on whether the vring is split or packed. Signed-off-by: Hanna Czenczek <hreitz@redhat.com> --- docs/interop/vhost-user.rst | 66 ++++++++++++++++++++++++++++++++++--- 1 file changed, 62 insertions(+), 4 deletions(-) diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst index 2f68e67a1a..50f5acebe5 100644 --- a/docs/interop/vhost-user.rst +++ b/docs/interop/vhost-user.rst @@ -108,6 +108,37 @@ A vring state description :num: a 32-bit number +A vring descriptor index for split virtqueues +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + ++-------------+---------------------+ +| vring index | index in avail ring | ++-------------+---------------------+ + +:vring index: 32-bit index of the respective virtqueue + +:index in avail ring: 32-bit value, of which currently only the lower 16 + bits are used: + + - Bits 0–15: Next descriptor index in the *Available Ring* + - Bits 16–31: Reserved (set to zero) + +Vring descriptor indices for packed virtqueues +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + ++-------------+--------------------+ +| vring index | descriptor indices | ++-------------+--------------------+ + +:vring index: 32-bit index of the respective virtqueue + +:descriptor indices: 32-bit value: + + - Bits 0–14: Index in the *Available Ring* + - Bit 15: Driver (Available) Ring Wrap Counter + - Bits 16–30: Index in the *Used Ring* + - Bit 31: Device (Used) Ring Wrap Counter + A vring address description ^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -1031,18 +1062,45 @@ Front-end message types ``VHOST_USER_SET_VRING_BASE`` :id: 10 :equivalent ioctl: ``VHOST_SET_VRING_BASE`` - :request payload: vring state description + :request payload: vring descriptor index/indices :reply payload: N/A - Sets the base offset in the available vring. + Sets the next index to use for descriptors in this vring: + + * For a split virtqueue, sets only the next descriptor index in the + *Available Ring*. The device is supposed to read the next index in + the *Used Ring* from the respective vring structure in guest memory. + + * For a packed virtqueue, both indices are supplied, as they are not + explicitly available in memory. + + Consequently, the payload type is specific to the type of virt queue + (*a vring descriptor index for split virtqueues* vs. *vring descriptor + indices for packed virtqueues*). ``VHOST_USER_GET_VRING_BASE`` :id: 11 :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` :request payload: vring state description - :reply payload: vring state description + :reply payload: vring descriptor index/indices + + Stops the vring and returns the current descriptor index or indices: + + * For a split virtqueue, returns only the 16-bit next descriptor + index in the *Available Ring*. The index in the *Used Ring* is + controlled by the guest driver and can be read from the vring + structure in memory, so is not covered. + + * For a packed virtqueue, neither index is explicitly available to + read from memory, so both indices (as maintained by the device) are + returned. + + Consequently, the payload type is specific to the type of virt queue + (*a vring descriptor index for split virtqueues* vs. *vring descriptor + indices for packed virtqueues*). - Get the available vring base offset. + The request payload’s *num* field is currently reserved and must be + set to 0. ``VHOST_USER_SET_VRING_KICK`` :id: 12 -- 2.41.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc 2023-10-04 12:58 ` [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc Hanna Czenczek @ 2023-10-05 17:38 ` Stefan Hajnoczi 2023-10-06 7:53 ` [Virtio-fs] " Hanna Czenczek 0 siblings, 1 reply; 53+ messages in thread From: Stefan Hajnoczi @ 2023-10-05 17:38 UTC (permalink / raw) To: Hanna Czenczek Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione, Eugenio Pérez, Anton Kuchin [-- Attachment #1: Type: text/plain, Size: 4758 bytes --] On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote: > GET_VRING_BASE does not mention that it stops the respective ring. Fix > that. > > Furthermore, it is not fully clear what the "base offset" these > commands' documentation refers to is; an offset could be many things. > Be more precise and verbose about it, especially given that these > commands use different payload structures depending on whether the vring > is split or packed. > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > docs/interop/vhost-user.rst | 66 ++++++++++++++++++++++++++++++++++--- > 1 file changed, 62 insertions(+), 4 deletions(-) > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst > index 2f68e67a1a..50f5acebe5 100644 > --- a/docs/interop/vhost-user.rst > +++ b/docs/interop/vhost-user.rst > @@ -108,6 +108,37 @@ A vring state description > > :num: a 32-bit number > > +A vring descriptor index for split virtqueues > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > ++-------------+---------------------+ > +| vring index | index in avail ring | > ++-------------+---------------------+ > + > +:vring index: 32-bit index of the respective virtqueue > + > +:index in avail ring: 32-bit value, of which currently only the lower 16 > + bits are used: > + > + - Bits 0–15: Next descriptor index in the *Available Ring* I think we need to say more to make this implementable just by reading the spec: Index of the next *Available Ring* descriptor that the back-end will process. This is a free-running index that is not wrapped by the ring size. Feel free to rephrase. > + - Bits 16–31: Reserved (set to zero) > + > +Vring descriptor indices for packed virtqueues > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > ++-------------+--------------------+ > +| vring index | descriptor indices | > ++-------------+--------------------+ > + > +:vring index: 32-bit index of the respective virtqueue > + > +:descriptor indices: 32-bit value: > + > + - Bits 0–14: Index in the *Available Ring* Same here. > + - Bit 15: Driver (Available) Ring Wrap Counter > + - Bits 16–30: Index in the *Used Ring* Same here. > + - Bit 31: Device (Used) Ring Wrap Counter > + > A vring address description > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > @@ -1031,18 +1062,45 @@ Front-end message types > ``VHOST_USER_SET_VRING_BASE`` > :id: 10 > :equivalent ioctl: ``VHOST_SET_VRING_BASE`` > - :request payload: vring state description > + :request payload: vring descriptor index/indices > :reply payload: N/A > > - Sets the base offset in the available vring. > + Sets the next index to use for descriptors in this vring: > + > + * For a split virtqueue, sets only the next descriptor index in the > + *Available Ring*. The device is supposed to read the next index in > + the *Used Ring* from the respective vring structure in guest memory. > + > + * For a packed virtqueue, both indices are supplied, as they are not > + explicitly available in memory. > + > + Consequently, the payload type is specific to the type of virt queue > + (*a vring descriptor index for split virtqueues* vs. *vring descriptor > + indices for packed virtqueues*). > > ``VHOST_USER_GET_VRING_BASE`` > :id: 11 > :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` > :request payload: vring state description > - :reply payload: vring state description > + :reply payload: vring descriptor index/indices > + > + Stops the vring and returns the current descriptor index or indices: > + > + * For a split virtqueue, returns only the 16-bit next descriptor > + index in the *Available Ring*. The index in the *Used Ring* is > + controlled by the guest driver and can be read from the vring I find "is controlled by the guest driver" confusing. The device writes the Used Ring index. The driver only reads it. The device is the active party here. The sentence can be shortened to omit the "controlled by the guest driver" part. > + structure in memory, so is not covered. > + > + * For a packed virtqueue, neither index is explicitly available to > + read from memory, so both indices (as maintained by the device) are > + returned. > + > + Consequently, the payload type is specific to the type of virt queue > + (*a vring descriptor index for split virtqueues* vs. *vring descriptor > + indices for packed virtqueues*). > > - Get the available vring base offset. > + The request payload’s *num* field is currently reserved and must be > + set to 0. > > ``VHOST_USER_SET_VRING_KICK`` > :id: 12 > -- > 2.41.0 > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc 2023-10-05 17:38 ` Stefan Hajnoczi @ 2023-10-06 7:53 ` Hanna Czenczek 2023-10-06 8:49 ` Michael S. Tsirkin 0 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-06 7:53 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Michael S . Tsirkin, qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin On 05.10.23 19:38, Stefan Hajnoczi wrote: > On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote: >> GET_VRING_BASE does not mention that it stops the respective ring. Fix >> that. >> >> Furthermore, it is not fully clear what the "base offset" these >> commands' documentation refers to is; an offset could be many things. >> Be more precise and verbose about it, especially given that these >> commands use different payload structures depending on whether the vring >> is split or packed. >> >> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >> --- >> docs/interop/vhost-user.rst | 66 ++++++++++++++++++++++++++++++++++--- >> 1 file changed, 62 insertions(+), 4 deletions(-) >> >> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst >> index 2f68e67a1a..50f5acebe5 100644 >> --- a/docs/interop/vhost-user.rst >> +++ b/docs/interop/vhost-user.rst >> @@ -108,6 +108,37 @@ A vring state description >> >> :num: a 32-bit number >> >> +A vring descriptor index for split virtqueues >> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> + >> ++-------------+---------------------+ >> +| vring index | index in avail ring | >> ++-------------+---------------------+ >> + >> +:vring index: 32-bit index of the respective virtqueue >> + >> +:index in avail ring: 32-bit value, of which currently only the lower 16 >> + bits are used: >> + >> + - Bits 0–15: Next descriptor index in the *Available Ring* > I think we need to say more to make this implementable just by reading > the spec: > > Index of the next *Available Ring* descriptor that the back-end will > process. This is a free-running index that is not wrapped by the ring > size. Sure, thanks. > Feel free to rephrase. > >> + - Bits 16–31: Reserved (set to zero) >> + >> +Vring descriptor indices for packed virtqueues >> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> + >> ++-------------+--------------------+ >> +| vring index | descriptor indices | >> ++-------------+--------------------+ >> + >> +:vring index: 32-bit index of the respective virtqueue >> + >> +:descriptor indices: 32-bit value: >> + >> + - Bits 0–14: Index in the *Available Ring* > Same here. > >> + - Bit 15: Driver (Available) Ring Wrap Counter >> + - Bits 16–30: Index in the *Used Ring* > Same here. > >> + - Bit 31: Device (Used) Ring Wrap Counter >> + >> A vring address description >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> @@ -1031,18 +1062,45 @@ Front-end message types >> ``VHOST_USER_SET_VRING_BASE`` >> :id: 10 >> :equivalent ioctl: ``VHOST_SET_VRING_BASE`` >> - :request payload: vring state description >> + :request payload: vring descriptor index/indices >> :reply payload: N/A >> >> - Sets the base offset in the available vring. >> + Sets the next index to use for descriptors in this vring: >> + >> + * For a split virtqueue, sets only the next descriptor index in the >> + *Available Ring*. The device is supposed to read the next index in >> + the *Used Ring* from the respective vring structure in guest memory. >> + >> + * For a packed virtqueue, both indices are supplied, as they are not >> + explicitly available in memory. >> + >> + Consequently, the payload type is specific to the type of virt queue >> + (*a vring descriptor index for split virtqueues* vs. *vring descriptor >> + indices for packed virtqueues*). >> >> ``VHOST_USER_GET_VRING_BASE`` >> :id: 11 >> :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` >> :request payload: vring state description >> - :reply payload: vring state description >> + :reply payload: vring descriptor index/indices >> + >> + Stops the vring and returns the current descriptor index or indices: >> + >> + * For a split virtqueue, returns only the 16-bit next descriptor >> + index in the *Available Ring*. The index in the *Used Ring* is >> + controlled by the guest driver and can be read from the vring > I find "is controlled by the guest driver" confusing. The device writes > the Used Ring index. The driver only reads it. The device is the active > party here. Er, good point. That breaks the whole reasoning. Then I don’t understand why we do get/set the available ring index and not the used ring index. Do you know why? > The sentence can be shortened to omit the "controlled by the guest > driver" part. I don’t want to shorten it, because I would like to know why we don’t get/set both indices for split virtqueues, too. Hanna >> + structure in memory, so is not covered. >> + >> + * For a packed virtqueue, neither index is explicitly available to >> + read from memory, so both indices (as maintained by the device) are >> + returned. >> + >> + Consequently, the payload type is specific to the type of virt queue >> + (*a vring descriptor index for split virtqueues* vs. *vring descriptor >> + indices for packed virtqueues*). >> >> - Get the available vring base offset. >> + The request payload’s *num* field is currently reserved and must be >> + set to 0. >> >> ``VHOST_USER_SET_VRING_KICK`` >> :id: 12 >> -- >> 2.41.0 >> >> >> _______________________________________________ >> Virtio-fs mailing list >> Virtio-fs@redhat.com >> https://listman.redhat.com/mailman/listinfo/virtio-fs ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc 2023-10-06 7:53 ` [Virtio-fs] " Hanna Czenczek @ 2023-10-06 8:49 ` Michael S. Tsirkin 2023-10-06 13:55 ` Hanna Czenczek 0 siblings, 1 reply; 53+ messages in thread From: Michael S. Tsirkin @ 2023-10-06 8:49 UTC (permalink / raw) To: Hanna Czenczek Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, Eugenio Pérez, Anton Kuchin On Fri, Oct 06, 2023 at 09:53:53AM +0200, Hanna Czenczek wrote: > On 05.10.23 19:38, Stefan Hajnoczi wrote: > > On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote: > > > GET_VRING_BASE does not mention that it stops the respective ring. Fix > > > that. > > > > > > Furthermore, it is not fully clear what the "base offset" these > > > commands' documentation refers to is; an offset could be many things. > > > Be more precise and verbose about it, especially given that these > > > commands use different payload structures depending on whether the vring > > > is split or packed. > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > --- > > > docs/interop/vhost-user.rst | 66 ++++++++++++++++++++++++++++++++++--- > > > 1 file changed, 62 insertions(+), 4 deletions(-) > > > > > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst > > > index 2f68e67a1a..50f5acebe5 100644 > > > --- a/docs/interop/vhost-user.rst > > > +++ b/docs/interop/vhost-user.rst > > > @@ -108,6 +108,37 @@ A vring state description > > > :num: a 32-bit number > > > +A vring descriptor index for split virtqueues > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > + > > > ++-------------+---------------------+ > > > +| vring index | index in avail ring | > > > ++-------------+---------------------+ > > > + > > > +:vring index: 32-bit index of the respective virtqueue > > > + > > > +:index in avail ring: 32-bit value, of which currently only the lower 16 > > > + bits are used: > > > + > > > + - Bits 0–15: Next descriptor index in the *Available Ring* > > I think we need to say more to make this implementable just by reading > > the spec: > > > > Index of the next *Available Ring* descriptor that the back-end will > > process. This is a free-running index that is not wrapped by the ring > > size. > > Sure, thanks. > > > Feel free to rephrase. > > > > > + - Bits 16–31: Reserved (set to zero) > > > + > > > +Vring descriptor indices for packed virtqueues > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > + > > > ++-------------+--------------------+ > > > +| vring index | descriptor indices | > > > ++-------------+--------------------+ > > > + > > > +:vring index: 32-bit index of the respective virtqueue > > > + > > > +:descriptor indices: 32-bit value: > > > + > > > + - Bits 0–14: Index in the *Available Ring* > > Same here. > > > > > + - Bit 15: Driver (Available) Ring Wrap Counter > > > + - Bits 16–30: Index in the *Used Ring* > > Same here. > > > > > + - Bit 31: Device (Used) Ring Wrap Counter > > > + > > > A vring address description > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > @@ -1031,18 +1062,45 @@ Front-end message types > > > ``VHOST_USER_SET_VRING_BASE`` > > > :id: 10 > > > :equivalent ioctl: ``VHOST_SET_VRING_BASE`` > > > - :request payload: vring state description > > > + :request payload: vring descriptor index/indices > > > :reply payload: N/A > > > - Sets the base offset in the available vring. > > > + Sets the next index to use for descriptors in this vring: > > > + > > > + * For a split virtqueue, sets only the next descriptor index in the > > > + *Available Ring*. The device is supposed to read the next index in > > > + the *Used Ring* from the respective vring structure in guest memory. > > > + > > > + * For a packed virtqueue, both indices are supplied, as they are not > > > + explicitly available in memory. > > > + > > > + Consequently, the payload type is specific to the type of virt queue > > > + (*a vring descriptor index for split virtqueues* vs. *vring descriptor > > > + indices for packed virtqueues*). > > > ``VHOST_USER_GET_VRING_BASE`` > > > :id: 11 > > > :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` > > > :request payload: vring state description > > > - :reply payload: vring state description > > > + :reply payload: vring descriptor index/indices > > > + > > > + Stops the vring and returns the current descriptor index or indices: > > > + > > > + * For a split virtqueue, returns only the 16-bit next descriptor > > > + index in the *Available Ring*. The index in the *Used Ring* is > > > + controlled by the guest driver and can be read from the vring > > I find "is controlled by the guest driver" confusing. The device writes > > the Used Ring index. The driver only reads it. The device is the active > > party here. > > Er, good point. That breaks the whole reasoning. Then I don’t understand > why we do get/set the available ring index and not the used ring index. Do > you know why? It's simple. used ring index in memory is controlled by the device and reflects device state. device can just read it back to restore. available ring index in memory is controlled by driver and does not reflect device state. > > The sentence can be shortened to omit the "controlled by the guest > > driver" part. > > I don’t want to shorten it, because I would like to know why we don’t > get/set both indices for split virtqueues, too. > > Hanna > > > > + structure in memory, so is not covered. > > > + > > > + * For a packed virtqueue, neither index is explicitly available to > > > + read from memory, so both indices (as maintained by the device) are > > > + returned. > > > + > > > + Consequently, the payload type is specific to the type of virt queue > > > + (*a vring descriptor index for split virtqueues* vs. *vring descriptor > > > + indices for packed virtqueues*). > > > - Get the available vring base offset. > > > + The request payload’s *num* field is currently reserved and must be > > > + set to 0. > > > ``VHOST_USER_SET_VRING_KICK`` > > > :id: 12 > > > -- > > > 2.41.0 > > > > > > > > > _______________________________________________ > > > Virtio-fs mailing list > > > Virtio-fs@redhat.com > > > https://listman.redhat.com/mailman/listinfo/virtio-fs ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc 2023-10-06 8:49 ` Michael S. Tsirkin @ 2023-10-06 13:55 ` Hanna Czenczek 2023-10-06 13:58 ` Hanna Czenczek 2023-10-07 21:27 ` Michael S. Tsirkin 0 siblings, 2 replies; 53+ messages in thread From: Hanna Czenczek @ 2023-10-06 13:55 UTC (permalink / raw) To: Michael S. Tsirkin Cc: virtio-fs, Eugenio Pérez, Anton Kuchin, qemu-devel, Stefan Hajnoczi On 06.10.23 10:49, Michael S. Tsirkin wrote: > On Fri, Oct 06, 2023 at 09:53:53AM +0200, Hanna Czenczek wrote: >> On 05.10.23 19:38, Stefan Hajnoczi wrote: >>> On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote: >>>> GET_VRING_BASE does not mention that it stops the respective ring. Fix >>>> that. >>>> >>>> Furthermore, it is not fully clear what the "base offset" these >>>> commands' documentation refers to is; an offset could be many things. >>>> Be more precise and verbose about it, especially given that these >>>> commands use different payload structures depending on whether the vring >>>> is split or packed. >>>> >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>>> --- >>>> docs/interop/vhost-user.rst | 66 ++++++++++++++++++++++++++++++++++--- >>>> 1 file changed, 62 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst >>>> index 2f68e67a1a..50f5acebe5 100644 >>>> --- a/docs/interop/vhost-user.rst >>>> +++ b/docs/interop/vhost-user.rst >>>> @@ -108,6 +108,37 @@ A vring state description >>>> :num: a 32-bit number >>>> +A vring descriptor index for split virtqueues >>>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>>> + >>>> ++-------------+---------------------+ >>>> +| vring index | index in avail ring | >>>> ++-------------+---------------------+ >>>> + >>>> +:vring index: 32-bit index of the respective virtqueue >>>> + >>>> +:index in avail ring: 32-bit value, of which currently only the lower 16 >>>> + bits are used: >>>> + >>>> + - Bits 0–15: Next descriptor index in the *Available Ring* >>> I think we need to say more to make this implementable just by reading >>> the spec: >>> >>> Index of the next *Available Ring* descriptor that the back-end will >>> process. This is a free-running index that is not wrapped by the ring >>> size. >> Sure, thanks. >> >>> Feel free to rephrase. >>> >>>> + - Bits 16–31: Reserved (set to zero) >>>> + >>>> +Vring descriptor indices for packed virtqueues >>>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>>> + >>>> ++-------------+--------------------+ >>>> +| vring index | descriptor indices | >>>> ++-------------+--------------------+ >>>> + >>>> +:vring index: 32-bit index of the respective virtqueue >>>> + >>>> +:descriptor indices: 32-bit value: >>>> + >>>> + - Bits 0–14: Index in the *Available Ring* >>> Same here. >>> >>>> + - Bit 15: Driver (Available) Ring Wrap Counter >>>> + - Bits 16–30: Index in the *Used Ring* >>> Same here. >>> >>>> + - Bit 31: Device (Used) Ring Wrap Counter >>>> + >>>> A vring address description >>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>>> @@ -1031,18 +1062,45 @@ Front-end message types >>>> ``VHOST_USER_SET_VRING_BASE`` >>>> :id: 10 >>>> :equivalent ioctl: ``VHOST_SET_VRING_BASE`` >>>> - :request payload: vring state description >>>> + :request payload: vring descriptor index/indices >>>> :reply payload: N/A >>>> - Sets the base offset in the available vring. >>>> + Sets the next index to use for descriptors in this vring: >>>> + >>>> + * For a split virtqueue, sets only the next descriptor index in the >>>> + *Available Ring*. The device is supposed to read the next index in >>>> + the *Used Ring* from the respective vring structure in guest memory. >>>> + >>>> + * For a packed virtqueue, both indices are supplied, as they are not >>>> + explicitly available in memory. >>>> + >>>> + Consequently, the payload type is specific to the type of virt queue >>>> + (*a vring descriptor index for split virtqueues* vs. *vring descriptor >>>> + indices for packed virtqueues*). >>>> ``VHOST_USER_GET_VRING_BASE`` >>>> :id: 11 >>>> :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` >>>> :request payload: vring state description >>>> - :reply payload: vring state description >>>> + :reply payload: vring descriptor index/indices >>>> + >>>> + Stops the vring and returns the current descriptor index or indices: >>>> + >>>> + * For a split virtqueue, returns only the 16-bit next descriptor >>>> + index in the *Available Ring*. The index in the *Used Ring* is >>>> + controlled by the guest driver and can be read from the vring >>> I find "is controlled by the guest driver" confusing. The device writes >>> the Used Ring index. The driver only reads it. The device is the active >>> party here. >> Er, good point. That breaks the whole reasoning. Then I don’t understand >> why we do get/set the available ring index and not the used ring index. Do >> you know why? > It's simple. used ring index in memory is controlled by the device and > reflects device state. Exactly, it’s device state, that’s why I thought the front-end needs to ensure its read and restored around the reset we currently have in vhost_dev_stop()/start(). > device can just read it back to restore. I find it strange that the device is supposed to read its own state from memory. > available ring index in memory is controlled by driver and does > not reflect device state. Why can’t the device read the available index from memory? That value is put into memory by the driver precisely so the device can read it from there. Hanna ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc 2023-10-06 13:55 ` Hanna Czenczek @ 2023-10-06 13:58 ` Hanna Czenczek 2023-10-07 21:29 ` Michael S. Tsirkin 2023-10-07 21:27 ` Michael S. Tsirkin 1 sibling, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-06 13:58 UTC (permalink / raw) To: Michael S. Tsirkin Cc: virtio-fs, Eugenio Pérez, Anton Kuchin, qemu-devel, Stefan Hajnoczi On 06.10.23 15:55, Hanna Czenczek wrote: > On 06.10.23 10:49, Michael S. Tsirkin wrote: >> On Fri, Oct 06, 2023 at 09:53:53AM +0200, Hanna Czenczek wrote: >>> On 05.10.23 19:38, Stefan Hajnoczi wrote: >>>> On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote: [...] >>>> ``VHOST_USER_GET_VRING_BASE`` >>>> :id: 11 >>>> :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` >>>> :request payload: vring state description >>>> - :reply payload: vring state description >>>> + :reply payload: vring descriptor index/indices >>>> + >>>> + Stops the vring and returns the current descriptor index or >>>> indices: >>>> + >>>> + * For a split virtqueue, returns only the 16-bit next descriptor >>>> + index in the *Available Ring*. The index in the *Used Ring* is >>>> + controlled by the guest driver and can be read from the vring >>>> I find "is controlled by the guest driver" confusing. The device >>>> writes >>>> the Used Ring index. The driver only reads it. The device is the >>>> active >>>> party here. >>> Er, good point. That breaks the whole reasoning. Then I don’t >>> understand >>> why we do get/set the available ring index and not the used ring >>> index. Do >>> you know why? >> It's simple. used ring index in memory is controlled by the device and >> reflects device state. > > Exactly, it’s device state, that’s why I thought the front-end needs > to ensure its read and restored around the reset we currently have in > vhost_dev_stop()/start(). > >> device can just read it back to restore. > > I find it strange that the device is supposed to read its own state > from memory. > >> available ring index in memory is controlled by driver and does >> not reflect device state. > > Why can’t the device read the available index from memory? That value > is put into memory by the driver precisely so the device can read it > from there. Ah, wait, is the idea that the device may have an internal available index counter that reflects what descriptor it has already fetched? I.e. this index will lag behind the one in memory, and the difference are new descriptors that the device still needs to read? If that internal counter is the index that’s get/set here, then yes, that makes a lot of sense. Hanna ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc 2023-10-06 13:58 ` Hanna Czenczek @ 2023-10-07 21:29 ` Michael S. Tsirkin 0 siblings, 0 replies; 53+ messages in thread From: Michael S. Tsirkin @ 2023-10-07 21:29 UTC (permalink / raw) To: Hanna Czenczek Cc: virtio-fs, Eugenio Pérez, Anton Kuchin, qemu-devel, Stefan Hajnoczi On Fri, Oct 06, 2023 at 03:58:44PM +0200, Hanna Czenczek wrote: > On 06.10.23 15:55, Hanna Czenczek wrote: > > On 06.10.23 10:49, Michael S. Tsirkin wrote: > > > On Fri, Oct 06, 2023 at 09:53:53AM +0200, Hanna Czenczek wrote: > > > > On 05.10.23 19:38, Stefan Hajnoczi wrote: > > > > > On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote: > > [...] > > > > > > ``VHOST_USER_GET_VRING_BASE`` > > > > > :id: 11 > > > > > :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` > > > > > :request payload: vring state description > > > > > - :reply payload: vring state description > > > > > + :reply payload: vring descriptor index/indices > > > > > + > > > > > + Stops the vring and returns the current descriptor index > > > > > or indices: > > > > > + > > > > > + * For a split virtqueue, returns only the 16-bit next descriptor > > > > > + index in the *Available Ring*. The index in the *Used Ring* is > > > > > + controlled by the guest driver and can be read from the vring > > > > > I find "is controlled by the guest driver" confusing. The > > > > > device writes > > > > > the Used Ring index. The driver only reads it. The device is > > > > > the active > > > > > party here. > > > > Er, good point. That breaks the whole reasoning. Then I don’t > > > > understand > > > > why we do get/set the available ring index and not the used ring > > > > index. Do > > > > you know why? > > > It's simple. used ring index in memory is controlled by the device and > > > reflects device state. > > > > Exactly, it’s device state, that’s why I thought the front-end needs to > > ensure its read and restored around the reset we currently have in > > vhost_dev_stop()/start(). > > > > > device can just read it back to restore. > > > > I find it strange that the device is supposed to read its own state from > > memory. > > > > > available ring index in memory is controlled by driver and does > > > not reflect device state. > > > > Why can’t the device read the available index from memory? That value > > is put into memory by the driver precisely so the device can read it > > from there. > > Ah, wait, is the idea that the device may have an internal available index > counter that reflects what descriptor it has already fetched? I.e. this > index will lag behind the one in memory, and the difference are new > descriptors that the device still needs to read? If that internal counter is > the index that’s get/set here, then yes, that makes a lot of sense. > > Hanna Exactly. And this gets eventually written out as used index. -- MST ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [Virtio-fs] [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc 2023-10-06 13:55 ` Hanna Czenczek 2023-10-06 13:58 ` Hanna Czenczek @ 2023-10-07 21:27 ` Michael S. Tsirkin 1 sibling, 0 replies; 53+ messages in thread From: Michael S. Tsirkin @ 2023-10-07 21:27 UTC (permalink / raw) To: Hanna Czenczek Cc: virtio-fs, Eugenio Pérez, Anton Kuchin, qemu-devel, Stefan Hajnoczi On Fri, Oct 06, 2023 at 03:55:56PM +0200, Hanna Czenczek wrote: > On 06.10.23 10:49, Michael S. Tsirkin wrote: > > On Fri, Oct 06, 2023 at 09:53:53AM +0200, Hanna Czenczek wrote: > > > On 05.10.23 19:38, Stefan Hajnoczi wrote: > > > > On Wed, Oct 04, 2023 at 02:58:58PM +0200, Hanna Czenczek wrote: > > > > > GET_VRING_BASE does not mention that it stops the respective ring. Fix > > > > > that. > > > > > > > > > > Furthermore, it is not fully clear what the "base offset" these > > > > > commands' documentation refers to is; an offset could be many things. > > > > > Be more precise and verbose about it, especially given that these > > > > > commands use different payload structures depending on whether the vring > > > > > is split or packed. > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > --- > > > > > docs/interop/vhost-user.rst | 66 ++++++++++++++++++++++++++++++++++--- > > > > > 1 file changed, 62 insertions(+), 4 deletions(-) > > > > > > > > > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst > > > > > index 2f68e67a1a..50f5acebe5 100644 > > > > > --- a/docs/interop/vhost-user.rst > > > > > +++ b/docs/interop/vhost-user.rst > > > > > @@ -108,6 +108,37 @@ A vring state description > > > > > :num: a 32-bit number > > > > > +A vring descriptor index for split virtqueues > > > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > + > > > > > ++-------------+---------------------+ > > > > > +| vring index | index in avail ring | > > > > > ++-------------+---------------------+ > > > > > + > > > > > +:vring index: 32-bit index of the respective virtqueue > > > > > + > > > > > +:index in avail ring: 32-bit value, of which currently only the lower 16 > > > > > + bits are used: > > > > > + > > > > > + - Bits 0–15: Next descriptor index in the *Available Ring* > > > > I think we need to say more to make this implementable just by reading > > > > the spec: > > > > > > > > Index of the next *Available Ring* descriptor that the back-end will > > > > process. This is a free-running index that is not wrapped by the ring > > > > size. > > > Sure, thanks. > > > > > > > Feel free to rephrase. > > > > > > > > > + - Bits 16–31: Reserved (set to zero) > > > > > + > > > > > +Vring descriptor indices for packed virtqueues > > > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > + > > > > > ++-------------+--------------------+ > > > > > +| vring index | descriptor indices | > > > > > ++-------------+--------------------+ > > > > > + > > > > > +:vring index: 32-bit index of the respective virtqueue > > > > > + > > > > > +:descriptor indices: 32-bit value: > > > > > + > > > > > + - Bits 0–14: Index in the *Available Ring* > > > > Same here. > > > > > > > > > + - Bit 15: Driver (Available) Ring Wrap Counter > > > > > + - Bits 16–30: Index in the *Used Ring* > > > > Same here. > > > > > > > > > + - Bit 31: Device (Used) Ring Wrap Counter > > > > > + > > > > > A vring address description > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > @@ -1031,18 +1062,45 @@ Front-end message types > > > > > ``VHOST_USER_SET_VRING_BASE`` > > > > > :id: 10 > > > > > :equivalent ioctl: ``VHOST_SET_VRING_BASE`` > > > > > - :request payload: vring state description > > > > > + :request payload: vring descriptor index/indices > > > > > :reply payload: N/A > > > > > - Sets the base offset in the available vring. > > > > > + Sets the next index to use for descriptors in this vring: > > > > > + > > > > > + * For a split virtqueue, sets only the next descriptor index in the > > > > > + *Available Ring*. The device is supposed to read the next index in > > > > > + the *Used Ring* from the respective vring structure in guest memory. > > > > > + > > > > > + * For a packed virtqueue, both indices are supplied, as they are not > > > > > + explicitly available in memory. > > > > > + > > > > > + Consequently, the payload type is specific to the type of virt queue > > > > > + (*a vring descriptor index for split virtqueues* vs. *vring descriptor > > > > > + indices for packed virtqueues*). > > > > > ``VHOST_USER_GET_VRING_BASE`` > > > > > :id: 11 > > > > > :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` > > > > > :request payload: vring state description > > > > > - :reply payload: vring state description > > > > > + :reply payload: vring descriptor index/indices > > > > > + > > > > > + Stops the vring and returns the current descriptor index or indices: > > > > > + > > > > > + * For a split virtqueue, returns only the 16-bit next descriptor > > > > > + index in the *Available Ring*. The index in the *Used Ring* is > > > > > + controlled by the guest driver and can be read from the vring > > > > I find "is controlled by the guest driver" confusing. The device writes > > > > the Used Ring index. The driver only reads it. The device is the active > > > > party here. > > > Er, good point. That breaks the whole reasoning. Then I don’t understand > > > why we do get/set the available ring index and not the used ring index. Do > > > you know why? > > It's simple. used ring index in memory is controlled by the device and > > reflects device state. > > Exactly, it’s device state, that’s why I thought the front-end needs to > ensure its read and restored around the reset we currently have in > vhost_dev_stop()/start(). > > > device can just read it back to restore. > > I find it strange that the device is supposed to read its own state from > memory. /me shrugs. It puts it there, why not read it back. Duplicating state is not usually a good idea - leads to bugs. > > available ring index in memory is controlled by driver and does > > not reflect device state. > > Why can’t the device read the available index from memory? That value is > put into memory by the driver precisely so the device can read it from > there. > > Hanna Consider an example of RX ring for net device. buffers might be available but device does not use them until packets arrive. what I think you could say is that actually just the used index should be sufficient. So I think main thing GET_BASE does is stop the ring. As for the value returned, we can if we want to validate that it matches used ring index. -- MST ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings 2023-10-04 12:58 [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek 2023-10-04 12:58 ` [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS Hanna Czenczek 2023-10-04 12:58 ` [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc Hanna Czenczek @ 2023-10-04 12:58 ` Hanna Czenczek 2023-10-05 17:43 ` Stefan Hajnoczi 2023-10-18 12:14 ` Michael S. Tsirkin 2023-10-04 12:59 ` [PATCH v4 4/8] vhost-user.rst: Introduce suspended state Hanna Czenczek ` (5 subsequent siblings) 8 siblings, 2 replies; 53+ messages in thread From: Hanna Czenczek @ 2023-10-04 12:58 UTC (permalink / raw) To: qemu-devel, virtio-fs Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi, German Maglione, Eugenio Pérez, Anton Kuchin Currently, the vhost-user documentation says that rings are to be initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is negotiated. However, by the time of feature negotiation, all rings have already been initialized, so it is not entirely clear what this means. At least the vhost-user-backend Rust crate's implementation interpreted it to mean that whenever this feature is negotiated, all rings are to put into a disabled state, which means that every SET_FEATURES call would disable all rings, effectively halting the device. This is problematic because the VHOST_F_LOG_ALL feature is also set or cleared this way, which happens during migration. Doing so should not halt the device. Other implementations have interpreted this to mean that the device is to be initialized with all rings disabled, and a subsequent SET_FEATURES call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of them. Here, SET_FEATURES will never disable any ring. This interpretation does not suffer the problem of unintentionally halting the device whenever features are set or cleared, so it seems better and more reasonable. We can clarify this in the documentation by making it explicit that the enabled/disabled state is tracked even while the vring is stopped. Every vring is initialized in a disabled state, and SET_FEATURES without VHOST_USER_F_PROTOCOL_FEATURES simply becomes one way to enable all vrings. Signed-off-by: Hanna Czenczek <hreitz@redhat.com> --- docs/interop/vhost-user.rst | 32 +++++++++++++++++--------------- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst index 50f5acebe5..9f4940a036 100644 --- a/docs/interop/vhost-user.rst +++ b/docs/interop/vhost-user.rst @@ -395,31 +395,33 @@ negotiation. Ring states ----------- -Rings can be in one of three states: +Rings have two independent states: started/stopped, and enabled/disabled. -* stopped: the back-end must not process the ring at all. +* While a ring is stopped, the back-end must not process the ring at + all, regardless of whether it is enabled or disabled. The + enabled/disabled state should still be tracked, though, so it can come + into effect once the ring is started. -* started but disabled: the back-end must process the ring without +* started and disabled: The back-end must process the ring without causing any side effects. For example, for a networking device, in the disabled state the back-end must not supply any new RX packets, but must process and discard any TX packets. -* started and enabled. +* started and enabled: The back-end must process the ring normally, i.e. + process all requests and execute them. -Each ring is initialized in a stopped state. The back-end must start -ring upon receiving a kick (that is, detecting that file descriptor is -readable) on the descriptor specified by ``VHOST_USER_SET_VRING_KICK`` -or receiving the in-band message ``VHOST_USER_VRING_KICK`` if negotiated, -and stop ring upon receiving ``VHOST_USER_GET_VRING_BASE``. +Each ring is initialized in a stopped and disabled state. The back-end +must start a ring upon receiving a kick (that is, detecting that file +descriptor is readable) on the descriptor specified by +``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message +``VHOST_USER_VRING_KICK`` if negotiated, and stop a ring upon receiving +``VHOST_USER_GET_VRING_BASE``. Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``. -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the -ring starts directly in the enabled state. - -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is -initialized in a disabled state and is enabled by -``VHOST_USER_SET_VRING_ENABLE`` with parameter 1. +In addition, upon receiving a ``VHOST_USER_SET_FEATURES`` message from +the front-end without ``VHOST_USER_F_PROTOCOL_FEATURES`` set, the +back-end must enable all rings immediately. While processing the rings (whether they are enabled or not), the back-end must support changing some configuration aspects on the fly. -- 2.41.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings 2023-10-04 12:58 ` [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings Hanna Czenczek @ 2023-10-05 17:43 ` Stefan Hajnoczi 2023-10-18 12:14 ` Michael S. Tsirkin 1 sibling, 0 replies; 53+ messages in thread From: Stefan Hajnoczi @ 2023-10-05 17:43 UTC (permalink / raw) To: Hanna Czenczek Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione, Eugenio Pérez, Anton Kuchin [-- Attachment #1: Type: text/plain, Size: 1824 bytes --] On Wed, Oct 04, 2023 at 02:58:59PM +0200, Hanna Czenczek wrote: > Currently, the vhost-user documentation says that rings are to be > initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is > negotiated. However, by the time of feature negotiation, all rings have > already been initialized, so it is not entirely clear what this means. > > At least the vhost-user-backend Rust crate's implementation interpreted > it to mean that whenever this feature is negotiated, all rings are to > put into a disabled state, which means that every SET_FEATURES call > would disable all rings, effectively halting the device. This is > problematic because the VHOST_F_LOG_ALL feature is also set or cleared > this way, which happens during migration. Doing so should not halt the > device. > > Other implementations have interpreted this to mean that the device is > to be initialized with all rings disabled, and a subsequent SET_FEATURES > call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of > them. Here, SET_FEATURES will never disable any ring. > > This interpretation does not suffer the problem of unintentionally > halting the device whenever features are set or cleared, so it seems > better and more reasonable. > > We can clarify this in the documentation by making it explicit that the > enabled/disabled state is tracked even while the vring is stopped. > Every vring is initialized in a disabled state, and SET_FEATURES without > VHOST_USER_F_PROTOCOL_FEATURES simply becomes one way to enable all > vrings. > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > docs/interop/vhost-user.rst | 32 +++++++++++++++++--------------- > 1 file changed, 17 insertions(+), 15 deletions(-) Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings 2023-10-04 12:58 ` [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings Hanna Czenczek 2023-10-05 17:43 ` Stefan Hajnoczi @ 2023-10-18 12:14 ` Michael S. Tsirkin 2023-10-18 16:17 ` Hanna Czenczek 1 sibling, 1 reply; 53+ messages in thread From: Michael S. Tsirkin @ 2023-10-18 12:14 UTC (permalink / raw) To: Hanna Czenczek Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, German Maglione, Eugenio Pérez, Anton Kuchin On Wed, Oct 04, 2023 at 02:58:59PM +0200, Hanna Czenczek wrote: > Currently, the vhost-user documentation says that rings are to be > initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is > negotiated. However, by the time of feature negotiation, all rings have > already been initialized, so it is not entirely clear what this means. > > At least the vhost-user-backend Rust crate's implementation interpreted > it to mean that whenever this feature is negotiated, all rings are to > put into a disabled state, which means that every SET_FEATURES call > would disable all rings, effectively halting the device. This is > problematic because the VHOST_F_LOG_ALL feature is also set or cleared > this way, which happens during migration. Doing so should not halt the > device. > > Other implementations have interpreted this to mean that the device is > to be initialized with all rings disabled, and a subsequent SET_FEATURES > call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of > them. Here, SET_FEATURES will never disable any ring. > > This interpretation does not suffer the problem of unintentionally > halting the device whenever features are set or cleared, so it seems > better and more reasonable. > > We can clarify this in the documentation by making it explicit that the > enabled/disabled state is tracked even while the vring is stopped. > Every vring is initialized in a disabled state, and SET_FEATURES without > VHOST_USER_F_PROTOCOL_FEATURES simply becomes one way to enable all > vrings. > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> OK so I am expecting v5. My advice is to move patch 1 to end of patchset so we can defer it if we want to. > --- > docs/interop/vhost-user.rst | 32 +++++++++++++++++--------------- > 1 file changed, 17 insertions(+), 15 deletions(-) > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst > index 50f5acebe5..9f4940a036 100644 > --- a/docs/interop/vhost-user.rst > +++ b/docs/interop/vhost-user.rst > @@ -395,31 +395,33 @@ negotiation. > Ring states > ----------- > > -Rings can be in one of three states: > +Rings have two independent states: started/stopped, and enabled/disabled. > > -* stopped: the back-end must not process the ring at all. > +* While a ring is stopped, the back-end must not process the ring at > + all, regardless of whether it is enabled or disabled. The > + enabled/disabled state should still be tracked, though, so it can come > + into effect once the ring is started. > > -* started but disabled: the back-end must process the ring without > +* started and disabled: The back-end must process the ring without > causing any side effects. For example, for a networking device, > in the disabled state the back-end must not supply any new RX packets, > but must process and discard any TX packets. > > -* started and enabled. > +* started and enabled: The back-end must process the ring normally, i.e. > + process all requests and execute them. > > -Each ring is initialized in a stopped state. The back-end must start > -ring upon receiving a kick (that is, detecting that file descriptor is > -readable) on the descriptor specified by ``VHOST_USER_SET_VRING_KICK`` > -or receiving the in-band message ``VHOST_USER_VRING_KICK`` if negotiated, > -and stop ring upon receiving ``VHOST_USER_GET_VRING_BASE``. > +Each ring is initialized in a stopped and disabled state. The back-end > +must start a ring upon receiving a kick (that is, detecting that file > +descriptor is readable) on the descriptor specified by > +``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message > +``VHOST_USER_VRING_KICK`` if negotiated, and stop a ring upon receiving > +``VHOST_USER_GET_VRING_BASE``. > > Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``. > > -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the > -ring starts directly in the enabled state. > - > -If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is > -initialized in a disabled state and is enabled by > -``VHOST_USER_SET_VRING_ENABLE`` with parameter 1. > +In addition, upon receiving a ``VHOST_USER_SET_FEATURES`` message from > +the front-end without ``VHOST_USER_F_PROTOCOL_FEATURES`` set, the > +back-end must enable all rings immediately. > > While processing the rings (whether they are enabled or not), the back-end > must support changing some configuration aspects on the fly. > -- > 2.41.0 On Wed, Oct 04, 2023 at 02:59:00PM +0200, Hanna Czenczek wrote: > In vDPA, GET_VRING_BASE does not stop the queried vring, which is why > SUSPEND was introduced so that the returned index would be stable. In > vhost-user, it does stop the vring, so under the same reasoning, it can > get away without SUSPEND. > > Still, we do want to clarify that if the device is completely stopped, > i.e. all vrings are stopped, the back-end should cease to modify any > state relating to the guest. Do this by calling it "suspended". > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > docs/interop/vhost-user.rst | 20 +++++++++++++++++++- > 1 file changed, 19 insertions(+), 1 deletion(-) > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst > index 9f4940a036..d282155562 100644 > --- a/docs/interop/vhost-user.rst > +++ b/docs/interop/vhost-user.rst > @@ -426,6 +426,19 @@ back-end must enable all rings immediately. > While processing the rings (whether they are enabled or not), the back-end > must support changing some configuration aspects on the fly. > > +.. _suspended_device_state: > + > +Suspended device state > +^^^^^^^^^^^^^^^^^^^^^^ > + > +While all vrings are stopped, the device is *suspended*. In addition to > +not processing any vring (because they are stopped), the device must: > + > +* not write to any guest memory regions, > +* not send any notifications to the guest, > +* not send any messages to the front-end, > +* still process and reply to messages from the front-end. > + > Multiple queue support > ---------------------- > > @@ -513,7 +526,8 @@ ancillary data, it may be used to inform the front-end that the log has > been modified. > > Once the source has finished migration, rings will be stopped by the > -source. No further update must be done before rings are restarted. > +source (:ref:`Suspended device state <suspended_device_state>`). No > +further update must be done before rings are restarted. > > In postcopy migration the back-end is started before all the memory has > been received from the source host, and care must be taken to avoid > @@ -1101,6 +1115,10 @@ Front-end message types > (*a vring descriptor index for split virtqueues* vs. *vring descriptor > indices for packed virtqueues*). > > + When and as long as all of a device’s vrings are stopped, it is > + *suspended*, see :ref:`Suspended device state > + <suspended_device_state>`. > + > The request payload’s *num* field is currently reserved and must be > set to 0. > > -- > 2.41.0 On Wed, Oct 04, 2023 at 02:59:01PM +0200, Hanna Czenczek wrote: > For vhost-user devices, qemu can migrate the virtio state, but not the > back-end's internal state. To do so, we need to be able to transfer > this internal state between front-end (qemu) and back-end. > > At this point, this new feature is added for the purpose of virtio-fs > migration. Because virtiofsd's internal state will not be too large, we > believe it is best to transfer it as a single binary blob after the > streaming phase. > > These are the additions to the protocol: > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_DEVICE_STATE > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a file > descriptor over which to transfer the state. > - CHECK_DEVICE_STATE: After the state has been transferred through the > file descriptor, the front-end invokes this function to verify > success. There is no in-band way (through the file descriptor) to > indicate failure, so we need to check explicitly. > > Once the transfer FD has been established via SET_DEVICE_STATE_FD > (which includes establishing the direction of transfer and migration > phase), the sending side writes its data into it, and the reading side > reads it until it sees an EOF. Then, the front-end will check for > success via CHECK_DEVICE_STATE, which on the destination side includes > checking for integrity (i.e. errors during deserialization). > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > docs/interop/vhost-user.rst | 172 ++++++++++++++++++++++++++++++++++++ > 1 file changed, 172 insertions(+) > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst > index d282155562..aa91e2b34e 100644 > --- a/docs/interop/vhost-user.rst > +++ b/docs/interop/vhost-user.rst > @@ -306,6 +306,32 @@ Inflight description > > :queue size: a 16-bit size of virtqueues > > +Device state transfer parameters > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + > ++--------------------+-----------------+ > +| transfer direction | migration phase | > ++--------------------+-----------------+ > + > +:transfer direction: a 32-bit enum, describing the direction in which > + the state is transferred: > + > + - 0: Save: Transfer the state from the back-end to the front-end, > + which happens on the source side of migration > + - 1: Load: Transfer the state from the front-end to the back-end, > + which happens on the destination side of migration > + > +:migration phase: a 32-bit enum, describing the state in which the VM > + guest and devices are: > + > + - 0: Stopped (in the period after the transfer of memory-mapped > + regions before switch-over to the destination): The VM guest is > + stopped, and the vhost-user device is suspended (see > + :ref:`Suspended device state <suspended_device_state>`). > + > + In the future, additional phases might be added e.g. to allow > + iterative migration while the device is running. > + > C structure > ----------- > > @@ -365,6 +391,7 @@ in the ancillary data: > * ``VHOST_USER_SET_VRING_ERR`` > * ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``) > * ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) > +* ``VHOST_USER_SET_DEVICE_STATE_FD`` > > If *front-end* is unable to send the full message or receives a wrong > reply it will close the connection. An optional reconnection mechanism > @@ -539,6 +566,80 @@ it performs WAKE ioctl's on the userfaultfd to wake the stalled > back-end. The front-end indicates support for this via the > ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature. > > +.. _migrating_backend_state: > + > +Migrating back-end state > +^^^^^^^^^^^^^^^^^^^^^^^^ > + > +Migrating device state involves transferring the state from one > +back-end, called the source, to another back-end, called the > +destination. After migration, the destination transparently resumes > +operation without requiring the driver to re-initialize the device at > +the VIRTIO level. If the migration fails, then the source can > +transparently resume operation until another migration attempt is made. > + > +Generally, the front-end is connected to a virtual machine guest (which > +contains the driver), which has its own state to transfer between source > +and destination, and therefore will have an implementation-specific > +mechanism to do so. The ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature > +provides functionality to have the front-end include the back-end's > +state in this transfer operation so the back-end does not need to > +implement its own mechanism, and so the virtual machine may have its > +complete state, including vhost-user devices' states, contained within a > +single stream of data. > + > +To do this, the back-end state is transferred from back-end to front-end > +on the source side, and vice versa on the destination side. This > +transfer happens over a channel that is negotiated using the > +``VHOST_USER_SET_DEVICE_STATE_FD`` message. This message has two > +parameters: > + > +* Direction of transfer: On the source, the data is saved, transferring > + it from the back-end to the front-end. On the destination, the data > + is loaded, transferring it from the front-end to the back-end. > + > +* Migration phase: Currently, the only supported phase is the period > + after the transfer of memory-mapped regions before switch-over to the > + destination, when both the source and destination devices are > + suspended (:ref:`Suspended device state <suspended_device_state>`). > + In the future, additional phases might be supported to allow iterative > + migration while the device is running. > + > +The nature of the channel is implementation-defined, but it must > +generally behave like a pipe: The writing end will write all the data it > +has into it, signalling the end of data by closing its end. The reading > +end must read all of this data (until encountering the end of file) and > +process it. > + > +* When saving, the writing end is the source back-end, and the reading > + end is the source front-end. After reading the state data from the > + channel, the source front-end must transfer it to the destination > + front-end through an implementation-defined mechanism. > + > +* When loading, the writing end is the destination front-end, and the > + reading end is the destination back-end. After reading the state data > + from the channel, the destination back-end must deserialize its > + internal state from that data and set itself up to allow the driver to > + seamlessly resume operation on the VIRTIO level. > + > +Seamlessly resuming operation means that the migration must be > +transparent to the guest driver, which operates on the VIRTIO level. > +This driver will not perform any re-initialization steps, but continue > +to use the device as if no migration had occurred. The vhost-user > +front-end, however, will re-initialize the vhost state on the > +destination, following the usual protocol for establishing a connection > +to a vhost-user back-end: This includes, for example, setting up memory > +mappings and kick and call FDs as necessary, negotiating protocol > +features, or setting the initial vring base indices (to the same value > +as on the source side, so that operation can resume). > + > +Both on the source and on the destination side, after the respective > +front-end has seen all data transferred (when the transfer FD has been > +closed), it sends the ``VHOST_USER_CHECK_DEVICE_STATE`` message to > +verify that data transfer was successful in the back-end, too. The > +back-end responds once it knows whether the transfer and processing was > +successful or not. > + > Memory access > ------------- > > @@ -932,6 +1033,7 @@ Protocol features > #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS 15 > #define VHOST_USER_PROTOCOL_F_STATUS 16 > #define VHOST_USER_PROTOCOL_F_XEN_MMAP 17 > + #define VHOST_USER_PROTOCOL_F_DEVICE_STATE 18 > > Front-end message types > ----------------------- > @@ -1532,6 +1634,76 @@ Front-end message types > back-end for its device status as defined in the Virtio specification. > Deprecated together with VHOST_USER_SET_STATUS. > > +``VHOST_USER_SET_DEVICE_STATE_FD`` > + :id: 41 > + :equivalent ioctl: N/A > + :request payload: device state transfer parameters > + :reply payload: ``u64`` > + > + Front-end and back-end negotiate a channel over which to transfer the > + back-end’s internal state during migration. Either side (front-end or > + back-end) may create the channel. The nature of this channel is not > + restricted or defined in this document, but whichever side creates it > + must create a file descriptor that is provided to the respectively > + other side, allowing access to the channel. This FD must behave as > + follows: > + > + * For the writing end, it must allow writing the whole back-end state > + sequentially. Closing the file descriptor signals the end of > + transfer. > + > + * For the reading end, it must allow reading the whole back-end state > + sequentially. The end of file signals the end of the transfer. > + > + For example, the channel may be a pipe, in which case the two ends of > + the pipe fulfill these requirements respectively. > + > + Initially, the front-end creates a channel along with such an FD. It > + passes the FD to the back-end as ancillary data of a > + ``VHOST_USER_SET_DEVICE_STATE_FD`` message. The back-end may create a > + different transfer channel, passing the respective FD back to the > + front-end as ancillary data of the reply. If so, the front-end must > + then discard its channel and use the one provided by the back-end. > + > + Whether the back-end should decide to use its own channel is decided > + based on efficiency: If the channel is a pipe, both ends will most > + likely need to copy data into and out of it. Any channel that allows > + for more efficient processing on at least one end, e.g. through > + zero-copy, is considered more efficient and thus preferred. If the > + back-end can provide such a channel, it should decide to use it. > + > + The request payload contains parameters for the subsequent data > + transfer, as described in the :ref:`Migrating back-end state > + <migrating_backend_state>` section. > + > + The value returned is both an indication for success, and whether a > + file descriptor for a back-end-provided channel is returned: Bits 0–7 > + are 0 on success, and non-zero on error. Bit 8 is the invalid FD > + flag; this flag is set when there is no file descriptor returned. > + When this flag is not set, the front-end must use the returned file > + descriptor as its end of the transfer channel. The back-end must not > + both indicate an error and return a file descriptor. > + > + Using this function requires prior negotiation of the > + ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature. > + > +``VHOST_USER_CHECK_DEVICE_STATE`` > + :id: 42 > + :equivalent ioctl: N/A > + :request payload: N/A > + :reply payload: ``u64`` > + > + After transferring the back-end’s internal state during migration (see > + the :ref:`Migrating back-end state <migrating_backend_state>` > + section), check whether the back-end was able to successfully fully > + process the state. > + > + The value returned indicates success or error; 0 is success, any > + non-zero value is an error. > + > + Using this function requires prior negotiation of the > + ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature. > + > > Back-end message types > ---------------------- > -- > 2.41.0 On Wed, Oct 04, 2023 at 02:59:02PM +0200, Hanna Czenczek wrote: > Add the interface for transferring the back-end's state during migration > as defined previously in vhost-user.rst. > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > include/hw/virtio/vhost-backend.h | 24 +++++ > include/hw/virtio/vhost.h | 78 ++++++++++++++++ > hw/virtio/vhost-user.c | 148 ++++++++++++++++++++++++++++++ > hw/virtio/vhost.c | 37 ++++++++ > 4 files changed, 287 insertions(+) > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > index 31a251a9f5..b6eee7e9fd 100644 > --- a/include/hw/virtio/vhost-backend.h > +++ b/include/hw/virtio/vhost-backend.h > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > } VhostSetConfigType; > > +typedef enum VhostDeviceStateDirection { > + /* Transfer state from back-end (device) to front-end */ > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > + /* Transfer state from front-end to back-end (device) */ > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > +} VhostDeviceStateDirection; > + > +typedef enum VhostDeviceStatePhase { > + /* The device (and all its vrings) is stopped */ > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > +} VhostDeviceStatePhase; > + > struct vhost_inflight; > struct vhost_dev; > struct vhost_log; > @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev, > > typedef void (*vhost_reset_status_op)(struct vhost_dev *dev); > > +typedef bool (*vhost_supports_device_state_op)(struct vhost_dev *dev); > +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp); > +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp); > + > typedef struct VhostOps { > VhostBackendType backend_type; > vhost_backend_init vhost_backend_init; > @@ -181,6 +202,9 @@ typedef struct VhostOps { > vhost_force_iommu_op vhost_force_iommu; > vhost_set_config_call_op vhost_set_config_call; > vhost_reset_status_op vhost_reset_status; > + vhost_supports_device_state_op vhost_supports_device_state; > + vhost_set_device_state_fd_op vhost_set_device_state_fd; > + vhost_check_device_state_op vhost_check_device_state; > } VhostOps; > > int vhost_backend_update_device_iotlb(struct vhost_dev *dev, > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h > index 14621f9e79..a0d03c9fdf 100644 > --- a/include/hw/virtio/vhost.h > +++ b/include/hw/virtio/vhost.h > @@ -348,4 +348,82 @@ static inline int vhost_reset_device(struct vhost_dev *hdev) > } > #endif /* CONFIG_VHOST */ > > +/** > + * vhost_supports_device_state(): Checks whether the back-end supports > + * transferring internal device state for the purpose of migration. > + * Support for this feature is required for vhost_set_device_state_fd() > + * and vhost_check_device_state(). > + * > + * @dev: The vhost device > + * > + * Returns true if the device supports these commands, and false if it > + * does not. > + */ > +bool vhost_supports_device_state(struct vhost_dev *dev); > + > +/** > + * vhost_set_device_state_fd(): Begin transfer of internal state from/to > + * the back-end for the purpose of migration. Data is to be transferred > + * over a pipe according to @direction and @phase. The sending end must > + * only write to the pipe, and the receiving end must only read from it. > + * Once the sending end is done, it closes its FD. The receiving end > + * must take this as the end-of-transfer signal and close its FD, too. > + * > + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the > + * read FD for LOAD. This function transfers ownership of @fd to the > + * back-end, i.e. closes it in the front-end. > + * > + * The back-end may optionally reply with an FD of its own, if this > + * improves efficiency on its end. In this case, the returned FD is > + * stored in *reply_fd. The back-end will discard the FD sent to it, > + * and the front-end must use *reply_fd for transferring state to/from > + * the back-end. > + * > + * @dev: The vhost device > + * @direction: The direction in which the state is to be transferred. > + * For outgoing migrations, this is SAVE, and data is read > + * from the back-end and stored by the front-end in the > + * migration stream. > + * For incoming migrations, this is LOAD, and data is read > + * by the front-end from the migration stream and sent to > + * the back-end to restore the saved state. > + * @phase: Which migration phase we are in. Currently, there is only > + * STOPPED (device and all vrings are stopped), in the future, > + * more phases such as PRE_COPY or POST_COPY may be added. > + * @fd: Back-end's end of the pipe through which to transfer state; note > + * that ownership is transferred to the back-end, so this function > + * closes @fd in the front-end. > + * @reply_fd: If the back-end wishes to use a different pipe for state > + * transfer, this will contain an FD for the front-end to > + * use. Otherwise, -1 is stored here. > + * @errp: Potential error description > + * > + * Returns 0 on success, and -errno on failure. > + */ > +int vhost_set_device_state_fd(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp); > + > +/** > + * vhost_set_device_state_fd(): After transferring state from/to the > + * back-end via vhost_set_device_state_fd(), i.e. once the sending end > + * has closed the pipe, inquire the back-end to report any potential > + * errors that have occurred on its side. This allows to sense errors > + * like: > + * - During outgoing migration, when the source side had already started > + * to produce its state, something went wrong and it failed to finish > + * - During incoming migration, when the received state is somehow > + * invalid and cannot be processed by the back-end > + * > + * @dev: The vhost device > + * @errp: Potential error description > + * > + * Returns 0 when the back-end reports successful state transfer and > + * processing, and -errno when an error occurred somewhere. > + */ > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp); > + > #endif > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c > index 7bed9ad7d5..7096b148a9 100644 > --- a/hw/virtio/vhost-user.c > +++ b/hw/virtio/vhost-user.c > @@ -74,6 +74,8 @@ enum VhostUserProtocolFeature { > /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */ > VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15, > VHOST_USER_PROTOCOL_F_STATUS = 16, > + /* Feature 17 reserved for VHOST_USER_PROTOCOL_F_XEN_MMAP. */ > + VHOST_USER_PROTOCOL_F_DEVICE_STATE = 18, > VHOST_USER_PROTOCOL_F_MAX > }; > > @@ -121,6 +123,8 @@ typedef enum VhostUserRequest { > VHOST_USER_REM_MEM_REG = 38, > VHOST_USER_SET_STATUS = 39, > VHOST_USER_GET_STATUS = 40, > + VHOST_USER_SET_DEVICE_STATE_FD = 41, > + VHOST_USER_CHECK_DEVICE_STATE = 42, > VHOST_USER_MAX > } VhostUserRequest; > > @@ -212,6 +216,12 @@ typedef struct { > uint32_t size; /* the following payload size */ > } QEMU_PACKED VhostUserHeader; > > +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */ > +typedef struct VhostUserTransferDeviceState { > + uint32_t direction; > + uint32_t phase; > +} VhostUserTransferDeviceState; > + > typedef union { > #define VHOST_USER_VRING_IDX_MASK (0xff) > #define VHOST_USER_VRING_NOFD_MASK (0x1 << 8) > @@ -226,6 +236,7 @@ typedef union { > VhostUserCryptoSession session; > VhostUserVringArea area; > VhostUserInflight inflight; > + VhostUserTransferDeviceState transfer_state; > } VhostUserPayload; > > typedef struct VhostUserMsg { > @@ -2746,6 +2757,140 @@ static void vhost_user_reset_status(struct vhost_dev *dev) > } > } > > +static bool vhost_user_supports_device_state(struct vhost_dev *dev) > +{ > + return virtio_has_feature(dev->protocol_features, > + VHOST_USER_PROTOCOL_F_DEVICE_STATE); > +} > + > +static int vhost_user_set_device_state_fd(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp) > +{ > + int ret; > + struct vhost_user *vu = dev->opaque; > + VhostUserMsg msg = { > + .hdr = { > + .request = VHOST_USER_SET_DEVICE_STATE_FD, > + .flags = VHOST_USER_VERSION, > + .size = sizeof(msg.payload.transfer_state), > + }, > + .payload.transfer_state = { > + .direction = direction, > + .phase = phase, > + }, > + }; > + > + *reply_fd = -1; > + > + if (!vhost_user_supports_device_state(dev)) { > + close(fd); > + error_setg(errp, "Back-end does not support migration state transfer"); > + return -ENOTSUP; > + } > + > + ret = vhost_user_write(dev, &msg, &fd, 1); > + close(fd); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to send SET_DEVICE_STATE_FD message"); > + return ret; > + } > + > + ret = vhost_user_read(dev, &msg); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to receive SET_DEVICE_STATE_FD reply"); > + return ret; > + } > + > + if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) { > + error_setg(errp, > + "Received unexpected message type, expected %d, received %d", > + VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request); > + return -EPROTO; > + } > + > + if (msg.hdr.size != sizeof(msg.payload.u64)) { > + error_setg(errp, > + "Received bad message size, expected %zu, received %" PRIu32, > + sizeof(msg.payload.u64), msg.hdr.size); > + return -EPROTO; > + } > + > + if ((msg.payload.u64 & 0xff) != 0) { > + error_setg(errp, "Back-end did not accept migration state transfer"); > + return -EIO; > + } > + > + if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) { > + *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr); > + if (*reply_fd < 0) { > + error_setg(errp, > + "Failed to get back-end-provided transfer pipe FD"); > + *reply_fd = -1; > + return -EIO; > + } > + } > + > + return 0; > +} > + > +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp) > +{ > + int ret; > + VhostUserMsg msg = { > + .hdr = { > + .request = VHOST_USER_CHECK_DEVICE_STATE, > + .flags = VHOST_USER_VERSION, > + .size = 0, > + }, > + }; > + > + if (!vhost_user_supports_device_state(dev)) { > + error_setg(errp, "Back-end does not support migration state transfer"); > + return -ENOTSUP; > + } > + > + ret = vhost_user_write(dev, &msg, NULL, 0); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to send CHECK_DEVICE_STATE message"); > + return ret; > + } > + > + ret = vhost_user_read(dev, &msg); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to receive CHECK_DEVICE_STATE reply"); > + return ret; > + } > + > + if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) { > + error_setg(errp, > + "Received unexpected message type, expected %d, received %d", > + VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request); > + return -EPROTO; > + } > + > + if (msg.hdr.size != sizeof(msg.payload.u64)) { > + error_setg(errp, > + "Received bad message size, expected %zu, received %" PRIu32, > + sizeof(msg.payload.u64), msg.hdr.size); > + return -EPROTO; > + } > + > + if (msg.payload.u64 != 0) { > + error_setg(errp, "Back-end failed to process its internal state"); > + return -EIO; > + } > + > + return 0; > +} > + > const VhostOps user_ops = { > .backend_type = VHOST_BACKEND_TYPE_USER, > .vhost_backend_init = vhost_user_backend_init, > @@ -2782,4 +2927,7 @@ const VhostOps user_ops = { > .vhost_set_inflight_fd = vhost_user_set_inflight_fd, > .vhost_dev_start = vhost_user_dev_start, > .vhost_reset_status = vhost_user_reset_status, > + .vhost_supports_device_state = vhost_user_supports_device_state, > + .vhost_set_device_state_fd = vhost_user_set_device_state_fd, > + .vhost_check_device_state = vhost_user_check_device_state, > }; > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > index 6003e50e83..85e199f0aa 100644 > --- a/hw/virtio/vhost.c > +++ b/hw/virtio/vhost.c > @@ -2096,3 +2096,40 @@ int vhost_reset_device(struct vhost_dev *hdev) > > return -ENOSYS; > } > + > +bool vhost_supports_device_state(struct vhost_dev *dev) > +{ > + if (dev->vhost_ops->vhost_supports_device_state) { > + return dev->vhost_ops->vhost_supports_device_state(dev); > + } > + > + return false; > +} > + > +int vhost_set_device_state_fd(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp) > +{ > + if (dev->vhost_ops->vhost_set_device_state_fd) { > + return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase, > + fd, reply_fd, errp); > + } > + > + error_setg(errp, > + "vhost transport does not support migration state transfer"); > + return -ENOSYS; > +} > + > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp) > +{ > + if (dev->vhost_ops->vhost_check_device_state) { > + return dev->vhost_ops->vhost_check_device_state(dev, errp); > + } > + > + error_setg(errp, > + "vhost transport does not support migration state transfer"); > + return -ENOSYS; > +} > -- > 2.41.0 On Wed, Oct 04, 2023 at 02:59:04PM +0200, Hanna Czenczek wrote: > A virtio-fs device's VM state consists of: > - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE) > - the back-end's (virtiofsd's) internal state > > We get/set the latter via the new vhost operations to transfer migratory > state. It is its own dedicated subsection, so that for external > migration, it can be disabled. > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > hw/virtio/vhost-user-fs.c | 101 +++++++++++++++++++++++++++++++++++++- > 1 file changed, 100 insertions(+), 1 deletion(-) > > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c > index 49d699ffc2..eb91723855 100644 > --- a/hw/virtio/vhost-user-fs.c > +++ b/hw/virtio/vhost-user-fs.c > @@ -298,9 +298,108 @@ static struct vhost_dev *vuf_get_vhost(VirtIODevice *vdev) > return &fs->vhost_dev; > } > > +/** > + * Fetch the internal state from virtiofsd and save it to `f`. > + */ > +static int vuf_save_state(QEMUFile *f, void *pv, size_t size, > + const VMStateField *field, JSONWriter *vmdesc) > +{ > + VirtIODevice *vdev = pv; > + VHostUserFS *fs = VHOST_USER_FS(vdev); > + Error *local_error = NULL; > + int ret; > + > + ret = vhost_save_backend_state(&fs->vhost_dev, f, &local_error); > + if (ret < 0) { > + error_reportf_err(local_error, > + "Error saving back-end state of %s device %s " > + "(tag: \"%s\"): ", > + vdev->name, vdev->parent_obj.canonical_path, > + fs->conf.tag ?: "<none>"); > + return ret; > + } > + > + return 0; > +} > + > +/** > + * Load virtiofsd's internal state from `f` and send it over to virtiofsd. > + */ > +static int vuf_load_state(QEMUFile *f, void *pv, size_t size, > + const VMStateField *field) > +{ > + VirtIODevice *vdev = pv; > + VHostUserFS *fs = VHOST_USER_FS(vdev); > + Error *local_error = NULL; > + int ret; > + > + ret = vhost_load_backend_state(&fs->vhost_dev, f, &local_error); > + if (ret < 0) { > + error_reportf_err(local_error, > + "Error loading back-end state of %s device %s " > + "(tag: \"%s\"): ", > + vdev->name, vdev->parent_obj.canonical_path, > + fs->conf.tag ?: "<none>"); > + return ret; > + } > + > + return 0; > +} > + > +static bool vuf_is_internal_migration(void *opaque) > +{ > + /* TODO: Return false when an external migration is requested */ > + return true; > +} > + > +static int vuf_check_migration_support(void *opaque) > +{ > + VirtIODevice *vdev = opaque; > + VHostUserFS *fs = VHOST_USER_FS(vdev); > + > + if (!vhost_supports_device_state(&fs->vhost_dev)) { > + error_report("Back-end of %s device %s (tag: \"%s\") does not support " > + "migration through qemu", > + vdev->name, vdev->parent_obj.canonical_path, > + fs->conf.tag ?: "<none>"); > + return -ENOTSUP; > + } > + > + return 0; > +} > + > +static const VMStateDescription vuf_backend_vmstate; > + > static const VMStateDescription vuf_vmstate = { > .name = "vhost-user-fs", > - .unmigratable = 1, > + .version_id = 0, > + .fields = (VMStateField[]) { > + VMSTATE_VIRTIO_DEVICE, > + VMSTATE_END_OF_LIST() > + }, > + .subsections = (const VMStateDescription * []) { > + &vuf_backend_vmstate, > + NULL, > + } > +}; > + > +static const VMStateDescription vuf_backend_vmstate = { > + .name = "vhost-user-fs-backend", > + .version_id = 0, > + .needed = vuf_is_internal_migration, > + .pre_load = vuf_check_migration_support, > + .pre_save = vuf_check_migration_support, > + .fields = (VMStateField[]) { > + { > + .name = "back-end", > + .info = &(const VMStateInfo) { > + .name = "virtio-fs back-end state", > + .get = vuf_load_state, > + .put = vuf_save_state, > + }, > + }, > + VMSTATE_END_OF_LIST() > + }, > }; > > static Property vuf_properties[] = { > -- > 2.41.0 On Wed, Oct 04, 2023 at 02:59:03PM +0200, Hanna Czenczek wrote: > vhost_save_backend_state() and vhost_load_backend_state() can be used by > vhost front-ends to easily save and load the back-end's state to/from > the migration stream. > > Because we do not know the full state size ahead of time, > vhost_save_backend_state() simply reads the data in 1 MB chunks, and > writes each chunk consecutively into the migration stream, prefixed by > its length. EOF is indicated by a 0-length chunk. > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > include/hw/virtio/vhost.h | 35 +++++++ > hw/virtio/vhost.c | 204 ++++++++++++++++++++++++++++++++++++++ > 2 files changed, 239 insertions(+) > > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h > index a0d03c9fdf..100fcc874d 100644 > --- a/include/hw/virtio/vhost.h > +++ b/include/hw/virtio/vhost.h > @@ -426,4 +426,39 @@ int vhost_set_device_state_fd(struct vhost_dev *dev, > */ > int vhost_check_device_state(struct vhost_dev *dev, Error **errp); > > +/** > + * vhost_save_backend_state(): High-level function to receive a vhost > + * back-end's state, and save it in @f. Uses > + * `vhost_set_device_state_fd()` to get the data from the back-end, and > + * stores it in consecutive chunks that are each prefixed by their > + * respective length (be32). The end is marked by a 0-length chunk. > + * > + * Must only be called while the device and all its vrings are stopped > + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`). > + * > + * @dev: The vhost device from which to save the state > + * @f: Migration stream in which to save the state > + * @errp: Potential error message > + * > + * Returns 0 on success, and -errno otherwise. > + */ > +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp); > + > +/** > + * vhost_load_backend_state(): High-level function to load a vhost > + * back-end's state from @f, and send it over to the back-end. Reads > + * the data from @f in the format used by `vhost_save_state()`, and uses > + * `vhost_set_device_state_fd()` to transfer it to the back-end. > + * > + * Must only be called while the device and all its vrings are stopped > + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`). > + * > + * @dev: The vhost device to which to send the sate > + * @f: Migration stream from which to load the state > + * @errp: Potential error message > + * > + * Returns 0 on success, and -errno otherwise. > + */ > +int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp); > + > #endif > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > index 85e199f0aa..1465adf13a 100644 > --- a/hw/virtio/vhost.c > +++ b/hw/virtio/vhost.c > @@ -2133,3 +2133,207 @@ int vhost_check_device_state(struct vhost_dev *dev, Error **errp) > "vhost transport does not support migration state transfer"); > return -ENOSYS; > } > + > +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp) > +{ > + /* Maximum chunk size in which to transfer the state */ > + const size_t chunk_size = 1 * 1024 * 1024; > + g_autofree void *transfer_buf = NULL; > + g_autoptr(GError) g_err = NULL; > + int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1; > + int ret; > + > + /* [0] for reading (our end), [1] for writing (back-end's end) */ > + if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) { > + error_setg(errp, "Failed to set up state transfer pipe: %s", > + g_err->message); > + ret = -EINVAL; > + goto fail; > + } > + > + read_fd = pipe_fds[0]; > + write_fd = pipe_fds[1]; > + > + /* > + * VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped. > + * Ideally, it is suspended, but SUSPEND/RESUME currently do not exist for > + * vhost-user, so just check that it is stopped at all. > + */ > + assert(!dev->started); > + > + /* Transfer ownership of write_fd to the back-end */ > + ret = vhost_set_device_state_fd(dev, > + VHOST_TRANSFER_STATE_DIRECTION_SAVE, > + VHOST_TRANSFER_STATE_PHASE_STOPPED, > + write_fd, > + &reply_fd, > + errp); > + if (ret < 0) { > + error_prepend(errp, "Failed to initiate state transfer: "); > + goto fail; > + } > + > + /* If the back-end wishes to use a different pipe, switch over */ > + if (reply_fd >= 0) { > + close(read_fd); > + read_fd = reply_fd; > + } > + > + transfer_buf = g_malloc(chunk_size); > + > + while (true) { > + ssize_t read_ret; > + > + read_ret = RETRY_ON_EINTR(read(read_fd, transfer_buf, chunk_size)); > + if (read_ret < 0) { > + ret = -errno; > + error_setg_errno(errp, -ret, "Failed to receive state"); > + goto fail; > + } > + > + assert(read_ret <= chunk_size); > + qemu_put_be32(f, read_ret); > + > + if (read_ret == 0) { > + /* EOF */ > + break; > + } > + > + qemu_put_buffer(f, transfer_buf, read_ret); > + } > + > + /* > + * Back-end will not really care, but be clean and close our end of the pipe > + * before inquiring the back-end about whether transfer was successful > + */ > + close(read_fd); > + read_fd = -1; > + > + /* Also, verify that the device is still stopped */ > + assert(!dev->started); > + > + ret = vhost_check_device_state(dev, errp); > + if (ret < 0) { > + goto fail; > + } > + > + ret = 0; > +fail: > + if (read_fd >= 0) { > + close(read_fd); > + } > + > + return ret; > +} > + > +int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp) > +{ > + size_t transfer_buf_size = 0; > + g_autofree void *transfer_buf = NULL; > + g_autoptr(GError) g_err = NULL; > + int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1; > + int ret; > + > + /* [0] for reading (back-end's end), [1] for writing (our end) */ > + if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) { > + error_setg(errp, "Failed to set up state transfer pipe: %s", > + g_err->message); > + ret = -EINVAL; > + goto fail; > + } > + > + read_fd = pipe_fds[0]; > + write_fd = pipe_fds[1]; > + > + /* > + * VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped. > + * Ideally, it is suspended, but SUSPEND/RESUME currently do not exist for > + * vhost-user, so just check that it is stopped at all. > + */ > + assert(!dev->started); > + > + /* Transfer ownership of read_fd to the back-end */ > + ret = vhost_set_device_state_fd(dev, > + VHOST_TRANSFER_STATE_DIRECTION_LOAD, > + VHOST_TRANSFER_STATE_PHASE_STOPPED, > + read_fd, > + &reply_fd, > + errp); > + if (ret < 0) { > + error_prepend(errp, "Failed to initiate state transfer: "); > + goto fail; > + } > + > + /* If the back-end wishes to use a different pipe, switch over */ > + if (reply_fd >= 0) { > + close(write_fd); > + write_fd = reply_fd; > + } > + > + while (true) { > + size_t this_chunk_size = qemu_get_be32(f); > + ssize_t write_ret; > + const uint8_t *transfer_pointer; > + > + if (this_chunk_size == 0) { > + /* End of state */ > + break; > + } > + > + if (transfer_buf_size < this_chunk_size) { > + transfer_buf = g_realloc(transfer_buf, this_chunk_size); > + transfer_buf_size = this_chunk_size; > + } > + > + if (qemu_get_buffer(f, transfer_buf, this_chunk_size) < > + this_chunk_size) > + { > + error_setg(errp, "Failed to read state"); > + ret = -EINVAL; > + goto fail; > + } > + > + transfer_pointer = transfer_buf; > + while (this_chunk_size > 0) { > + write_ret = RETRY_ON_EINTR( > + write(write_fd, transfer_pointer, this_chunk_size) > + ); > + if (write_ret < 0) { > + ret = -errno; > + error_setg_errno(errp, -ret, "Failed to send state"); > + goto fail; > + } else if (write_ret == 0) { > + error_setg(errp, "Failed to send state: Connection is closed"); > + ret = -ECONNRESET; > + goto fail; > + } > + > + assert(write_ret <= this_chunk_size); > + this_chunk_size -= write_ret; > + transfer_pointer += write_ret; > + } > + } > + > + /* > + * Close our end, thus ending transfer, before inquiring the back-end about > + * whether transfer was successful > + */ > + close(write_fd); > + write_fd = -1; > + > + /* Also, verify that the device is still stopped */ > + assert(!dev->started); > + > + ret = vhost_check_device_state(dev, errp); > + if (ret < 0) { > + goto fail; > + } > + > + ret = 0; > +fail: > + if (write_fd >= 0) { > + close(write_fd); > + } > + > + return ret; > +} > -- > 2.41.0 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings 2023-10-18 12:14 ` Michael S. Tsirkin @ 2023-10-18 16:17 ` Hanna Czenczek 0 siblings, 0 replies; 53+ messages in thread From: Hanna Czenczek @ 2023-10-18 16:17 UTC (permalink / raw) To: Michael S. Tsirkin Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, German Maglione, Eugenio Pérez, Anton Kuchin On 18.10.23 14:14, Michael S. Tsirkin wrote: > On Wed, Oct 04, 2023 at 02:58:59PM +0200, Hanna Czenczek wrote: >> Currently, the vhost-user documentation says that rings are to be >> initialized in a disabled state when VHOST_USER_F_PROTOCOL_FEATURES is >> negotiated. However, by the time of feature negotiation, all rings have >> already been initialized, so it is not entirely clear what this means. >> >> At least the vhost-user-backend Rust crate's implementation interpreted >> it to mean that whenever this feature is negotiated, all rings are to >> put into a disabled state, which means that every SET_FEATURES call >> would disable all rings, effectively halting the device. This is >> problematic because the VHOST_F_LOG_ALL feature is also set or cleared >> this way, which happens during migration. Doing so should not halt the >> device. >> >> Other implementations have interpreted this to mean that the device is >> to be initialized with all rings disabled, and a subsequent SET_FEATURES >> call that does not set VHOST_USER_F_PROTOCOL_FEATURES will enable all of >> them. Here, SET_FEATURES will never disable any ring. >> >> This interpretation does not suffer the problem of unintentionally >> halting the device whenever features are set or cleared, so it seems >> better and more reasonable. >> >> We can clarify this in the documentation by making it explicit that the >> enabled/disabled state is tracked even while the vring is stopped. >> Every vring is initialized in a disabled state, and SET_FEATURES without >> VHOST_USER_F_PROTOCOL_FEATURES simply becomes one way to enable all >> vrings. >> >> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > OK so I am expecting v5. My advice is to move patch 1 to end of patchset > so we can defer it if we want to. Already sent – I’ve just dropped patch 1, since it doesn’t add anything to the objective of the patch series itself: https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg04727.html Hanna ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v4 4/8] vhost-user.rst: Introduce suspended state 2023-10-04 12:58 [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek ` (2 preceding siblings ...) 2023-10-04 12:58 ` [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings Hanna Czenczek @ 2023-10-04 12:59 ` Hanna Czenczek 2023-10-05 17:44 ` Stefan Hajnoczi 2023-10-04 12:59 ` [PATCH v4 5/8] vhost-user.rst: Migrating back-end-internal state Hanna Czenczek ` (4 subsequent siblings) 8 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-04 12:59 UTC (permalink / raw) To: qemu-devel, virtio-fs Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi, German Maglione, Eugenio Pérez, Anton Kuchin In vDPA, GET_VRING_BASE does not stop the queried vring, which is why SUSPEND was introduced so that the returned index would be stable. In vhost-user, it does stop the vring, so under the same reasoning, it can get away without SUSPEND. Still, we do want to clarify that if the device is completely stopped, i.e. all vrings are stopped, the back-end should cease to modify any state relating to the guest. Do this by calling it "suspended". Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> --- docs/interop/vhost-user.rst | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst index 9f4940a036..d282155562 100644 --- a/docs/interop/vhost-user.rst +++ b/docs/interop/vhost-user.rst @@ -426,6 +426,19 @@ back-end must enable all rings immediately. While processing the rings (whether they are enabled or not), the back-end must support changing some configuration aspects on the fly. +.. _suspended_device_state: + +Suspended device state +^^^^^^^^^^^^^^^^^^^^^^ + +While all vrings are stopped, the device is *suspended*. In addition to +not processing any vring (because they are stopped), the device must: + +* not write to any guest memory regions, +* not send any notifications to the guest, +* not send any messages to the front-end, +* still process and reply to messages from the front-end. + Multiple queue support ---------------------- @@ -513,7 +526,8 @@ ancillary data, it may be used to inform the front-end that the log has been modified. Once the source has finished migration, rings will be stopped by the -source. No further update must be done before rings are restarted. +source (:ref:`Suspended device state <suspended_device_state>`). No +further update must be done before rings are restarted. In postcopy migration the back-end is started before all the memory has been received from the source host, and care must be taken to avoid @@ -1101,6 +1115,10 @@ Front-end message types (*a vring descriptor index for split virtqueues* vs. *vring descriptor indices for packed virtqueues*). + When and as long as all of a device’s vrings are stopped, it is + *suspended*, see :ref:`Suspended device state + <suspended_device_state>`. + The request payload’s *num* field is currently reserved and must be set to 0. -- 2.41.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v4 4/8] vhost-user.rst: Introduce suspended state 2023-10-04 12:59 ` [PATCH v4 4/8] vhost-user.rst: Introduce suspended state Hanna Czenczek @ 2023-10-05 17:44 ` Stefan Hajnoczi 0 siblings, 0 replies; 53+ messages in thread From: Stefan Hajnoczi @ 2023-10-05 17:44 UTC (permalink / raw) To: Hanna Czenczek Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione, Eugenio Pérez, Anton Kuchin [-- Attachment #1: Type: text/plain, Size: 817 bytes --] On Wed, Oct 04, 2023 at 02:59:00PM +0200, Hanna Czenczek wrote: > In vDPA, GET_VRING_BASE does not stop the queried vring, which is why > SUSPEND was introduced so that the returned index would be stable. In > vhost-user, it does stop the vring, so under the same reasoning, it can > get away without SUSPEND. > > Still, we do want to clarify that if the device is completely stopped, > i.e. all vrings are stopped, the back-end should cease to modify any > state relating to the guest. Do this by calling it "suspended". > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > docs/interop/vhost-user.rst | 20 +++++++++++++++++++- > 1 file changed, 19 insertions(+), 1 deletion(-) Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v4 5/8] vhost-user.rst: Migrating back-end-internal state 2023-10-04 12:58 [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek ` (3 preceding siblings ...) 2023-10-04 12:59 ` [PATCH v4 4/8] vhost-user.rst: Introduce suspended state Hanna Czenczek @ 2023-10-04 12:59 ` Hanna Czenczek 2023-10-05 17:46 ` Stefan Hajnoczi 2023-10-04 12:59 ` [PATCH v4 6/8] vhost-user: Interface for migration state transfer Hanna Czenczek ` (3 subsequent siblings) 8 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-04 12:59 UTC (permalink / raw) To: qemu-devel, virtio-fs Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi, German Maglione, Eugenio Pérez, Anton Kuchin For vhost-user devices, qemu can migrate the virtio state, but not the back-end's internal state. To do so, we need to be able to transfer this internal state between front-end (qemu) and back-end. At this point, this new feature is added for the purpose of virtio-fs migration. Because virtiofsd's internal state will not be too large, we believe it is best to transfer it as a single binary blob after the streaming phase. These are the additions to the protocol: - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_DEVICE_STATE - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a file descriptor over which to transfer the state. - CHECK_DEVICE_STATE: After the state has been transferred through the file descriptor, the front-end invokes this function to verify success. There is no in-band way (through the file descriptor) to indicate failure, so we need to check explicitly. Once the transfer FD has been established via SET_DEVICE_STATE_FD (which includes establishing the direction of transfer and migration phase), the sending side writes its data into it, and the reading side reads it until it sees an EOF. Then, the front-end will check for success via CHECK_DEVICE_STATE, which on the destination side includes checking for integrity (i.e. errors during deserialization). Signed-off-by: Hanna Czenczek <hreitz@redhat.com> --- docs/interop/vhost-user.rst | 172 ++++++++++++++++++++++++++++++++++++ 1 file changed, 172 insertions(+) diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst index d282155562..aa91e2b34e 100644 --- a/docs/interop/vhost-user.rst +++ b/docs/interop/vhost-user.rst @@ -306,6 +306,32 @@ Inflight description :queue size: a 16-bit size of virtqueues +Device state transfer parameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + ++--------------------+-----------------+ +| transfer direction | migration phase | ++--------------------+-----------------+ + +:transfer direction: a 32-bit enum, describing the direction in which + the state is transferred: + + - 0: Save: Transfer the state from the back-end to the front-end, + which happens on the source side of migration + - 1: Load: Transfer the state from the front-end to the back-end, + which happens on the destination side of migration + +:migration phase: a 32-bit enum, describing the state in which the VM + guest and devices are: + + - 0: Stopped (in the period after the transfer of memory-mapped + regions before switch-over to the destination): The VM guest is + stopped, and the vhost-user device is suspended (see + :ref:`Suspended device state <suspended_device_state>`). + + In the future, additional phases might be added e.g. to allow + iterative migration while the device is running. + C structure ----------- @@ -365,6 +391,7 @@ in the ancillary data: * ``VHOST_USER_SET_VRING_ERR`` * ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``) * ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) +* ``VHOST_USER_SET_DEVICE_STATE_FD`` If *front-end* is unable to send the full message or receives a wrong reply it will close the connection. An optional reconnection mechanism @@ -539,6 +566,80 @@ it performs WAKE ioctl's on the userfaultfd to wake the stalled back-end. The front-end indicates support for this via the ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature. +.. _migrating_backend_state: + +Migrating back-end state +^^^^^^^^^^^^^^^^^^^^^^^^ + +Migrating device state involves transferring the state from one +back-end, called the source, to another back-end, called the +destination. After migration, the destination transparently resumes +operation without requiring the driver to re-initialize the device at +the VIRTIO level. If the migration fails, then the source can +transparently resume operation until another migration attempt is made. + +Generally, the front-end is connected to a virtual machine guest (which +contains the driver), which has its own state to transfer between source +and destination, and therefore will have an implementation-specific +mechanism to do so. The ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature +provides functionality to have the front-end include the back-end's +state in this transfer operation so the back-end does not need to +implement its own mechanism, and so the virtual machine may have its +complete state, including vhost-user devices' states, contained within a +single stream of data. + +To do this, the back-end state is transferred from back-end to front-end +on the source side, and vice versa on the destination side. This +transfer happens over a channel that is negotiated using the +``VHOST_USER_SET_DEVICE_STATE_FD`` message. This message has two +parameters: + +* Direction of transfer: On the source, the data is saved, transferring + it from the back-end to the front-end. On the destination, the data + is loaded, transferring it from the front-end to the back-end. + +* Migration phase: Currently, the only supported phase is the period + after the transfer of memory-mapped regions before switch-over to the + destination, when both the source and destination devices are + suspended (:ref:`Suspended device state <suspended_device_state>`). + In the future, additional phases might be supported to allow iterative + migration while the device is running. + +The nature of the channel is implementation-defined, but it must +generally behave like a pipe: The writing end will write all the data it +has into it, signalling the end of data by closing its end. The reading +end must read all of this data (until encountering the end of file) and +process it. + +* When saving, the writing end is the source back-end, and the reading + end is the source front-end. After reading the state data from the + channel, the source front-end must transfer it to the destination + front-end through an implementation-defined mechanism. + +* When loading, the writing end is the destination front-end, and the + reading end is the destination back-end. After reading the state data + from the channel, the destination back-end must deserialize its + internal state from that data and set itself up to allow the driver to + seamlessly resume operation on the VIRTIO level. + +Seamlessly resuming operation means that the migration must be +transparent to the guest driver, which operates on the VIRTIO level. +This driver will not perform any re-initialization steps, but continue +to use the device as if no migration had occurred. The vhost-user +front-end, however, will re-initialize the vhost state on the +destination, following the usual protocol for establishing a connection +to a vhost-user back-end: This includes, for example, setting up memory +mappings and kick and call FDs as necessary, negotiating protocol +features, or setting the initial vring base indices (to the same value +as on the source side, so that operation can resume). + +Both on the source and on the destination side, after the respective +front-end has seen all data transferred (when the transfer FD has been +closed), it sends the ``VHOST_USER_CHECK_DEVICE_STATE`` message to +verify that data transfer was successful in the back-end, too. The +back-end responds once it knows whether the transfer and processing was +successful or not. + Memory access ------------- @@ -932,6 +1033,7 @@ Protocol features #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS 15 #define VHOST_USER_PROTOCOL_F_STATUS 16 #define VHOST_USER_PROTOCOL_F_XEN_MMAP 17 + #define VHOST_USER_PROTOCOL_F_DEVICE_STATE 18 Front-end message types ----------------------- @@ -1532,6 +1634,76 @@ Front-end message types back-end for its device status as defined in the Virtio specification. Deprecated together with VHOST_USER_SET_STATUS. +``VHOST_USER_SET_DEVICE_STATE_FD`` + :id: 41 + :equivalent ioctl: N/A + :request payload: device state transfer parameters + :reply payload: ``u64`` + + Front-end and back-end negotiate a channel over which to transfer the + back-end’s internal state during migration. Either side (front-end or + back-end) may create the channel. The nature of this channel is not + restricted or defined in this document, but whichever side creates it + must create a file descriptor that is provided to the respectively + other side, allowing access to the channel. This FD must behave as + follows: + + * For the writing end, it must allow writing the whole back-end state + sequentially. Closing the file descriptor signals the end of + transfer. + + * For the reading end, it must allow reading the whole back-end state + sequentially. The end of file signals the end of the transfer. + + For example, the channel may be a pipe, in which case the two ends of + the pipe fulfill these requirements respectively. + + Initially, the front-end creates a channel along with such an FD. It + passes the FD to the back-end as ancillary data of a + ``VHOST_USER_SET_DEVICE_STATE_FD`` message. The back-end may create a + different transfer channel, passing the respective FD back to the + front-end as ancillary data of the reply. If so, the front-end must + then discard its channel and use the one provided by the back-end. + + Whether the back-end should decide to use its own channel is decided + based on efficiency: If the channel is a pipe, both ends will most + likely need to copy data into and out of it. Any channel that allows + for more efficient processing on at least one end, e.g. through + zero-copy, is considered more efficient and thus preferred. If the + back-end can provide such a channel, it should decide to use it. + + The request payload contains parameters for the subsequent data + transfer, as described in the :ref:`Migrating back-end state + <migrating_backend_state>` section. + + The value returned is both an indication for success, and whether a + file descriptor for a back-end-provided channel is returned: Bits 0–7 + are 0 on success, and non-zero on error. Bit 8 is the invalid FD + flag; this flag is set when there is no file descriptor returned. + When this flag is not set, the front-end must use the returned file + descriptor as its end of the transfer channel. The back-end must not + both indicate an error and return a file descriptor. + + Using this function requires prior negotiation of the + ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature. + +``VHOST_USER_CHECK_DEVICE_STATE`` + :id: 42 + :equivalent ioctl: N/A + :request payload: N/A + :reply payload: ``u64`` + + After transferring the back-end’s internal state during migration (see + the :ref:`Migrating back-end state <migrating_backend_state>` + section), check whether the back-end was able to successfully fully + process the state. + + The value returned indicates success or error; 0 is success, any + non-zero value is an error. + + Using this function requires prior negotiation of the + ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature. + Back-end message types ---------------------- -- 2.41.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v4 5/8] vhost-user.rst: Migrating back-end-internal state 2023-10-04 12:59 ` [PATCH v4 5/8] vhost-user.rst: Migrating back-end-internal state Hanna Czenczek @ 2023-10-05 17:46 ` Stefan Hajnoczi 0 siblings, 0 replies; 53+ messages in thread From: Stefan Hajnoczi @ 2023-10-05 17:46 UTC (permalink / raw) To: Hanna Czenczek Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione, Eugenio Pérez, Anton Kuchin [-- Attachment #1: Type: text/plain, Size: 1689 bytes --] On Wed, Oct 04, 2023 at 02:59:01PM +0200, Hanna Czenczek wrote: > For vhost-user devices, qemu can migrate the virtio state, but not the > back-end's internal state. To do so, we need to be able to transfer > this internal state between front-end (qemu) and back-end. > > At this point, this new feature is added for the purpose of virtio-fs > migration. Because virtiofsd's internal state will not be too large, we > believe it is best to transfer it as a single binary blob after the > streaming phase. > > These are the additions to the protocol: > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_DEVICE_STATE > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a file > descriptor over which to transfer the state. > - CHECK_DEVICE_STATE: After the state has been transferred through the > file descriptor, the front-end invokes this function to verify > success. There is no in-band way (through the file descriptor) to > indicate failure, so we need to check explicitly. > > Once the transfer FD has been established via SET_DEVICE_STATE_FD > (which includes establishing the direction of transfer and migration > phase), the sending side writes its data into it, and the reading side > reads it until it sees an EOF. Then, the front-end will check for > success via CHECK_DEVICE_STATE, which on the destination side includes > checking for integrity (i.e. errors during deserialization). > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > docs/interop/vhost-user.rst | 172 ++++++++++++++++++++++++++++++++++++ > 1 file changed, 172 insertions(+) Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v4 6/8] vhost-user: Interface for migration state transfer 2023-10-04 12:58 [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek ` (4 preceding siblings ...) 2023-10-04 12:59 ` [PATCH v4 5/8] vhost-user.rst: Migrating back-end-internal state Hanna Czenczek @ 2023-10-04 12:59 ` Hanna Czenczek 2023-10-05 17:46 ` Stefan Hajnoczi 2023-10-04 12:59 ` [PATCH v4 7/8] vhost: Add high-level state save/load functions Hanna Czenczek ` (2 subsequent siblings) 8 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-04 12:59 UTC (permalink / raw) To: qemu-devel, virtio-fs Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi, German Maglione, Eugenio Pérez, Anton Kuchin Add the interface for transferring the back-end's state during migration as defined previously in vhost-user.rst. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> --- include/hw/virtio/vhost-backend.h | 24 +++++ include/hw/virtio/vhost.h | 78 ++++++++++++++++ hw/virtio/vhost-user.c | 148 ++++++++++++++++++++++++++++++ hw/virtio/vhost.c | 37 ++++++++ 4 files changed, 287 insertions(+) diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h index 31a251a9f5..b6eee7e9fd 100644 --- a/include/hw/virtio/vhost-backend.h +++ b/include/hw/virtio/vhost-backend.h @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { VHOST_SET_CONFIG_TYPE_MIGRATION = 1, } VhostSetConfigType; +typedef enum VhostDeviceStateDirection { + /* Transfer state from back-end (device) to front-end */ + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, + /* Transfer state from front-end to back-end (device) */ + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, +} VhostDeviceStateDirection; + +typedef enum VhostDeviceStatePhase { + /* The device (and all its vrings) is stopped */ + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, +} VhostDeviceStatePhase; + struct vhost_inflight; struct vhost_dev; struct vhost_log; @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev, typedef void (*vhost_reset_status_op)(struct vhost_dev *dev); +typedef bool (*vhost_supports_device_state_op)(struct vhost_dev *dev); +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev, + VhostDeviceStateDirection direction, + VhostDeviceStatePhase phase, + int fd, + int *reply_fd, + Error **errp); +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp); + typedef struct VhostOps { VhostBackendType backend_type; vhost_backend_init vhost_backend_init; @@ -181,6 +202,9 @@ typedef struct VhostOps { vhost_force_iommu_op vhost_force_iommu; vhost_set_config_call_op vhost_set_config_call; vhost_reset_status_op vhost_reset_status; + vhost_supports_device_state_op vhost_supports_device_state; + vhost_set_device_state_fd_op vhost_set_device_state_fd; + vhost_check_device_state_op vhost_check_device_state; } VhostOps; int vhost_backend_update_device_iotlb(struct vhost_dev *dev, diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h index 14621f9e79..a0d03c9fdf 100644 --- a/include/hw/virtio/vhost.h +++ b/include/hw/virtio/vhost.h @@ -348,4 +348,82 @@ static inline int vhost_reset_device(struct vhost_dev *hdev) } #endif /* CONFIG_VHOST */ +/** + * vhost_supports_device_state(): Checks whether the back-end supports + * transferring internal device state for the purpose of migration. + * Support for this feature is required for vhost_set_device_state_fd() + * and vhost_check_device_state(). + * + * @dev: The vhost device + * + * Returns true if the device supports these commands, and false if it + * does not. + */ +bool vhost_supports_device_state(struct vhost_dev *dev); + +/** + * vhost_set_device_state_fd(): Begin transfer of internal state from/to + * the back-end for the purpose of migration. Data is to be transferred + * over a pipe according to @direction and @phase. The sending end must + * only write to the pipe, and the receiving end must only read from it. + * Once the sending end is done, it closes its FD. The receiving end + * must take this as the end-of-transfer signal and close its FD, too. + * + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the + * read FD for LOAD. This function transfers ownership of @fd to the + * back-end, i.e. closes it in the front-end. + * + * The back-end may optionally reply with an FD of its own, if this + * improves efficiency on its end. In this case, the returned FD is + * stored in *reply_fd. The back-end will discard the FD sent to it, + * and the front-end must use *reply_fd for transferring state to/from + * the back-end. + * + * @dev: The vhost device + * @direction: The direction in which the state is to be transferred. + * For outgoing migrations, this is SAVE, and data is read + * from the back-end and stored by the front-end in the + * migration stream. + * For incoming migrations, this is LOAD, and data is read + * by the front-end from the migration stream and sent to + * the back-end to restore the saved state. + * @phase: Which migration phase we are in. Currently, there is only + * STOPPED (device and all vrings are stopped), in the future, + * more phases such as PRE_COPY or POST_COPY may be added. + * @fd: Back-end's end of the pipe through which to transfer state; note + * that ownership is transferred to the back-end, so this function + * closes @fd in the front-end. + * @reply_fd: If the back-end wishes to use a different pipe for state + * transfer, this will contain an FD for the front-end to + * use. Otherwise, -1 is stored here. + * @errp: Potential error description + * + * Returns 0 on success, and -errno on failure. + */ +int vhost_set_device_state_fd(struct vhost_dev *dev, + VhostDeviceStateDirection direction, + VhostDeviceStatePhase phase, + int fd, + int *reply_fd, + Error **errp); + +/** + * vhost_set_device_state_fd(): After transferring state from/to the + * back-end via vhost_set_device_state_fd(), i.e. once the sending end + * has closed the pipe, inquire the back-end to report any potential + * errors that have occurred on its side. This allows to sense errors + * like: + * - During outgoing migration, when the source side had already started + * to produce its state, something went wrong and it failed to finish + * - During incoming migration, when the received state is somehow + * invalid and cannot be processed by the back-end + * + * @dev: The vhost device + * @errp: Potential error description + * + * Returns 0 when the back-end reports successful state transfer and + * processing, and -errno when an error occurred somewhere. + */ +int vhost_check_device_state(struct vhost_dev *dev, Error **errp); + #endif diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index 7bed9ad7d5..7096b148a9 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -74,6 +74,8 @@ enum VhostUserProtocolFeature { /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */ VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15, VHOST_USER_PROTOCOL_F_STATUS = 16, + /* Feature 17 reserved for VHOST_USER_PROTOCOL_F_XEN_MMAP. */ + VHOST_USER_PROTOCOL_F_DEVICE_STATE = 18, VHOST_USER_PROTOCOL_F_MAX }; @@ -121,6 +123,8 @@ typedef enum VhostUserRequest { VHOST_USER_REM_MEM_REG = 38, VHOST_USER_SET_STATUS = 39, VHOST_USER_GET_STATUS = 40, + VHOST_USER_SET_DEVICE_STATE_FD = 41, + VHOST_USER_CHECK_DEVICE_STATE = 42, VHOST_USER_MAX } VhostUserRequest; @@ -212,6 +216,12 @@ typedef struct { uint32_t size; /* the following payload size */ } QEMU_PACKED VhostUserHeader; +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */ +typedef struct VhostUserTransferDeviceState { + uint32_t direction; + uint32_t phase; +} VhostUserTransferDeviceState; + typedef union { #define VHOST_USER_VRING_IDX_MASK (0xff) #define VHOST_USER_VRING_NOFD_MASK (0x1 << 8) @@ -226,6 +236,7 @@ typedef union { VhostUserCryptoSession session; VhostUserVringArea area; VhostUserInflight inflight; + VhostUserTransferDeviceState transfer_state; } VhostUserPayload; typedef struct VhostUserMsg { @@ -2746,6 +2757,140 @@ static void vhost_user_reset_status(struct vhost_dev *dev) } } +static bool vhost_user_supports_device_state(struct vhost_dev *dev) +{ + return virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_DEVICE_STATE); +} + +static int vhost_user_set_device_state_fd(struct vhost_dev *dev, + VhostDeviceStateDirection direction, + VhostDeviceStatePhase phase, + int fd, + int *reply_fd, + Error **errp) +{ + int ret; + struct vhost_user *vu = dev->opaque; + VhostUserMsg msg = { + .hdr = { + .request = VHOST_USER_SET_DEVICE_STATE_FD, + .flags = VHOST_USER_VERSION, + .size = sizeof(msg.payload.transfer_state), + }, + .payload.transfer_state = { + .direction = direction, + .phase = phase, + }, + }; + + *reply_fd = -1; + + if (!vhost_user_supports_device_state(dev)) { + close(fd); + error_setg(errp, "Back-end does not support migration state transfer"); + return -ENOTSUP; + } + + ret = vhost_user_write(dev, &msg, &fd, 1); + close(fd); + if (ret < 0) { + error_setg_errno(errp, -ret, + "Failed to send SET_DEVICE_STATE_FD message"); + return ret; + } + + ret = vhost_user_read(dev, &msg); + if (ret < 0) { + error_setg_errno(errp, -ret, + "Failed to receive SET_DEVICE_STATE_FD reply"); + return ret; + } + + if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) { + error_setg(errp, + "Received unexpected message type, expected %d, received %d", + VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request); + return -EPROTO; + } + + if (msg.hdr.size != sizeof(msg.payload.u64)) { + error_setg(errp, + "Received bad message size, expected %zu, received %" PRIu32, + sizeof(msg.payload.u64), msg.hdr.size); + return -EPROTO; + } + + if ((msg.payload.u64 & 0xff) != 0) { + error_setg(errp, "Back-end did not accept migration state transfer"); + return -EIO; + } + + if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) { + *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr); + if (*reply_fd < 0) { + error_setg(errp, + "Failed to get back-end-provided transfer pipe FD"); + *reply_fd = -1; + return -EIO; + } + } + + return 0; +} + +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp) +{ + int ret; + VhostUserMsg msg = { + .hdr = { + .request = VHOST_USER_CHECK_DEVICE_STATE, + .flags = VHOST_USER_VERSION, + .size = 0, + }, + }; + + if (!vhost_user_supports_device_state(dev)) { + error_setg(errp, "Back-end does not support migration state transfer"); + return -ENOTSUP; + } + + ret = vhost_user_write(dev, &msg, NULL, 0); + if (ret < 0) { + error_setg_errno(errp, -ret, + "Failed to send CHECK_DEVICE_STATE message"); + return ret; + } + + ret = vhost_user_read(dev, &msg); + if (ret < 0) { + error_setg_errno(errp, -ret, + "Failed to receive CHECK_DEVICE_STATE reply"); + return ret; + } + + if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) { + error_setg(errp, + "Received unexpected message type, expected %d, received %d", + VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request); + return -EPROTO; + } + + if (msg.hdr.size != sizeof(msg.payload.u64)) { + error_setg(errp, + "Received bad message size, expected %zu, received %" PRIu32, + sizeof(msg.payload.u64), msg.hdr.size); + return -EPROTO; + } + + if (msg.payload.u64 != 0) { + error_setg(errp, "Back-end failed to process its internal state"); + return -EIO; + } + + return 0; +} + const VhostOps user_ops = { .backend_type = VHOST_BACKEND_TYPE_USER, .vhost_backend_init = vhost_user_backend_init, @@ -2782,4 +2927,7 @@ const VhostOps user_ops = { .vhost_set_inflight_fd = vhost_user_set_inflight_fd, .vhost_dev_start = vhost_user_dev_start, .vhost_reset_status = vhost_user_reset_status, + .vhost_supports_device_state = vhost_user_supports_device_state, + .vhost_set_device_state_fd = vhost_user_set_device_state_fd, + .vhost_check_device_state = vhost_user_check_device_state, }; diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 6003e50e83..85e199f0aa 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -2096,3 +2096,40 @@ int vhost_reset_device(struct vhost_dev *hdev) return -ENOSYS; } + +bool vhost_supports_device_state(struct vhost_dev *dev) +{ + if (dev->vhost_ops->vhost_supports_device_state) { + return dev->vhost_ops->vhost_supports_device_state(dev); + } + + return false; +} + +int vhost_set_device_state_fd(struct vhost_dev *dev, + VhostDeviceStateDirection direction, + VhostDeviceStatePhase phase, + int fd, + int *reply_fd, + Error **errp) +{ + if (dev->vhost_ops->vhost_set_device_state_fd) { + return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase, + fd, reply_fd, errp); + } + + error_setg(errp, + "vhost transport does not support migration state transfer"); + return -ENOSYS; +} + +int vhost_check_device_state(struct vhost_dev *dev, Error **errp) +{ + if (dev->vhost_ops->vhost_check_device_state) { + return dev->vhost_ops->vhost_check_device_state(dev, errp); + } + + error_setg(errp, + "vhost transport does not support migration state transfer"); + return -ENOSYS; +} -- 2.41.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v4 6/8] vhost-user: Interface for migration state transfer 2023-10-04 12:59 ` [PATCH v4 6/8] vhost-user: Interface for migration state transfer Hanna Czenczek @ 2023-10-05 17:46 ` Stefan Hajnoczi 0 siblings, 0 replies; 53+ messages in thread From: Stefan Hajnoczi @ 2023-10-05 17:46 UTC (permalink / raw) To: Hanna Czenczek Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione, Eugenio Pérez, Anton Kuchin [-- Attachment #1: Type: text/plain, Size: 635 bytes --] On Wed, Oct 04, 2023 at 02:59:02PM +0200, Hanna Czenczek wrote: > Add the interface for transferring the back-end's state during migration > as defined previously in vhost-user.rst. > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > include/hw/virtio/vhost-backend.h | 24 +++++ > include/hw/virtio/vhost.h | 78 ++++++++++++++++ > hw/virtio/vhost-user.c | 148 ++++++++++++++++++++++++++++++ > hw/virtio/vhost.c | 37 ++++++++ > 4 files changed, 287 insertions(+) Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v4 7/8] vhost: Add high-level state save/load functions 2023-10-04 12:58 [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek ` (5 preceding siblings ...) 2023-10-04 12:59 ` [PATCH v4 6/8] vhost-user: Interface for migration state transfer Hanna Czenczek @ 2023-10-04 12:59 ` Hanna Czenczek 2023-10-05 17:46 ` Stefan Hajnoczi 2023-10-04 12:59 ` [PATCH v4 8/8] vhost-user-fs: Implement internal migration Hanna Czenczek 2023-10-05 17:48 ` [PATCH v4 0/8] vhost-user: Back-end state migration Stefan Hajnoczi 8 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-04 12:59 UTC (permalink / raw) To: qemu-devel, virtio-fs Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi, German Maglione, Eugenio Pérez, Anton Kuchin vhost_save_backend_state() and vhost_load_backend_state() can be used by vhost front-ends to easily save and load the back-end's state to/from the migration stream. Because we do not know the full state size ahead of time, vhost_save_backend_state() simply reads the data in 1 MB chunks, and writes each chunk consecutively into the migration stream, prefixed by its length. EOF is indicated by a 0-length chunk. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> --- include/hw/virtio/vhost.h | 35 +++++++ hw/virtio/vhost.c | 204 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 239 insertions(+) diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h index a0d03c9fdf..100fcc874d 100644 --- a/include/hw/virtio/vhost.h +++ b/include/hw/virtio/vhost.h @@ -426,4 +426,39 @@ int vhost_set_device_state_fd(struct vhost_dev *dev, */ int vhost_check_device_state(struct vhost_dev *dev, Error **errp); +/** + * vhost_save_backend_state(): High-level function to receive a vhost + * back-end's state, and save it in @f. Uses + * `vhost_set_device_state_fd()` to get the data from the back-end, and + * stores it in consecutive chunks that are each prefixed by their + * respective length (be32). The end is marked by a 0-length chunk. + * + * Must only be called while the device and all its vrings are stopped + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`). + * + * @dev: The vhost device from which to save the state + * @f: Migration stream in which to save the state + * @errp: Potential error message + * + * Returns 0 on success, and -errno otherwise. + */ +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp); + +/** + * vhost_load_backend_state(): High-level function to load a vhost + * back-end's state from @f, and send it over to the back-end. Reads + * the data from @f in the format used by `vhost_save_state()`, and uses + * `vhost_set_device_state_fd()` to transfer it to the back-end. + * + * Must only be called while the device and all its vrings are stopped + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`). + * + * @dev: The vhost device to which to send the sate + * @f: Migration stream from which to load the state + * @errp: Potential error message + * + * Returns 0 on success, and -errno otherwise. + */ +int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp); + #endif diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 85e199f0aa..1465adf13a 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -2133,3 +2133,207 @@ int vhost_check_device_state(struct vhost_dev *dev, Error **errp) "vhost transport does not support migration state transfer"); return -ENOSYS; } + +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp) +{ + /* Maximum chunk size in which to transfer the state */ + const size_t chunk_size = 1 * 1024 * 1024; + g_autofree void *transfer_buf = NULL; + g_autoptr(GError) g_err = NULL; + int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1; + int ret; + + /* [0] for reading (our end), [1] for writing (back-end's end) */ + if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) { + error_setg(errp, "Failed to set up state transfer pipe: %s", + g_err->message); + ret = -EINVAL; + goto fail; + } + + read_fd = pipe_fds[0]; + write_fd = pipe_fds[1]; + + /* + * VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped. + * Ideally, it is suspended, but SUSPEND/RESUME currently do not exist for + * vhost-user, so just check that it is stopped at all. + */ + assert(!dev->started); + + /* Transfer ownership of write_fd to the back-end */ + ret = vhost_set_device_state_fd(dev, + VHOST_TRANSFER_STATE_DIRECTION_SAVE, + VHOST_TRANSFER_STATE_PHASE_STOPPED, + write_fd, + &reply_fd, + errp); + if (ret < 0) { + error_prepend(errp, "Failed to initiate state transfer: "); + goto fail; + } + + /* If the back-end wishes to use a different pipe, switch over */ + if (reply_fd >= 0) { + close(read_fd); + read_fd = reply_fd; + } + + transfer_buf = g_malloc(chunk_size); + + while (true) { + ssize_t read_ret; + + read_ret = RETRY_ON_EINTR(read(read_fd, transfer_buf, chunk_size)); + if (read_ret < 0) { + ret = -errno; + error_setg_errno(errp, -ret, "Failed to receive state"); + goto fail; + } + + assert(read_ret <= chunk_size); + qemu_put_be32(f, read_ret); + + if (read_ret == 0) { + /* EOF */ + break; + } + + qemu_put_buffer(f, transfer_buf, read_ret); + } + + /* + * Back-end will not really care, but be clean and close our end of the pipe + * before inquiring the back-end about whether transfer was successful + */ + close(read_fd); + read_fd = -1; + + /* Also, verify that the device is still stopped */ + assert(!dev->started); + + ret = vhost_check_device_state(dev, errp); + if (ret < 0) { + goto fail; + } + + ret = 0; +fail: + if (read_fd >= 0) { + close(read_fd); + } + + return ret; +} + +int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp) +{ + size_t transfer_buf_size = 0; + g_autofree void *transfer_buf = NULL; + g_autoptr(GError) g_err = NULL; + int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1; + int ret; + + /* [0] for reading (back-end's end), [1] for writing (our end) */ + if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) { + error_setg(errp, "Failed to set up state transfer pipe: %s", + g_err->message); + ret = -EINVAL; + goto fail; + } + + read_fd = pipe_fds[0]; + write_fd = pipe_fds[1]; + + /* + * VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped. + * Ideally, it is suspended, but SUSPEND/RESUME currently do not exist for + * vhost-user, so just check that it is stopped at all. + */ + assert(!dev->started); + + /* Transfer ownership of read_fd to the back-end */ + ret = vhost_set_device_state_fd(dev, + VHOST_TRANSFER_STATE_DIRECTION_LOAD, + VHOST_TRANSFER_STATE_PHASE_STOPPED, + read_fd, + &reply_fd, + errp); + if (ret < 0) { + error_prepend(errp, "Failed to initiate state transfer: "); + goto fail; + } + + /* If the back-end wishes to use a different pipe, switch over */ + if (reply_fd >= 0) { + close(write_fd); + write_fd = reply_fd; + } + + while (true) { + size_t this_chunk_size = qemu_get_be32(f); + ssize_t write_ret; + const uint8_t *transfer_pointer; + + if (this_chunk_size == 0) { + /* End of state */ + break; + } + + if (transfer_buf_size < this_chunk_size) { + transfer_buf = g_realloc(transfer_buf, this_chunk_size); + transfer_buf_size = this_chunk_size; + } + + if (qemu_get_buffer(f, transfer_buf, this_chunk_size) < + this_chunk_size) + { + error_setg(errp, "Failed to read state"); + ret = -EINVAL; + goto fail; + } + + transfer_pointer = transfer_buf; + while (this_chunk_size > 0) { + write_ret = RETRY_ON_EINTR( + write(write_fd, transfer_pointer, this_chunk_size) + ); + if (write_ret < 0) { + ret = -errno; + error_setg_errno(errp, -ret, "Failed to send state"); + goto fail; + } else if (write_ret == 0) { + error_setg(errp, "Failed to send state: Connection is closed"); + ret = -ECONNRESET; + goto fail; + } + + assert(write_ret <= this_chunk_size); + this_chunk_size -= write_ret; + transfer_pointer += write_ret; + } + } + + /* + * Close our end, thus ending transfer, before inquiring the back-end about + * whether transfer was successful + */ + close(write_fd); + write_fd = -1; + + /* Also, verify that the device is still stopped */ + assert(!dev->started); + + ret = vhost_check_device_state(dev, errp); + if (ret < 0) { + goto fail; + } + + ret = 0; +fail: + if (write_fd >= 0) { + close(write_fd); + } + + return ret; +} -- 2.41.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v4 7/8] vhost: Add high-level state save/load functions 2023-10-04 12:59 ` [PATCH v4 7/8] vhost: Add high-level state save/load functions Hanna Czenczek @ 2023-10-05 17:46 ` Stefan Hajnoczi 0 siblings, 0 replies; 53+ messages in thread From: Stefan Hajnoczi @ 2023-10-05 17:46 UTC (permalink / raw) To: Hanna Czenczek Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione, Eugenio Pérez, Anton Kuchin [-- Attachment #1: Type: text/plain, Size: 834 bytes --] On Wed, Oct 04, 2023 at 02:59:03PM +0200, Hanna Czenczek wrote: > vhost_save_backend_state() and vhost_load_backend_state() can be used by > vhost front-ends to easily save and load the back-end's state to/from > the migration stream. > > Because we do not know the full state size ahead of time, > vhost_save_backend_state() simply reads the data in 1 MB chunks, and > writes each chunk consecutively into the migration stream, prefixed by > its length. EOF is indicated by a 0-length chunk. > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > include/hw/virtio/vhost.h | 35 +++++++ > hw/virtio/vhost.c | 204 ++++++++++++++++++++++++++++++++++++++ > 2 files changed, 239 insertions(+) Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v4 8/8] vhost-user-fs: Implement internal migration 2023-10-04 12:58 [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek ` (6 preceding siblings ...) 2023-10-04 12:59 ` [PATCH v4 7/8] vhost: Add high-level state save/load functions Hanna Czenczek @ 2023-10-04 12:59 ` Hanna Czenczek 2023-10-05 17:46 ` Stefan Hajnoczi 2023-10-05 17:48 ` [PATCH v4 0/8] vhost-user: Back-end state migration Stefan Hajnoczi 8 siblings, 1 reply; 53+ messages in thread From: Hanna Czenczek @ 2023-10-04 12:59 UTC (permalink / raw) To: qemu-devel, virtio-fs Cc: Hanna Czenczek, Michael S . Tsirkin, Stefan Hajnoczi, German Maglione, Eugenio Pérez, Anton Kuchin A virtio-fs device's VM state consists of: - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE) - the back-end's (virtiofsd's) internal state We get/set the latter via the new vhost operations to transfer migratory state. It is its own dedicated subsection, so that for external migration, it can be disabled. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> --- hw/virtio/vhost-user-fs.c | 101 +++++++++++++++++++++++++++++++++++++- 1 file changed, 100 insertions(+), 1 deletion(-) diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c index 49d699ffc2..eb91723855 100644 --- a/hw/virtio/vhost-user-fs.c +++ b/hw/virtio/vhost-user-fs.c @@ -298,9 +298,108 @@ static struct vhost_dev *vuf_get_vhost(VirtIODevice *vdev) return &fs->vhost_dev; } +/** + * Fetch the internal state from virtiofsd and save it to `f`. + */ +static int vuf_save_state(QEMUFile *f, void *pv, size_t size, + const VMStateField *field, JSONWriter *vmdesc) +{ + VirtIODevice *vdev = pv; + VHostUserFS *fs = VHOST_USER_FS(vdev); + Error *local_error = NULL; + int ret; + + ret = vhost_save_backend_state(&fs->vhost_dev, f, &local_error); + if (ret < 0) { + error_reportf_err(local_error, + "Error saving back-end state of %s device %s " + "(tag: \"%s\"): ", + vdev->name, vdev->parent_obj.canonical_path, + fs->conf.tag ?: "<none>"); + return ret; + } + + return 0; +} + +/** + * Load virtiofsd's internal state from `f` and send it over to virtiofsd. + */ +static int vuf_load_state(QEMUFile *f, void *pv, size_t size, + const VMStateField *field) +{ + VirtIODevice *vdev = pv; + VHostUserFS *fs = VHOST_USER_FS(vdev); + Error *local_error = NULL; + int ret; + + ret = vhost_load_backend_state(&fs->vhost_dev, f, &local_error); + if (ret < 0) { + error_reportf_err(local_error, + "Error loading back-end state of %s device %s " + "(tag: \"%s\"): ", + vdev->name, vdev->parent_obj.canonical_path, + fs->conf.tag ?: "<none>"); + return ret; + } + + return 0; +} + +static bool vuf_is_internal_migration(void *opaque) +{ + /* TODO: Return false when an external migration is requested */ + return true; +} + +static int vuf_check_migration_support(void *opaque) +{ + VirtIODevice *vdev = opaque; + VHostUserFS *fs = VHOST_USER_FS(vdev); + + if (!vhost_supports_device_state(&fs->vhost_dev)) { + error_report("Back-end of %s device %s (tag: \"%s\") does not support " + "migration through qemu", + vdev->name, vdev->parent_obj.canonical_path, + fs->conf.tag ?: "<none>"); + return -ENOTSUP; + } + + return 0; +} + +static const VMStateDescription vuf_backend_vmstate; + static const VMStateDescription vuf_vmstate = { .name = "vhost-user-fs", - .unmigratable = 1, + .version_id = 0, + .fields = (VMStateField[]) { + VMSTATE_VIRTIO_DEVICE, + VMSTATE_END_OF_LIST() + }, + .subsections = (const VMStateDescription * []) { + &vuf_backend_vmstate, + NULL, + } +}; + +static const VMStateDescription vuf_backend_vmstate = { + .name = "vhost-user-fs-backend", + .version_id = 0, + .needed = vuf_is_internal_migration, + .pre_load = vuf_check_migration_support, + .pre_save = vuf_check_migration_support, + .fields = (VMStateField[]) { + { + .name = "back-end", + .info = &(const VMStateInfo) { + .name = "virtio-fs back-end state", + .get = vuf_load_state, + .put = vuf_save_state, + }, + }, + VMSTATE_END_OF_LIST() + }, }; static Property vuf_properties[] = { -- 2.41.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v4 8/8] vhost-user-fs: Implement internal migration 2023-10-04 12:59 ` [PATCH v4 8/8] vhost-user-fs: Implement internal migration Hanna Czenczek @ 2023-10-05 17:46 ` Stefan Hajnoczi 0 siblings, 0 replies; 53+ messages in thread From: Stefan Hajnoczi @ 2023-10-05 17:46 UTC (permalink / raw) To: Hanna Czenczek Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione, Eugenio Pérez, Anton Kuchin [-- Attachment #1: Type: text/plain, Size: 703 bytes --] On Wed, Oct 04, 2023 at 02:59:04PM +0200, Hanna Czenczek wrote: > A virtio-fs device's VM state consists of: > - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE) > - the back-end's (virtiofsd's) internal state > > We get/set the latter via the new vhost operations to transfer migratory > state. It is its own dedicated subsection, so that for external > migration, it can be disabled. > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > hw/virtio/vhost-user-fs.c | 101 +++++++++++++++++++++++++++++++++++++- > 1 file changed, 100 insertions(+), 1 deletion(-) Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v4 0/8] vhost-user: Back-end state migration 2023-10-04 12:58 [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek ` (7 preceding siblings ...) 2023-10-04 12:59 ` [PATCH v4 8/8] vhost-user-fs: Implement internal migration Hanna Czenczek @ 2023-10-05 17:48 ` Stefan Hajnoczi 8 siblings, 0 replies; 53+ messages in thread From: Stefan Hajnoczi @ 2023-10-05 17:48 UTC (permalink / raw) To: Hanna Czenczek Cc: qemu-devel, virtio-fs, Michael S . Tsirkin, German Maglione, Eugenio Pérez, Anton Kuchin [-- Attachment #1: Type: text/plain, Size: 4046 bytes --] On Wed, Oct 04, 2023 at 02:58:56PM +0200, Hanna Czenczek wrote: > RFC: > https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04263.html > > v1: > https://lists.nongnu.org/archive/html/qemu-devel/2023-04/msg01575.html > > v2: > https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02604.html > > v3: > https://lists.nongnu.org/archive/html/qemu-devel/2023-09/msg03750.html > > > Based-on: <20231004014532.1228637-1-stefanha@redhat.com> > ([PATCH v2 0/3] vhost: clean up device reset) > > > Hi, > > This v4 includes largely unchanged patches from v3. The main > addition/change is what came out of the discussion between Stefan and me > around how to proceed without SUSPEND/RESUME, which is that this series > is now based on his reset fix, and it includes more documentation > changes. This looks good. I posted some minor comments on the new patches. Stefan > > Changes in detail: > > - Patch 1: Fall-out from the reset fix: Currently, the status byte is > effectively unused (qemu only uses it for resetting, which all > back-ends ignore; DPDK uses it to announce potential feature > negotiation failure, which qemu ignores). It is also not defined what > exactly front-end or back-end should do with this byte, except > pointing at the virtio spec, which however naturally does not say how > this integrates with vhost-user’s RESET_DEVICE or [GS]ET_FEATURES. > Furthermore, there does not seem to be a use for this; we have > RESET_DEVICE for resetting, and we have [GS]ET_FEATURES (and > REPLY_ACK, which can be used on SET_FEATURES) for feature > negotation. > Therefore, deprecate the status byte, pointing to those other commands > instead. > > - Patch 2: Patch 4 defines a suspended state for the whole back-end if > all vrings are stopped. I think this should be mentioned in > GET_VRING_BASE, but upon trying to add it, I found that it does not > even mention that it stops the vring (mentioned only in the Ring > States section), and remembered that the whole description of both > GET_VRING_BASE and SET_VRING_BASE really was not helpful when trying > to implement a vhost-user back-end. Took the opportunity to overhaul > both. > > - Patch 3: This one’s from v3, but quite heavily modified. Stefan > suggested consistently defining the started/stopped and > enabled/disabled states to be independent, and indeed doing so > simplifies a whole lot of stuff. Specifically, it makes the magic > “enabled/disabled when started” go away. Basically, I found this > change alone is enough to remove the confusion I had with the existing > documentation. > > - Patch 4: As suggested by Stefan, just define a suspended state without > introducing SUSPEND. vDPA needs SUSPEND because its GET_VRING_BASE > does not stop the vring, but vhost-user’s does, so we can define the > suspended state to be when all vrings are stopped. > > - Patch 5: Reference the suspended state. > > - Patches 6 through 8: Unmodified, except for them being rebase on > Stefan’s series. > > > Hanna Czenczek (8): > vhost-user.rst: Deprecate [GS]ET_STATUS > vhost-user.rst: Improve [GS]ET_VRING_BASE doc > vhost-user.rst: Clarify enabling/disabling vrings > vhost-user.rst: Introduce suspended state > vhost-user.rst: Migrating back-end-internal state > vhost-user: Interface for migration state transfer > vhost: Add high-level state save/load functions > vhost-user-fs: Implement internal migration > > docs/interop/vhost-user.rst | 318 +++++++++++++++++++++++++++--- > include/hw/virtio/vhost-backend.h | 24 +++ > include/hw/virtio/vhost.h | 113 +++++++++++ > hw/virtio/vhost-user-fs.c | 101 +++++++++- > hw/virtio/vhost-user.c | 148 ++++++++++++++ > hw/virtio/vhost.c | 241 ++++++++++++++++++++++ > 6 files changed, 917 insertions(+), 28 deletions(-) > > -- > 2.41.0 > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2023-10-18 16:19 UTC | newest] Thread overview: 53+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-10-04 12:58 [PATCH v4 0/8] vhost-user: Back-end state migration Hanna Czenczek 2023-10-04 12:58 ` [PATCH v4 1/8] vhost-user.rst: Deprecate [GS]ET_STATUS Hanna Czenczek 2023-10-05 17:08 ` Stefan Hajnoczi 2023-10-05 17:15 ` Michael S. Tsirkin 2023-10-06 7:48 ` [Virtio-fs] (no subject) Hanna Czenczek 2023-10-06 8:45 ` Michael S. Tsirkin 2023-10-06 9:15 ` Hanna Czenczek 2023-10-06 9:26 ` Michael S. Tsirkin 2023-10-06 9:47 ` Hanna Czenczek 2023-10-06 10:34 ` Michael S. Tsirkin 2023-10-06 11:42 ` Hanna Czenczek 2023-10-06 15:17 ` Alex Bennée 2023-10-06 15:47 ` Hanna Czenczek 2023-10-06 20:49 ` Alex Bennée 2023-10-09 8:07 ` Hanna Czenczek 2023-10-07 2:22 ` Yajun Wu 2023-10-09 8:21 ` Hanna Czenczek 2023-10-09 9:07 ` Hanna Czenczek 2023-10-09 9:13 ` Hanna Czenczek 2023-10-10 4:00 ` Yajun Wu 2023-10-10 8:18 ` Hanna Czenczek 2023-10-10 10:36 ` Alex Bennée 2023-10-10 13:18 ` Hanna Czenczek 2023-10-10 14:35 ` Alex Bennée 2023-10-13 18:02 ` Hanna Czenczek 2023-10-17 7:49 ` Viresh Kumar 2023-10-17 8:13 ` Hanna Czenczek 2023-10-09 10:28 ` German Maglione 2023-10-10 2:56 ` Yajun Wu 2023-10-10 10:04 ` German Maglione 2023-10-04 12:58 ` [PATCH v4 2/8] vhost-user.rst: Improve [GS]ET_VRING_BASE doc Hanna Czenczek 2023-10-05 17:38 ` Stefan Hajnoczi 2023-10-06 7:53 ` [Virtio-fs] " Hanna Czenczek 2023-10-06 8:49 ` Michael S. Tsirkin 2023-10-06 13:55 ` Hanna Czenczek 2023-10-06 13:58 ` Hanna Czenczek 2023-10-07 21:29 ` Michael S. Tsirkin 2023-10-07 21:27 ` Michael S. Tsirkin 2023-10-04 12:58 ` [PATCH v4 3/8] vhost-user.rst: Clarify enabling/disabling vrings Hanna Czenczek 2023-10-05 17:43 ` Stefan Hajnoczi 2023-10-18 12:14 ` Michael S. Tsirkin 2023-10-18 16:17 ` Hanna Czenczek 2023-10-04 12:59 ` [PATCH v4 4/8] vhost-user.rst: Introduce suspended state Hanna Czenczek 2023-10-05 17:44 ` Stefan Hajnoczi 2023-10-04 12:59 ` [PATCH v4 5/8] vhost-user.rst: Migrating back-end-internal state Hanna Czenczek 2023-10-05 17:46 ` Stefan Hajnoczi 2023-10-04 12:59 ` [PATCH v4 6/8] vhost-user: Interface for migration state transfer Hanna Czenczek 2023-10-05 17:46 ` Stefan Hajnoczi 2023-10-04 12:59 ` [PATCH v4 7/8] vhost: Add high-level state save/load functions Hanna Czenczek 2023-10-05 17:46 ` Stefan Hajnoczi 2023-10-04 12:59 ` [PATCH v4 8/8] vhost-user-fs: Implement internal migration Hanna Czenczek 2023-10-05 17:46 ` Stefan Hajnoczi 2023-10-05 17:48 ` [PATCH v4 0/8] vhost-user: Back-end state migration Stefan Hajnoczi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).