* [PATCH 0/4] vhost-user-fs: Internal migration
@ 2023-04-11 15:05 Hanna Czenczek
  2023-04-11 15:05 ` [PATCH 1/4] vhost: Re-enable vrings after setting features Hanna Czenczek
                   ` (6 more replies)
  0 siblings, 7 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-11 15:05 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Stefan Hajnoczi, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella
RFC:
https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04263.html
Hi,
Patch 2 of this series adds new vhost methods (only for vhost-user at
this point) for transferring the back-end’s internal state to/from qemu
during migration, so that this state can be stored in the migration
stream.  (This is what we call “internal migration”, because the state
is internally available to qemu; this is in contrast to “external
migration”, which Anton is working on, where the back-end’s state is
handled by the back-end itself without involving qemu.)
For this, the state is handled as a binary blob by qemu, and it is
transferred over a pipe that is established via a new vhost method.
Patch 3 adds two high-level helper functions to (A) fetch any vhost
back-end’s internal state and store it in a migration stream (a
`QEMUFile`), and (B) load such state from a migrations stream and send
it to a vhost back-end.  These build on the low-level interface
introduced in patch 2.
Patch 4 then uses these functions to implement internal migration for
vhost-user-fs.  Note that this of course depends on support in the
back-end (virtiofsd), which is not yet ready.
Finally, patch 1 fixes a bug around migrating vhost-user devices: To
enable/disable logging[1], the VHOST_F_LOG_ALL feature must be
set/cleared, via the SET_FEATURES call.  Another, technically unrelated,
feature exists, VHOST_USER_F_PROTOCOL_FEATURES, which indicates support
for vhost-user protocol features.  Naturally, qemu wants to keep that
other feature enabled, so it will set it (when possible) in every
SET_FEATURES call.  However, a side effect of setting
VHOST_USER_F_PROTOCOL_FEATURES is that all vrings are disabled.  This
causes any enabling (done at the start of migration) or disabling (done
on the source after a cancelled/failed migration) of logging to make the
back-end hang.  Without patch 1, therefore, starting a migration will
have any vhost-user back-end that supports both VHOST_F_LOG_ALL and
VHOST_USER_F_PROTOCOL_FEATURES immediately hang completely, and unless
execution is transferred to the destination, it will continue to hang.
[1] Logging here means logging writes to guest memory pages in a dirty
bitmap so that these dirty pages are flushed to the destination.  qemu
cannot monitor the back-end’s writes to guest memory, so the back-end
has to do so itself, and log its writes in a dirty bitmap shared with
qemu.
Changes in v1 compared to the RFC:
- Patch 1 added
- Patch 2: Interface is different, now uses a pipe instead of shared
  memory (as suggested by Stefan); also, this is now a generic
  vhost-user interface, and not just for vhost-user-fs
- Patches 3 and 4: Because this is now supposed to be a generic
  migration method for vhost-user back-ends, most of the migration code
  has been moved from vhost-user-fs.c to vhost.c so it can be shared
  between different back-ends.  The vhost-user-fs code is now a rather
  thin wrapper around the common code.
  - Note also (as suggested by Anton) that the back-end’s migration
    state is now in a subsection, and that it is technically optional.
    “Technically” means that with this series, it is always used (unless
    the back-end doesn’t support migration, in which case migration is
    just blocked), but Anton’s series for external migration would make
    it optional.  (I.e., the subsection would be skipped for external
    migration, and mandatorily included for internal migration.)
Hanna Czenczek (4):
  vhost: Re-enable vrings after setting features
  vhost-user: Interface for migration state transfer
  vhost: Add high-level state save/load functions
  vhost-user-fs: Implement internal migration
 include/hw/virtio/vhost-backend.h |  24 +++
 include/hw/virtio/vhost.h         | 124 +++++++++++++++
 hw/virtio/vhost-user-fs.c         | 101 +++++++++++-
 hw/virtio/vhost-user.c            | 147 ++++++++++++++++++
 hw/virtio/vhost.c                 | 246 ++++++++++++++++++++++++++++++
 5 files changed, 641 insertions(+), 1 deletion(-)
-- 
2.39.1
^ permalink raw reply	[flat|nested] 93+ messages in thread
* [PATCH 1/4] vhost: Re-enable vrings after setting features
  2023-04-11 15:05 [PATCH 0/4] vhost-user-fs: Internal migration Hanna Czenczek
@ 2023-04-11 15:05 ` Hanna Czenczek
  2023-04-12 10:55   ` German Maglione
                     ` (3 more replies)
  2023-04-11 15:05 ` [PATCH 2/4] vhost-user: Interface for migration state transfer Hanna Czenczek
                   ` (5 subsequent siblings)
  6 siblings, 4 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-11 15:05 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Stefan Hajnoczi, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella
If the back-end supports the VHOST_USER_F_PROTOCOL_FEATURES feature,
setting the vhost features will set this feature, too.  Doing so
disables all vrings, which may not be intended.
For example, enabling or disabling logging during migration requires
setting those features (to set or unset VHOST_F_LOG_ALL), which will
automatically disable all vrings.  In either case, the VM is running
(disabling logging is done after a failed or cancelled migration, and
only once the VM is running again, see comment in
memory_global_dirty_log_stop()), so the vrings should really be enabled.
As a result, the back-end seems to hang.
To fix this, we must remember whether the vrings are supposed to be
enabled, and, if so, re-enable them after a SET_FEATURES call that set
VHOST_USER_F_PROTOCOL_FEATURES.
It seems less than ideal that there is a short period in which the VM is
running but the vrings will be stopped (between SET_FEATURES and
SET_VRING_ENABLE).  To fix this, we would need to change the protocol,
e.g. by introducing a new flag or vhost-user protocol feature to disable
disabling vrings whenever VHOST_USER_F_PROTOCOL_FEATURES is set, or add
new functions for setting/clearing singular feature bits (so that
F_LOG_ALL can be set/cleared without touching F_PROTOCOL_FEATURES).
Even with such a potential addition to the protocol, we still need this
fix here, because we cannot expect that back-ends will implement this
addition.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/hw/virtio/vhost.h | 10 ++++++++++
 hw/virtio/vhost.c         | 13 +++++++++++++
 2 files changed, 23 insertions(+)
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index a52f273347..2fe02ed5d4 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -90,6 +90,16 @@ struct vhost_dev {
     int vq_index_end;
     /* if non-zero, minimum required value for max_queues */
     int num_queues;
+
+    /*
+     * Whether the virtqueues are supposed to be enabled (via
+     * SET_VRING_ENABLE).  Setting the features (e.g. for
+     * enabling/disabling logging) will disable all virtqueues if
+     * VHOST_USER_F_PROTOCOL_FEATURES is set, so then we need to
+     * re-enable them if this field is set.
+     */
+    bool enable_vqs;
+
     /**
      * vhost feature handling requires matching the feature set
      * offered by a backend which may be a subset of the total
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index a266396576..cbff589efa 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -50,6 +50,8 @@ static unsigned int used_memslots;
 static QLIST_HEAD(, vhost_dev) vhost_devices =
     QLIST_HEAD_INITIALIZER(vhost_devices);
 
+static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable);
+
 bool vhost_has_free_slot(void)
 {
     unsigned int slots_limit = ~0U;
@@ -899,6 +901,15 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
         }
     }
 
+    if (dev->enable_vqs) {
+        /*
+         * Setting VHOST_USER_F_PROTOCOL_FEATURES would have disabled all
+         * virtqueues, even if that was not intended; re-enable them if
+         * necessary.
+         */
+        vhost_dev_set_vring_enable(dev, true);
+    }
+
 out:
     return r;
 }
@@ -1896,6 +1907,8 @@ int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
 
 static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable)
 {
+    hdev->enable_vqs = enable;
+
     if (!hdev->vhost_ops->vhost_set_vring_enable) {
         return 0;
     }
-- 
2.39.1
^ permalink raw reply related	[flat|nested] 93+ messages in thread
* [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-11 15:05 [PATCH 0/4] vhost-user-fs: Internal migration Hanna Czenczek
  2023-04-11 15:05 ` [PATCH 1/4] vhost: Re-enable vrings after setting features Hanna Czenczek
@ 2023-04-11 15:05 ` Hanna Czenczek
  2023-04-12 21:06   ` Stefan Hajnoczi
  2023-04-13  8:50   ` Eugenio Perez Martin
  2023-04-11 15:05 ` [PATCH 3/4] vhost: Add high-level state save/load functions Hanna Czenczek
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-11 15:05 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Stefan Hajnoczi, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella
So-called "internal" virtio-fs migration refers to transporting the
back-end's (virtiofsd's) state through qemu's migration stream.  To do
this, we need to be able to transfer virtiofsd's internal state to and
from virtiofsd.
Because virtiofsd's internal state will not be too large, we believe it
is best to transfer it as a single binary blob after the streaming
phase.  Because this method should be useful to other vhost-user
implementations, too, it is introduced as a general-purpose addition to
the protocol, not limited to vhost-user-fs.
These are the additions to the protocol:
- New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
  This feature signals support for transferring state, and is added so
  that migration can fail early when the back-end has no support.
- SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
  over which to transfer the state.  The front-end sends an FD to the
  back-end into/from which it can write/read its state, and the back-end
  can decide to either use it, or reply with a different FD for the
  front-end to override the front-end's choice.
  The front-end creates a simple pipe to transfer the state, but maybe
  the back-end already has an FD into/from which it has to write/read
  its state, in which case it will want to override the simple pipe.
  Conversely, maybe in the future we find a way to have the front-end
  get an immediate FD for the migration stream (in some cases), in which
  case we will want to send this to the back-end instead of creating a
  pipe.
  Hence the negotiation: If one side has a better idea than a plain
  pipe, we will want to use that.
- CHECK_DEVICE_STATE: After the state has been transferred through the
  pipe (the end indicated by EOF), the front-end invokes this function
  to verify success.  There is no in-band way (through the pipe) to
  indicate failure, so we need to check explicitly.
Once the transfer pipe has been established via SET_DEVICE_STATE_FD
(which includes establishing the direction of transfer and migration
phase), the sending side writes its data into the pipe, and the reading
side reads it until it sees an EOF.  Then, the front-end will check for
success via CHECK_DEVICE_STATE, which on the destination side includes
checking for integrity (i.e. errors during deserialization).
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/hw/virtio/vhost-backend.h |  24 +++++
 include/hw/virtio/vhost.h         |  79 ++++++++++++++++
 hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
 hw/virtio/vhost.c                 |  37 ++++++++
 4 files changed, 287 insertions(+)
diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index ec3fbae58d..5935b32fe3 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
     VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
 } VhostSetConfigType;
 
+typedef enum VhostDeviceStateDirection {
+    /* Transfer state from back-end (device) to front-end */
+    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
+    /* Transfer state from front-end to back-end (device) */
+    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
+} VhostDeviceStateDirection;
+
+typedef enum VhostDeviceStatePhase {
+    /* The device (and all its vrings) is stopped */
+    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
+} VhostDeviceStatePhase;
+
 struct vhost_inflight;
 struct vhost_dev;
 struct vhost_log;
@@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
 
 typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
 
+typedef bool (*vhost_supports_migratory_state_op)(struct vhost_dev *dev);
+typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev,
+                                            VhostDeviceStateDirection direction,
+                                            VhostDeviceStatePhase phase,
+                                            int fd,
+                                            int *reply_fd,
+                                            Error **errp);
+typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp);
+
 typedef struct VhostOps {
     VhostBackendType backend_type;
     vhost_backend_init vhost_backend_init;
@@ -181,6 +202,9 @@ typedef struct VhostOps {
     vhost_force_iommu_op vhost_force_iommu;
     vhost_set_config_call_op vhost_set_config_call;
     vhost_reset_status_op vhost_reset_status;
+    vhost_supports_migratory_state_op vhost_supports_migratory_state;
+    vhost_set_device_state_fd_op vhost_set_device_state_fd;
+    vhost_check_device_state_op vhost_check_device_state;
 } VhostOps;
 
 int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 2fe02ed5d4..29449e0fe2 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
                            struct vhost_inflight *inflight);
 int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
                            struct vhost_inflight *inflight);
+
+/**
+ * vhost_supports_migratory_state(): Checks whether the back-end
+ * supports transferring internal state for the purpose of migration.
+ * Support for this feature is required for vhost_set_device_state_fd()
+ * and vhost_check_device_state().
+ *
+ * @dev: The vhost device
+ *
+ * Returns true if the device supports these commands, and false if it
+ * does not.
+ */
+bool vhost_supports_migratory_state(struct vhost_dev *dev);
+
+/**
+ * vhost_set_device_state_fd(): Begin transfer of internal state from/to
+ * the back-end for the purpose of migration.  Data is to be transferred
+ * over a pipe according to @direction and @phase.  The sending end must
+ * only write to the pipe, and the receiving end must only read from it.
+ * Once the sending end is done, it closes its FD.  The receiving end
+ * must take this as the end-of-transfer signal and close its FD, too.
+ *
+ * @fd is the back-end's end of the pipe: The write FD for SAVE, and the
+ * read FD for LOAD.  This function transfers ownership of @fd to the
+ * back-end, i.e. closes it in the front-end.
+ *
+ * The back-end may optionally reply with an FD of its own, if this
+ * improves efficiency on its end.  In this case, the returned FD is
+ * stored in *reply_fd.  The back-end will discard the FD sent to it,
+ * and the front-end must use *reply_fd for transferring state to/from
+ * the back-end.
+ *
+ * @dev: The vhost device
+ * @direction: The direction in which the state is to be transferred.
+ *             For outgoing migrations, this is SAVE, and data is read
+ *             from the back-end and stored by the front-end in the
+ *             migration stream.
+ *             For incoming migrations, this is LOAD, and data is read
+ *             by the front-end from the migration stream and sent to
+ *             the back-end to restore the saved state.
+ * @phase: Which migration phase we are in.  Currently, there is only
+ *         STOPPED (device and all vrings are stopped), in the future,
+ *         more phases such as PRE_COPY or POST_COPY may be added.
+ * @fd: Back-end's end of the pipe through which to transfer state; note
+ *      that ownership is transferred to the back-end, so this function
+ *      closes @fd in the front-end.
+ * @reply_fd: If the back-end wishes to use a different pipe for state
+ *            transfer, this will contain an FD for the front-end to
+ *            use.  Otherwise, -1 is stored here.
+ * @errp: Potential error description
+ *
+ * Returns 0 on success, and -errno on failure.
+ */
+int vhost_set_device_state_fd(struct vhost_dev *dev,
+                              VhostDeviceStateDirection direction,
+                              VhostDeviceStatePhase phase,
+                              int fd,
+                              int *reply_fd,
+                              Error **errp);
+
+/**
+ * vhost_set_device_state_fd(): After transferring state from/to the
+ * back-end via vhost_set_device_state_fd(), i.e. once the sending end
+ * has closed the pipe, inquire the back-end to report any potential
+ * errors that have occurred on its side.  This allows to sense errors
+ * like:
+ * - During outgoing migration, when the source side had already started
+ *   to produce its state, something went wrong and it failed to finish
+ * - During incoming migration, when the received state is somehow
+ *   invalid and cannot be processed by the back-end
+ *
+ * @dev: The vhost device
+ * @errp: Potential error description
+ *
+ * Returns 0 when the back-end reports successful state transfer and
+ * processing, and -errno when an error occurred somewhere.
+ */
+int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
+
 #endif
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index e5285df4ba..93d8f2494a 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -83,6 +83,7 @@ enum VhostUserProtocolFeature {
     /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */
     VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
     VHOST_USER_PROTOCOL_F_STATUS = 16,
+    VHOST_USER_PROTOCOL_F_MIGRATORY_STATE = 17,
     VHOST_USER_PROTOCOL_F_MAX
 };
 
@@ -130,6 +131,8 @@ typedef enum VhostUserRequest {
     VHOST_USER_REM_MEM_REG = 38,
     VHOST_USER_SET_STATUS = 39,
     VHOST_USER_GET_STATUS = 40,
+    VHOST_USER_SET_DEVICE_STATE_FD = 41,
+    VHOST_USER_CHECK_DEVICE_STATE = 42,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -210,6 +213,12 @@ typedef struct {
     uint32_t size; /* the following payload size */
 } QEMU_PACKED VhostUserHeader;
 
+/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */
+typedef struct VhostUserTransferDeviceState {
+    uint32_t direction;
+    uint32_t phase;
+} VhostUserTransferDeviceState;
+
 typedef union {
 #define VHOST_USER_VRING_IDX_MASK   (0xff)
 #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
@@ -224,6 +233,7 @@ typedef union {
         VhostUserCryptoSession session;
         VhostUserVringArea area;
         VhostUserInflight inflight;
+        VhostUserTransferDeviceState transfer_state;
 } VhostUserPayload;
 
 typedef struct VhostUserMsg {
@@ -2681,6 +2691,140 @@ static int vhost_user_dev_start(struct vhost_dev *dev, bool started)
     }
 }
 
+static bool vhost_user_supports_migratory_state(struct vhost_dev *dev)
+{
+    return virtio_has_feature(dev->protocol_features,
+                              VHOST_USER_PROTOCOL_F_MIGRATORY_STATE);
+}
+
+static int vhost_user_set_device_state_fd(struct vhost_dev *dev,
+                                          VhostDeviceStateDirection direction,
+                                          VhostDeviceStatePhase phase,
+                                          int fd,
+                                          int *reply_fd,
+                                          Error **errp)
+{
+    int ret;
+    struct vhost_user *vu = dev->opaque;
+    VhostUserMsg msg = {
+        .hdr = {
+            .request = VHOST_USER_SET_DEVICE_STATE_FD,
+            .flags = VHOST_USER_VERSION,
+            .size = sizeof(msg.payload.transfer_state),
+        },
+        .payload.transfer_state = {
+            .direction = direction,
+            .phase = phase,
+        },
+    };
+
+    *reply_fd = -1;
+
+    if (!vhost_user_supports_migratory_state(dev)) {
+        close(fd);
+        error_setg(errp, "Back-end does not support migration state transfer");
+        return -ENOTSUP;
+    }
+
+    ret = vhost_user_write(dev, &msg, &fd, 1);
+    close(fd);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "Failed to send SET_DEVICE_STATE_FD message");
+        return ret;
+    }
+
+    ret = vhost_user_read(dev, &msg);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "Failed to receive SET_DEVICE_STATE_FD reply");
+        return ret;
+    }
+
+    if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) {
+        error_setg(errp,
+                   "Received unexpected message type, expected %d, received %d",
+                   VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request);
+        return -EPROTO;
+    }
+
+    if (msg.hdr.size != sizeof(msg.payload.u64)) {
+        error_setg(errp,
+                   "Received bad message size, expected %zu, received %" PRIu32,
+                   sizeof(msg.payload.u64), msg.hdr.size);
+        return -EPROTO;
+    }
+
+    if ((msg.payload.u64 & 0xff) != 0) {
+        error_setg(errp, "Back-end did not accept migration state transfer");
+        return -EIO;
+    }
+
+    if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) {
+        *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr);
+        if (*reply_fd < 0) {
+            error_setg(errp,
+                       "Failed to get back-end-provided transfer pipe FD");
+            *reply_fd = -1;
+            return -EIO;
+        }
+    }
+
+    return 0;
+}
+
+static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp)
+{
+    int ret;
+    VhostUserMsg msg = {
+        .hdr = {
+            .request = VHOST_USER_CHECK_DEVICE_STATE,
+            .flags = VHOST_USER_VERSION,
+            .size = 0,
+        },
+    };
+
+    if (!vhost_user_supports_migratory_state(dev)) {
+        error_setg(errp, "Back-end does not support migration state transfer");
+        return -ENOTSUP;
+    }
+
+    ret = vhost_user_write(dev, &msg, NULL, 0);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "Failed to send CHECK_DEVICE_STATE message");
+        return ret;
+    }
+
+    ret = vhost_user_read(dev, &msg);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "Failed to receive CHECK_DEVICE_STATE reply");
+        return ret;
+    }
+
+    if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) {
+        error_setg(errp,
+                   "Received unexpected message type, expected %d, received %d",
+                   VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request);
+        return -EPROTO;
+    }
+
+    if (msg.hdr.size != sizeof(msg.payload.u64)) {
+        error_setg(errp,
+                   "Received bad message size, expected %zu, received %" PRIu32,
+                   sizeof(msg.payload.u64), msg.hdr.size);
+        return -EPROTO;
+    }
+
+    if (msg.payload.u64 != 0) {
+        error_setg(errp, "Back-end failed to process its internal state");
+        return -EIO;
+    }
+
+    return 0;
+}
+
 const VhostOps user_ops = {
         .backend_type = VHOST_BACKEND_TYPE_USER,
         .vhost_backend_init = vhost_user_backend_init,
@@ -2716,4 +2860,7 @@ const VhostOps user_ops = {
         .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
         .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
         .vhost_dev_start = vhost_user_dev_start,
+        .vhost_supports_migratory_state = vhost_user_supports_migratory_state,
+        .vhost_set_device_state_fd = vhost_user_set_device_state_fd,
+        .vhost_check_device_state = vhost_user_check_device_state,
 };
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index cbff589efa..90099d8f6a 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -2088,3 +2088,40 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
 
     return -ENOSYS;
 }
+
+bool vhost_supports_migratory_state(struct vhost_dev *dev)
+{
+    if (dev->vhost_ops->vhost_supports_migratory_state) {
+        return dev->vhost_ops->vhost_supports_migratory_state(dev);
+    }
+
+    return false;
+}
+
+int vhost_set_device_state_fd(struct vhost_dev *dev,
+                              VhostDeviceStateDirection direction,
+                              VhostDeviceStatePhase phase,
+                              int fd,
+                              int *reply_fd,
+                              Error **errp)
+{
+    if (dev->vhost_ops->vhost_set_device_state_fd) {
+        return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase,
+                                                         fd, reply_fd, errp);
+    }
+
+    error_setg(errp,
+               "vhost transport does not support migration state transfer");
+    return -ENOSYS;
+}
+
+int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
+{
+    if (dev->vhost_ops->vhost_check_device_state) {
+        return dev->vhost_ops->vhost_check_device_state(dev, errp);
+    }
+
+    error_setg(errp,
+               "vhost transport does not support migration state transfer");
+    return -ENOSYS;
+}
-- 
2.39.1
^ permalink raw reply related	[flat|nested] 93+ messages in thread
* [PATCH 3/4] vhost: Add high-level state save/load functions
  2023-04-11 15:05 [PATCH 0/4] vhost-user-fs: Internal migration Hanna Czenczek
  2023-04-11 15:05 ` [PATCH 1/4] vhost: Re-enable vrings after setting features Hanna Czenczek
  2023-04-11 15:05 ` [PATCH 2/4] vhost-user: Interface for migration state transfer Hanna Czenczek
@ 2023-04-11 15:05 ` Hanna Czenczek
  2023-04-12 21:14   ` Stefan Hajnoczi
  2023-04-11 15:05 ` [PATCH 4/4] vhost-user-fs: Implement internal migration Hanna Czenczek
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-11 15:05 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Stefan Hajnoczi, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella
vhost_save_backend_state() and vhost_load_backend_state() can be used by
vhost front-ends to easily save and load the back-end's state to/from
the migration stream.
Because we do not know the full state size ahead of time,
vhost_save_backend_state() simply reads the data in 1 MB chunks, and
writes each chunk consecutively into the migration stream, prefixed by
its length.  EOF is indicated by a 0-length chunk.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/hw/virtio/vhost.h |  35 +++++++
 hw/virtio/vhost.c         | 196 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 231 insertions(+)
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 29449e0fe2..d1f1e9e1f3 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -425,4 +425,39 @@ int vhost_set_device_state_fd(struct vhost_dev *dev,
  */
 int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
 
+/**
+ * vhost_save_backend_state(): High-level function to receive a vhost
+ * back-end's state, and save it in `f`.  Uses
+ * `vhost_set_device_state_fd()` to get the data from the back-end, and
+ * stores it in consecutive chunks that are each prefixed by their
+ * respective length (be32).  The end is marked by a 0-length chunk.
+ *
+ * Must only be called while the device and all its vrings are stopped
+ * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
+ *
+ * @dev: The vhost device from which to save the state
+ * @f: Migration stream in which to save the state
+ * @errp: Potential error message
+ *
+ * Returns 0 on success, and -errno otherwise.
+ */
+int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
+
+/**
+ * vhost_load_backend_state(): High-level function to load a vhost
+ * back-end's state from `f`, and send it over to the back-end.  Reads
+ * the data from `f` in the format used by `vhost_save_state()`, and
+ * uses `vhost_set_device_state_fd()` to transfer it to the back-end.
+ *
+ * Must only be called while the device and all its vrings are stopped
+ * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
+ *
+ * @dev: The vhost device to which to send the sate
+ * @f: Migration stream from which to load the state
+ * @errp: Potential error message
+ *
+ * Returns 0 on success, and -errno otherwise.
+ */
+int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
+
 #endif
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 90099d8f6a..d08849c691 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -2125,3 +2125,199 @@ int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
                "vhost transport does not support migration state transfer");
     return -ENOSYS;
 }
+
+int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp)
+{
+    /* Maximum chunk size in which to transfer the state */
+    const size_t chunk_size = 1 * 1024 * 1024;
+    void *transfer_buf = NULL;
+    g_autoptr(GError) g_err = NULL;
+    int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1;
+    int ret;
+
+    /* [0] for reading (our end), [1] for writing (back-end's end) */
+    if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) {
+        error_setg(errp, "Failed to set up state transfer pipe: %s",
+                   g_err->message);
+        ret = -EINVAL;
+        goto fail;
+    }
+
+    read_fd = pipe_fds[0];
+    write_fd = pipe_fds[1];
+
+    /* VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped */
+    assert(!dev->started && !dev->enable_vqs);
+
+    /* Transfer ownership of write_fd to the back-end */
+    ret = vhost_set_device_state_fd(dev,
+                                    VHOST_TRANSFER_STATE_DIRECTION_SAVE,
+                                    VHOST_TRANSFER_STATE_PHASE_STOPPED,
+                                    write_fd,
+                                    &reply_fd,
+                                    errp);
+    if (ret < 0) {
+        error_prepend(errp, "Failed to initiate state transfer: ");
+        goto fail;
+    }
+
+    /* If the back-end wishes to use a different pipe, switch over */
+    if (reply_fd >= 0) {
+        close(read_fd);
+        read_fd = reply_fd;
+    }
+
+    transfer_buf = g_malloc(chunk_size);
+
+    while (true) {
+        ssize_t read_ret;
+
+        read_ret = read(read_fd, transfer_buf, chunk_size);
+        if (read_ret < 0) {
+            ret = -errno;
+            error_setg_errno(errp, -ret, "Failed to receive state");
+            goto fail;
+        }
+
+        assert(read_ret <= chunk_size);
+        qemu_put_be32(f, read_ret);
+
+        if (read_ret == 0) {
+            /* EOF */
+            break;
+        }
+
+        qemu_put_buffer(f, transfer_buf, read_ret);
+    }
+
+    /*
+     * Back-end will not really care, but be clean and close our end of the pipe
+     * before inquiring the back-end about whether transfer was successful
+     */
+    close(read_fd);
+    read_fd = -1;
+
+    /* Also, verify that the device is still stopped */
+    assert(!dev->started && !dev->enable_vqs);
+
+    ret = vhost_check_device_state(dev, errp);
+    if (ret < 0) {
+        goto fail;
+    }
+
+    ret = 0;
+fail:
+    g_free(transfer_buf);
+    if (read_fd >= 0) {
+        close(read_fd);
+    }
+
+    return ret;
+}
+
+int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp)
+{
+    size_t transfer_buf_size = 0;
+    void *transfer_buf = NULL;
+    g_autoptr(GError) g_err = NULL;
+    int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1;
+    int ret;
+
+    /* [0] for reading (back-end's end), [1] for writing (our end) */
+    if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) {
+        error_setg(errp, "Failed to set up state transfer pipe: %s",
+                   g_err->message);
+        ret = -EINVAL;
+        goto fail;
+    }
+
+    read_fd = pipe_fds[0];
+    write_fd = pipe_fds[1];
+
+    /* VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped */
+    assert(!dev->started && !dev->enable_vqs);
+
+    /* Transfer ownership of read_fd to the back-end */
+    ret = vhost_set_device_state_fd(dev,
+                                    VHOST_TRANSFER_STATE_DIRECTION_LOAD,
+                                    VHOST_TRANSFER_STATE_PHASE_STOPPED,
+                                    read_fd,
+                                    &reply_fd,
+                                    errp);
+    if (ret < 0) {
+        error_prepend(errp, "Failed to initiate state transfer: ");
+        goto fail;
+    }
+
+    /* If the back-end wishes to use a different pipe, switch over */
+    if (reply_fd >= 0) {
+        close(write_fd);
+        write_fd = reply_fd;
+    }
+
+    while (true) {
+        size_t this_chunk_size = qemu_get_be32(f);
+        ssize_t write_ret;
+        const uint8_t *transfer_pointer;
+
+        if (this_chunk_size == 0) {
+            /* End of state */
+            break;
+        }
+
+        if (transfer_buf_size < this_chunk_size) {
+            transfer_buf = g_realloc(transfer_buf, this_chunk_size);
+            transfer_buf_size = this_chunk_size;
+        }
+
+        if (qemu_get_buffer(f, transfer_buf, this_chunk_size) <
+                this_chunk_size)
+        {
+            error_setg(errp, "Failed to read state");
+            ret = -EINVAL;
+            goto fail;
+        }
+
+        transfer_pointer = transfer_buf;
+        while (this_chunk_size > 0) {
+            write_ret = write(write_fd, transfer_pointer, this_chunk_size);
+            if (write_ret < 0) {
+                ret = -errno;
+                error_setg_errno(errp, -ret, "Failed to send state");
+                goto fail;
+            } else if (write_ret == 0) {
+                error_setg(errp, "Failed to send state: Connection is closed");
+                ret = -ECONNRESET;
+                goto fail;
+            }
+
+            assert(write_ret <= this_chunk_size);
+            this_chunk_size -= write_ret;
+            transfer_pointer += write_ret;
+        }
+    }
+
+    /*
+     * Close our end, thus ending transfer, before inquiring the back-end about
+     * whether transfer was successful
+     */
+    close(write_fd);
+    write_fd = -1;
+
+    /* Also, verify that the device is still stopped */
+    assert(!dev->started && !dev->enable_vqs);
+
+    ret = vhost_check_device_state(dev, errp);
+    if (ret < 0) {
+        goto fail;
+    }
+
+    ret = 0;
+fail:
+    g_free(transfer_buf);
+    if (write_fd >= 0) {
+        close(write_fd);
+    }
+
+    return ret;
+}
-- 
2.39.1
^ permalink raw reply related	[flat|nested] 93+ messages in thread
* [PATCH 4/4] vhost-user-fs: Implement internal migration
  2023-04-11 15:05 [PATCH 0/4] vhost-user-fs: Internal migration Hanna Czenczek
                   ` (2 preceding siblings ...)
  2023-04-11 15:05 ` [PATCH 3/4] vhost: Add high-level state save/load functions Hanna Czenczek
@ 2023-04-11 15:05 ` Hanna Czenczek
  2023-04-12 21:00 ` [PATCH 0/4] vhost-user-fs: Internal migration Stefan Hajnoczi
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-11 15:05 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Hanna Czenczek, Stefan Hajnoczi, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella
A virtio-fs device's VM state consists of:
- the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE)
- the back-end's (virtiofsd's) internal state
We get/set the latter via the new vhost operations to transfer migratory
state.  It is its own dedicated subsection, so that for external
migration, it can be disabled.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 hw/virtio/vhost-user-fs.c | 101 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 100 insertions(+), 1 deletion(-)
diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index 83fc20e49e..f9c19f7a3d 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -298,9 +298,108 @@ static struct vhost_dev *vuf_get_vhost(VirtIODevice *vdev)
     return &fs->vhost_dev;
 }
 
+/**
+ * Fetch the internal state from virtiofsd and save it to `f`.
+ */
+static int vuf_save_state(QEMUFile *f, void *pv, size_t size,
+                          const VMStateField *field, JSONWriter *vmdesc)
+{
+    VirtIODevice *vdev = pv;
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+    Error *local_error = NULL;
+    int ret;
+
+    ret = vhost_save_backend_state(&fs->vhost_dev, f, &local_error);
+    if (ret < 0) {
+        error_reportf_err(local_error,
+                          "Error saving back-end state of %s device %s "
+                          "(tag: \"%s\"): ",
+                          vdev->name, vdev->parent_obj.canonical_path,
+                          fs->conf.tag ?: "<none>");
+        return ret;
+    }
+
+    return 0;
+}
+
+/**
+ * Load virtiofsd's internal state from `f` and send it over to virtiofsd.
+ */
+static int vuf_load_state(QEMUFile *f, void *pv, size_t size,
+                          const VMStateField *field)
+{
+    VirtIODevice *vdev = pv;
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+    Error *local_error = NULL;
+    int ret;
+
+    ret = vhost_load_backend_state(&fs->vhost_dev, f, &local_error);
+    if (ret < 0) {
+        error_reportf_err(local_error,
+                          "Error loading back-end state of %s device %s "
+                          "(tag: \"%s\"): ",
+                          vdev->name, vdev->parent_obj.canonical_path,
+                          fs->conf.tag ?: "<none>");
+        return ret;
+    }
+
+    return 0;
+}
+
+static bool vuf_is_internal_migration(void *opaque)
+{
+    /* TODO: Return false when an external migration is requested */
+    return true;
+}
+
+static int vuf_check_migration_support(void *opaque)
+{
+    VirtIODevice *vdev = opaque;
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+
+    if (!vhost_supports_migratory_state(&fs->vhost_dev)) {
+        error_report("Back-end of %s device %s (tag: \"%s\") does not support "
+                     "migration through qemu",
+                     vdev->name, vdev->parent_obj.canonical_path,
+                     fs->conf.tag ?: "<none>");
+        return -ENOTSUP;
+    }
+
+    return 0;
+}
+
+static const VMStateDescription vuf_backend_vmstate;
+
 static const VMStateDescription vuf_vmstate = {
     .name = "vhost-user-fs",
-    .unmigratable = 1,
+    .version_id = 0,
+    .fields = (VMStateField[]) {
+        VMSTATE_VIRTIO_DEVICE,
+        VMSTATE_END_OF_LIST()
+    },
+    .subsections = (const VMStateDescription * []) {
+        &vuf_backend_vmstate,
+        NULL,
+    }
+};
+
+static const VMStateDescription vuf_backend_vmstate = {
+    .name = "vhost-user-fs-backend",
+    .version_id = 0,
+    .needed = vuf_is_internal_migration,
+    .pre_load = vuf_check_migration_support,
+    .pre_save = vuf_check_migration_support,
+    .fields = (VMStateField[]) {
+        {
+            .name = "back-end",
+            .info = &(const VMStateInfo) {
+                .name = "virtio-fs back-end state",
+                .get = vuf_load_state,
+                .put = vuf_save_state,
+            },
+        },
+        VMSTATE_END_OF_LIST()
+    },
 };
 
 static Property vuf_properties[] = {
-- 
2.39.1
^ permalink raw reply related	[flat|nested] 93+ messages in thread
* Re: [PATCH 1/4] vhost: Re-enable vrings after setting features
  2023-04-11 15:05 ` [PATCH 1/4] vhost: Re-enable vrings after setting features Hanna Czenczek
@ 2023-04-12 10:55   ` German Maglione
  2023-04-12 12:18     ` Hanna Czenczek
  2023-04-12 20:51   ` Stefan Hajnoczi
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 93+ messages in thread
From: German Maglione @ 2023-04-12 10:55 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella
On Tue, Apr 11, 2023 at 5:05 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> If the back-end supports the VHOST_USER_F_PROTOCOL_FEATURES feature,
> setting the vhost features will set this feature, too.  Doing so
> disables all vrings, which may not be intended.
>
> For example, enabling or disabling logging during migration requires
> setting those features (to set or unset VHOST_F_LOG_ALL), which will
> automatically disable all vrings.  In either case, the VM is running
> (disabling logging is done after a failed or cancelled migration, and
> only once the VM is running again, see comment in
> memory_global_dirty_log_stop()), so the vrings should really be enabled.
> As a result, the back-end seems to hang.
>
> To fix this, we must remember whether the vrings are supposed to be
> enabled, and, if so, re-enable them after a SET_FEATURES call that set
> VHOST_USER_F_PROTOCOL_FEATURES.
>
> It seems less than ideal that there is a short period in which the VM is
> running but the vrings will be stopped (between SET_FEATURES and
> SET_VRING_ENABLE).  To fix this, we would need to change the protocol,
> e.g. by introducing a new flag or vhost-user protocol feature to disable
> disabling vrings whenever VHOST_USER_F_PROTOCOL_FEATURES is set, or add
> new functions for setting/clearing singular feature bits (so that
> F_LOG_ALL can be set/cleared without touching F_PROTOCOL_FEATURES).
>
Could be the other way around?, I mean before commit
02b61f38d3574900fb4cc4c450b17c75956a6a04
so until v7.2rc0 we didn't have this problem with
VHOST_USER_F_PROTOCOL_FEATURES,
so "it seems" it's fine to start with the vrings enabled, could be
possible to go back to that
behavior (reflecting that in the spec) and add a new flag to start
with vrings disabled?
> Even with such a potential addition to the protocol, we still need this
> fix here, because we cannot expect that back-ends will implement this
> addition.
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost.h | 10 ++++++++++
>  hw/virtio/vhost.c         | 13 +++++++++++++
>  2 files changed, 23 insertions(+)
>
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index a52f273347..2fe02ed5d4 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -90,6 +90,16 @@ struct vhost_dev {
>      int vq_index_end;
>      /* if non-zero, minimum required value for max_queues */
>      int num_queues;
> +
> +    /*
> +     * Whether the virtqueues are supposed to be enabled (via
> +     * SET_VRING_ENABLE).  Setting the features (e.g. for
> +     * enabling/disabling logging) will disable all virtqueues if
> +     * VHOST_USER_F_PROTOCOL_FEATURES is set, so then we need to
> +     * re-enable them if this field is set.
> +     */
> +    bool enable_vqs;
> +
>      /**
>       * vhost feature handling requires matching the feature set
>       * offered by a backend which may be a subset of the total
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index a266396576..cbff589efa 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -50,6 +50,8 @@ static unsigned int used_memslots;
>  static QLIST_HEAD(, vhost_dev) vhost_devices =
>      QLIST_HEAD_INITIALIZER(vhost_devices);
>
> +static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable);
> +
>  bool vhost_has_free_slot(void)
>  {
>      unsigned int slots_limit = ~0U;
> @@ -899,6 +901,15 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
>          }
>      }
>
> +    if (dev->enable_vqs) {
> +        /*
> +         * Setting VHOST_USER_F_PROTOCOL_FEATURES would have disabled all
> +         * virtqueues, even if that was not intended; re-enable them if
> +         * necessary.
> +         */
> +        vhost_dev_set_vring_enable(dev, true);
> +    }
> +
>  out:
>      return r;
>  }
> @@ -1896,6 +1907,8 @@ int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>
>  static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable)
>  {
> +    hdev->enable_vqs = enable;
> +
>      if (!hdev->vhost_ops->vhost_set_vring_enable) {
>          return 0;
>      }
> --
> 2.39.1
>
-- 
German
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 1/4] vhost: Re-enable vrings after setting features
  2023-04-12 10:55   ` German Maglione
@ 2023-04-12 12:18     ` Hanna Czenczek
  0 siblings, 0 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-12 12:18 UTC (permalink / raw)
  To: German Maglione
  Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella
On 12.04.23 12:55, German Maglione wrote:
> On Tue, Apr 11, 2023 at 5:05 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>> If the back-end supports the VHOST_USER_F_PROTOCOL_FEATURES feature,
>> setting the vhost features will set this feature, too.  Doing so
>> disables all vrings, which may not be intended.
>>
>> For example, enabling or disabling logging during migration requires
>> setting those features (to set or unset VHOST_F_LOG_ALL), which will
>> automatically disable all vrings.  In either case, the VM is running
>> (disabling logging is done after a failed or cancelled migration, and
>> only once the VM is running again, see comment in
>> memory_global_dirty_log_stop()), so the vrings should really be enabled.
>> As a result, the back-end seems to hang.
>>
>> To fix this, we must remember whether the vrings are supposed to be
>> enabled, and, if so, re-enable them after a SET_FEATURES call that set
>> VHOST_USER_F_PROTOCOL_FEATURES.
>>
>> It seems less than ideal that there is a short period in which the VM is
>> running but the vrings will be stopped (between SET_FEATURES and
>> SET_VRING_ENABLE).  To fix this, we would need to change the protocol,
>> e.g. by introducing a new flag or vhost-user protocol feature to disable
>> disabling vrings whenever VHOST_USER_F_PROTOCOL_FEATURES is set, or add
>> new functions for setting/clearing singular feature bits (so that
>> F_LOG_ALL can be set/cleared without touching F_PROTOCOL_FEATURES).
>>
> Could be the other way around?, I mean before commit
> 02b61f38d3574900fb4cc4c450b17c75956a6a04
>
> so until v7.2rc0 we didn't have this problem with
> VHOST_USER_F_PROTOCOL_FEATURES,
> so "it seems" it's fine to start with the vrings enabled, could be
> possible to go back to that
> behavior (reflecting that in the spec) and add a new flag to start
> with vrings disabled?
I’m not a fan of retroactively changing a public specification in an 
incompatible manner.  Also, “seems fine” isn’t enough of an argument to 
do so. :)  I’m not sure whether finding out if it’s actually fine is 
easy.  But in general, I try to abstain from retroactive spec changes...
I see the problem of qemu apparently not really caring for the specified 
meaning of the flag, indicating that this specified behavior is not 
optimal.  But the ideal way to fix this to me seems to add new flags to 
change the meaning to something more broadly useful.
But I’m not convinced either way.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 1/4] vhost: Re-enable vrings after setting features
  2023-04-11 15:05 ` [PATCH 1/4] vhost: Re-enable vrings after setting features Hanna Czenczek
  2023-04-12 10:55   ` German Maglione
@ 2023-04-12 20:51   ` Stefan Hajnoczi
  2023-04-13  7:17     ` Maxime Coquelin
  2023-04-13  8:19     ` Hanna Czenczek
  2023-04-13 11:03   ` Stefan Hajnoczi
  2023-04-13 13:19   ` Michael S. Tsirkin
  3 siblings, 2 replies; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-12 20:51 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella,
	Roman Kagan, Maxime Coquelin, Marc-André Lureau
[-- Attachment #1: Type: text/plain, Size: 4717 bytes --]
On Tue, Apr 11, 2023 at 05:05:12PM +0200, Hanna Czenczek wrote:
> If the back-end supports the VHOST_USER_F_PROTOCOL_FEATURES feature,
> setting the vhost features will set this feature, too.  Doing so
> disables all vrings, which may not be intended.
> 
> For example, enabling or disabling logging during migration requires
> setting those features (to set or unset VHOST_F_LOG_ALL), which will
> automatically disable all vrings.  In either case, the VM is running
> (disabling logging is done after a failed or cancelled migration, and
> only once the VM is running again, see comment in
> memory_global_dirty_log_stop()), so the vrings should really be enabled.
> As a result, the back-end seems to hang.
> 
> To fix this, we must remember whether the vrings are supposed to be
> enabled, and, if so, re-enable them after a SET_FEATURES call that set
> VHOST_USER_F_PROTOCOL_FEATURES.
> 
> It seems less than ideal that there is a short period in which the VM is
> running but the vrings will be stopped (between SET_FEATURES and
> SET_VRING_ENABLE).  To fix this, we would need to change the protocol,
> e.g. by introducing a new flag or vhost-user protocol feature to disable
> disabling vrings whenever VHOST_USER_F_PROTOCOL_FEATURES is set, or add
> new functions for setting/clearing singular feature bits (so that
> F_LOG_ALL can be set/cleared without touching F_PROTOCOL_FEATURES).
> 
> Even with such a potential addition to the protocol, we still need this
> fix here, because we cannot expect that back-ends will implement this
> addition.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost.h | 10 ++++++++++
>  hw/virtio/vhost.c         | 13 +++++++++++++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index a52f273347..2fe02ed5d4 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -90,6 +90,16 @@ struct vhost_dev {
>      int vq_index_end;
>      /* if non-zero, minimum required value for max_queues */
>      int num_queues;
> +
> +    /*
> +     * Whether the virtqueues are supposed to be enabled (via
> +     * SET_VRING_ENABLE).  Setting the features (e.g. for
> +     * enabling/disabling logging) will disable all virtqueues if
> +     * VHOST_USER_F_PROTOCOL_FEATURES is set, so then we need to
> +     * re-enable them if this field is set.
> +     */
> +    bool enable_vqs;
> +
>      /**
>       * vhost feature handling requires matching the feature set
>       * offered by a backend which may be a subset of the total
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index a266396576..cbff589efa 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -50,6 +50,8 @@ static unsigned int used_memslots;
>  static QLIST_HEAD(, vhost_dev) vhost_devices =
>      QLIST_HEAD_INITIALIZER(vhost_devices);
>  
> +static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable);
> +
>  bool vhost_has_free_slot(void)
>  {
>      unsigned int slots_limit = ~0U;
> @@ -899,6 +901,15 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
>          }
>      }
>  
> +    if (dev->enable_vqs) {
> +        /*
> +         * Setting VHOST_USER_F_PROTOCOL_FEATURES would have disabled all
> +         * virtqueues, even if that was not intended; re-enable them if
> +         * necessary.
> +         */
> +        vhost_dev_set_vring_enable(dev, true);
> +    }
> +
>  out:
>      return r;
>  }
> @@ -1896,6 +1907,8 @@ int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>  
>  static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable)
>  {
> +    hdev->enable_vqs = enable;
> +
>      if (!hdev->vhost_ops->vhost_set_vring_enable) {
>          return 0;
>      }
The vhost-user spec doesn't say that VHOST_F_LOG_ALL needs to be toggled
at runtime and I don't think VHOST_USER_SET_PROTOCOL_FEATURES is
intended to be used like that. This issue shows why doing so is a bad
idea.
VHOST_F_LOG_ALL does not need to be toggled to control logging. Logging
is controlled at runtime by the presence of the dirty log
(VHOST_USER_SET_LOG_BASE) and the per-vring logging flag
(VHOST_VRING_F_LOG).
I suggest permanently enabling VHOST_F_LOG_ALL upon connection when the
the backend supports it. No spec changes are required.
libvhost-user looks like it will work. I didn't look at DPDK/SPDK, but
checking that it works there is important too.
I have CCed people who may be interested in this issue. This is the
first time I've looked at vhost-user logging, so this idea may not work.
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-04-11 15:05 [PATCH 0/4] vhost-user-fs: Internal migration Hanna Czenczek
                   ` (3 preceding siblings ...)
  2023-04-11 15:05 ` [PATCH 4/4] vhost-user-fs: Implement internal migration Hanna Czenczek
@ 2023-04-12 21:00 ` Stefan Hajnoczi
  2023-04-13  8:20   ` Hanna Czenczek
  2023-04-13 16:11 ` Michael S. Tsirkin
  2023-05-04 16:05 ` Hanna Czenczek
  6 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-12 21:00 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella
[-- Attachment #1: Type: text/plain, Size: 58 bytes --]
Hi,
Is there a vhost-user.rst spec patch?
Thanks,
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-11 15:05 ` [PATCH 2/4] vhost-user: Interface for migration state transfer Hanna Czenczek
@ 2023-04-12 21:06   ` Stefan Hajnoczi
  2023-04-13  9:24     ` Hanna Czenczek
  2023-04-13 10:14     ` Eugenio Perez Martin
  2023-04-13  8:50   ` Eugenio Perez Martin
  1 sibling, 2 replies; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-12 21:06 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella,
	Eugenio Pérez
[-- Attachment #1: Type: text/plain, Size: 19206 bytes --]
On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> So-called "internal" virtio-fs migration refers to transporting the
> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> this, we need to be able to transfer virtiofsd's internal state to and
> from virtiofsd.
> 
> Because virtiofsd's internal state will not be too large, we believe it
> is best to transfer it as a single binary blob after the streaming
> phase.  Because this method should be useful to other vhost-user
> implementations, too, it is introduced as a general-purpose addition to
> the protocol, not limited to vhost-user-fs.
> 
> These are the additions to the protocol:
> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>   This feature signals support for transferring state, and is added so
>   that migration can fail early when the back-end has no support.
> 
> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>   over which to transfer the state.  The front-end sends an FD to the
>   back-end into/from which it can write/read its state, and the back-end
>   can decide to either use it, or reply with a different FD for the
>   front-end to override the front-end's choice.
>   The front-end creates a simple pipe to transfer the state, but maybe
>   the back-end already has an FD into/from which it has to write/read
>   its state, in which case it will want to override the simple pipe.
>   Conversely, maybe in the future we find a way to have the front-end
>   get an immediate FD for the migration stream (in some cases), in which
>   case we will want to send this to the back-end instead of creating a
>   pipe.
>   Hence the negotiation: If one side has a better idea than a plain
>   pipe, we will want to use that.
> 
> - CHECK_DEVICE_STATE: After the state has been transferred through the
>   pipe (the end indicated by EOF), the front-end invokes this function
>   to verify success.  There is no in-band way (through the pipe) to
>   indicate failure, so we need to check explicitly.
> 
> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> (which includes establishing the direction of transfer and migration
> phase), the sending side writes its data into the pipe, and the reading
> side reads it until it sees an EOF.  Then, the front-end will check for
> success via CHECK_DEVICE_STATE, which on the destination side includes
> checking for integrity (i.e. errors during deserialization).
> 
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost-backend.h |  24 +++++
>  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>  hw/virtio/vhost.c                 |  37 ++++++++
>  4 files changed, 287 insertions(+)
> 
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index ec3fbae58d..5935b32fe3 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>  } VhostSetConfigType;
>  
> +typedef enum VhostDeviceStateDirection {
> +    /* Transfer state from back-end (device) to front-end */
> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> +    /* Transfer state from front-end to back-end (device) */
> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> +} VhostDeviceStateDirection;
> +
> +typedef enum VhostDeviceStatePhase {
> +    /* The device (and all its vrings) is stopped */
> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> +} VhostDeviceStatePhase;
vDPA has:
  /* Suspend a device so it does not process virtqueue requests anymore
   *
   * After the return of ioctl the device must preserve all the necessary state
   * (the virtqueue vring base plus the possible device specific states) that is
   * required for restoring in the future. The device must not change its
   * configuration after that point.
   */
  #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
  /* Resume a device so it can resume processing virtqueue requests
   *
   * After the return of this ioctl the device will have restored all the
   * necessary states and it is fully operational to continue processing the
   * virtqueue descriptors.
   */
  #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
I wonder if it makes sense to import these into vhost-user so that the
difference between kernel vhost and vhost-user is minimized. It's okay
if one of them is ahead of the other, but it would be nice to avoid
overlapping/duplicated functionality.
(And I hope vDPA will import the device state vhost-user messages
introduced in this series.)
> +
>  struct vhost_inflight;
>  struct vhost_dev;
>  struct vhost_log;
> @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
>  
>  typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
>  
> +typedef bool (*vhost_supports_migratory_state_op)(struct vhost_dev *dev);
> +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev,
> +                                            VhostDeviceStateDirection direction,
> +                                            VhostDeviceStatePhase phase,
> +                                            int fd,
> +                                            int *reply_fd,
> +                                            Error **errp);
> +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp);
> +
>  typedef struct VhostOps {
>      VhostBackendType backend_type;
>      vhost_backend_init vhost_backend_init;
> @@ -181,6 +202,9 @@ typedef struct VhostOps {
>      vhost_force_iommu_op vhost_force_iommu;
>      vhost_set_config_call_op vhost_set_config_call;
>      vhost_reset_status_op vhost_reset_status;
> +    vhost_supports_migratory_state_op vhost_supports_migratory_state;
> +    vhost_set_device_state_fd_op vhost_set_device_state_fd;
> +    vhost_check_device_state_op vhost_check_device_state;
>  } VhostOps;
>  
>  int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index 2fe02ed5d4..29449e0fe2 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
>                             struct vhost_inflight *inflight);
>  int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>                             struct vhost_inflight *inflight);
> +
> +/**
> + * vhost_supports_migratory_state(): Checks whether the back-end
> + * supports transferring internal state for the purpose of migration.
> + * Support for this feature is required for vhost_set_device_state_fd()
> + * and vhost_check_device_state().
> + *
> + * @dev: The vhost device
> + *
> + * Returns true if the device supports these commands, and false if it
> + * does not.
> + */
> +bool vhost_supports_migratory_state(struct vhost_dev *dev);
> +
> +/**
> + * vhost_set_device_state_fd(): Begin transfer of internal state from/to
> + * the back-end for the purpose of migration.  Data is to be transferred
> + * over a pipe according to @direction and @phase.  The sending end must
> + * only write to the pipe, and the receiving end must only read from it.
> + * Once the sending end is done, it closes its FD.  The receiving end
> + * must take this as the end-of-transfer signal and close its FD, too.
> + *
> + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the
> + * read FD for LOAD.  This function transfers ownership of @fd to the
> + * back-end, i.e. closes it in the front-end.
> + *
> + * The back-end may optionally reply with an FD of its own, if this
> + * improves efficiency on its end.  In this case, the returned FD is
> + * stored in *reply_fd.  The back-end will discard the FD sent to it,
> + * and the front-end must use *reply_fd for transferring state to/from
> + * the back-end.
> + *
> + * @dev: The vhost device
> + * @direction: The direction in which the state is to be transferred.
> + *             For outgoing migrations, this is SAVE, and data is read
> + *             from the back-end and stored by the front-end in the
> + *             migration stream.
> + *             For incoming migrations, this is LOAD, and data is read
> + *             by the front-end from the migration stream and sent to
> + *             the back-end to restore the saved state.
> + * @phase: Which migration phase we are in.  Currently, there is only
> + *         STOPPED (device and all vrings are stopped), in the future,
> + *         more phases such as PRE_COPY or POST_COPY may be added.
> + * @fd: Back-end's end of the pipe through which to transfer state; note
> + *      that ownership is transferred to the back-end, so this function
> + *      closes @fd in the front-end.
> + * @reply_fd: If the back-end wishes to use a different pipe for state
> + *            transfer, this will contain an FD for the front-end to
> + *            use.  Otherwise, -1 is stored here.
> + * @errp: Potential error description
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_set_device_state_fd(struct vhost_dev *dev,
> +                              VhostDeviceStateDirection direction,
> +                              VhostDeviceStatePhase phase,
> +                              int fd,
> +                              int *reply_fd,
> +                              Error **errp);
> +
> +/**
> + * vhost_set_device_state_fd(): After transferring state from/to the
> + * back-end via vhost_set_device_state_fd(), i.e. once the sending end
> + * has closed the pipe, inquire the back-end to report any potential
> + * errors that have occurred on its side.  This allows to sense errors
> + * like:
> + * - During outgoing migration, when the source side had already started
> + *   to produce its state, something went wrong and it failed to finish
> + * - During incoming migration, when the received state is somehow
> + *   invalid and cannot be processed by the back-end
> + *
> + * @dev: The vhost device
> + * @errp: Potential error description
> + *
> + * Returns 0 when the back-end reports successful state transfer and
> + * processing, and -errno when an error occurred somewhere.
> + */
> +int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
> +
>  #endif
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index e5285df4ba..93d8f2494a 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -83,6 +83,7 @@ enum VhostUserProtocolFeature {
>      /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */
>      VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
>      VHOST_USER_PROTOCOL_F_STATUS = 16,
> +    VHOST_USER_PROTOCOL_F_MIGRATORY_STATE = 17,
>      VHOST_USER_PROTOCOL_F_MAX
>  };
>  
> @@ -130,6 +131,8 @@ typedef enum VhostUserRequest {
>      VHOST_USER_REM_MEM_REG = 38,
>      VHOST_USER_SET_STATUS = 39,
>      VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_SET_DEVICE_STATE_FD = 41,
> +    VHOST_USER_CHECK_DEVICE_STATE = 42,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> @@ -210,6 +213,12 @@ typedef struct {
>      uint32_t size; /* the following payload size */
>  } QEMU_PACKED VhostUserHeader;
>  
> +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */
> +typedef struct VhostUserTransferDeviceState {
> +    uint32_t direction;
> +    uint32_t phase;
> +} VhostUserTransferDeviceState;
> +
>  typedef union {
>  #define VHOST_USER_VRING_IDX_MASK   (0xff)
>  #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> @@ -224,6 +233,7 @@ typedef union {
>          VhostUserCryptoSession session;
>          VhostUserVringArea area;
>          VhostUserInflight inflight;
> +        VhostUserTransferDeviceState transfer_state;
>  } VhostUserPayload;
>  
>  typedef struct VhostUserMsg {
> @@ -2681,6 +2691,140 @@ static int vhost_user_dev_start(struct vhost_dev *dev, bool started)
>      }
>  }
>  
> +static bool vhost_user_supports_migratory_state(struct vhost_dev *dev)
> +{
> +    return virtio_has_feature(dev->protocol_features,
> +                              VHOST_USER_PROTOCOL_F_MIGRATORY_STATE);
> +}
> +
> +static int vhost_user_set_device_state_fd(struct vhost_dev *dev,
> +                                          VhostDeviceStateDirection direction,
> +                                          VhostDeviceStatePhase phase,
> +                                          int fd,
> +                                          int *reply_fd,
> +                                          Error **errp)
> +{
> +    int ret;
> +    struct vhost_user *vu = dev->opaque;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_SET_DEVICE_STATE_FD,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.transfer_state),
> +        },
> +        .payload.transfer_state = {
> +            .direction = direction,
> +            .phase = phase,
> +        },
> +    };
> +
> +    *reply_fd = -1;
> +
> +    if (!vhost_user_supports_migratory_state(dev)) {
> +        close(fd);
> +        error_setg(errp, "Back-end does not support migration state transfer");
> +        return -ENOTSUP;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, &fd, 1);
> +    close(fd);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to send SET_DEVICE_STATE_FD message");
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to receive SET_DEVICE_STATE_FD reply");
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) {
> +        error_setg(errp,
> +                   "Received unexpected message type, expected %d, received %d",
> +                   VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> +        error_setg(errp,
> +                   "Received bad message size, expected %zu, received %" PRIu32,
> +                   sizeof(msg.payload.u64), msg.hdr.size);
> +        return -EPROTO;
> +    }
> +
> +    if ((msg.payload.u64 & 0xff) != 0) {
> +        error_setg(errp, "Back-end did not accept migration state transfer");
> +        return -EIO;
> +    }
> +
> +    if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) {
> +        *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr);
> +        if (*reply_fd < 0) {
> +            error_setg(errp,
> +                       "Failed to get back-end-provided transfer pipe FD");
> +            *reply_fd = -1;
> +            return -EIO;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp)
> +{
> +    int ret;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_CHECK_DEVICE_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = 0,
> +        },
> +    };
> +
> +    if (!vhost_user_supports_migratory_state(dev)) {
> +        error_setg(errp, "Back-end does not support migration state transfer");
> +        return -ENOTSUP;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to send CHECK_DEVICE_STATE message");
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to receive CHECK_DEVICE_STATE reply");
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) {
> +        error_setg(errp,
> +                   "Received unexpected message type, expected %d, received %d",
> +                   VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> +        error_setg(errp,
> +                   "Received bad message size, expected %zu, received %" PRIu32,
> +                   sizeof(msg.payload.u64), msg.hdr.size);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.payload.u64 != 0) {
> +        error_setg(errp, "Back-end failed to process its internal state");
> +        return -EIO;
> +    }
> +
> +    return 0;
> +}
> +
>  const VhostOps user_ops = {
>          .backend_type = VHOST_BACKEND_TYPE_USER,
>          .vhost_backend_init = vhost_user_backend_init,
> @@ -2716,4 +2860,7 @@ const VhostOps user_ops = {
>          .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
>          .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
>          .vhost_dev_start = vhost_user_dev_start,
> +        .vhost_supports_migratory_state = vhost_user_supports_migratory_state,
> +        .vhost_set_device_state_fd = vhost_user_set_device_state_fd,
> +        .vhost_check_device_state = vhost_user_check_device_state,
>  };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index cbff589efa..90099d8f6a 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2088,3 +2088,40 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>  
>      return -ENOSYS;
>  }
> +
> +bool vhost_supports_migratory_state(struct vhost_dev *dev)
> +{
> +    if (dev->vhost_ops->vhost_supports_migratory_state) {
> +        return dev->vhost_ops->vhost_supports_migratory_state(dev);
> +    }
> +
> +    return false;
> +}
> +
> +int vhost_set_device_state_fd(struct vhost_dev *dev,
> +                              VhostDeviceStateDirection direction,
> +                              VhostDeviceStatePhase phase,
> +                              int fd,
> +                              int *reply_fd,
> +                              Error **errp)
> +{
> +    if (dev->vhost_ops->vhost_set_device_state_fd) {
> +        return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase,
> +                                                         fd, reply_fd, errp);
> +    }
> +
> +    error_setg(errp,
> +               "vhost transport does not support migration state transfer");
> +    return -ENOSYS;
> +}
> +
> +int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
> +{
> +    if (dev->vhost_ops->vhost_check_device_state) {
> +        return dev->vhost_ops->vhost_check_device_state(dev, errp);
> +    }
> +
> +    error_setg(errp,
> +               "vhost transport does not support migration state transfer");
> +    return -ENOSYS;
> +}
> -- 
> 2.39.1
> 
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] vhost: Add high-level state save/load functions
  2023-04-11 15:05 ` [PATCH 3/4] vhost: Add high-level state save/load functions Hanna Czenczek
@ 2023-04-12 21:14   ` Stefan Hajnoczi
  2023-04-13  9:04     ` Hanna Czenczek
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-12 21:14 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella
[-- Attachment #1: Type: text/plain, Size: 9828 bytes --]
On Tue, Apr 11, 2023 at 05:05:14PM +0200, Hanna Czenczek wrote:
> vhost_save_backend_state() and vhost_load_backend_state() can be used by
> vhost front-ends to easily save and load the back-end's state to/from
> the migration stream.
> 
> Because we do not know the full state size ahead of time,
> vhost_save_backend_state() simply reads the data in 1 MB chunks, and
> writes each chunk consecutively into the migration stream, prefixed by
> its length.  EOF is indicated by a 0-length chunk.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost.h |  35 +++++++
>  hw/virtio/vhost.c         | 196 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 231 insertions(+)
> 
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index 29449e0fe2..d1f1e9e1f3 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -425,4 +425,39 @@ int vhost_set_device_state_fd(struct vhost_dev *dev,
>   */
>  int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
>  
> +/**
> + * vhost_save_backend_state(): High-level function to receive a vhost
> + * back-end's state, and save it in `f`.  Uses
> + * `vhost_set_device_state_fd()` to get the data from the back-end, and
> + * stores it in consecutive chunks that are each prefixed by their
> + * respective length (be32).  The end is marked by a 0-length chunk.
> + *
> + * Must only be called while the device and all its vrings are stopped
> + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
> + *
> + * @dev: The vhost device from which to save the state
> + * @f: Migration stream in which to save the state
> + * @errp: Potential error message
> + *
> + * Returns 0 on success, and -errno otherwise.
> + */
> +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
> +
> +/**
> + * vhost_load_backend_state(): High-level function to load a vhost
> + * back-end's state from `f`, and send it over to the back-end.  Reads
> + * the data from `f` in the format used by `vhost_save_state()`, and
> + * uses `vhost_set_device_state_fd()` to transfer it to the back-end.
> + *
> + * Must only be called while the device and all its vrings are stopped
> + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
> + *
> + * @dev: The vhost device to which to send the sate
> + * @f: Migration stream from which to load the state
> + * @errp: Potential error message
> + *
> + * Returns 0 on success, and -errno otherwise.
> + */
> +int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
> +
>  #endif
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 90099d8f6a..d08849c691 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2125,3 +2125,199 @@ int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
>                 "vhost transport does not support migration state transfer");
>      return -ENOSYS;
>  }
> +
> +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp)
> +{
> +    /* Maximum chunk size in which to transfer the state */
> +    const size_t chunk_size = 1 * 1024 * 1024;
> +    void *transfer_buf = NULL;
> +    g_autoptr(GError) g_err = NULL;
> +    int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1;
> +    int ret;
> +
> +    /* [0] for reading (our end), [1] for writing (back-end's end) */
> +    if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) {
> +        error_setg(errp, "Failed to set up state transfer pipe: %s",
> +                   g_err->message);
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    read_fd = pipe_fds[0];
> +    write_fd = pipe_fds[1];
> +
> +    /* VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped */
> +    assert(!dev->started && !dev->enable_vqs);
> +
> +    /* Transfer ownership of write_fd to the back-end */
> +    ret = vhost_set_device_state_fd(dev,
> +                                    VHOST_TRANSFER_STATE_DIRECTION_SAVE,
> +                                    VHOST_TRANSFER_STATE_PHASE_STOPPED,
> +                                    write_fd,
> +                                    &reply_fd,
> +                                    errp);
> +    if (ret < 0) {
> +        error_prepend(errp, "Failed to initiate state transfer: ");
> +        goto fail;
> +    }
> +
> +    /* If the back-end wishes to use a different pipe, switch over */
> +    if (reply_fd >= 0) {
> +        close(read_fd);
> +        read_fd = reply_fd;
> +    }
> +
> +    transfer_buf = g_malloc(chunk_size);
> +
> +    while (true) {
> +        ssize_t read_ret;
> +
> +        read_ret = read(read_fd, transfer_buf, chunk_size);
> +        if (read_ret < 0) {
> +            ret = -errno;
> +            error_setg_errno(errp, -ret, "Failed to receive state");
> +            goto fail;
> +        }
> +
> +        assert(read_ret <= chunk_size);
> +        qemu_put_be32(f, read_ret);
> +
> +        if (read_ret == 0) {
> +            /* EOF */
> +            break;
> +        }
> +
> +        qemu_put_buffer(f, transfer_buf, read_ret);
> +    }
I think this synchronous approach with a single contiguous stream of
chunks is okay for now.
Does this make the QEMU monitor unresponsive if the backend is slow?
In the future the interface could be extended to allow participating in
the iterative phase of migration. Then chunks from multiple backends
(plus guest RAM) would be interleaved and there would be some
parallelism.
> +
> +    /*
> +     * Back-end will not really care, but be clean and close our end of the pipe
> +     * before inquiring the back-end about whether transfer was successful
> +     */
> +    close(read_fd);
> +    read_fd = -1;
> +
> +    /* Also, verify that the device is still stopped */
> +    assert(!dev->started && !dev->enable_vqs);
> +
> +    ret = vhost_check_device_state(dev, errp);
> +    if (ret < 0) {
> +        goto fail;
> +    }
> +
> +    ret = 0;
> +fail:
> +    g_free(transfer_buf);
> +    if (read_fd >= 0) {
> +        close(read_fd);
> +    }
> +
> +    return ret;
> +}
> +
> +int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp)
> +{
> +    size_t transfer_buf_size = 0;
> +    void *transfer_buf = NULL;
> +    g_autoptr(GError) g_err = NULL;
> +    int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1;
> +    int ret;
> +
> +    /* [0] for reading (back-end's end), [1] for writing (our end) */
> +    if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) {
> +        error_setg(errp, "Failed to set up state transfer pipe: %s",
> +                   g_err->message);
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    read_fd = pipe_fds[0];
> +    write_fd = pipe_fds[1];
> +
> +    /* VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped */
> +    assert(!dev->started && !dev->enable_vqs);
> +
> +    /* Transfer ownership of read_fd to the back-end */
> +    ret = vhost_set_device_state_fd(dev,
> +                                    VHOST_TRANSFER_STATE_DIRECTION_LOAD,
> +                                    VHOST_TRANSFER_STATE_PHASE_STOPPED,
> +                                    read_fd,
> +                                    &reply_fd,
> +                                    errp);
> +    if (ret < 0) {
> +        error_prepend(errp, "Failed to initiate state transfer: ");
> +        goto fail;
> +    }
> +
> +    /* If the back-end wishes to use a different pipe, switch over */
> +    if (reply_fd >= 0) {
> +        close(write_fd);
> +        write_fd = reply_fd;
> +    }
> +
> +    while (true) {
> +        size_t this_chunk_size = qemu_get_be32(f);
> +        ssize_t write_ret;
> +        const uint8_t *transfer_pointer;
> +
> +        if (this_chunk_size == 0) {
> +            /* End of state */
> +            break;
> +        }
> +
> +        if (transfer_buf_size < this_chunk_size) {
> +            transfer_buf = g_realloc(transfer_buf, this_chunk_size);
> +            transfer_buf_size = this_chunk_size;
> +        }
> +
> +        if (qemu_get_buffer(f, transfer_buf, this_chunk_size) <
> +                this_chunk_size)
> +        {
> +            error_setg(errp, "Failed to read state");
> +            ret = -EINVAL;
> +            goto fail;
> +        }
> +
> +        transfer_pointer = transfer_buf;
> +        while (this_chunk_size > 0) {
> +            write_ret = write(write_fd, transfer_pointer, this_chunk_size);
> +            if (write_ret < 0) {
> +                ret = -errno;
> +                error_setg_errno(errp, -ret, "Failed to send state");
> +                goto fail;
> +            } else if (write_ret == 0) {
> +                error_setg(errp, "Failed to send state: Connection is closed");
> +                ret = -ECONNRESET;
> +                goto fail;
> +            }
> +
> +            assert(write_ret <= this_chunk_size);
> +            this_chunk_size -= write_ret;
> +            transfer_pointer += write_ret;
> +        }
> +    }
> +
> +    /*
> +     * Close our end, thus ending transfer, before inquiring the back-end about
> +     * whether transfer was successful
> +     */
> +    close(write_fd);
> +    write_fd = -1;
> +
> +    /* Also, verify that the device is still stopped */
> +    assert(!dev->started && !dev->enable_vqs);
> +
> +    ret = vhost_check_device_state(dev, errp);
> +    if (ret < 0) {
> +        goto fail;
> +    }
> +
> +    ret = 0;
> +fail:
> +    g_free(transfer_buf);
> +    if (write_fd >= 0) {
> +        close(write_fd);
> +    }
> +
> +    return ret;
> +}
> -- 
> 2.39.1
> 
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 1/4] vhost: Re-enable vrings after setting features
  2023-04-12 20:51   ` Stefan Hajnoczi
@ 2023-04-13  7:17     ` Maxime Coquelin
  2023-04-13  8:19     ` Hanna Czenczek
  1 sibling, 0 replies; 93+ messages in thread
From: Maxime Coquelin @ 2023-04-13  7:17 UTC (permalink / raw)
  To: Stefan Hajnoczi, Hanna Czenczek
  Cc: qemu-devel, virtio-fs, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella,
	Roman Kagan, Marc-André Lureau
On 4/12/23 22:51, Stefan Hajnoczi wrote:
> On Tue, Apr 11, 2023 at 05:05:12PM +0200, Hanna Czenczek wrote:
>> If the back-end supports the VHOST_USER_F_PROTOCOL_FEATURES feature,
>> setting the vhost features will set this feature, too.  Doing so
>> disables all vrings, which may not be intended.
>>
>> For example, enabling or disabling logging during migration requires
>> setting those features (to set or unset VHOST_F_LOG_ALL), which will
>> automatically disable all vrings.  In either case, the VM is running
>> (disabling logging is done after a failed or cancelled migration, and
>> only once the VM is running again, see comment in
>> memory_global_dirty_log_stop()), so the vrings should really be enabled.
>> As a result, the back-end seems to hang.
>>
>> To fix this, we must remember whether the vrings are supposed to be
>> enabled, and, if so, re-enable them after a SET_FEATURES call that set
>> VHOST_USER_F_PROTOCOL_FEATURES.
>>
>> It seems less than ideal that there is a short period in which the VM is
>> running but the vrings will be stopped (between SET_FEATURES and
>> SET_VRING_ENABLE).  To fix this, we would need to change the protocol,
>> e.g. by introducing a new flag or vhost-user protocol feature to disable
>> disabling vrings whenever VHOST_USER_F_PROTOCOL_FEATURES is set, or add
>> new functions for setting/clearing singular feature bits (so that
>> F_LOG_ALL can be set/cleared without touching F_PROTOCOL_FEATURES).
>>
>> Even with such a potential addition to the protocol, we still need this
>> fix here, because we cannot expect that back-ends will implement this
>> addition.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   include/hw/virtio/vhost.h | 10 ++++++++++
>>   hw/virtio/vhost.c         | 13 +++++++++++++
>>   2 files changed, 23 insertions(+)
>>
>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>> index a52f273347..2fe02ed5d4 100644
>> --- a/include/hw/virtio/vhost.h
>> +++ b/include/hw/virtio/vhost.h
>> @@ -90,6 +90,16 @@ struct vhost_dev {
>>       int vq_index_end;
>>       /* if non-zero, minimum required value for max_queues */
>>       int num_queues;
>> +
>> +    /*
>> +     * Whether the virtqueues are supposed to be enabled (via
>> +     * SET_VRING_ENABLE).  Setting the features (e.g. for
>> +     * enabling/disabling logging) will disable all virtqueues if
>> +     * VHOST_USER_F_PROTOCOL_FEATURES is set, so then we need to
>> +     * re-enable them if this field is set.
>> +     */
>> +    bool enable_vqs;
>> +
>>       /**
>>        * vhost feature handling requires matching the feature set
>>        * offered by a backend which may be a subset of the total
>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>> index a266396576..cbff589efa 100644
>> --- a/hw/virtio/vhost.c
>> +++ b/hw/virtio/vhost.c
>> @@ -50,6 +50,8 @@ static unsigned int used_memslots;
>>   static QLIST_HEAD(, vhost_dev) vhost_devices =
>>       QLIST_HEAD_INITIALIZER(vhost_devices);
>>   
>> +static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable);
>> +
>>   bool vhost_has_free_slot(void)
>>   {
>>       unsigned int slots_limit = ~0U;
>> @@ -899,6 +901,15 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
>>           }
>>       }
>>   
>> +    if (dev->enable_vqs) {
>> +        /*
>> +         * Setting VHOST_USER_F_PROTOCOL_FEATURES would have disabled all
>> +         * virtqueues, even if that was not intended; re-enable them if
>> +         * necessary.
>> +         */
>> +        vhost_dev_set_vring_enable(dev, true);
>> +    }
>> +
>>   out:
>>       return r;
>>   }
>> @@ -1896,6 +1907,8 @@ int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>>   
>>   static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable)
>>   {
>> +    hdev->enable_vqs = enable;
>> +
>>       if (!hdev->vhost_ops->vhost_set_vring_enable) {
>>           return 0;
>>       }
> 
> The vhost-user spec doesn't say that VHOST_F_LOG_ALL needs to be toggled
> at runtime and I don't think VHOST_USER_SET_PROTOCOL_FEATURES is
> intended to be used like that. This issue shows why doing so is a bad
> idea.
> 
> VHOST_F_LOG_ALL does not need to be toggled to control logging. Logging
> is controlled at runtime by the presence of the dirty log
> (VHOST_USER_SET_LOG_BASE) and the per-vring logging flag
> (VHOST_VRING_F_LOG).
> 
> I suggest permanently enabling VHOST_F_LOG_ALL upon connection when the
> the backend supports it. No spec changes are required.
> 
> libvhost-user looks like it will work. I didn't look at DPDK/SPDK, but
> checking that it works there is important too.
> 
> I have CCed people who may be interested in this issue. This is the
> first time I've looked at vhost-user logging, so this idea may not work.
In the case of DPDK, we rely on VHOST_F_LOG_ALL to be set to know
whether we should do dirty pages logging or not [0], so setting this
feature at init time will cause performance degradation. The check on
whether the log base address has been set is done afterwards.
Regards,
Maxime
> Stefan
[0]: https://git.dpdk.org/dpdk/tree/lib/vhost/vhost.h#n594
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 1/4] vhost: Re-enable vrings after setting features
  2023-04-12 20:51   ` Stefan Hajnoczi
  2023-04-13  7:17     ` Maxime Coquelin
@ 2023-04-13  8:19     ` Hanna Czenczek
  2023-04-13 11:03       ` Stefan Hajnoczi
  1 sibling, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-13  8:19 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, virtio-fs, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella,
	Roman Kagan, Maxime Coquelin, Marc-André Lureau
On 12.04.23 22:51, Stefan Hajnoczi wrote:
> On Tue, Apr 11, 2023 at 05:05:12PM +0200, Hanna Czenczek wrote:
>> If the back-end supports the VHOST_USER_F_PROTOCOL_FEATURES feature,
>> setting the vhost features will set this feature, too.  Doing so
>> disables all vrings, which may not be intended.
>>
>> For example, enabling or disabling logging during migration requires
>> setting those features (to set or unset VHOST_F_LOG_ALL), which will
>> automatically disable all vrings.  In either case, the VM is running
>> (disabling logging is done after a failed or cancelled migration, and
>> only once the VM is running again, see comment in
>> memory_global_dirty_log_stop()), so the vrings should really be enabled.
>> As a result, the back-end seems to hang.
>>
>> To fix this, we must remember whether the vrings are supposed to be
>> enabled, and, if so, re-enable them after a SET_FEATURES call that set
>> VHOST_USER_F_PROTOCOL_FEATURES.
>>
>> It seems less than ideal that there is a short period in which the VM is
>> running but the vrings will be stopped (between SET_FEATURES and
>> SET_VRING_ENABLE).  To fix this, we would need to change the protocol,
>> e.g. by introducing a new flag or vhost-user protocol feature to disable
>> disabling vrings whenever VHOST_USER_F_PROTOCOL_FEATURES is set, or add
>> new functions for setting/clearing singular feature bits (so that
>> F_LOG_ALL can be set/cleared without touching F_PROTOCOL_FEATURES).
>>
>> Even with such a potential addition to the protocol, we still need this
>> fix here, because we cannot expect that back-ends will implement this
>> addition.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   include/hw/virtio/vhost.h | 10 ++++++++++
>>   hw/virtio/vhost.c         | 13 +++++++++++++
>>   2 files changed, 23 insertions(+)
>>
>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>> index a52f273347..2fe02ed5d4 100644
>> --- a/include/hw/virtio/vhost.h
>> +++ b/include/hw/virtio/vhost.h
>> @@ -90,6 +90,16 @@ struct vhost_dev {
>>       int vq_index_end;
>>       /* if non-zero, minimum required value for max_queues */
>>       int num_queues;
>> +
>> +    /*
>> +     * Whether the virtqueues are supposed to be enabled (via
>> +     * SET_VRING_ENABLE).  Setting the features (e.g. for
>> +     * enabling/disabling logging) will disable all virtqueues if
>> +     * VHOST_USER_F_PROTOCOL_FEATURES is set, so then we need to
>> +     * re-enable them if this field is set.
>> +     */
>> +    bool enable_vqs;
>> +
>>       /**
>>        * vhost feature handling requires matching the feature set
>>        * offered by a backend which may be a subset of the total
>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>> index a266396576..cbff589efa 100644
>> --- a/hw/virtio/vhost.c
>> +++ b/hw/virtio/vhost.c
>> @@ -50,6 +50,8 @@ static unsigned int used_memslots;
>>   static QLIST_HEAD(, vhost_dev) vhost_devices =
>>       QLIST_HEAD_INITIALIZER(vhost_devices);
>>   
>> +static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable);
>> +
>>   bool vhost_has_free_slot(void)
>>   {
>>       unsigned int slots_limit = ~0U;
>> @@ -899,6 +901,15 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
>>           }
>>       }
>>   
>> +    if (dev->enable_vqs) {
>> +        /*
>> +         * Setting VHOST_USER_F_PROTOCOL_FEATURES would have disabled all
>> +         * virtqueues, even if that was not intended; re-enable them if
>> +         * necessary.
>> +         */
>> +        vhost_dev_set_vring_enable(dev, true);
>> +    }
>> +
>>   out:
>>       return r;
>>   }
>> @@ -1896,6 +1907,8 @@ int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>>   
>>   static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable)
>>   {
>> +    hdev->enable_vqs = enable;
>> +
>>       if (!hdev->vhost_ops->vhost_set_vring_enable) {
>>           return 0;
>>       }
> The vhost-user spec doesn't say that VHOST_F_LOG_ALL needs to be toggled
> at runtime and I don't think VHOST_USER_SET_PROTOCOL_FEATURES is
> intended to be used like that. This issue shows why doing so is a bad
> idea.
>
> VHOST_F_LOG_ALL does not need to be toggled to control logging. Logging
> is controlled at runtime by the presence of the dirty log
> (VHOST_USER_SET_LOG_BASE) and the per-vring logging flag
> (VHOST_VRING_F_LOG).
Technically, the spec doesn’t say that SET_LOG_BASE is required.  It says:
“To start/stop logging of data/used ring writes, the front-end may send 
messages VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and 
VHOST_USER_SET_VRING_ADDR with VHOST_VRING_F_LOG in ring’s flags set to 
1/0, respectively.”
(So the spec also very much does imply that toggling F_LOG_ALL at 
runtime is a valid way to enable/disable logging.  If we were to no 
longer do that, we should clarify it there.)
I mean, naturally, logging without a shared memory area to log in to 
isn’t much fun, so we could clarify that SET_LOG_BASE is also a 
requirement, but it looks to me as if we can’t use SET_LOG_BASE to 
disable logging, because it’s supposed to always pass a valid FD (at 
least libvhost-user expects this: 
https://gitlab.com/qemu-project/qemu/-/blob/master/subprojects/libvhost-user/libvhost-user.c#L1044). 
So after a cancelled migration, the dirty bitmap SHM will stay around 
indefinitely (which is already not great, but if we were to use the 
presence of that bitmap as an indicator as to whether we should log or 
not, it would be worse).
So the VRING_F_LOG flag remains.
> I suggest permanently enabling VHOST_F_LOG_ALL upon connection when the
> the backend supports it. No spec changes are required.
>
> libvhost-user looks like it will work. I didn't look at DPDK/SPDK, but
> checking that it works there is important too.
I can’t find VRING_F_LOG in libvhost-user, so what protocol do you mean 
exactly that would work in libvhost-user?  Because SET_LOG_BASE won’t 
work, as you can’t use it to disable logging.
(For DPDK, I’m not sure.  It looks like it sometimes takes VRING_F_LOG 
into account, but I think only when it comes to logging accesses to the 
vring specifically, i.e. not DMA to other areas of guest memory?  I 
think only places that use vq->log_guest_addr implicitly check it, 
others don’t.  So for example, 
vhost_log_write_iova()/vhost_log_cache_write_iova() don’t seem to check 
VRING_F_LOG, which seem to be the functions generally used for writes to 
memory outside of the vrings.  So here, even if VRING_F_LOG is disabled 
for all vrings, as long as a log SHM is set, all writes to memory 
outside of the immediate vrings seem to cause logging (as long as 
F_LOG_ALL is set).)
Hanna
> I have CCed people who may be interested in this issue. This is the
> first time I've looked at vhost-user logging, so this idea may not work.
>
> Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-04-12 21:00 ` [PATCH 0/4] vhost-user-fs: Internal migration Stefan Hajnoczi
@ 2023-04-13  8:20   ` Hanna Czenczek
  0 siblings, 0 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-13  8:20 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, virtio-fs, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella
On 12.04.23 23:00, Stefan Hajnoczi wrote:
> Hi,
> Is there a vhost-user.rst spec patch?
Ah, right, I forgot.
Will add!
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-11 15:05 ` [PATCH 2/4] vhost-user: Interface for migration state transfer Hanna Czenczek
  2023-04-12 21:06   ` Stefan Hajnoczi
@ 2023-04-13  8:50   ` Eugenio Perez Martin
  2023-04-13  9:25     ` Hanna Czenczek
  1 sibling, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-04-13  8:50 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Tue, Apr 11, 2023 at 5:33 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> So-called "internal" virtio-fs migration refers to transporting the
> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> this, we need to be able to transfer virtiofsd's internal state to and
> from virtiofsd.
>
> Because virtiofsd's internal state will not be too large, we believe it
> is best to transfer it as a single binary blob after the streaming
> phase.  Because this method should be useful to other vhost-user
> implementations, too, it is introduced as a general-purpose addition to
> the protocol, not limited to vhost-user-fs.
>
> These are the additions to the protocol:
> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>   This feature signals support for transferring state, and is added so
>   that migration can fail early when the back-end has no support.
>
> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>   over which to transfer the state.  The front-end sends an FD to the
>   back-end into/from which it can write/read its state, and the back-end
>   can decide to either use it, or reply with a different FD for the
>   front-end to override the front-end's choice.
>   The front-end creates a simple pipe to transfer the state, but maybe
>   the back-end already has an FD into/from which it has to write/read
>   its state, in which case it will want to override the simple pipe.
>   Conversely, maybe in the future we find a way to have the front-end
>   get an immediate FD for the migration stream (in some cases), in which
>   case we will want to send this to the back-end instead of creating a
>   pipe.
>   Hence the negotiation: If one side has a better idea than a plain
>   pipe, we will want to use that.
>
> - CHECK_DEVICE_STATE: After the state has been transferred through the
>   pipe (the end indicated by EOF), the front-end invokes this function
>   to verify success.  There is no in-band way (through the pipe) to
>   indicate failure, so we need to check explicitly.
>
> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> (which includes establishing the direction of transfer and migration
> phase), the sending side writes its data into the pipe, and the reading
> side reads it until it sees an EOF.  Then, the front-end will check for
> success via CHECK_DEVICE_STATE, which on the destination side includes
> checking for integrity (i.e. errors during deserialization).
>
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost-backend.h |  24 +++++
>  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>  hw/virtio/vhost.c                 |  37 ++++++++
>  4 files changed, 287 insertions(+)
>
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index ec3fbae58d..5935b32fe3 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>  } VhostSetConfigType;
>
> +typedef enum VhostDeviceStateDirection {
> +    /* Transfer state from back-end (device) to front-end */
> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> +    /* Transfer state from front-end to back-end (device) */
> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> +} VhostDeviceStateDirection;
> +
> +typedef enum VhostDeviceStatePhase {
> +    /* The device (and all its vrings) is stopped */
> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> +} VhostDeviceStatePhase;
> +
>  struct vhost_inflight;
>  struct vhost_dev;
>  struct vhost_log;
> @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
>
>  typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
>
> +typedef bool (*vhost_supports_migratory_state_op)(struct vhost_dev *dev);
> +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev,
> +                                            VhostDeviceStateDirection direction,
> +                                            VhostDeviceStatePhase phase,
> +                                            int fd,
> +                                            int *reply_fd,
> +                                            Error **errp);
> +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp);
> +
>  typedef struct VhostOps {
>      VhostBackendType backend_type;
>      vhost_backend_init vhost_backend_init;
> @@ -181,6 +202,9 @@ typedef struct VhostOps {
>      vhost_force_iommu_op vhost_force_iommu;
>      vhost_set_config_call_op vhost_set_config_call;
>      vhost_reset_status_op vhost_reset_status;
> +    vhost_supports_migratory_state_op vhost_supports_migratory_state;
> +    vhost_set_device_state_fd_op vhost_set_device_state_fd;
> +    vhost_check_device_state_op vhost_check_device_state;
>  } VhostOps;
>
>  int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index 2fe02ed5d4..29449e0fe2 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
>                             struct vhost_inflight *inflight);
>  int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>                             struct vhost_inflight *inflight);
> +
> +/**
> + * vhost_supports_migratory_state(): Checks whether the back-end
> + * supports transferring internal state for the purpose of migration.
> + * Support for this feature is required for vhost_set_device_state_fd()
> + * and vhost_check_device_state().
> + *
> + * @dev: The vhost device
> + *
> + * Returns true if the device supports these commands, and false if it
> + * does not.
> + */
> +bool vhost_supports_migratory_state(struct vhost_dev *dev);
> +
> +/**
> + * vhost_set_device_state_fd(): Begin transfer of internal state from/to
> + * the back-end for the purpose of migration.  Data is to be transferred
> + * over a pipe according to @direction and @phase.  The sending end must
> + * only write to the pipe, and the receiving end must only read from it.
> + * Once the sending end is done, it closes its FD.  The receiving end
> + * must take this as the end-of-transfer signal and close its FD, too.
> + *
> + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the
> + * read FD for LOAD.  This function transfers ownership of @fd to the
> + * back-end, i.e. closes it in the front-end.
> + *
> + * The back-end may optionally reply with an FD of its own, if this
> + * improves efficiency on its end.  In this case, the returned FD is
> + * stored in *reply_fd.  The back-end will discard the FD sent to it,
> + * and the front-end must use *reply_fd for transferring state to/from
> + * the back-end.
> + *
> + * @dev: The vhost device
> + * @direction: The direction in which the state is to be transferred.
> + *             For outgoing migrations, this is SAVE, and data is read
> + *             from the back-end and stored by the front-end in the
> + *             migration stream.
> + *             For incoming migrations, this is LOAD, and data is read
> + *             by the front-end from the migration stream and sent to
> + *             the back-end to restore the saved state.
> + * @phase: Which migration phase we are in.  Currently, there is only
> + *         STOPPED (device and all vrings are stopped), in the future,
> + *         more phases such as PRE_COPY or POST_COPY may be added.
> + * @fd: Back-end's end of the pipe through which to transfer state; note
> + *      that ownership is transferred to the back-end, so this function
> + *      closes @fd in the front-end.
> + * @reply_fd: If the back-end wishes to use a different pipe for state
> + *            transfer, this will contain an FD for the front-end to
> + *            use.  Otherwise, -1 is stored here.
> + * @errp: Potential error description
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_set_device_state_fd(struct vhost_dev *dev,
> +                              VhostDeviceStateDirection direction,
> +                              VhostDeviceStatePhase phase,
> +                              int fd,
> +                              int *reply_fd,
> +                              Error **errp);
> +
> +/**
> + * vhost_set_device_state_fd(): After transferring state from/to the
Nitpick: This function doc is for vhost_check_device_state not
vhost_set_device_state_fd.
Thanks!
> + * back-end via vhost_set_device_state_fd(), i.e. once the sending end
> + * has closed the pipe, inquire the back-end to report any potential
> + * errors that have occurred on its side.  This allows to sense errors
> + * like:
> + * - During outgoing migration, when the source side had already started
> + *   to produce its state, something went wrong and it failed to finish
> + * - During incoming migration, when the received state is somehow
> + *   invalid and cannot be processed by the back-end
> + *
> + * @dev: The vhost device
> + * @errp: Potential error description
> + *
> + * Returns 0 when the back-end reports successful state transfer and
> + * processing, and -errno when an error occurred somewhere.
> + */
> +int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
> +
>  #endif
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index e5285df4ba..93d8f2494a 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -83,6 +83,7 @@ enum VhostUserProtocolFeature {
>      /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */
>      VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
>      VHOST_USER_PROTOCOL_F_STATUS = 16,
> +    VHOST_USER_PROTOCOL_F_MIGRATORY_STATE = 17,
>      VHOST_USER_PROTOCOL_F_MAX
>  };
>
> @@ -130,6 +131,8 @@ typedef enum VhostUserRequest {
>      VHOST_USER_REM_MEM_REG = 38,
>      VHOST_USER_SET_STATUS = 39,
>      VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_SET_DEVICE_STATE_FD = 41,
> +    VHOST_USER_CHECK_DEVICE_STATE = 42,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>
> @@ -210,6 +213,12 @@ typedef struct {
>      uint32_t size; /* the following payload size */
>  } QEMU_PACKED VhostUserHeader;
>
> +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */
> +typedef struct VhostUserTransferDeviceState {
> +    uint32_t direction;
> +    uint32_t phase;
> +} VhostUserTransferDeviceState;
> +
>  typedef union {
>  #define VHOST_USER_VRING_IDX_MASK   (0xff)
>  #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> @@ -224,6 +233,7 @@ typedef union {
>          VhostUserCryptoSession session;
>          VhostUserVringArea area;
>          VhostUserInflight inflight;
> +        VhostUserTransferDeviceState transfer_state;
>  } VhostUserPayload;
>
>  typedef struct VhostUserMsg {
> @@ -2681,6 +2691,140 @@ static int vhost_user_dev_start(struct vhost_dev *dev, bool started)
>      }
>  }
>
> +static bool vhost_user_supports_migratory_state(struct vhost_dev *dev)
> +{
> +    return virtio_has_feature(dev->protocol_features,
> +                              VHOST_USER_PROTOCOL_F_MIGRATORY_STATE);
> +}
> +
> +static int vhost_user_set_device_state_fd(struct vhost_dev *dev,
> +                                          VhostDeviceStateDirection direction,
> +                                          VhostDeviceStatePhase phase,
> +                                          int fd,
> +                                          int *reply_fd,
> +                                          Error **errp)
> +{
> +    int ret;
> +    struct vhost_user *vu = dev->opaque;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_SET_DEVICE_STATE_FD,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.transfer_state),
> +        },
> +        .payload.transfer_state = {
> +            .direction = direction,
> +            .phase = phase,
> +        },
> +    };
> +
> +    *reply_fd = -1;
> +
> +    if (!vhost_user_supports_migratory_state(dev)) {
> +        close(fd);
> +        error_setg(errp, "Back-end does not support migration state transfer");
> +        return -ENOTSUP;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, &fd, 1);
> +    close(fd);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to send SET_DEVICE_STATE_FD message");
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to receive SET_DEVICE_STATE_FD reply");
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) {
> +        error_setg(errp,
> +                   "Received unexpected message type, expected %d, received %d",
> +                   VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> +        error_setg(errp,
> +                   "Received bad message size, expected %zu, received %" PRIu32,
> +                   sizeof(msg.payload.u64), msg.hdr.size);
> +        return -EPROTO;
> +    }
> +
> +    if ((msg.payload.u64 & 0xff) != 0) {
> +        error_setg(errp, "Back-end did not accept migration state transfer");
> +        return -EIO;
> +    }
> +
> +    if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) {
> +        *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr);
> +        if (*reply_fd < 0) {
> +            error_setg(errp,
> +                       "Failed to get back-end-provided transfer pipe FD");
> +            *reply_fd = -1;
> +            return -EIO;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp)
> +{
> +    int ret;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_CHECK_DEVICE_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = 0,
> +        },
> +    };
> +
> +    if (!vhost_user_supports_migratory_state(dev)) {
> +        error_setg(errp, "Back-end does not support migration state transfer");
> +        return -ENOTSUP;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to send CHECK_DEVICE_STATE message");
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to receive CHECK_DEVICE_STATE reply");
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) {
> +        error_setg(errp,
> +                   "Received unexpected message type, expected %d, received %d",
> +                   VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> +        error_setg(errp,
> +                   "Received bad message size, expected %zu, received %" PRIu32,
> +                   sizeof(msg.payload.u64), msg.hdr.size);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.payload.u64 != 0) {
> +        error_setg(errp, "Back-end failed to process its internal state");
> +        return -EIO;
> +    }
> +
> +    return 0;
> +}
> +
>  const VhostOps user_ops = {
>          .backend_type = VHOST_BACKEND_TYPE_USER,
>          .vhost_backend_init = vhost_user_backend_init,
> @@ -2716,4 +2860,7 @@ const VhostOps user_ops = {
>          .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
>          .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
>          .vhost_dev_start = vhost_user_dev_start,
> +        .vhost_supports_migratory_state = vhost_user_supports_migratory_state,
> +        .vhost_set_device_state_fd = vhost_user_set_device_state_fd,
> +        .vhost_check_device_state = vhost_user_check_device_state,
>  };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index cbff589efa..90099d8f6a 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2088,3 +2088,40 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>
>      return -ENOSYS;
>  }
> +
> +bool vhost_supports_migratory_state(struct vhost_dev *dev)
> +{
> +    if (dev->vhost_ops->vhost_supports_migratory_state) {
> +        return dev->vhost_ops->vhost_supports_migratory_state(dev);
> +    }
> +
> +    return false;
> +}
> +
> +int vhost_set_device_state_fd(struct vhost_dev *dev,
> +                              VhostDeviceStateDirection direction,
> +                              VhostDeviceStatePhase phase,
> +                              int fd,
> +                              int *reply_fd,
> +                              Error **errp)
> +{
> +    if (dev->vhost_ops->vhost_set_device_state_fd) {
> +        return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase,
> +                                                         fd, reply_fd, errp);
> +    }
> +
> +    error_setg(errp,
> +               "vhost transport does not support migration state transfer");
> +    return -ENOSYS;
> +}
> +
> +int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
> +{
> +    if (dev->vhost_ops->vhost_check_device_state) {
> +        return dev->vhost_ops->vhost_check_device_state(dev, errp);
> +    }
> +
> +    error_setg(errp,
> +               "vhost transport does not support migration state transfer");
> +    return -ENOSYS;
> +}
> --
> 2.39.1
>
>
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] vhost: Add high-level state save/load functions
  2023-04-12 21:14   ` Stefan Hajnoczi
@ 2023-04-13  9:04     ` Hanna Czenczek
  2023-04-13 11:22       ` Stefan Hajnoczi
  0 siblings, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-13  9:04 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, virtio-fs, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella
On 12.04.23 23:14, Stefan Hajnoczi wrote:
> On Tue, Apr 11, 2023 at 05:05:14PM +0200, Hanna Czenczek wrote:
>> vhost_save_backend_state() and vhost_load_backend_state() can be used by
>> vhost front-ends to easily save and load the back-end's state to/from
>> the migration stream.
>>
>> Because we do not know the full state size ahead of time,
>> vhost_save_backend_state() simply reads the data in 1 MB chunks, and
>> writes each chunk consecutively into the migration stream, prefixed by
>> its length.  EOF is indicated by a 0-length chunk.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   include/hw/virtio/vhost.h |  35 +++++++
>>   hw/virtio/vhost.c         | 196 ++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 231 insertions(+)
>>
>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>> index 29449e0fe2..d1f1e9e1f3 100644
>> --- a/include/hw/virtio/vhost.h
>> +++ b/include/hw/virtio/vhost.h
>> @@ -425,4 +425,39 @@ int vhost_set_device_state_fd(struct vhost_dev *dev,
>>    */
>>   int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
>>   
>> +/**
>> + * vhost_save_backend_state(): High-level function to receive a vhost
>> + * back-end's state, and save it in `f`.  Uses
>> + * `vhost_set_device_state_fd()` to get the data from the back-end, and
>> + * stores it in consecutive chunks that are each prefixed by their
>> + * respective length (be32).  The end is marked by a 0-length chunk.
>> + *
>> + * Must only be called while the device and all its vrings are stopped
>> + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
>> + *
>> + * @dev: The vhost device from which to save the state
>> + * @f: Migration stream in which to save the state
>> + * @errp: Potential error message
>> + *
>> + * Returns 0 on success, and -errno otherwise.
>> + */
>> +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
>> +
>> +/**
>> + * vhost_load_backend_state(): High-level function to load a vhost
>> + * back-end's state from `f`, and send it over to the back-end.  Reads
>> + * the data from `f` in the format used by `vhost_save_state()`, and
>> + * uses `vhost_set_device_state_fd()` to transfer it to the back-end.
>> + *
>> + * Must only be called while the device and all its vrings are stopped
>> + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
>> + *
>> + * @dev: The vhost device to which to send the sate
>> + * @f: Migration stream from which to load the state
>> + * @errp: Potential error message
>> + *
>> + * Returns 0 on success, and -errno otherwise.
>> + */
>> +int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
>> +
>>   #endif
>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>> index 90099d8f6a..d08849c691 100644
>> --- a/hw/virtio/vhost.c
>> +++ b/hw/virtio/vhost.c
>> @@ -2125,3 +2125,199 @@ int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
>>                  "vhost transport does not support migration state transfer");
>>       return -ENOSYS;
>>   }
>> +
>> +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp)
>> +{
>> +    /* Maximum chunk size in which to transfer the state */
>> +    const size_t chunk_size = 1 * 1024 * 1024;
>> +    void *transfer_buf = NULL;
>> +    g_autoptr(GError) g_err = NULL;
>> +    int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1;
>> +    int ret;
>> +
>> +    /* [0] for reading (our end), [1] for writing (back-end's end) */
>> +    if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) {
>> +        error_setg(errp, "Failed to set up state transfer pipe: %s",
>> +                   g_err->message);
>> +        ret = -EINVAL;
>> +        goto fail;
>> +    }
>> +
>> +    read_fd = pipe_fds[0];
>> +    write_fd = pipe_fds[1];
>> +
>> +    /* VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped */
>> +    assert(!dev->started && !dev->enable_vqs);
>> +
>> +    /* Transfer ownership of write_fd to the back-end */
>> +    ret = vhost_set_device_state_fd(dev,
>> +                                    VHOST_TRANSFER_STATE_DIRECTION_SAVE,
>> +                                    VHOST_TRANSFER_STATE_PHASE_STOPPED,
>> +                                    write_fd,
>> +                                    &reply_fd,
>> +                                    errp);
>> +    if (ret < 0) {
>> +        error_prepend(errp, "Failed to initiate state transfer: ");
>> +        goto fail;
>> +    }
>> +
>> +    /* If the back-end wishes to use a different pipe, switch over */
>> +    if (reply_fd >= 0) {
>> +        close(read_fd);
>> +        read_fd = reply_fd;
>> +    }
>> +
>> +    transfer_buf = g_malloc(chunk_size);
>> +
>> +    while (true) {
>> +        ssize_t read_ret;
>> +
>> +        read_ret = read(read_fd, transfer_buf, chunk_size);
>> +        if (read_ret < 0) {
>> +            ret = -errno;
>> +            error_setg_errno(errp, -ret, "Failed to receive state");
>> +            goto fail;
>> +        }
>> +
>> +        assert(read_ret <= chunk_size);
>> +        qemu_put_be32(f, read_ret);
>> +
>> +        if (read_ret == 0) {
>> +            /* EOF */
>> +            break;
>> +        }
>> +
>> +        qemu_put_buffer(f, transfer_buf, read_ret);
>> +    }
> I think this synchronous approach with a single contiguous stream of
> chunks is okay for now.
>
> Does this make the QEMU monitor unresponsive if the backend is slow?
Oh, absolutely.  But as far as I can tell that’s also the case if the 
back-end doesn’t respond (or responds slowly) to vhost-user messages, 
because they’re generally sent/received synchronously. (So, notably, 
these synchronous read()/write() calls aren’t worse than the previous 
model of transferring data through shared memory, but awaiting the 
back-end via vhost-user calls between each chunk.)
I don’t know whether it’s even possible to do it better (while keeping 
it all in the switch-over phase).  The VMState functions aren’t 
coroutines or AIO, so I can’t think of a way to improve this.
(Well, except:)
> In the future the interface could be extended to allow participating in
> the iterative phase of migration. Then chunks from multiple backends
> (plus guest RAM) would be interleaved and there would be some
> parallelism.
Sure.  That would also definitely help with an unintentionally slow 
back-end.  If the back-end making qemu unresponsive is a deal-breaker, 
then we’d have to do this now.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-12 21:06   ` Stefan Hajnoczi
@ 2023-04-13  9:24     ` Hanna Czenczek
  2023-04-13 11:38       ` Stefan Hajnoczi
  2023-04-13 10:14     ` Eugenio Perez Martin
  1 sibling, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-13  9:24 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, virtio-fs, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella,
	Eugenio Pérez
On 12.04.23 23:06, Stefan Hajnoczi wrote:
> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>> So-called "internal" virtio-fs migration refers to transporting the
>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>> this, we need to be able to transfer virtiofsd's internal state to and
>> from virtiofsd.
>>
>> Because virtiofsd's internal state will not be too large, we believe it
>> is best to transfer it as a single binary blob after the streaming
>> phase.  Because this method should be useful to other vhost-user
>> implementations, too, it is introduced as a general-purpose addition to
>> the protocol, not limited to vhost-user-fs.
>>
>> These are the additions to the protocol:
>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>    This feature signals support for transferring state, and is added so
>>    that migration can fail early when the back-end has no support.
>>
>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>    over which to transfer the state.  The front-end sends an FD to the
>>    back-end into/from which it can write/read its state, and the back-end
>>    can decide to either use it, or reply with a different FD for the
>>    front-end to override the front-end's choice.
>>    The front-end creates a simple pipe to transfer the state, but maybe
>>    the back-end already has an FD into/from which it has to write/read
>>    its state, in which case it will want to override the simple pipe.
>>    Conversely, maybe in the future we find a way to have the front-end
>>    get an immediate FD for the migration stream (in some cases), in which
>>    case we will want to send this to the back-end instead of creating a
>>    pipe.
>>    Hence the negotiation: If one side has a better idea than a plain
>>    pipe, we will want to use that.
>>
>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>    pipe (the end indicated by EOF), the front-end invokes this function
>>    to verify success.  There is no in-band way (through the pipe) to
>>    indicate failure, so we need to check explicitly.
>>
>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>> (which includes establishing the direction of transfer and migration
>> phase), the sending side writes its data into the pipe, and the reading
>> side reads it until it sees an EOF.  Then, the front-end will check for
>> success via CHECK_DEVICE_STATE, which on the destination side includes
>> checking for integrity (i.e. errors during deserialization).
>>
>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   include/hw/virtio/vhost-backend.h |  24 +++++
>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>   hw/virtio/vhost.c                 |  37 ++++++++
>>   4 files changed, 287 insertions(+)
>>
>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>> index ec3fbae58d..5935b32fe3 100644
>> --- a/include/hw/virtio/vhost-backend.h
>> +++ b/include/hw/virtio/vhost-backend.h
>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>   } VhostSetConfigType;
>>   
>> +typedef enum VhostDeviceStateDirection {
>> +    /* Transfer state from back-end (device) to front-end */
>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>> +    /* Transfer state from front-end to back-end (device) */
>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>> +} VhostDeviceStateDirection;
>> +
>> +typedef enum VhostDeviceStatePhase {
>> +    /* The device (and all its vrings) is stopped */
>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>> +} VhostDeviceStatePhase;
> vDPA has:
>
>    /* Suspend a device so it does not process virtqueue requests anymore
>     *
>     * After the return of ioctl the device must preserve all the necessary state
>     * (the virtqueue vring base plus the possible device specific states) that is
>     * required for restoring in the future. The device must not change its
>     * configuration after that point.
>     */
>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>
>    /* Resume a device so it can resume processing virtqueue requests
>     *
>     * After the return of this ioctl the device will have restored all the
>     * necessary states and it is fully operational to continue processing the
>     * virtqueue descriptors.
>     */
>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>
> I wonder if it makes sense to import these into vhost-user so that the
> difference between kernel vhost and vhost-user is minimized. It's okay
> if one of them is ahead of the other, but it would be nice to avoid
> overlapping/duplicated functionality.
>
> (And I hope vDPA will import the device state vhost-user messages
> introduced in this series.)
I don’t understand your suggestion.  (Like, I very simply don’t 
understand :))
These are vhost messages, right?  What purpose do you have in mind for 
them in vhost-user for internal migration?  They’re different from the 
state transfer messages, because they don’t transfer state to/from the 
front-end.  Also, the state transfer stuff is supposed to be distinct 
from starting/stopping the device; right now, it just requires the 
device to be stopped beforehand (or started only afterwards).  And in 
the future, new VhostDeviceStatePhase values may allow the messages to 
be used on devices that aren’t stopped.
So they seem to serve very different purposes.  I can imagine using the 
VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is 
working on), but they don’t really help with internal migration 
implemented here.  If I were to add them, they’d just be sent in 
addition to the new messages added in this patch here, i.e. SUSPEND on 
the source before SET_DEVICE_STATE_FD, and RESUME on the destination 
after CHECK_DEVICE_STATE (we could use RESUME in place of 
CHECK_DEVICE_STATE on the destination, but we can’t do that on the 
source, so we still need CHECK_DEVICE_STATE).
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-13  8:50   ` Eugenio Perez Martin
@ 2023-04-13  9:25     ` Hanna Czenczek
  0 siblings, 0 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-13  9:25 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On 13.04.23 10:50, Eugenio Perez Martin wrote:
> On Tue, Apr 11, 2023 at 5:33 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>> So-called "internal" virtio-fs migration refers to transporting the
>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>> this, we need to be able to transfer virtiofsd's internal state to and
>> from virtiofsd.
>>
>> Because virtiofsd's internal state will not be too large, we believe it
>> is best to transfer it as a single binary blob after the streaming
>> phase.  Because this method should be useful to other vhost-user
>> implementations, too, it is introduced as a general-purpose addition to
>> the protocol, not limited to vhost-user-fs.
>>
>> These are the additions to the protocol:
>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>    This feature signals support for transferring state, and is added so
>>    that migration can fail early when the back-end has no support.
>>
>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>    over which to transfer the state.  The front-end sends an FD to the
>>    back-end into/from which it can write/read its state, and the back-end
>>    can decide to either use it, or reply with a different FD for the
>>    front-end to override the front-end's choice.
>>    The front-end creates a simple pipe to transfer the state, but maybe
>>    the back-end already has an FD into/from which it has to write/read
>>    its state, in which case it will want to override the simple pipe.
>>    Conversely, maybe in the future we find a way to have the front-end
>>    get an immediate FD for the migration stream (in some cases), in which
>>    case we will want to send this to the back-end instead of creating a
>>    pipe.
>>    Hence the negotiation: If one side has a better idea than a plain
>>    pipe, we will want to use that.
>>
>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>    pipe (the end indicated by EOF), the front-end invokes this function
>>    to verify success.  There is no in-band way (through the pipe) to
>>    indicate failure, so we need to check explicitly.
>>
>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>> (which includes establishing the direction of transfer and migration
>> phase), the sending side writes its data into the pipe, and the reading
>> side reads it until it sees an EOF.  Then, the front-end will check for
>> success via CHECK_DEVICE_STATE, which on the destination side includes
>> checking for integrity (i.e. errors during deserialization).
>>
>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   include/hw/virtio/vhost-backend.h |  24 +++++
>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>   hw/virtio/vhost.c                 |  37 ++++++++
>>   4 files changed, 287 insertions(+)
[...]
>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>> index 2fe02ed5d4..29449e0fe2 100644
>> --- a/include/hw/virtio/vhost.h
>> +++ b/include/hw/virtio/vhost.h
>> @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
[...]
>> +/**
>> + * vhost_set_device_state_fd(): After transferring state from/to the
> Nitpick: This function doc is for vhost_check_device_state not
> vhost_set_device_state_fd.
>
> Thanks!
Oops, right, thanks!
Hanna
>> + * back-end via vhost_set_device_state_fd(), i.e. once the sending end
>> + * has closed the pipe, inquire the back-end to report any potential
>> + * errors that have occurred on its side.  This allows to sense errors
>> + * like:
>> + * - During outgoing migration, when the source side had already started
>> + *   to produce its state, something went wrong and it failed to finish
>> + * - During incoming migration, when the received state is somehow
>> + *   invalid and cannot be processed by the back-end
>> + *
>> + * @dev: The vhost device
>> + * @errp: Potential error description
>> + *
>> + * Returns 0 when the back-end reports successful state transfer and
>> + * processing, and -errno when an error occurred somewhere.
>> + */
>> +int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
>> +
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-12 21:06   ` Stefan Hajnoczi
  2023-04-13  9:24     ` Hanna Czenczek
@ 2023-04-13 10:14     ` Eugenio Perez Martin
  2023-04-13 11:07       ` Stefan Hajnoczi
                         ` (3 more replies)
  1 sibling, 4 replies; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-04-13 10:14 UTC (permalink / raw)
  To: Hanna Czenczek, Stefan Hajnoczi
  Cc: qemu-devel, virtio-fs, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella
On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > So-called "internal" virtio-fs migration refers to transporting the
> > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > this, we need to be able to transfer virtiofsd's internal state to and
> > from virtiofsd.
> >
> > Because virtiofsd's internal state will not be too large, we believe it
> > is best to transfer it as a single binary blob after the streaming
> > phase.  Because this method should be useful to other vhost-user
> > implementations, too, it is introduced as a general-purpose addition to
> > the protocol, not limited to vhost-user-fs.
> >
> > These are the additions to the protocol:
> > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >   This feature signals support for transferring state, and is added so
> >   that migration can fail early when the back-end has no support.
> >
> > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >   over which to transfer the state.  The front-end sends an FD to the
> >   back-end into/from which it can write/read its state, and the back-end
> >   can decide to either use it, or reply with a different FD for the
> >   front-end to override the front-end's choice.
> >   The front-end creates a simple pipe to transfer the state, but maybe
> >   the back-end already has an FD into/from which it has to write/read
> >   its state, in which case it will want to override the simple pipe.
> >   Conversely, maybe in the future we find a way to have the front-end
> >   get an immediate FD for the migration stream (in some cases), in which
> >   case we will want to send this to the back-end instead of creating a
> >   pipe.
> >   Hence the negotiation: If one side has a better idea than a plain
> >   pipe, we will want to use that.
> >
> > - CHECK_DEVICE_STATE: After the state has been transferred through the
> >   pipe (the end indicated by EOF), the front-end invokes this function
> >   to verify success.  There is no in-band way (through the pipe) to
> >   indicate failure, so we need to check explicitly.
> >
> > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > (which includes establishing the direction of transfer and migration
> > phase), the sending side writes its data into the pipe, and the reading
> > side reads it until it sees an EOF.  Then, the front-end will check for
> > success via CHECK_DEVICE_STATE, which on the destination side includes
> > checking for integrity (i.e. errors during deserialization).
> >
> > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > ---
> >  include/hw/virtio/vhost-backend.h |  24 +++++
> >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >  hw/virtio/vhost.c                 |  37 ++++++++
> >  4 files changed, 287 insertions(+)
> >
> > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > index ec3fbae58d..5935b32fe3 100644
> > --- a/include/hw/virtio/vhost-backend.h
> > +++ b/include/hw/virtio/vhost-backend.h
> > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >  } VhostSetConfigType;
> >
> > +typedef enum VhostDeviceStateDirection {
> > +    /* Transfer state from back-end (device) to front-end */
> > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > +    /* Transfer state from front-end to back-end (device) */
> > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > +} VhostDeviceStateDirection;
> > +
> > +typedef enum VhostDeviceStatePhase {
> > +    /* The device (and all its vrings) is stopped */
> > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > +} VhostDeviceStatePhase;
>
> vDPA has:
>
>   /* Suspend a device so it does not process virtqueue requests anymore
>    *
>    * After the return of ioctl the device must preserve all the necessary state
>    * (the virtqueue vring base plus the possible device specific states) that is
>    * required for restoring in the future. The device must not change its
>    * configuration after that point.
>    */
>   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>
>   /* Resume a device so it can resume processing virtqueue requests
>    *
>    * After the return of this ioctl the device will have restored all the
>    * necessary states and it is fully operational to continue processing the
>    * virtqueue descriptors.
>    */
>   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>
> I wonder if it makes sense to import these into vhost-user so that the
> difference between kernel vhost and vhost-user is minimized. It's okay
> if one of them is ahead of the other, but it would be nice to avoid
> overlapping/duplicated functionality.
>
That's what I had in mind in the first versions. I proposed VHOST_STOP
instead of VHOST_VDPA_STOP for this very reason. Later it did change
to SUSPEND.
Generally it is better if we make the interface less parametrized and
we trust in the messages and its semantics in my opinion. In other
words, instead of
vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
Another way to apply this is with the "direction" parameter. Maybe it
is better to split it into "set_state_fd" and "get_state_fd"?
In that case, reusing the ioctls as vhost-user messages would be ok.
But that puts this proposal further from the VFIO code, which uses
"migration_set_state(state)", and maybe it is better when the number
of states is high.
BTW, is there any usage for *reply_fd at this moment from the backend?
> (And I hope vDPA will import the device state vhost-user messages
> introduced in this series.)
>
I guess they will be needed for vdpa-fs devices? Is there any emulated
virtio-fs in qemu?
Thanks!
> > +
> >  struct vhost_inflight;
> >  struct vhost_dev;
> >  struct vhost_log;
> > @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
> >
> >  typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
> >
> > +typedef bool (*vhost_supports_migratory_state_op)(struct vhost_dev *dev);
> > +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev,
> > +                                            VhostDeviceStateDirection direction,
> > +                                            VhostDeviceStatePhase phase,
> > +                                            int fd,
> > +                                            int *reply_fd,
> > +                                            Error **errp);
> > +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp);
> > +
> >  typedef struct VhostOps {
> >      VhostBackendType backend_type;
> >      vhost_backend_init vhost_backend_init;
> > @@ -181,6 +202,9 @@ typedef struct VhostOps {
> >      vhost_force_iommu_op vhost_force_iommu;
> >      vhost_set_config_call_op vhost_set_config_call;
> >      vhost_reset_status_op vhost_reset_status;
> > +    vhost_supports_migratory_state_op vhost_supports_migratory_state;
> > +    vhost_set_device_state_fd_op vhost_set_device_state_fd;
> > +    vhost_check_device_state_op vhost_check_device_state;
> >  } VhostOps;
> >
> >  int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
> > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> > index 2fe02ed5d4..29449e0fe2 100644
> > --- a/include/hw/virtio/vhost.h
> > +++ b/include/hw/virtio/vhost.h
> > @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
> >                             struct vhost_inflight *inflight);
> >  int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
> >                             struct vhost_inflight *inflight);
> > +
> > +/**
> > + * vhost_supports_migratory_state(): Checks whether the back-end
> > + * supports transferring internal state for the purpose of migration.
> > + * Support for this feature is required for vhost_set_device_state_fd()
> > + * and vhost_check_device_state().
> > + *
> > + * @dev: The vhost device
> > + *
> > + * Returns true if the device supports these commands, and false if it
> > + * does not.
> > + */
> > +bool vhost_supports_migratory_state(struct vhost_dev *dev);
> > +
> > +/**
> > + * vhost_set_device_state_fd(): Begin transfer of internal state from/to
> > + * the back-end for the purpose of migration.  Data is to be transferred
> > + * over a pipe according to @direction and @phase.  The sending end must
> > + * only write to the pipe, and the receiving end must only read from it.
> > + * Once the sending end is done, it closes its FD.  The receiving end
> > + * must take this as the end-of-transfer signal and close its FD, too.
> > + *
> > + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the
> > + * read FD for LOAD.  This function transfers ownership of @fd to the
> > + * back-end, i.e. closes it in the front-end.
> > + *
> > + * The back-end may optionally reply with an FD of its own, if this
> > + * improves efficiency on its end.  In this case, the returned FD is
> > + * stored in *reply_fd.  The back-end will discard the FD sent to it,
> > + * and the front-end must use *reply_fd for transferring state to/from
> > + * the back-end.
> > + *
> > + * @dev: The vhost device
> > + * @direction: The direction in which the state is to be transferred.
> > + *             For outgoing migrations, this is SAVE, and data is read
> > + *             from the back-end and stored by the front-end in the
> > + *             migration stream.
> > + *             For incoming migrations, this is LOAD, and data is read
> > + *             by the front-end from the migration stream and sent to
> > + *             the back-end to restore the saved state.
> > + * @phase: Which migration phase we are in.  Currently, there is only
> > + *         STOPPED (device and all vrings are stopped), in the future,
> > + *         more phases such as PRE_COPY or POST_COPY may be added.
> > + * @fd: Back-end's end of the pipe through which to transfer state; note
> > + *      that ownership is transferred to the back-end, so this function
> > + *      closes @fd in the front-end.
> > + * @reply_fd: If the back-end wishes to use a different pipe for state
> > + *            transfer, this will contain an FD for the front-end to
> > + *            use.  Otherwise, -1 is stored here.
> > + * @errp: Potential error description
> > + *
> > + * Returns 0 on success, and -errno on failure.
> > + */
> > +int vhost_set_device_state_fd(struct vhost_dev *dev,
> > +                              VhostDeviceStateDirection direction,
> > +                              VhostDeviceStatePhase phase,
> > +                              int fd,
> > +                              int *reply_fd,
> > +                              Error **errp);
> > +
> > +/**
> > + * vhost_set_device_state_fd(): After transferring state from/to the
> > + * back-end via vhost_set_device_state_fd(), i.e. once the sending end
> > + * has closed the pipe, inquire the back-end to report any potential
> > + * errors that have occurred on its side.  This allows to sense errors
> > + * like:
> > + * - During outgoing migration, when the source side had already started
> > + *   to produce its state, something went wrong and it failed to finish
> > + * - During incoming migration, when the received state is somehow
> > + *   invalid and cannot be processed by the back-end
> > + *
> > + * @dev: The vhost device
> > + * @errp: Potential error description
> > + *
> > + * Returns 0 when the back-end reports successful state transfer and
> > + * processing, and -errno when an error occurred somewhere.
> > + */
> > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
> > +
> >  #endif
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index e5285df4ba..93d8f2494a 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -83,6 +83,7 @@ enum VhostUserProtocolFeature {
> >      /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */
> >      VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
> >      VHOST_USER_PROTOCOL_F_STATUS = 16,
> > +    VHOST_USER_PROTOCOL_F_MIGRATORY_STATE = 17,
> >      VHOST_USER_PROTOCOL_F_MAX
> >  };
> >
> > @@ -130,6 +131,8 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_REM_MEM_REG = 38,
> >      VHOST_USER_SET_STATUS = 39,
> >      VHOST_USER_GET_STATUS = 40,
> > +    VHOST_USER_SET_DEVICE_STATE_FD = 41,
> > +    VHOST_USER_CHECK_DEVICE_STATE = 42,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >
> > @@ -210,6 +213,12 @@ typedef struct {
> >      uint32_t size; /* the following payload size */
> >  } QEMU_PACKED VhostUserHeader;
> >
> > +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */
> > +typedef struct VhostUserTransferDeviceState {
> > +    uint32_t direction;
> > +    uint32_t phase;
> > +} VhostUserTransferDeviceState;
> > +
> >  typedef union {
> >  #define VHOST_USER_VRING_IDX_MASK   (0xff)
> >  #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> > @@ -224,6 +233,7 @@ typedef union {
> >          VhostUserCryptoSession session;
> >          VhostUserVringArea area;
> >          VhostUserInflight inflight;
> > +        VhostUserTransferDeviceState transfer_state;
> >  } VhostUserPayload;
> >
> >  typedef struct VhostUserMsg {
> > @@ -2681,6 +2691,140 @@ static int vhost_user_dev_start(struct vhost_dev *dev, bool started)
> >      }
> >  }
> >
> > +static bool vhost_user_supports_migratory_state(struct vhost_dev *dev)
> > +{
> > +    return virtio_has_feature(dev->protocol_features,
> > +                              VHOST_USER_PROTOCOL_F_MIGRATORY_STATE);
> > +}
> > +
> > +static int vhost_user_set_device_state_fd(struct vhost_dev *dev,
> > +                                          VhostDeviceStateDirection direction,
> > +                                          VhostDeviceStatePhase phase,
> > +                                          int fd,
> > +                                          int *reply_fd,
> > +                                          Error **errp)
> > +{
> > +    int ret;
> > +    struct vhost_user *vu = dev->opaque;
> > +    VhostUserMsg msg = {
> > +        .hdr = {
> > +            .request = VHOST_USER_SET_DEVICE_STATE_FD,
> > +            .flags = VHOST_USER_VERSION,
> > +            .size = sizeof(msg.payload.transfer_state),
> > +        },
> > +        .payload.transfer_state = {
> > +            .direction = direction,
> > +            .phase = phase,
> > +        },
> > +    };
> > +
> > +    *reply_fd = -1;
> > +
> > +    if (!vhost_user_supports_migratory_state(dev)) {
> > +        close(fd);
> > +        error_setg(errp, "Back-end does not support migration state transfer");
> > +        return -ENOTSUP;
> > +    }
> > +
> > +    ret = vhost_user_write(dev, &msg, &fd, 1);
> > +    close(fd);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "Failed to send SET_DEVICE_STATE_FD message");
> > +        return ret;
> > +    }
> > +
> > +    ret = vhost_user_read(dev, &msg);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "Failed to receive SET_DEVICE_STATE_FD reply");
> > +        return ret;
> > +    }
> > +
> > +    if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) {
> > +        error_setg(errp,
> > +                   "Received unexpected message type, expected %d, received %d",
> > +                   VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request);
> > +        return -EPROTO;
> > +    }
> > +
> > +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> > +        error_setg(errp,
> > +                   "Received bad message size, expected %zu, received %" PRIu32,
> > +                   sizeof(msg.payload.u64), msg.hdr.size);
> > +        return -EPROTO;
> > +    }
> > +
> > +    if ((msg.payload.u64 & 0xff) != 0) {
> > +        error_setg(errp, "Back-end did not accept migration state transfer");
> > +        return -EIO;
> > +    }
> > +
> > +    if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) {
> > +        *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr);
> > +        if (*reply_fd < 0) {
> > +            error_setg(errp,
> > +                       "Failed to get back-end-provided transfer pipe FD");
> > +            *reply_fd = -1;
> > +            return -EIO;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp)
> > +{
> > +    int ret;
> > +    VhostUserMsg msg = {
> > +        .hdr = {
> > +            .request = VHOST_USER_CHECK_DEVICE_STATE,
> > +            .flags = VHOST_USER_VERSION,
> > +            .size = 0,
> > +        },
> > +    };
> > +
> > +    if (!vhost_user_supports_migratory_state(dev)) {
> > +        error_setg(errp, "Back-end does not support migration state transfer");
> > +        return -ENOTSUP;
> > +    }
> > +
> > +    ret = vhost_user_write(dev, &msg, NULL, 0);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "Failed to send CHECK_DEVICE_STATE message");
> > +        return ret;
> > +    }
> > +
> > +    ret = vhost_user_read(dev, &msg);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "Failed to receive CHECK_DEVICE_STATE reply");
> > +        return ret;
> > +    }
> > +
> > +    if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) {
> > +        error_setg(errp,
> > +                   "Received unexpected message type, expected %d, received %d",
> > +                   VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request);
> > +        return -EPROTO;
> > +    }
> > +
> > +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> > +        error_setg(errp,
> > +                   "Received bad message size, expected %zu, received %" PRIu32,
> > +                   sizeof(msg.payload.u64), msg.hdr.size);
> > +        return -EPROTO;
> > +    }
> > +
> > +    if (msg.payload.u64 != 0) {
> > +        error_setg(errp, "Back-end failed to process its internal state");
> > +        return -EIO;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  const VhostOps user_ops = {
> >          .backend_type = VHOST_BACKEND_TYPE_USER,
> >          .vhost_backend_init = vhost_user_backend_init,
> > @@ -2716,4 +2860,7 @@ const VhostOps user_ops = {
> >          .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
> >          .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
> >          .vhost_dev_start = vhost_user_dev_start,
> > +        .vhost_supports_migratory_state = vhost_user_supports_migratory_state,
> > +        .vhost_set_device_state_fd = vhost_user_set_device_state_fd,
> > +        .vhost_check_device_state = vhost_user_check_device_state,
> >  };
> > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > index cbff589efa..90099d8f6a 100644
> > --- a/hw/virtio/vhost.c
> > +++ b/hw/virtio/vhost.c
> > @@ -2088,3 +2088,40 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
> >
> >      return -ENOSYS;
> >  }
> > +
> > +bool vhost_supports_migratory_state(struct vhost_dev *dev)
> > +{
> > +    if (dev->vhost_ops->vhost_supports_migratory_state) {
> > +        return dev->vhost_ops->vhost_supports_migratory_state(dev);
> > +    }
> > +
> > +    return false;
> > +}
> > +
> > +int vhost_set_device_state_fd(struct vhost_dev *dev,
> > +                              VhostDeviceStateDirection direction,
> > +                              VhostDeviceStatePhase phase,
> > +                              int fd,
> > +                              int *reply_fd,
> > +                              Error **errp)
> > +{
> > +    if (dev->vhost_ops->vhost_set_device_state_fd) {
> > +        return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase,
> > +                                                         fd, reply_fd, errp);
> > +    }
> > +
> > +    error_setg(errp,
> > +               "vhost transport does not support migration state transfer");
> > +    return -ENOSYS;
> > +}
> > +
> > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
> > +{
> > +    if (dev->vhost_ops->vhost_check_device_state) {
> > +        return dev->vhost_ops->vhost_check_device_state(dev, errp);
> > +    }
> > +
> > +    error_setg(errp,
> > +               "vhost transport does not support migration state transfer");
> > +    return -ENOSYS;
> > +}
> > --
> > 2.39.1
> >
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 1/4] vhost: Re-enable vrings after setting features
  2023-04-11 15:05 ` [PATCH 1/4] vhost: Re-enable vrings after setting features Hanna Czenczek
  2023-04-12 10:55   ` German Maglione
  2023-04-12 20:51   ` Stefan Hajnoczi
@ 2023-04-13 11:03   ` Stefan Hajnoczi
  2023-04-13 17:32     ` Hanna Czenczek
  2023-04-13 13:19   ` Michael S. Tsirkin
  3 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-13 11:03 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Tue, 11 Apr 2023 at 11:05, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> If the back-end supports the VHOST_USER_F_PROTOCOL_FEATURES feature,
> setting the vhost features will set this feature, too.  Doing so
> disables all vrings, which may not be intended.
>
> For example, enabling or disabling logging during migration requires
> setting those features (to set or unset VHOST_F_LOG_ALL), which will
> automatically disable all vrings.  In either case, the VM is running
> (disabling logging is done after a failed or cancelled migration, and
> only once the VM is running again, see comment in
> memory_global_dirty_log_stop()), so the vrings should really be enabled.
> As a result, the back-end seems to hang.
>
> To fix this, we must remember whether the vrings are supposed to be
> enabled, and, if so, re-enable them after a SET_FEATURES call that set
> VHOST_USER_F_PROTOCOL_FEATURES.
>
> It seems less than ideal that there is a short period in which the VM is
> running but the vrings will be stopped (between SET_FEATURES and
> SET_VRING_ENABLE).  To fix this, we would need to change the protocol,
> e.g. by introducing a new flag or vhost-user protocol feature to disable
> disabling vrings whenever VHOST_USER_F_PROTOCOL_FEATURES is set, or add
> new functions for setting/clearing singular feature bits (so that
> F_LOG_ALL can be set/cleared without touching F_PROTOCOL_FEATURES).
>
> Even with such a potential addition to the protocol, we still need this
> fix here, because we cannot expect that back-ends will implement this
> addition.
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost.h | 10 ++++++++++
>  hw/virtio/vhost.c         | 13 +++++++++++++
>  2 files changed, 23 insertions(+)
>
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index a52f273347..2fe02ed5d4 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -90,6 +90,16 @@ struct vhost_dev {
>      int vq_index_end;
>      /* if non-zero, minimum required value for max_queues */
>      int num_queues;
> +
> +    /*
> +     * Whether the virtqueues are supposed to be enabled (via
> +     * SET_VRING_ENABLE).  Setting the features (e.g. for
> +     * enabling/disabling logging) will disable all virtqueues if
> +     * VHOST_USER_F_PROTOCOL_FEATURES is set, so then we need to
> +     * re-enable them if this field is set.
> +     */
> +    bool enable_vqs;
> +
>      /**
>       * vhost feature handling requires matching the feature set
>       * offered by a backend which may be a subset of the total
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index a266396576..cbff589efa 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -50,6 +50,8 @@ static unsigned int used_memslots;
>  static QLIST_HEAD(, vhost_dev) vhost_devices =
>      QLIST_HEAD_INITIALIZER(vhost_devices);
>
> +static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable);
> +
>  bool vhost_has_free_slot(void)
>  {
>      unsigned int slots_limit = ~0U;
> @@ -899,6 +901,15 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
>          }
>      }
>
> +    if (dev->enable_vqs) {
> +        /*
> +         * Setting VHOST_USER_F_PROTOCOL_FEATURES would have disabled all
Is there a reason to put this vhost-user-specific workaround in
vhost.c instead of vhost-user.c?
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 1/4] vhost: Re-enable vrings after setting features
  2023-04-13  8:19     ` Hanna Czenczek
@ 2023-04-13 11:03       ` Stefan Hajnoczi
  2023-04-13 14:24         ` Anton Kuchin
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-13 11:03 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella, Roman Kagan, Maxime Coquelin,
	Marc-André Lureau
On Thu, 13 Apr 2023 at 04:20, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 12.04.23 22:51, Stefan Hajnoczi wrote:
> > On Tue, Apr 11, 2023 at 05:05:12PM +0200, Hanna Czenczek wrote:
> >> If the back-end supports the VHOST_USER_F_PROTOCOL_FEATURES feature,
> >> setting the vhost features will set this feature, too.  Doing so
> >> disables all vrings, which may not be intended.
> >>
> >> For example, enabling or disabling logging during migration requires
> >> setting those features (to set or unset VHOST_F_LOG_ALL), which will
> >> automatically disable all vrings.  In either case, the VM is running
> >> (disabling logging is done after a failed or cancelled migration, and
> >> only once the VM is running again, see comment in
> >> memory_global_dirty_log_stop()), so the vrings should really be enabled.
> >> As a result, the back-end seems to hang.
> >>
> >> To fix this, we must remember whether the vrings are supposed to be
> >> enabled, and, if so, re-enable them after a SET_FEATURES call that set
> >> VHOST_USER_F_PROTOCOL_FEATURES.
> >>
> >> It seems less than ideal that there is a short period in which the VM is
> >> running but the vrings will be stopped (between SET_FEATURES and
> >> SET_VRING_ENABLE).  To fix this, we would need to change the protocol,
> >> e.g. by introducing a new flag or vhost-user protocol feature to disable
> >> disabling vrings whenever VHOST_USER_F_PROTOCOL_FEATURES is set, or add
> >> new functions for setting/clearing singular feature bits (so that
> >> F_LOG_ALL can be set/cleared without touching F_PROTOCOL_FEATURES).
> >>
> >> Even with such a potential addition to the protocol, we still need this
> >> fix here, because we cannot expect that back-ends will implement this
> >> addition.
> >>
> >> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >> ---
> >>   include/hw/virtio/vhost.h | 10 ++++++++++
> >>   hw/virtio/vhost.c         | 13 +++++++++++++
> >>   2 files changed, 23 insertions(+)
> >>
> >> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> >> index a52f273347..2fe02ed5d4 100644
> >> --- a/include/hw/virtio/vhost.h
> >> +++ b/include/hw/virtio/vhost.h
> >> @@ -90,6 +90,16 @@ struct vhost_dev {
> >>       int vq_index_end;
> >>       /* if non-zero, minimum required value for max_queues */
> >>       int num_queues;
> >> +
> >> +    /*
> >> +     * Whether the virtqueues are supposed to be enabled (via
> >> +     * SET_VRING_ENABLE).  Setting the features (e.g. for
> >> +     * enabling/disabling logging) will disable all virtqueues if
> >> +     * VHOST_USER_F_PROTOCOL_FEATURES is set, so then we need to
> >> +     * re-enable them if this field is set.
> >> +     */
> >> +    bool enable_vqs;
> >> +
> >>       /**
> >>        * vhost feature handling requires matching the feature set
> >>        * offered by a backend which may be a subset of the total
> >> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >> index a266396576..cbff589efa 100644
> >> --- a/hw/virtio/vhost.c
> >> +++ b/hw/virtio/vhost.c
> >> @@ -50,6 +50,8 @@ static unsigned int used_memslots;
> >>   static QLIST_HEAD(, vhost_dev) vhost_devices =
> >>       QLIST_HEAD_INITIALIZER(vhost_devices);
> >>
> >> +static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable);
> >> +
> >>   bool vhost_has_free_slot(void)
> >>   {
> >>       unsigned int slots_limit = ~0U;
> >> @@ -899,6 +901,15 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
> >>           }
> >>       }
> >>
> >> +    if (dev->enable_vqs) {
> >> +        /*
> >> +         * Setting VHOST_USER_F_PROTOCOL_FEATURES would have disabled all
> >> +         * virtqueues, even if that was not intended; re-enable them if
> >> +         * necessary.
> >> +         */
> >> +        vhost_dev_set_vring_enable(dev, true);
> >> +    }
> >> +
> >>   out:
> >>       return r;
> >>   }
> >> @@ -1896,6 +1907,8 @@ int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
> >>
> >>   static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable)
> >>   {
> >> +    hdev->enable_vqs = enable;
> >> +
> >>       if (!hdev->vhost_ops->vhost_set_vring_enable) {
> >>           return 0;
> >>       }
> > The vhost-user spec doesn't say that VHOST_F_LOG_ALL needs to be toggled
> > at runtime and I don't think VHOST_USER_SET_PROTOCOL_FEATURES is
> > intended to be used like that. This issue shows why doing so is a bad
> > idea.
> >
> > VHOST_F_LOG_ALL does not need to be toggled to control logging. Logging
> > is controlled at runtime by the presence of the dirty log
> > (VHOST_USER_SET_LOG_BASE) and the per-vring logging flag
> > (VHOST_VRING_F_LOG).
>
> Technically, the spec doesn’t say that SET_LOG_BASE is required.  It says:
>
> “To start/stop logging of data/used ring writes, the front-end may send
> messages VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and
> VHOST_USER_SET_VRING_ADDR with VHOST_VRING_F_LOG in ring’s flags set to
> 1/0, respectively.”
>
> (So the spec also very much does imply that toggling F_LOG_ALL at
> runtime is a valid way to enable/disable logging.  If we were to no
> longer do that, we should clarify it there.)
I missed that VHOST_VRING_F_LOG only controls logging used ring writes
while writes to descriptors are always logged when VHOST_F_LOG_ALL is
set. I agree that the spec does require VHOST_F_LOG_ALL to be toggled
at runtime.
What I suggested won't work.
> I mean, naturally, logging without a shared memory area to log in to
> isn’t much fun, so we could clarify that SET_LOG_BASE is also a
> requirement, but it looks to me as if we can’t use SET_LOG_BASE to
> disable logging, because it’s supposed to always pass a valid FD (at
> least libvhost-user expects this:
> https://gitlab.com/qemu-project/qemu/-/blob/master/subprojects/libvhost-user/libvhost-user.c#L1044).
As an aside: I don't understand how logging without an fd is supposed
to work in QEMU's code or in the vhost-user spec. QEMU does not
support that case even though it's written as if shmfd were optional.
> So after a cancelled migration, the dirty bitmap SHM will stay around
> indefinitely (which is already not great, but if we were to use the
> presence of that bitmap as an indicator as to whether we should log or
> not, it would be worse).
Yes, continuing to log forever is worse.
>
> So the VRING_F_LOG flag remains.
>
> > I suggest permanently enabling VHOST_F_LOG_ALL upon connection when the
> > the backend supports it. No spec changes are required.
> >
> > libvhost-user looks like it will work. I didn't look at DPDK/SPDK, but
> > checking that it works there is important too.
>
> I can’t find VRING_F_LOG in libvhost-user, so what protocol do you mean
> exactly that would work in libvhost-user?  Because SET_LOG_BASE won’t
> work, as you can’t use it to disable logging.
That's true. There is no way to disable logging.
> (For DPDK, I’m not sure.  It looks like it sometimes takes VRING_F_LOG
> into account, but I think only when it comes to logging accesses to the
> vring specifically, i.e. not DMA to other areas of guest memory?  I
> think only places that use vq->log_guest_addr implicitly check it,
> others don’t.  So for example,
> vhost_log_write_iova()/vhost_log_cache_write_iova() don’t seem to check
> VRING_F_LOG, which seem to be the functions generally used for writes to
> memory outside of the vrings.  So here, even if VRING_F_LOG is disabled
> for all vrings, as long as a log SHM is set, all writes to memory
> outside of the immediate vrings seem to cause logging (as long as
> F_LOG_ALL is set).)
>
> Hanna
>
> > I have CCed people who may be interested in this issue. This is the
> > first time I've looked at vhost-user logging, so this idea may not work.
> >
> > Stefan
>
>
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-13 10:14     ` Eugenio Perez Martin
@ 2023-04-13 11:07       ` Stefan Hajnoczi
  2023-04-13 17:31       ` Hanna Czenczek
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-13 11:07 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Hanna Czenczek, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Thu, 13 Apr 2023 at 06:15, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > (And I hope vDPA will import the device state vhost-user messages
> > introduced in this series.)
> >
>
> I guess they will be needed for vdpa-fs devices? Is there any emulated
> virtio-fs in qemu?
Maybe also virtio-gpu or virtio-crypto, if someone decides to create
hardware or in-kernel implementations.
virtiofs is not built into QEMU, there are only vhost-user implementations.
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 3/4] vhost: Add high-level state save/load functions
  2023-04-13  9:04     ` Hanna Czenczek
@ 2023-04-13 11:22       ` Stefan Hajnoczi
  0 siblings, 0 replies; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-13 11:22 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Thu, 13 Apr 2023 at 05:04, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 12.04.23 23:14, Stefan Hajnoczi wrote:
> > On Tue, Apr 11, 2023 at 05:05:14PM +0200, Hanna Czenczek wrote:
> >> vhost_save_backend_state() and vhost_load_backend_state() can be used by
> >> vhost front-ends to easily save and load the back-end's state to/from
> >> the migration stream.
> >>
> >> Because we do not know the full state size ahead of time,
> >> vhost_save_backend_state() simply reads the data in 1 MB chunks, and
> >> writes each chunk consecutively into the migration stream, prefixed by
> >> its length.  EOF is indicated by a 0-length chunk.
> >>
> >> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >> ---
> >>   include/hw/virtio/vhost.h |  35 +++++++
> >>   hw/virtio/vhost.c         | 196 ++++++++++++++++++++++++++++++++++++++
> >>   2 files changed, 231 insertions(+)
> >>
> >> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> >> index 29449e0fe2..d1f1e9e1f3 100644
> >> --- a/include/hw/virtio/vhost.h
> >> +++ b/include/hw/virtio/vhost.h
> >> @@ -425,4 +425,39 @@ int vhost_set_device_state_fd(struct vhost_dev *dev,
> >>    */
> >>   int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
> >>
> >> +/**
> >> + * vhost_save_backend_state(): High-level function to receive a vhost
> >> + * back-end's state, and save it in `f`.  Uses
> >> + * `vhost_set_device_state_fd()` to get the data from the back-end, and
> >> + * stores it in consecutive chunks that are each prefixed by their
> >> + * respective length (be32).  The end is marked by a 0-length chunk.
> >> + *
> >> + * Must only be called while the device and all its vrings are stopped
> >> + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
> >> + *
> >> + * @dev: The vhost device from which to save the state
> >> + * @f: Migration stream in which to save the state
> >> + * @errp: Potential error message
> >> + *
> >> + * Returns 0 on success, and -errno otherwise.
> >> + */
> >> +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
> >> +
> >> +/**
> >> + * vhost_load_backend_state(): High-level function to load a vhost
> >> + * back-end's state from `f`, and send it over to the back-end.  Reads
> >> + * the data from `f` in the format used by `vhost_save_state()`, and
> >> + * uses `vhost_set_device_state_fd()` to transfer it to the back-end.
> >> + *
> >> + * Must only be called while the device and all its vrings are stopped
> >> + * (`VHOST_TRANSFER_STATE_PHASE_STOPPED`).
> >> + *
> >> + * @dev: The vhost device to which to send the sate
> >> + * @f: Migration stream from which to load the state
> >> + * @errp: Potential error message
> >> + *
> >> + * Returns 0 on success, and -errno otherwise.
> >> + */
> >> +int vhost_load_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp);
> >> +
> >>   #endif
> >> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >> index 90099d8f6a..d08849c691 100644
> >> --- a/hw/virtio/vhost.c
> >> +++ b/hw/virtio/vhost.c
> >> @@ -2125,3 +2125,199 @@ int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
> >>                  "vhost transport does not support migration state transfer");
> >>       return -ENOSYS;
> >>   }
> >> +
> >> +int vhost_save_backend_state(struct vhost_dev *dev, QEMUFile *f, Error **errp)
> >> +{
> >> +    /* Maximum chunk size in which to transfer the state */
> >> +    const size_t chunk_size = 1 * 1024 * 1024;
> >> +    void *transfer_buf = NULL;
> >> +    g_autoptr(GError) g_err = NULL;
> >> +    int pipe_fds[2], read_fd = -1, write_fd = -1, reply_fd = -1;
> >> +    int ret;
> >> +
> >> +    /* [0] for reading (our end), [1] for writing (back-end's end) */
> >> +    if (!g_unix_open_pipe(pipe_fds, FD_CLOEXEC, &g_err)) {
> >> +        error_setg(errp, "Failed to set up state transfer pipe: %s",
> >> +                   g_err->message);
> >> +        ret = -EINVAL;
> >> +        goto fail;
> >> +    }
> >> +
> >> +    read_fd = pipe_fds[0];
> >> +    write_fd = pipe_fds[1];
> >> +
> >> +    /* VHOST_TRANSFER_STATE_PHASE_STOPPED means the device must be stopped */
> >> +    assert(!dev->started && !dev->enable_vqs);
> >> +
> >> +    /* Transfer ownership of write_fd to the back-end */
> >> +    ret = vhost_set_device_state_fd(dev,
> >> +                                    VHOST_TRANSFER_STATE_DIRECTION_SAVE,
> >> +                                    VHOST_TRANSFER_STATE_PHASE_STOPPED,
> >> +                                    write_fd,
> >> +                                    &reply_fd,
> >> +                                    errp);
> >> +    if (ret < 0) {
> >> +        error_prepend(errp, "Failed to initiate state transfer: ");
> >> +        goto fail;
> >> +    }
> >> +
> >> +    /* If the back-end wishes to use a different pipe, switch over */
> >> +    if (reply_fd >= 0) {
> >> +        close(read_fd);
> >> +        read_fd = reply_fd;
> >> +    }
> >> +
> >> +    transfer_buf = g_malloc(chunk_size);
> >> +
> >> +    while (true) {
> >> +        ssize_t read_ret;
> >> +
> >> +        read_ret = read(read_fd, transfer_buf, chunk_size);
> >> +        if (read_ret < 0) {
> >> +            ret = -errno;
> >> +            error_setg_errno(errp, -ret, "Failed to receive state");
> >> +            goto fail;
> >> +        }
> >> +
> >> +        assert(read_ret <= chunk_size);
> >> +        qemu_put_be32(f, read_ret);
> >> +
> >> +        if (read_ret == 0) {
> >> +            /* EOF */
> >> +            break;
> >> +        }
> >> +
> >> +        qemu_put_buffer(f, transfer_buf, read_ret);
> >> +    }
> > I think this synchronous approach with a single contiguous stream of
> > chunks is okay for now.
> >
> > Does this make the QEMU monitor unresponsive if the backend is slow?
>
> Oh, absolutely.  But as far as I can tell that’s also the case if the
> back-end doesn’t respond (or responds slowly) to vhost-user messages,
> because they’re generally sent/received synchronously. (So, notably,
> these synchronous read()/write() calls aren’t worse than the previous
> model of transferring data through shared memory, but awaiting the
> back-end via vhost-user calls between each chunk.)
>
> I don’t know whether it’s even possible to do it better (while keeping
> it all in the switch-over phase).  The VMState functions aren’t
> coroutines or AIO, so I can’t think of a way to improve this.
>
> (Well, except:)
>
> > In the future the interface could be extended to allow participating in
> > the iterative phase of migration. Then chunks from multiple backends
> > (plus guest RAM) would be interleaved and there would be some
> > parallelism.
>
> Sure.  That would also definitely help with an unintentionally slow
> back-end.  If the back-end making qemu unresponsive is a deal-breaker,
> then we’d have to do this now.
Yes, vhost-user trusts the backend to respond within a reasonable
amount of time.
I think iterating over multiple devices can wait until iteration is needed.
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-13  9:24     ` Hanna Czenczek
@ 2023-04-13 11:38       ` Stefan Hajnoczi
  2023-04-13 17:55         ` Hanna Czenczek
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-13 11:38 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella, Eugenio Pérez
On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >> So-called "internal" virtio-fs migration refers to transporting the
> >> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >> this, we need to be able to transfer virtiofsd's internal state to and
> >> from virtiofsd.
> >>
> >> Because virtiofsd's internal state will not be too large, we believe it
> >> is best to transfer it as a single binary blob after the streaming
> >> phase.  Because this method should be useful to other vhost-user
> >> implementations, too, it is introduced as a general-purpose addition to
> >> the protocol, not limited to vhost-user-fs.
> >>
> >> These are the additions to the protocol:
> >> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>    This feature signals support for transferring state, and is added so
> >>    that migration can fail early when the back-end has no support.
> >>
> >> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>    over which to transfer the state.  The front-end sends an FD to the
> >>    back-end into/from which it can write/read its state, and the back-end
> >>    can decide to either use it, or reply with a different FD for the
> >>    front-end to override the front-end's choice.
> >>    The front-end creates a simple pipe to transfer the state, but maybe
> >>    the back-end already has an FD into/from which it has to write/read
> >>    its state, in which case it will want to override the simple pipe.
> >>    Conversely, maybe in the future we find a way to have the front-end
> >>    get an immediate FD for the migration stream (in some cases), in which
> >>    case we will want to send this to the back-end instead of creating a
> >>    pipe.
> >>    Hence the negotiation: If one side has a better idea than a plain
> >>    pipe, we will want to use that.
> >>
> >> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>    pipe (the end indicated by EOF), the front-end invokes this function
> >>    to verify success.  There is no in-band way (through the pipe) to
> >>    indicate failure, so we need to check explicitly.
> >>
> >> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >> (which includes establishing the direction of transfer and migration
> >> phase), the sending side writes its data into the pipe, and the reading
> >> side reads it until it sees an EOF.  Then, the front-end will check for
> >> success via CHECK_DEVICE_STATE, which on the destination side includes
> >> checking for integrity (i.e. errors during deserialization).
> >>
> >> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >> ---
> >>   include/hw/virtio/vhost-backend.h |  24 +++++
> >>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>   hw/virtio/vhost.c                 |  37 ++++++++
> >>   4 files changed, 287 insertions(+)
> >>
> >> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >> index ec3fbae58d..5935b32fe3 100644
> >> --- a/include/hw/virtio/vhost-backend.h
> >> +++ b/include/hw/virtio/vhost-backend.h
> >> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>   } VhostSetConfigType;
> >>
> >> +typedef enum VhostDeviceStateDirection {
> >> +    /* Transfer state from back-end (device) to front-end */
> >> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >> +    /* Transfer state from front-end to back-end (device) */
> >> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >> +} VhostDeviceStateDirection;
> >> +
> >> +typedef enum VhostDeviceStatePhase {
> >> +    /* The device (and all its vrings) is stopped */
> >> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >> +} VhostDeviceStatePhase;
> > vDPA has:
> >
> >    /* Suspend a device so it does not process virtqueue requests anymore
> >     *
> >     * After the return of ioctl the device must preserve all the necessary state
> >     * (the virtqueue vring base plus the possible device specific states) that is
> >     * required for restoring in the future. The device must not change its
> >     * configuration after that point.
> >     */
> >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >
> >    /* Resume a device so it can resume processing virtqueue requests
> >     *
> >     * After the return of this ioctl the device will have restored all the
> >     * necessary states and it is fully operational to continue processing the
> >     * virtqueue descriptors.
> >     */
> >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >
> > I wonder if it makes sense to import these into vhost-user so that the
> > difference between kernel vhost and vhost-user is minimized. It's okay
> > if one of them is ahead of the other, but it would be nice to avoid
> > overlapping/duplicated functionality.
> >
> > (And I hope vDPA will import the device state vhost-user messages
> > introduced in this series.)
>
> I don’t understand your suggestion.  (Like, I very simply don’t
> understand :))
>
> These are vhost messages, right?  What purpose do you have in mind for
> them in vhost-user for internal migration?  They’re different from the
> state transfer messages, because they don’t transfer state to/from the
> front-end.  Also, the state transfer stuff is supposed to be distinct
> from starting/stopping the device; right now, it just requires the
> device to be stopped beforehand (or started only afterwards).  And in
> the future, new VhostDeviceStatePhase values may allow the messages to
> be used on devices that aren’t stopped.
>
> So they seem to serve very different purposes.  I can imagine using the
> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is
> working on), but they don’t really help with internal migration
> implemented here.  If I were to add them, they’d just be sent in
> addition to the new messages added in this patch here, i.e. SUSPEND on
> the source before SET_DEVICE_STATE_FD, and RESUME on the destination
> after CHECK_DEVICE_STATE (we could use RESUME in place of
> CHECK_DEVICE_STATE on the destination, but we can’t do that on the
> source, so we still need CHECK_DEVICE_STATE).
Yes, they are complementary to the device state fd message. I want to
make sure pre-conditions about the device's state (running vs stopped)
already take into account the vDPA SUSPEND/RESUME model.
vDPA will need device state save/load in the future. For virtiofs
devices, for example. This is why I think we should plan for vDPA and
vhost-user to share the same interface.
Also, I think the code path you're relying on (vhost_dev_stop()) on
doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS
because stopping the backend resets the device and throws away its
state. SUSPEND/RESUME solve this. This looks like a more general
problem since vhost_dev_stop() is called any time the VM is paused.
Maybe it needs to use SUSPEND/RESUME whenever possible.
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 1/4] vhost: Re-enable vrings after setting features
  2023-04-11 15:05 ` [PATCH 1/4] vhost: Re-enable vrings after setting features Hanna Czenczek
                     ` (2 preceding siblings ...)
  2023-04-13 11:03   ` Stefan Hajnoczi
@ 2023-04-13 13:19   ` Michael S. Tsirkin
  3 siblings, 0 replies; 93+ messages in thread
From: Michael S. Tsirkin @ 2023-04-13 13:19 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, German Maglione,
	Anton Kuchin, Juan Quintela, Stefano Garzarella
On Tue, Apr 11, 2023 at 05:05:12PM +0200, Hanna Czenczek wrote:
> If the back-end supports the VHOST_USER_F_PROTOCOL_FEATURES feature,
> setting the vhost features will set this feature, too.  Doing so
> disables all vrings, which may not be intended.
Hmm not sure I understand: why does it disable vrings?
> For example, enabling or disabling logging during migration requires
> setting those features (to set or unset VHOST_F_LOG_ALL), which will
> automatically disable all vrings.  In either case, the VM is running
> (disabling logging is done after a failed or cancelled migration, and
> only once the VM is running again, see comment in
> memory_global_dirty_log_stop()), so the vrings should really be enabled.
> As a result, the back-end seems to hang.
> 
> To fix this, we must remember whether the vrings are supposed to be
> enabled, and, if so, re-enable them after a SET_FEATURES call that set
> VHOST_USER_F_PROTOCOL_FEATURES.
> 
> It seems less than ideal that there is a short period in which the VM is
> running but the vrings will be stopped (between SET_FEATURES and
> SET_VRING_ENABLE).  To fix this, we would need to change the protocol,
> e.g. by introducing a new flag or vhost-user protocol feature to disable
> disabling vrings whenever VHOST_USER_F_PROTOCOL_FEATURES is set, or add
> new functions for setting/clearing singular feature bits (so that
> F_LOG_ALL can be set/cleared without touching F_PROTOCOL_FEATURES).
> 
> Even with such a potential addition to the protocol, we still need this
> fix here, because we cannot expect that back-ends will implement this
> addition.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost.h | 10 ++++++++++
>  hw/virtio/vhost.c         | 13 +++++++++++++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index a52f273347..2fe02ed5d4 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -90,6 +90,16 @@ struct vhost_dev {
>      int vq_index_end;
>      /* if non-zero, minimum required value for max_queues */
>      int num_queues;
> +
> +    /*
> +     * Whether the virtqueues are supposed to be enabled (via
> +     * SET_VRING_ENABLE).  Setting the features (e.g. for
> +     * enabling/disabling logging) will disable all virtqueues if
> +     * VHOST_USER_F_PROTOCOL_FEATURES is set, so then we need to
> +     * re-enable them if this field is set.
> +     */
> +    bool enable_vqs;
> +
>      /**
>       * vhost feature handling requires matching the feature set
>       * offered by a backend which may be a subset of the total
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index a266396576..cbff589efa 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -50,6 +50,8 @@ static unsigned int used_memslots;
>  static QLIST_HEAD(, vhost_dev) vhost_devices =
>      QLIST_HEAD_INITIALIZER(vhost_devices);
>  
> +static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable);
> +
>  bool vhost_has_free_slot(void)
>  {
>      unsigned int slots_limit = ~0U;
> @@ -899,6 +901,15 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
>          }
>      }
>  
> +    if (dev->enable_vqs) {
> +        /*
> +         * Setting VHOST_USER_F_PROTOCOL_FEATURES would have disabled all
> +         * virtqueues, even if that was not intended; re-enable them if
> +         * necessary.
> +         */
> +        vhost_dev_set_vring_enable(dev, true);
> +    }
> +
>  out:
>      return r;
>  }
> @@ -1896,6 +1907,8 @@ int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>  
>  static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable)
>  {
> +    hdev->enable_vqs = enable;
> +
>      if (!hdev->vhost_ops->vhost_set_vring_enable) {
>          return 0;
>      }
> -- 
> 2.39.1
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 1/4] vhost: Re-enable vrings after setting features
  2023-04-13 11:03       ` Stefan Hajnoczi
@ 2023-04-13 14:24         ` Anton Kuchin
  2023-04-13 15:48           ` Michael S. Tsirkin
  0 siblings, 1 reply; 93+ messages in thread
From: Anton Kuchin @ 2023-04-13 14:24 UTC (permalink / raw)
  To: Stefan Hajnoczi, Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, German Maglione,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella,
	Roman Kagan, Maxime Coquelin, Marc-André Lureau
On 13/04/2023 14:03, Stefan Hajnoczi wrote:
> On Thu, 13 Apr 2023 at 04:20, Hanna Czenczek <hreitz@redhat.com> wrote:
>> On 12.04.23 22:51, Stefan Hajnoczi wrote:
>>> On Tue, Apr 11, 2023 at 05:05:12PM +0200, Hanna Czenczek wrote:
>>>> If the back-end supports the VHOST_USER_F_PROTOCOL_FEATURES feature,
>>>> setting the vhost features will set this feature, too.  Doing so
>>>> disables all vrings, which may not be intended.
>>>>
>>>> For example, enabling or disabling logging during migration requires
>>>> setting those features (to set or unset VHOST_F_LOG_ALL), which will
>>>> automatically disable all vrings.  In either case, the VM is running
>>>> (disabling logging is done after a failed or cancelled migration, and
>>>> only once the VM is running again, see comment in
>>>> memory_global_dirty_log_stop()), so the vrings should really be enabled.
>>>> As a result, the back-end seems to hang.
>>>>
>>>> To fix this, we must remember whether the vrings are supposed to be
>>>> enabled, and, if so, re-enable them after a SET_FEATURES call that set
>>>> VHOST_USER_F_PROTOCOL_FEATURES.
>>>>
>>>> It seems less than ideal that there is a short period in which the VM is
>>>> running but the vrings will be stopped (between SET_FEATURES and
>>>> SET_VRING_ENABLE).  To fix this, we would need to change the protocol,
>>>> e.g. by introducing a new flag or vhost-user protocol feature to disable
>>>> disabling vrings whenever VHOST_USER_F_PROTOCOL_FEATURES is set, or add
>>>> new functions for setting/clearing singular feature bits (so that
>>>> F_LOG_ALL can be set/cleared without touching F_PROTOCOL_FEATURES).
>>>>
>>>> Even with such a potential addition to the protocol, we still need this
>>>> fix here, because we cannot expect that back-ends will implement this
>>>> addition.
>>>>
>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>> ---
>>>>    include/hw/virtio/vhost.h | 10 ++++++++++
>>>>    hw/virtio/vhost.c         | 13 +++++++++++++
>>>>    2 files changed, 23 insertions(+)
>>>>
>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>>> index a52f273347..2fe02ed5d4 100644
>>>> --- a/include/hw/virtio/vhost.h
>>>> +++ b/include/hw/virtio/vhost.h
>>>> @@ -90,6 +90,16 @@ struct vhost_dev {
>>>>        int vq_index_end;
>>>>        /* if non-zero, minimum required value for max_queues */
>>>>        int num_queues;
>>>> +
>>>> +    /*
>>>> +     * Whether the virtqueues are supposed to be enabled (via
>>>> +     * SET_VRING_ENABLE).  Setting the features (e.g. for
>>>> +     * enabling/disabling logging) will disable all virtqueues if
>>>> +     * VHOST_USER_F_PROTOCOL_FEATURES is set, so then we need to
>>>> +     * re-enable them if this field is set.
>>>> +     */
>>>> +    bool enable_vqs;
>>>> +
>>>>        /**
>>>>         * vhost feature handling requires matching the feature set
>>>>         * offered by a backend which may be a subset of the total
>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>> index a266396576..cbff589efa 100644
>>>> --- a/hw/virtio/vhost.c
>>>> +++ b/hw/virtio/vhost.c
>>>> @@ -50,6 +50,8 @@ static unsigned int used_memslots;
>>>>    static QLIST_HEAD(, vhost_dev) vhost_devices =
>>>>        QLIST_HEAD_INITIALIZER(vhost_devices);
>>>>
>>>> +static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable);
>>>> +
>>>>    bool vhost_has_free_slot(void)
>>>>    {
>>>>        unsigned int slots_limit = ~0U;
>>>> @@ -899,6 +901,15 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
>>>>            }
>>>>        }
>>>>
>>>> +    if (dev->enable_vqs) {
>>>> +        /*
>>>> +         * Setting VHOST_USER_F_PROTOCOL_FEATURES would have disabled all
>>>> +         * virtqueues, even if that was not intended; re-enable them if
>>>> +         * necessary.
>>>> +         */
>>>> +        vhost_dev_set_vring_enable(dev, true);
>>>> +    }
>>>> +
>>>>    out:
>>>>        return r;
>>>>    }
>>>> @@ -1896,6 +1907,8 @@ int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>>>>
>>>>    static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable)
>>>>    {
>>>> +    hdev->enable_vqs = enable;
>>>> +
>>>>        if (!hdev->vhost_ops->vhost_set_vring_enable) {
>>>>            return 0;
>>>>        }
>>> The vhost-user spec doesn't say that VHOST_F_LOG_ALL needs to be toggled
>>> at runtime and I don't think VHOST_USER_SET_PROTOCOL_FEATURES is
>>> intended to be used like that. This issue shows why doing so is a bad
>>> idea.
>>>
>>> VHOST_F_LOG_ALL does not need to be toggled to control logging. Logging
>>> is controlled at runtime by the presence of the dirty log
>>> (VHOST_USER_SET_LOG_BASE) and the per-vring logging flag
>>> (VHOST_VRING_F_LOG).
>> Technically, the spec doesn’t say that SET_LOG_BASE is required.  It says:
>>
>> “To start/stop logging of data/used ring writes, the front-end may send
>> messages VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and
>> VHOST_USER_SET_VRING_ADDR with VHOST_VRING_F_LOG in ring’s flags set to
>> 1/0, respectively.”
>>
>> (So the spec also very much does imply that toggling F_LOG_ALL at
>> runtime is a valid way to enable/disable logging.  If we were to no
>> longer do that, we should clarify it there.)
> I missed that VHOST_VRING_F_LOG only controls logging used ring writes
> while writes to descriptors are always logged when VHOST_F_LOG_ALL is
> set. I agree that the spec does require VHOST_F_LOG_ALL to be toggled
> at runtime.
>
> What I suggested won't work.
But is there a valid use-case for logging some dirty memory but not all?
I can't understand if this is a feature or a just flaw in specification.
>
>> I mean, naturally, logging without a shared memory area to log in to
>> isn’t much fun, so we could clarify that SET_LOG_BASE is also a
>> requirement, but it looks to me as if we can’t use SET_LOG_BASE to
>> disable logging, because it’s supposed to always pass a valid FD (at
>> least libvhost-user expects this:
>> https://gitlab.com/qemu-project/qemu/-/blob/master/subprojects/libvhost-user/libvhost-user.c#L1044).
> As an aside: I don't understand how logging without an fd is supposed
> to work in QEMU's code or in the vhost-user spec. QEMU does not
> support that case even though it's written as if shmfd were optional.
>
>> So after a cancelled migration, the dirty bitmap SHM will stay around
>> indefinitely (which is already not great, but if we were to use the
>> presence of that bitmap as an indicator as to whether we should log or
>> not, it would be worse).
> Yes, continuing to log forever is worse.
>
>> So the VRING_F_LOG flag remains.
>>
>>> I suggest permanently enabling VHOST_F_LOG_ALL upon connection when the
>>> the backend supports it. No spec changes are required.
>>>
>>> libvhost-user looks like it will work. I didn't look at DPDK/SPDK, but
>>> checking that it works there is important too.
>> I can’t find VRING_F_LOG in libvhost-user, so what protocol do you mean
>> exactly that would work in libvhost-user?  Because SET_LOG_BASE won’t
>> work, as you can’t use it to disable logging.
> That's true. There is no way to disable logging.
>
>> (For DPDK, I’m not sure.  It looks like it sometimes takes VRING_F_LOG
>> into account, but I think only when it comes to logging accesses to the
>> vring specifically, i.e. not DMA to other areas of guest memory?  I
>> think only places that use vq->log_guest_addr implicitly check it,
>> others don’t.  So for example,
>> vhost_log_write_iova()/vhost_log_cache_write_iova() don’t seem to check
>> VRING_F_LOG, which seem to be the functions generally used for writes to
>> memory outside of the vrings.  So here, even if VRING_F_LOG is disabled
>> for all vrings, as long as a log SHM is set, all writes to memory
>> outside of the immediate vrings seem to cause logging (as long as
>> F_LOG_ALL is set).)
>>
>> Hanna
>>
>>> I have CCed people who may be interested in this issue. This is the
>>> first time I've looked at vhost-user logging, so this idea may not work.
>>>
>>> Stefan
>>
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 1/4] vhost: Re-enable vrings after setting features
  2023-04-13 14:24         ` Anton Kuchin
@ 2023-04-13 15:48           ` Michael S. Tsirkin
  0 siblings, 0 replies; 93+ messages in thread
From: Michael S. Tsirkin @ 2023-04-13 15:48 UTC (permalink / raw)
  To: Anton Kuchin
  Cc: Stefan Hajnoczi, Hanna Czenczek, Stefan Hajnoczi, qemu-devel,
	virtio-fs, German Maglione, Juan Quintela, Stefano Garzarella,
	Roman Kagan, Maxime Coquelin, Marc-André Lureau
On Thu, Apr 13, 2023 at 05:24:36PM +0300, Anton Kuchin wrote:
> But is there a valid use-case for logging some dirty memory but not all?
> I can't understand if this is a feature or a just flaw in specification.
IRC the use-case originally conceived was for shadow VQs.  If you use
shadow VQs the VQ access by backend does not change memory since shadow
VQ is not in memory. Not a practical concern right now but there you
have it.
-- 
MST
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-04-11 15:05 [PATCH 0/4] vhost-user-fs: Internal migration Hanna Czenczek
                   ` (4 preceding siblings ...)
  2023-04-12 21:00 ` [PATCH 0/4] vhost-user-fs: Internal migration Stefan Hajnoczi
@ 2023-04-13 16:11 ` Michael S. Tsirkin
  2023-04-13 17:53   ` [Virtio-fs] " Hanna Czenczek
  2023-05-04 16:05 ` Hanna Czenczek
  6 siblings, 1 reply; 93+ messages in thread
From: Michael S. Tsirkin @ 2023-04-13 16:11 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, German Maglione,
	Anton Kuchin, Juan Quintela, Stefano Garzarella
On Tue, Apr 11, 2023 at 05:05:11PM +0200, Hanna Czenczek wrote:
> RFC:
> https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04263.html
> 
> Hi,
> 
> Patch 2 of this series adds new vhost methods (only for vhost-user at
> this point) for transferring the back-end’s internal state to/from qemu
> during migration, so that this state can be stored in the migration
> stream.  (This is what we call “internal migration”, because the state
> is internally available to qemu; this is in contrast to “external
> migration”, which Anton is working on, where the back-end’s state is
> handled by the back-end itself without involving qemu.)
> 
> For this, the state is handled as a binary blob by qemu, and it is
> transferred over a pipe that is established via a new vhost method.
> 
> Patch 3 adds two high-level helper functions to (A) fetch any vhost
> back-end’s internal state and store it in a migration stream (a
> `QEMUFile`), and (B) load such state from a migrations stream and send
> it to a vhost back-end.  These build on the low-level interface
> introduced in patch 2.
> 
> Patch 4 then uses these functions to implement internal migration for
> vhost-user-fs.  Note that this of course depends on support in the
> back-end (virtiofsd), which is not yet ready.
> 
> Finally, patch 1 fixes a bug around migrating vhost-user devices: To
> enable/disable logging[1], the VHOST_F_LOG_ALL feature must be
> set/cleared, via the SET_FEATURES call.  Another, technically unrelated,
> feature exists, VHOST_USER_F_PROTOCOL_FEATURES, which indicates support
> for vhost-user protocol features.  Naturally, qemu wants to keep that
> other feature enabled, so it will set it (when possible) in every
> SET_FEATURES call.  However, a side effect of setting
> VHOST_USER_F_PROTOCOL_FEATURES is that all vrings are disabled.
I didn't get this part.
Two questions:
	Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.
	If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
	ring starts directly in the enabled state.
	If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
	initialized in a disabled state and is enabled by
	``VHOST_USER_SET_VRING_ENABLE`` with parameter 1.
so VHOST_USER_F_PROTOCOL_FEATURES only controls initial state of rings,
it does not disable rings.
>  This
> causes any enabling (done at the start of migration) or disabling (done
> on the source after a cancelled/failed migration) of logging to make the
> back-end hang.  Without patch 1, therefore, starting a migration will
> have any vhost-user back-end that supports both VHOST_F_LOG_ALL and
> VHOST_USER_F_PROTOCOL_FEATURES immediately hang completely, and unless
> execution is transferred to the destination, it will continue to hang.
> 
> 
> [1] Logging here means logging writes to guest memory pages in a dirty
> bitmap so that these dirty pages are flushed to the destination.  qemu
> cannot monitor the back-end’s writes to guest memory, so the back-end
> has to do so itself, and log its writes in a dirty bitmap shared with
> qemu.
> 
> 
> Changes in v1 compared to the RFC:
> - Patch 1 added
> 
> - Patch 2: Interface is different, now uses a pipe instead of shared
>   memory (as suggested by Stefan); also, this is now a generic
>   vhost-user interface, and not just for vhost-user-fs
> 
> - Patches 3 and 4: Because this is now supposed to be a generic
>   migration method for vhost-user back-ends, most of the migration code
>   has been moved from vhost-user-fs.c to vhost.c so it can be shared
>   between different back-ends.  The vhost-user-fs code is now a rather
>   thin wrapper around the common code.
>   - Note also (as suggested by Anton) that the back-end’s migration
>     state is now in a subsection, and that it is technically optional.
>     “Technically” means that with this series, it is always used (unless
>     the back-end doesn’t support migration, in which case migration is
>     just blocked), but Anton’s series for external migration would make
>     it optional.  (I.e., the subsection would be skipped for external
>     migration, and mandatorily included for internal migration.)
> 
> 
> Hanna Czenczek (4):
>   vhost: Re-enable vrings after setting features
>   vhost-user: Interface for migration state transfer
>   vhost: Add high-level state save/load functions
>   vhost-user-fs: Implement internal migration
> 
>  include/hw/virtio/vhost-backend.h |  24 +++
>  include/hw/virtio/vhost.h         | 124 +++++++++++++++
>  hw/virtio/vhost-user-fs.c         | 101 +++++++++++-
>  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++
>  hw/virtio/vhost.c                 | 246 ++++++++++++++++++++++++++++++
>  5 files changed, 641 insertions(+), 1 deletion(-)
> 
> -- 
> 2.39.1
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-13 10:14     ` Eugenio Perez Martin
  2023-04-13 11:07       ` Stefan Hajnoczi
@ 2023-04-13 17:31       ` Hanna Czenczek
  2023-04-17 15:12         ` Stefan Hajnoczi
  2023-04-17 18:37         ` Eugenio Perez Martin
  2023-04-17 15:38       ` Stefan Hajnoczi
  2023-04-17 17:14       ` Stefan Hajnoczi
  3 siblings, 2 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-13 17:31 UTC (permalink / raw)
  To: Eugenio Perez Martin, Stefan Hajnoczi
  Cc: qemu-devel, virtio-fs, German Maglione, Anton Kuchin,
	Juan Quintela, Michael S . Tsirkin, Stefano Garzarella
On 13.04.23 12:14, Eugenio Perez Martin wrote:
> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>>> So-called "internal" virtio-fs migration refers to transporting the
>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>>> this, we need to be able to transfer virtiofsd's internal state to and
>>> from virtiofsd.
>>>
>>> Because virtiofsd's internal state will not be too large, we believe it
>>> is best to transfer it as a single binary blob after the streaming
>>> phase.  Because this method should be useful to other vhost-user
>>> implementations, too, it is introduced as a general-purpose addition to
>>> the protocol, not limited to vhost-user-fs.
>>>
>>> These are the additions to the protocol:
>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>>    This feature signals support for transferring state, and is added so
>>>    that migration can fail early when the back-end has no support.
>>>
>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>>    over which to transfer the state.  The front-end sends an FD to the
>>>    back-end into/from which it can write/read its state, and the back-end
>>>    can decide to either use it, or reply with a different FD for the
>>>    front-end to override the front-end's choice.
>>>    The front-end creates a simple pipe to transfer the state, but maybe
>>>    the back-end already has an FD into/from which it has to write/read
>>>    its state, in which case it will want to override the simple pipe.
>>>    Conversely, maybe in the future we find a way to have the front-end
>>>    get an immediate FD for the migration stream (in some cases), in which
>>>    case we will want to send this to the back-end instead of creating a
>>>    pipe.
>>>    Hence the negotiation: If one side has a better idea than a plain
>>>    pipe, we will want to use that.
>>>
>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>>    pipe (the end indicated by EOF), the front-end invokes this function
>>>    to verify success.  There is no in-band way (through the pipe) to
>>>    indicate failure, so we need to check explicitly.
>>>
>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>>> (which includes establishing the direction of transfer and migration
>>> phase), the sending side writes its data into the pipe, and the reading
>>> side reads it until it sees an EOF.  Then, the front-end will check for
>>> success via CHECK_DEVICE_STATE, which on the destination side includes
>>> checking for integrity (i.e. errors during deserialization).
>>>
>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>> ---
>>>   include/hw/virtio/vhost-backend.h |  24 +++++
>>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>>   hw/virtio/vhost.c                 |  37 ++++++++
>>>   4 files changed, 287 insertions(+)
>>>
>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>> index ec3fbae58d..5935b32fe3 100644
>>> --- a/include/hw/virtio/vhost-backend.h
>>> +++ b/include/hw/virtio/vhost-backend.h
>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>>   } VhostSetConfigType;
>>>
>>> +typedef enum VhostDeviceStateDirection {
>>> +    /* Transfer state from back-end (device) to front-end */
>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>>> +    /* Transfer state from front-end to back-end (device) */
>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>>> +} VhostDeviceStateDirection;
>>> +
>>> +typedef enum VhostDeviceStatePhase {
>>> +    /* The device (and all its vrings) is stopped */
>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>>> +} VhostDeviceStatePhase;
>> vDPA has:
>>
>>    /* Suspend a device so it does not process virtqueue requests anymore
>>     *
>>     * After the return of ioctl the device must preserve all the necessary state
>>     * (the virtqueue vring base plus the possible device specific states) that is
>>     * required for restoring in the future. The device must not change its
>>     * configuration after that point.
>>     */
>>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>>
>>    /* Resume a device so it can resume processing virtqueue requests
>>     *
>>     * After the return of this ioctl the device will have restored all the
>>     * necessary states and it is fully operational to continue processing the
>>     * virtqueue descriptors.
>>     */
>>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>>
>> I wonder if it makes sense to import these into vhost-user so that the
>> difference between kernel vhost and vhost-user is minimized. It's okay
>> if one of them is ahead of the other, but it would be nice to avoid
>> overlapping/duplicated functionality.
>>
> That's what I had in mind in the first versions. I proposed VHOST_STOP
> instead of VHOST_VDPA_STOP for this very reason. Later it did change
> to SUSPEND.
>
> Generally it is better if we make the interface less parametrized and
> we trust in the messages and its semantics in my opinion. In other
> words, instead of
> vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
I.e. you mean that this should simply be stateful instead of
re-affirming the current state with a parameter?
The problem I see is that transferring states in different phases of
migration will require specialized implementations.  So running
SET_DEVICE_STATE_FD in a different phase will require support from the
back-end.  Same in the front-end, the exact protocol and thus
implementation will (probably, difficult to say at this point) depend on
the migration phase.  I would therefore prefer to have an explicit
distinction in the command itself that affirms the phase we’re
targeting.
On the other hand, I don’t see the parameter complicating anything. The
front-end must supply it, but it will know the phase anyway, so this is
easy.  The back-end can just choose to ignore it, if it doesn’t feel the
need to verify that the phase is what it thinks it is.
> Another way to apply this is with the "direction" parameter. Maybe it
> is better to split it into "set_state_fd" and "get_state_fd"?
Well, it would rather be `set_state_send_fd` and `set_state_receive_fd`.
We always negotiate a pipe between front-end and back-end, the question
is just whether the back-end gets the receiving (load) or the sending
(save) end.
Technically, one can make it fully stateful and say that if the device
hasn’t been started already, it’s always a LOAD, and otherwise always a
SAVE.  But as above, I’d prefer to keep the parameter because the
implementations are different, so I’d prefer there to be a
re-affirmation that front-end and back-end are in sync about what should
be done.
Personally, I don’t really see the advantage of having two functions
instead of one function with an enum with two values.  The thing about
SET_DEVICE_STATE_FD is that it itself won’t differ much regardless of
whether we’re loading or saving, it just negotiates the pipe – the
difference is what happens after the pipe has been negotiated.  So if we
split the function into two, both implementations will share most of
their code anyway, which makes me think it should be a single function.
> In that case, reusing the ioctls as vhost-user messages would be ok.
> But that puts this proposal further from the VFIO code, which uses
> "migration_set_state(state)", and maybe it is better when the number
> of states is high.
I’m not sure what you mean (because I don’t know the VFIO code, I
assume).  Are you saying that using a more finely grained
migration_set_state() model would conflict with the rather coarse
suspend/resume?
> BTW, is there any usage for *reply_fd at this moment from the backend?
No, virtiofsd doesn’t plan to make use of it.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 1/4] vhost: Re-enable vrings after setting features
  2023-04-13 11:03   ` Stefan Hajnoczi
@ 2023-04-13 17:32     ` Hanna Czenczek
  0 siblings, 0 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-13 17:32 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On 13.04.23 13:03, Stefan Hajnoczi wrote:
> On Tue, 11 Apr 2023 at 11:05, Hanna Czenczek <hreitz@redhat.com> wrote:
>> If the back-end supports the VHOST_USER_F_PROTOCOL_FEATURES feature,
>> setting the vhost features will set this feature, too.  Doing so
>> disables all vrings, which may not be intended.
>>
>> For example, enabling or disabling logging during migration requires
>> setting those features (to set or unset VHOST_F_LOG_ALL), which will
>> automatically disable all vrings.  In either case, the VM is running
>> (disabling logging is done after a failed or cancelled migration, and
>> only once the VM is running again, see comment in
>> memory_global_dirty_log_stop()), so the vrings should really be enabled.
>> As a result, the back-end seems to hang.
>>
>> To fix this, we must remember whether the vrings are supposed to be
>> enabled, and, if so, re-enable them after a SET_FEATURES call that set
>> VHOST_USER_F_PROTOCOL_FEATURES.
>>
>> It seems less than ideal that there is a short period in which the VM is
>> running but the vrings will be stopped (between SET_FEATURES and
>> SET_VRING_ENABLE).  To fix this, we would need to change the protocol,
>> e.g. by introducing a new flag or vhost-user protocol feature to disable
>> disabling vrings whenever VHOST_USER_F_PROTOCOL_FEATURES is set, or add
>> new functions for setting/clearing singular feature bits (so that
>> F_LOG_ALL can be set/cleared without touching F_PROTOCOL_FEATURES).
>>
>> Even with such a potential addition to the protocol, we still need this
>> fix here, because we cannot expect that back-ends will implement this
>> addition.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   include/hw/virtio/vhost.h | 10 ++++++++++
>>   hw/virtio/vhost.c         | 13 +++++++++++++
>>   2 files changed, 23 insertions(+)
>>
>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>> index a52f273347..2fe02ed5d4 100644
>> --- a/include/hw/virtio/vhost.h
>> +++ b/include/hw/virtio/vhost.h
>> @@ -90,6 +90,16 @@ struct vhost_dev {
>>       int vq_index_end;
>>       /* if non-zero, minimum required value for max_queues */
>>       int num_queues;
>> +
>> +    /*
>> +     * Whether the virtqueues are supposed to be enabled (via
>> +     * SET_VRING_ENABLE).  Setting the features (e.g. for
>> +     * enabling/disabling logging) will disable all virtqueues if
>> +     * VHOST_USER_F_PROTOCOL_FEATURES is set, so then we need to
>> +     * re-enable them if this field is set.
>> +     */
>> +    bool enable_vqs;
>> +
>>       /**
>>        * vhost feature handling requires matching the feature set
>>        * offered by a backend which may be a subset of the total
>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>> index a266396576..cbff589efa 100644
>> --- a/hw/virtio/vhost.c
>> +++ b/hw/virtio/vhost.c
>> @@ -50,6 +50,8 @@ static unsigned int used_memslots;
>>   static QLIST_HEAD(, vhost_dev) vhost_devices =
>>       QLIST_HEAD_INITIALIZER(vhost_devices);
>>
>> +static int vhost_dev_set_vring_enable(struct vhost_dev *hdev, int enable);
>> +
>>   bool vhost_has_free_slot(void)
>>   {
>>       unsigned int slots_limit = ~0U;
>> @@ -899,6 +901,15 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
>>           }
>>       }
>>
>> +    if (dev->enable_vqs) {
>> +        /*
>> +         * Setting VHOST_USER_F_PROTOCOL_FEATURES would have disabled all
> Is there a reason to put this vhost-user-specific workaround in
> vhost.c instead of vhost-user.c?
My feeling was that this isn’t really vhost-user-specific.  It just so
happens that vhost-user is the only implementation that has a special
feature that disables all vrings, but I can’t see a reason why that
would be vhost-user-specific.
I mean, vhost_dev_set_vring_enable() looks like it can work for any
vhost implementation, but:
- .vhost_set_vring_enable() is indeed only implemented for vhost-user
- vhost_dev_set_vring_enable() won’t do anything if (for a vhost-user
   back-end) the F_PROTOCOL_FEATURES feature hasn’t been negotiated.
So this looked to me like if .vhost_set_vring_enable() were ever
implemented for anything but vhost-user, it’s quite likely that this too
would be gated behind some feature that auto-disables all vrings.
So this, plus the fact that vhost_dev_set_vring_enable() already has
vhost-user-specific code (while being a vhost.c function), I thought
it’d be best to put this into generic vhost.c code, simply because in
the worst case, it’ll be a no-op on other vhost implementations.
But as it is, functionally, of course I can just put it into
vhost_user_set_features().  (It would be a tiny bit more complicated,
because then I’d have to use vhost_user_set_vring_enable(), which
returns an error if F_PROTOCOL_FEATURES isn’t there, which we’d have to
ignore – vhost_dev_set_vring_enable() does that already for me.)
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [Virtio-fs] [PATCH 0/4] vhost-user-fs: Internal migration
  2023-04-13 16:11 ` Michael S. Tsirkin
@ 2023-04-13 17:53   ` Hanna Czenczek
  0 siblings, 0 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-13 17:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Juan Quintela, qemu-devel, virtio-fs,
	Anton Kuchin
On 13.04.23 18:11, Michael S. Tsirkin wrote:
> On Tue, Apr 11, 2023 at 05:05:11PM +0200, Hanna Czenczek wrote:
>> RFC:
>> https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04263.html
>>
>> Hi,
>>
>> Patch 2 of this series adds new vhost methods (only for vhost-user at
>> this point) for transferring the back-end’s internal state to/from qemu
>> during migration, so that this state can be stored in the migration
>> stream.  (This is what we call “internal migration”, because the state
>> is internally available to qemu; this is in contrast to “external
>> migration”, which Anton is working on, where the back-end’s state is
>> handled by the back-end itself without involving qemu.)
>>
>> For this, the state is handled as a binary blob by qemu, and it is
>> transferred over a pipe that is established via a new vhost method.
>>
>> Patch 3 adds two high-level helper functions to (A) fetch any vhost
>> back-end’s internal state and store it in a migration stream (a
>> `QEMUFile`), and (B) load such state from a migrations stream and send
>> it to a vhost back-end.  These build on the low-level interface
>> introduced in patch 2.
>>
>> Patch 4 then uses these functions to implement internal migration for
>> vhost-user-fs.  Note that this of course depends on support in the
>> back-end (virtiofsd), which is not yet ready.
>>
>> Finally, patch 1 fixes a bug around migrating vhost-user devices: To
>> enable/disable logging[1], the VHOST_F_LOG_ALL feature must be
>> set/cleared, via the SET_FEATURES call.  Another, technically unrelated,
>> feature exists, VHOST_USER_F_PROTOCOL_FEATURES, which indicates support
>> for vhost-user protocol features.  Naturally, qemu wants to keep that
>> other feature enabled, so it will set it (when possible) in every
>> SET_FEATURES call.  However, a side effect of setting
>> VHOST_USER_F_PROTOCOL_FEATURES is that all vrings are disabled.
>
> I didn't get this part.
> Two questions:
> 	Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.
>
> 	If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
> 	ring starts directly in the enabled state.
>
> 	If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
> 	initialized in a disabled state and is enabled by
> 	``VHOST_USER_SET_VRING_ENABLE`` with parameter 1.
>
> so VHOST_USER_F_PROTOCOL_FEATURES only controls initial state of rings,
> it does not disable rings.
Oh.  Thanks. :)
That’s indeed a valid and more sensible interpretation.  I know that the 
vhost-user-backend crate virtiofsd uses has interpreted it differently.  
Looking into libvhost-user and DPDK, both have decided to instead have 
all vrings be disabled at reset, and enable them only when a 
SET_FEATURES with F_PROTOCOL_FEATURES comes in.  Doesn’t sound quite 
literally to spec either, but adheres to the interpretation of not 
disabling any rings just because F_PROTOCOL_FEATURES appears.
(I thought of proposing this (“only disable vrings for a `false` -> 
`true` flag state transition”), but thought that’d be too complicated.  
Oh, well. :))
So, the fix will go to the vhost-user-backend crate instead of qemu.  
That’s good!
Still, I will also prepare a patch to vhost-user.rst for this, because I 
still don’t find the specification clear on this.  The thing is, nobody 
interprets it as “negotiating this feature will decide whether, when all 
rings are initialized, they will be initialized in disabled or enabled 
state”, which is how I think you’ve interpreted it.  The problem is that 
“initialization” isn’t well-defined here.
Even libvhost-user and DPDK initialize the rings always in disabled 
state, regardless of this feature, but will put them into an enabled 
state later on if the feature isn’t negotiated.  I think this exact 
behavior should be precisely described in the spec, like:
Between initialization and ``VHOST_USER_SET_FEATURES``, it is 
implementation-defined whether each ring is enabled or disabled.
If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, each 
ring, when started, will be enabled immediately.
If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, each ring 
will remain in the disabled state until ``VHOST_USER_SET_VRING_ENABLE`` 
enables it with parameter 1.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-13 11:38       ` Stefan Hajnoczi
@ 2023-04-13 17:55         ` Hanna Czenczek
  2023-04-13 20:42           ` Stefan Hajnoczi
  2023-04-14 15:17           ` Eugenio Perez Martin
  0 siblings, 2 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-13 17:55 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella, Eugenio Pérez
On 13.04.23 13:38, Stefan Hajnoczi wrote:
> On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
>> On 12.04.23 23:06, Stefan Hajnoczi wrote:
>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>>>> So-called "internal" virtio-fs migration refers to transporting the
>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>>>> this, we need to be able to transfer virtiofsd's internal state to and
>>>> from virtiofsd.
>>>>
>>>> Because virtiofsd's internal state will not be too large, we believe it
>>>> is best to transfer it as a single binary blob after the streaming
>>>> phase.  Because this method should be useful to other vhost-user
>>>> implementations, too, it is introduced as a general-purpose addition to
>>>> the protocol, not limited to vhost-user-fs.
>>>>
>>>> These are the additions to the protocol:
>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>>>     This feature signals support for transferring state, and is added so
>>>>     that migration can fail early when the back-end has no support.
>>>>
>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>>>     over which to transfer the state.  The front-end sends an FD to the
>>>>     back-end into/from which it can write/read its state, and the back-end
>>>>     can decide to either use it, or reply with a different FD for the
>>>>     front-end to override the front-end's choice.
>>>>     The front-end creates a simple pipe to transfer the state, but maybe
>>>>     the back-end already has an FD into/from which it has to write/read
>>>>     its state, in which case it will want to override the simple pipe.
>>>>     Conversely, maybe in the future we find a way to have the front-end
>>>>     get an immediate FD for the migration stream (in some cases), in which
>>>>     case we will want to send this to the back-end instead of creating a
>>>>     pipe.
>>>>     Hence the negotiation: If one side has a better idea than a plain
>>>>     pipe, we will want to use that.
>>>>
>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>>>     pipe (the end indicated by EOF), the front-end invokes this function
>>>>     to verify success.  There is no in-band way (through the pipe) to
>>>>     indicate failure, so we need to check explicitly.
>>>>
>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>>>> (which includes establishing the direction of transfer and migration
>>>> phase), the sending side writes its data into the pipe, and the reading
>>>> side reads it until it sees an EOF.  Then, the front-end will check for
>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
>>>> checking for integrity (i.e. errors during deserialization).
>>>>
>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>> ---
>>>>    include/hw/virtio/vhost-backend.h |  24 +++++
>>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>>>    hw/virtio/vhost.c                 |  37 ++++++++
>>>>    4 files changed, 287 insertions(+)
>>>>
>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>>> index ec3fbae58d..5935b32fe3 100644
>>>> --- a/include/hw/virtio/vhost-backend.h
>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>>>    } VhostSetConfigType;
>>>>
>>>> +typedef enum VhostDeviceStateDirection {
>>>> +    /* Transfer state from back-end (device) to front-end */
>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>>>> +    /* Transfer state from front-end to back-end (device) */
>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>>>> +} VhostDeviceStateDirection;
>>>> +
>>>> +typedef enum VhostDeviceStatePhase {
>>>> +    /* The device (and all its vrings) is stopped */
>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>>>> +} VhostDeviceStatePhase;
>>> vDPA has:
>>>
>>>     /* Suspend a device so it does not process virtqueue requests anymore
>>>      *
>>>      * After the return of ioctl the device must preserve all the necessary state
>>>      * (the virtqueue vring base plus the possible device specific states) that is
>>>      * required for restoring in the future. The device must not change its
>>>      * configuration after that point.
>>>      */
>>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>>>
>>>     /* Resume a device so it can resume processing virtqueue requests
>>>      *
>>>      * After the return of this ioctl the device will have restored all the
>>>      * necessary states and it is fully operational to continue processing the
>>>      * virtqueue descriptors.
>>>      */
>>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>>>
>>> I wonder if it makes sense to import these into vhost-user so that the
>>> difference between kernel vhost and vhost-user is minimized. It's okay
>>> if one of them is ahead of the other, but it would be nice to avoid
>>> overlapping/duplicated functionality.
>>>
>>> (And I hope vDPA will import the device state vhost-user messages
>>> introduced in this series.)
>> I don’t understand your suggestion.  (Like, I very simply don’t
>> understand :))
>>
>> These are vhost messages, right?  What purpose do you have in mind for
>> them in vhost-user for internal migration?  They’re different from the
>> state transfer messages, because they don’t transfer state to/from the
>> front-end.  Also, the state transfer stuff is supposed to be distinct
>> from starting/stopping the device; right now, it just requires the
>> device to be stopped beforehand (or started only afterwards).  And in
>> the future, new VhostDeviceStatePhase values may allow the messages to
>> be used on devices that aren’t stopped.
>>
>> So they seem to serve very different purposes.  I can imagine using the
>> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is
>> working on), but they don’t really help with internal migration
>> implemented here.  If I were to add them, they’d just be sent in
>> addition to the new messages added in this patch here, i.e. SUSPEND on
>> the source before SET_DEVICE_STATE_FD, and RESUME on the destination
>> after CHECK_DEVICE_STATE (we could use RESUME in place of
>> CHECK_DEVICE_STATE on the destination, but we can’t do that on the
>> source, so we still need CHECK_DEVICE_STATE).
> Yes, they are complementary to the device state fd message. I want to
> make sure pre-conditions about the device's state (running vs stopped)
> already take into account the vDPA SUSPEND/RESUME model.
>
> vDPA will need device state save/load in the future. For virtiofs
> devices, for example. This is why I think we should plan for vDPA and
> vhost-user to share the same interface.
While the paragraph below is more important, I don’t feel like this
would be important right now.  It’s clear that SUSPEND must come before
transferring any state, and that RESUME must come after transferring
state.  I don’t think we need to clarify this now, it’d be obvious when
implementing SUSPEND/RESUME.
> Also, I think the code path you're relying on (vhost_dev_stop()) on
> doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS
> because stopping the backend resets the device and throws away its
> state. SUSPEND/RESUME solve this. This looks like a more general
> problem since vhost_dev_stop() is called any time the VM is paused.
> Maybe it needs to use SUSPEND/RESUME whenever possible.
That’s a problem.  Quite a problem, to be honest, because this sounds
rather complicated with honestly absolutely no practical benefit right
now.
Would you require SUSPEND/RESUME for state transfer even if the back-end
does not implement GET/SET_STATUS?  Because then this would also lead to
more complexity in virtiofsd.
Basically, what I’m hearing is that I need to implement a different
feature that has no practical impact right now, and also fix bugs around
it along the way...
(Not that I have any better suggestion.)
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-13 17:55         ` Hanna Czenczek
@ 2023-04-13 20:42           ` Stefan Hajnoczi
  2023-04-14 15:17           ` Eugenio Perez Martin
  1 sibling, 0 replies; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-13 20:42 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella, Eugenio Pérez
On Thu, 13 Apr 2023 at 13:55, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >>>> So-called "internal" virtio-fs migration refers to transporting the
> >>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >>>> this, we need to be able to transfer virtiofsd's internal state to and
> >>>> from virtiofsd.
> >>>>
> >>>> Because virtiofsd's internal state will not be too large, we believe it
> >>>> is best to transfer it as a single binary blob after the streaming
> >>>> phase.  Because this method should be useful to other vhost-user
> >>>> implementations, too, it is introduced as a general-purpose addition to
> >>>> the protocol, not limited to vhost-user-fs.
> >>>>
> >>>> These are the additions to the protocol:
> >>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>>>     This feature signals support for transferring state, and is added so
> >>>>     that migration can fail early when the back-end has no support.
> >>>>
> >>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>>>     over which to transfer the state.  The front-end sends an FD to the
> >>>>     back-end into/from which it can write/read its state, and the back-end
> >>>>     can decide to either use it, or reply with a different FD for the
> >>>>     front-end to override the front-end's choice.
> >>>>     The front-end creates a simple pipe to transfer the state, but maybe
> >>>>     the back-end already has an FD into/from which it has to write/read
> >>>>     its state, in which case it will want to override the simple pipe.
> >>>>     Conversely, maybe in the future we find a way to have the front-end
> >>>>     get an immediate FD for the migration stream (in some cases), in which
> >>>>     case we will want to send this to the back-end instead of creating a
> >>>>     pipe.
> >>>>     Hence the negotiation: If one side has a better idea than a plain
> >>>>     pipe, we will want to use that.
> >>>>
> >>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>>>     pipe (the end indicated by EOF), the front-end invokes this function
> >>>>     to verify success.  There is no in-band way (through the pipe) to
> >>>>     indicate failure, so we need to check explicitly.
> >>>>
> >>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >>>> (which includes establishing the direction of transfer and migration
> >>>> phase), the sending side writes its data into the pipe, and the reading
> >>>> side reads it until it sees an EOF.  Then, the front-end will check for
> >>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> >>>> checking for integrity (i.e. errors during deserialization).
> >>>>
> >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>> ---
> >>>>    include/hw/virtio/vhost-backend.h |  24 +++++
> >>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>>>    hw/virtio/vhost.c                 |  37 ++++++++
> >>>>    4 files changed, 287 insertions(+)
> >>>>
> >>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>>> index ec3fbae58d..5935b32fe3 100644
> >>>> --- a/include/hw/virtio/vhost-backend.h
> >>>> +++ b/include/hw/virtio/vhost-backend.h
> >>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>>>    } VhostSetConfigType;
> >>>>
> >>>> +typedef enum VhostDeviceStateDirection {
> >>>> +    /* Transfer state from back-end (device) to front-end */
> >>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >>>> +    /* Transfer state from front-end to back-end (device) */
> >>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >>>> +} VhostDeviceStateDirection;
> >>>> +
> >>>> +typedef enum VhostDeviceStatePhase {
> >>>> +    /* The device (and all its vrings) is stopped */
> >>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >>>> +} VhostDeviceStatePhase;
> >>> vDPA has:
> >>>
> >>>     /* Suspend a device so it does not process virtqueue requests anymore
> >>>      *
> >>>      * After the return of ioctl the device must preserve all the necessary state
> >>>      * (the virtqueue vring base plus the possible device specific states) that is
> >>>      * required for restoring in the future. The device must not change its
> >>>      * configuration after that point.
> >>>      */
> >>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >>>
> >>>     /* Resume a device so it can resume processing virtqueue requests
> >>>      *
> >>>      * After the return of this ioctl the device will have restored all the
> >>>      * necessary states and it is fully operational to continue processing the
> >>>      * virtqueue descriptors.
> >>>      */
> >>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >>>
> >>> I wonder if it makes sense to import these into vhost-user so that the
> >>> difference between kernel vhost and vhost-user is minimized. It's okay
> >>> if one of them is ahead of the other, but it would be nice to avoid
> >>> overlapping/duplicated functionality.
> >>>
> >>> (And I hope vDPA will import the device state vhost-user messages
> >>> introduced in this series.)
> >> I don’t understand your suggestion.  (Like, I very simply don’t
> >> understand :))
> >>
> >> These are vhost messages, right?  What purpose do you have in mind for
> >> them in vhost-user for internal migration?  They’re different from the
> >> state transfer messages, because they don’t transfer state to/from the
> >> front-end.  Also, the state transfer stuff is supposed to be distinct
> >> from starting/stopping the device; right now, it just requires the
> >> device to be stopped beforehand (or started only afterwards).  And in
> >> the future, new VhostDeviceStatePhase values may allow the messages to
> >> be used on devices that aren’t stopped.
> >>
> >> So they seem to serve very different purposes.  I can imagine using the
> >> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is
> >> working on), but they don’t really help with internal migration
> >> implemented here.  If I were to add them, they’d just be sent in
> >> addition to the new messages added in this patch here, i.e. SUSPEND on
> >> the source before SET_DEVICE_STATE_FD, and RESUME on the destination
> >> after CHECK_DEVICE_STATE (we could use RESUME in place of
> >> CHECK_DEVICE_STATE on the destination, but we can’t do that on the
> >> source, so we still need CHECK_DEVICE_STATE).
> > Yes, they are complementary to the device state fd message. I want to
> > make sure pre-conditions about the device's state (running vs stopped)
> > already take into account the vDPA SUSPEND/RESUME model.
> >
> > vDPA will need device state save/load in the future. For virtiofs
> > devices, for example. This is why I think we should plan for vDPA and
> > vhost-user to share the same interface.
>
> While the paragraph below is more important, I don’t feel like this
> would be important right now.  It’s clear that SUSPEND must come before
> transferring any state, and that RESUME must come after transferring
> state.  I don’t think we need to clarify this now, it’d be obvious when
> implementing SUSPEND/RESUME.
>
> > Also, I think the code path you're relying on (vhost_dev_stop()) on
> > doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS
> > because stopping the backend resets the device and throws away its
> > state. SUSPEND/RESUME solve this. This looks like a more general
> > problem since vhost_dev_stop() is called any time the VM is paused.
> > Maybe it needs to use SUSPEND/RESUME whenever possible.
>
> That’s a problem.  Quite a problem, to be honest, because this sounds
> rather complicated with honestly absolutely no practical benefit right
> now.
>
> Would you require SUSPEND/RESUME for state transfer even if the back-end
> does not implement GET/SET_STATUS?  Because then this would also lead to
> more complexity in virtiofsd.
>
> Basically, what I’m hearing is that I need to implement a different
> feature that has no practical impact right now, and also fix bugs around
> it along the way...
Eugenio's input regarding the design of the vhost-user messages is
important. That way we know it can be ported to vDPA later.
There is some extra discussion and work here, but only on the design
of the interface. You shouldn't need to implement extra unused stuff.
Whoever needs it can do that later based on a design that left room to
eventually do iterative migration for vhost-user and vDPA (comparable
to VFIO's migration interface).
Since both vDPA (vhost kernel) and vhost-user are stable APIs, it will
be hard to make significant design changes later without breaking all
existing implementations. That's why I think we should think ahead.
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-13 17:55         ` Hanna Czenczek
  2023-04-13 20:42           ` Stefan Hajnoczi
@ 2023-04-14 15:17           ` Eugenio Perez Martin
  2023-04-17 15:18             ` Stefan Hajnoczi
  2023-04-19 10:45             ` Hanna Czenczek
  1 sibling, 2 replies; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-04-14 15:17 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >>>> So-called "internal" virtio-fs migration refers to transporting the
> >>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >>>> this, we need to be able to transfer virtiofsd's internal state to and
> >>>> from virtiofsd.
> >>>>
> >>>> Because virtiofsd's internal state will not be too large, we believe it
> >>>> is best to transfer it as a single binary blob after the streaming
> >>>> phase.  Because this method should be useful to other vhost-user
> >>>> implementations, too, it is introduced as a general-purpose addition to
> >>>> the protocol, not limited to vhost-user-fs.
> >>>>
> >>>> These are the additions to the protocol:
> >>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>>>     This feature signals support for transferring state, and is added so
> >>>>     that migration can fail early when the back-end has no support.
> >>>>
> >>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>>>     over which to transfer the state.  The front-end sends an FD to the
> >>>>     back-end into/from which it can write/read its state, and the back-end
> >>>>     can decide to either use it, or reply with a different FD for the
> >>>>     front-end to override the front-end's choice.
> >>>>     The front-end creates a simple pipe to transfer the state, but maybe
> >>>>     the back-end already has an FD into/from which it has to write/read
> >>>>     its state, in which case it will want to override the simple pipe.
> >>>>     Conversely, maybe in the future we find a way to have the front-end
> >>>>     get an immediate FD for the migration stream (in some cases), in which
> >>>>     case we will want to send this to the back-end instead of creating a
> >>>>     pipe.
> >>>>     Hence the negotiation: If one side has a better idea than a plain
> >>>>     pipe, we will want to use that.
> >>>>
> >>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>>>     pipe (the end indicated by EOF), the front-end invokes this function
> >>>>     to verify success.  There is no in-band way (through the pipe) to
> >>>>     indicate failure, so we need to check explicitly.
> >>>>
> >>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >>>> (which includes establishing the direction of transfer and migration
> >>>> phase), the sending side writes its data into the pipe, and the reading
> >>>> side reads it until it sees an EOF.  Then, the front-end will check for
> >>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> >>>> checking for integrity (i.e. errors during deserialization).
> >>>>
> >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>> ---
> >>>>    include/hw/virtio/vhost-backend.h |  24 +++++
> >>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>>>    hw/virtio/vhost.c                 |  37 ++++++++
> >>>>    4 files changed, 287 insertions(+)
> >>>>
> >>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>>> index ec3fbae58d..5935b32fe3 100644
> >>>> --- a/include/hw/virtio/vhost-backend.h
> >>>> +++ b/include/hw/virtio/vhost-backend.h
> >>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>>>    } VhostSetConfigType;
> >>>>
> >>>> +typedef enum VhostDeviceStateDirection {
> >>>> +    /* Transfer state from back-end (device) to front-end */
> >>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >>>> +    /* Transfer state from front-end to back-end (device) */
> >>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >>>> +} VhostDeviceStateDirection;
> >>>> +
> >>>> +typedef enum VhostDeviceStatePhase {
> >>>> +    /* The device (and all its vrings) is stopped */
> >>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >>>> +} VhostDeviceStatePhase;
> >>> vDPA has:
> >>>
> >>>     /* Suspend a device so it does not process virtqueue requests anymore
> >>>      *
> >>>      * After the return of ioctl the device must preserve all the necessary state
> >>>      * (the virtqueue vring base plus the possible device specific states) that is
> >>>      * required for restoring in the future. The device must not change its
> >>>      * configuration after that point.
> >>>      */
> >>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >>>
> >>>     /* Resume a device so it can resume processing virtqueue requests
> >>>      *
> >>>      * After the return of this ioctl the device will have restored all the
> >>>      * necessary states and it is fully operational to continue processing the
> >>>      * virtqueue descriptors.
> >>>      */
> >>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >>>
> >>> I wonder if it makes sense to import these into vhost-user so that the
> >>> difference between kernel vhost and vhost-user is minimized. It's okay
> >>> if one of them is ahead of the other, but it would be nice to avoid
> >>> overlapping/duplicated functionality.
> >>>
> >>> (And I hope vDPA will import the device state vhost-user messages
> >>> introduced in this series.)
> >> I don’t understand your suggestion.  (Like, I very simply don’t
> >> understand :))
> >>
> >> These are vhost messages, right?  What purpose do you have in mind for
> >> them in vhost-user for internal migration?  They’re different from the
> >> state transfer messages, because they don’t transfer state to/from the
> >> front-end.  Also, the state transfer stuff is supposed to be distinct
> >> from starting/stopping the device; right now, it just requires the
> >> device to be stopped beforehand (or started only afterwards).  And in
> >> the future, new VhostDeviceStatePhase values may allow the messages to
> >> be used on devices that aren’t stopped.
> >>
> >> So they seem to serve very different purposes.  I can imagine using the
> >> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is
> >> working on), but they don’t really help with internal migration
> >> implemented here.  If I were to add them, they’d just be sent in
> >> addition to the new messages added in this patch here, i.e. SUSPEND on
> >> the source before SET_DEVICE_STATE_FD, and RESUME on the destination
> >> after CHECK_DEVICE_STATE (we could use RESUME in place of
> >> CHECK_DEVICE_STATE on the destination, but we can’t do that on the
> >> source, so we still need CHECK_DEVICE_STATE).
> > Yes, they are complementary to the device state fd message. I want to
> > make sure pre-conditions about the device's state (running vs stopped)
> > already take into account the vDPA SUSPEND/RESUME model.
> >
> > vDPA will need device state save/load in the future. For virtiofs
> > devices, for example. This is why I think we should plan for vDPA and
> > vhost-user to share the same interface.
>
> While the paragraph below is more important, I don’t feel like this
> would be important right now.  It’s clear that SUSPEND must come before
> transferring any state, and that RESUME must come after transferring
> state.  I don’t think we need to clarify this now, it’d be obvious when
> implementing SUSPEND/RESUME.
>
> > Also, I think the code path you're relying on (vhost_dev_stop()) on
> > doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS
> > because stopping the backend resets the device and throws away its
> > state. SUSPEND/RESUME solve this. This looks like a more general
> > problem since vhost_dev_stop() is called any time the VM is paused.
> > Maybe it needs to use SUSPEND/RESUME whenever possible.
>
> That’s a problem.  Quite a problem, to be honest, because this sounds
> rather complicated with honestly absolutely no practical benefit right
> now.
>
> Would you require SUSPEND/RESUME for state transfer even if the back-end
> does not implement GET/SET_STATUS?  Because then this would also lead to
> more complexity in virtiofsd.
>
At this moment the vhost-user net in DPDK suspends at
VHOST_GET_VRING_BASE. Not the same case though, as here only the vq
indexes / wrap bits are transferred here.
Vhost-vdpa implements the suspend call so it does not need to trust
VHOST_GET_VRING_BASE to be the last vring call done. Since virtiofsd
is using vhost-user maybe it is not needed to implement it actually.
> Basically, what I’m hearing is that I need to implement a different
> feature that has no practical impact right now, and also fix bugs around
> it along the way...
>
To fix this properly requires iterative device migration in qemu as
far as I know, instead of using VMStates [1]. This way the state is
requested to virtiofsd before the device reset.
What does virtiofsd do when the state is totally sent? Does it keep
processing requests and generating new state or is only a one shot
that will suspend the daemon? If it is the second I think it still can
be done in one shot at the end, always indicating "no more state" at
save_live_pending and sending all the state at
save_live_complete_precopy.
Does that make sense to you?
Thanks!
[1] https://qemu.readthedocs.io/en/latest/devel/migration.html#iterative-device-migration
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-13 17:31       ` Hanna Czenczek
@ 2023-04-17 15:12         ` Stefan Hajnoczi
  2023-04-19 10:47           ` Hanna Czenczek
  2023-04-17 18:37         ` Eugenio Perez Martin
  1 sibling, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-17 15:12 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Eugenio Perez Martin, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
[-- Attachment #1: Type: text/plain, Size: 11075 bytes --]
On Thu, Apr 13, 2023 at 07:31:57PM +0200, Hanna Czenczek wrote:
> On 13.04.23 12:14, Eugenio Perez Martin wrote:
> > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > from virtiofsd.
> > > > 
> > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > is best to transfer it as a single binary blob after the streaming
> > > > phase.  Because this method should be useful to other vhost-user
> > > > implementations, too, it is introduced as a general-purpose addition to
> > > > the protocol, not limited to vhost-user-fs.
> > > > 
> > > > These are the additions to the protocol:
> > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > >    This feature signals support for transferring state, and is added so
> > > >    that migration can fail early when the back-end has no support.
> > > > 
> > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > >    over which to transfer the state.  The front-end sends an FD to the
> > > >    back-end into/from which it can write/read its state, and the back-end
> > > >    can decide to either use it, or reply with a different FD for the
> > > >    front-end to override the front-end's choice.
> > > >    The front-end creates a simple pipe to transfer the state, but maybe
> > > >    the back-end already has an FD into/from which it has to write/read
> > > >    its state, in which case it will want to override the simple pipe.
> > > >    Conversely, maybe in the future we find a way to have the front-end
> > > >    get an immediate FD for the migration stream (in some cases), in which
> > > >    case we will want to send this to the back-end instead of creating a
> > > >    pipe.
> > > >    Hence the negotiation: If one side has a better idea than a plain
> > > >    pipe, we will want to use that.
> > > > 
> > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > >    pipe (the end indicated by EOF), the front-end invokes this function
> > > >    to verify success.  There is no in-band way (through the pipe) to
> > > >    indicate failure, so we need to check explicitly.
> > > > 
> > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > (which includes establishing the direction of transfer and migration
> > > > phase), the sending side writes its data into the pipe, and the reading
> > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > checking for integrity (i.e. errors during deserialization).
> > > > 
> > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > ---
> > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > >   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > >   4 files changed, 287 insertions(+)
> > > > 
> > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > index ec3fbae58d..5935b32fe3 100644
> > > > --- a/include/hw/virtio/vhost-backend.h
> > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > >   } VhostSetConfigType;
> > > > 
> > > > +typedef enum VhostDeviceStateDirection {
> > > > +    /* Transfer state from back-end (device) to front-end */
> > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > +    /* Transfer state from front-end to back-end (device) */
> > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > +} VhostDeviceStateDirection;
> > > > +
> > > > +typedef enum VhostDeviceStatePhase {
> > > > +    /* The device (and all its vrings) is stopped */
> > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > +} VhostDeviceStatePhase;
> > > vDPA has:
> > > 
> > >    /* Suspend a device so it does not process virtqueue requests anymore
> > >     *
> > >     * After the return of ioctl the device must preserve all the necessary state
> > >     * (the virtqueue vring base plus the possible device specific states) that is
> > >     * required for restoring in the future. The device must not change its
> > >     * configuration after that point.
> > >     */
> > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > 
> > >    /* Resume a device so it can resume processing virtqueue requests
> > >     *
> > >     * After the return of this ioctl the device will have restored all the
> > >     * necessary states and it is fully operational to continue processing the
> > >     * virtqueue descriptors.
> > >     */
> > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > 
> > > I wonder if it makes sense to import these into vhost-user so that the
> > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > if one of them is ahead of the other, but it would be nice to avoid
> > > overlapping/duplicated functionality.
> > > 
> > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > to SUSPEND.
> > 
> > Generally it is better if we make the interface less parametrized and
> > we trust in the messages and its semantics in my opinion. In other
> > words, instead of
> > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
> 
> I.e. you mean that this should simply be stateful instead of
> re-affirming the current state with a parameter?
> 
> The problem I see is that transferring states in different phases of
> migration will require specialized implementations.  So running
> SET_DEVICE_STATE_FD in a different phase will require support from the
> back-end.  Same in the front-end, the exact protocol and thus
> implementation will (probably, difficult to say at this point) depend on
> the migration phase.  I would therefore prefer to have an explicit
> distinction in the command itself that affirms the phase we’re
> targeting.
> 
> On the other hand, I don’t see the parameter complicating anything. The
> front-end must supply it, but it will know the phase anyway, so this is
> easy.  The back-end can just choose to ignore it, if it doesn’t feel the
> need to verify that the phase is what it thinks it is.
> 
> > Another way to apply this is with the "direction" parameter. Maybe it
> > is better to split it into "set_state_fd" and "get_state_fd"?
> 
> Well, it would rather be `set_state_send_fd` and `set_state_receive_fd`.
> We always negotiate a pipe between front-end and back-end, the question
> is just whether the back-end gets the receiving (load) or the sending
> (save) end.
> 
> Technically, one can make it fully stateful and say that if the device
> hasn’t been started already, it’s always a LOAD, and otherwise always a
> SAVE.  But as above, I’d prefer to keep the parameter because the
> implementations are different, so I’d prefer there to be a
> re-affirmation that front-end and back-end are in sync about what should
> be done.
> 
> Personally, I don’t really see the advantage of having two functions
> instead of one function with an enum with two values.  The thing about
> SET_DEVICE_STATE_FD is that it itself won’t differ much regardless of
> whether we’re loading or saving, it just negotiates the pipe – the
> difference is what happens after the pipe has been negotiated.  So if we
> split the function into two, both implementations will share most of
> their code anyway, which makes me think it should be a single function.
I also don't really see an advantage to defining separate messages as
long as SET_DEVICE_STATE_FD just sets up the pipe. If there are other
arguments that differ depending on the state/direction, then it's nicer
to have separate messages so that argument type remains simple (not a
union).
This brings to mind how iterative migration will work. The interface for
iterative migration is basically the same as non-iterative migration
plus a method to query the number of bytes remaining. When the number of
bytes falls below a threshold, the vCPUs are stopped and the remainder
of the data is read.
Some details from VFIO migration:
- The VMM must explicitly change the state when transitioning from
  iterative and non-iterative migration, but the data transfer fd
  remains the same.
- The state of the device (running, stopped, resuming, etc) doesn't
  change asynchronously, it's always driven by the VMM. However, setting
  the state can fail and then the new state may be an error state.
Mapping this to SET_DEVICE_STATE_FD:
- VhostDeviceStatePhase is extended with
  VHOST_TRANSFER_STATE_PHASE_RUNNING = 1 for iterative migration. The
  frontend sends SET_DEVICE_STATE_FD again with
  VHOST_TRANSFER_STATE_PHASE_STOPPED when entering non-iterative
  migration and the frontend sends the iterative fd from the previous
  SET_DEVICE_STATE_FD call to the backend. The backend may reply with
  another fd, if necessary. If the backend changes the fd, then the
  contents of the previous fd must be fully read and transferred before
  the contents of the new fd are migrated. (Maybe this is too complex
  and we should forbid changing the fd when going from RUNNING ->
  STOPPED.)
- CHECK_DEVICE_STATE can be extended to report the number of bytes
  remaining. The semantics change so that CHECK_DEVICE_STATE can be
  called while the VMM is still reading from the fd. It becomes:
    enum CheckDeviceStateResult {
        Saving(bytes_remaining : usize),
	Failed(error_code : u64),
    }
> > In that case, reusing the ioctls as vhost-user messages would be ok.
> > But that puts this proposal further from the VFIO code, which uses
> > "migration_set_state(state)", and maybe it is better when the number
> > of states is high.
> 
> I’m not sure what you mean (because I don’t know the VFIO code, I
> assume).  Are you saying that using a more finely grained
> migration_set_state() model would conflict with the rather coarse
> suspend/resume?
I think VFIO is already different because vDPA has SUSPEND/RESUME,
whereas VFIO controls the state via VFIO_DEVICE_FEATURE
VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE (which is similar but not identical
to SET_DEVICE_STATE_FD in this patch series).
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-14 15:17           ` Eugenio Perez Martin
@ 2023-04-17 15:18             ` Stefan Hajnoczi
  2023-04-17 18:55               ` Eugenio Perez Martin
  2023-04-19 10:45             ` Hanna Czenczek
  1 sibling, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-17 15:18 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Hanna Czenczek, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
[-- Attachment #1: Type: text/plain, Size: 10486 bytes --]
On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote:
> On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> >
> > On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> > >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > >>>> So-called "internal" virtio-fs migration refers to transporting the
> > >>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > >>>> this, we need to be able to transfer virtiofsd's internal state to and
> > >>>> from virtiofsd.
> > >>>>
> > >>>> Because virtiofsd's internal state will not be too large, we believe it
> > >>>> is best to transfer it as a single binary blob after the streaming
> > >>>> phase.  Because this method should be useful to other vhost-user
> > >>>> implementations, too, it is introduced as a general-purpose addition to
> > >>>> the protocol, not limited to vhost-user-fs.
> > >>>>
> > >>>> These are the additions to the protocol:
> > >>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > >>>>     This feature signals support for transferring state, and is added so
> > >>>>     that migration can fail early when the back-end has no support.
> > >>>>
> > >>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > >>>>     over which to transfer the state.  The front-end sends an FD to the
> > >>>>     back-end into/from which it can write/read its state, and the back-end
> > >>>>     can decide to either use it, or reply with a different FD for the
> > >>>>     front-end to override the front-end's choice.
> > >>>>     The front-end creates a simple pipe to transfer the state, but maybe
> > >>>>     the back-end already has an FD into/from which it has to write/read
> > >>>>     its state, in which case it will want to override the simple pipe.
> > >>>>     Conversely, maybe in the future we find a way to have the front-end
> > >>>>     get an immediate FD for the migration stream (in some cases), in which
> > >>>>     case we will want to send this to the back-end instead of creating a
> > >>>>     pipe.
> > >>>>     Hence the negotiation: If one side has a better idea than a plain
> > >>>>     pipe, we will want to use that.
> > >>>>
> > >>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> > >>>>     pipe (the end indicated by EOF), the front-end invokes this function
> > >>>>     to verify success.  There is no in-band way (through the pipe) to
> > >>>>     indicate failure, so we need to check explicitly.
> > >>>>
> > >>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > >>>> (which includes establishing the direction of transfer and migration
> > >>>> phase), the sending side writes its data into the pipe, and the reading
> > >>>> side reads it until it sees an EOF.  Then, the front-end will check for
> > >>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> > >>>> checking for integrity (i.e. errors during deserialization).
> > >>>>
> > >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > >>>> ---
> > >>>>    include/hw/virtio/vhost-backend.h |  24 +++++
> > >>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > >>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > >>>>    hw/virtio/vhost.c                 |  37 ++++++++
> > >>>>    4 files changed, 287 insertions(+)
> > >>>>
> > >>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > >>>> index ec3fbae58d..5935b32fe3 100644
> > >>>> --- a/include/hw/virtio/vhost-backend.h
> > >>>> +++ b/include/hw/virtio/vhost-backend.h
> > >>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > >>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > >>>>    } VhostSetConfigType;
> > >>>>
> > >>>> +typedef enum VhostDeviceStateDirection {
> > >>>> +    /* Transfer state from back-end (device) to front-end */
> > >>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > >>>> +    /* Transfer state from front-end to back-end (device) */
> > >>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > >>>> +} VhostDeviceStateDirection;
> > >>>> +
> > >>>> +typedef enum VhostDeviceStatePhase {
> > >>>> +    /* The device (and all its vrings) is stopped */
> > >>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > >>>> +} VhostDeviceStatePhase;
> > >>> vDPA has:
> > >>>
> > >>>     /* Suspend a device so it does not process virtqueue requests anymore
> > >>>      *
> > >>>      * After the return of ioctl the device must preserve all the necessary state
> > >>>      * (the virtqueue vring base plus the possible device specific states) that is
> > >>>      * required for restoring in the future. The device must not change its
> > >>>      * configuration after that point.
> > >>>      */
> > >>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > >>>
> > >>>     /* Resume a device so it can resume processing virtqueue requests
> > >>>      *
> > >>>      * After the return of this ioctl the device will have restored all the
> > >>>      * necessary states and it is fully operational to continue processing the
> > >>>      * virtqueue descriptors.
> > >>>      */
> > >>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > >>>
> > >>> I wonder if it makes sense to import these into vhost-user so that the
> > >>> difference between kernel vhost and vhost-user is minimized. It's okay
> > >>> if one of them is ahead of the other, but it would be nice to avoid
> > >>> overlapping/duplicated functionality.
> > >>>
> > >>> (And I hope vDPA will import the device state vhost-user messages
> > >>> introduced in this series.)
> > >> I don’t understand your suggestion.  (Like, I very simply don’t
> > >> understand :))
> > >>
> > >> These are vhost messages, right?  What purpose do you have in mind for
> > >> them in vhost-user for internal migration?  They’re different from the
> > >> state transfer messages, because they don’t transfer state to/from the
> > >> front-end.  Also, the state transfer stuff is supposed to be distinct
> > >> from starting/stopping the device; right now, it just requires the
> > >> device to be stopped beforehand (or started only afterwards).  And in
> > >> the future, new VhostDeviceStatePhase values may allow the messages to
> > >> be used on devices that aren’t stopped.
> > >>
> > >> So they seem to serve very different purposes.  I can imagine using the
> > >> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is
> > >> working on), but they don’t really help with internal migration
> > >> implemented here.  If I were to add them, they’d just be sent in
> > >> addition to the new messages added in this patch here, i.e. SUSPEND on
> > >> the source before SET_DEVICE_STATE_FD, and RESUME on the destination
> > >> after CHECK_DEVICE_STATE (we could use RESUME in place of
> > >> CHECK_DEVICE_STATE on the destination, but we can’t do that on the
> > >> source, so we still need CHECK_DEVICE_STATE).
> > > Yes, they are complementary to the device state fd message. I want to
> > > make sure pre-conditions about the device's state (running vs stopped)
> > > already take into account the vDPA SUSPEND/RESUME model.
> > >
> > > vDPA will need device state save/load in the future. For virtiofs
> > > devices, for example. This is why I think we should plan for vDPA and
> > > vhost-user to share the same interface.
> >
> > While the paragraph below is more important, I don’t feel like this
> > would be important right now.  It’s clear that SUSPEND must come before
> > transferring any state, and that RESUME must come after transferring
> > state.  I don’t think we need to clarify this now, it’d be obvious when
> > implementing SUSPEND/RESUME.
> >
> > > Also, I think the code path you're relying on (vhost_dev_stop()) on
> > > doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS
> > > because stopping the backend resets the device and throws away its
> > > state. SUSPEND/RESUME solve this. This looks like a more general
> > > problem since vhost_dev_stop() is called any time the VM is paused.
> > > Maybe it needs to use SUSPEND/RESUME whenever possible.
> >
> > That’s a problem.  Quite a problem, to be honest, because this sounds
> > rather complicated with honestly absolutely no practical benefit right
> > now.
> >
> > Would you require SUSPEND/RESUME for state transfer even if the back-end
> > does not implement GET/SET_STATUS?  Because then this would also lead to
> > more complexity in virtiofsd.
> >
> 
> At this moment the vhost-user net in DPDK suspends at
> VHOST_GET_VRING_BASE. Not the same case though, as here only the vq
> indexes / wrap bits are transferred here.
> 
> Vhost-vdpa implements the suspend call so it does not need to trust
> VHOST_GET_VRING_BASE to be the last vring call done. Since virtiofsd
> is using vhost-user maybe it is not needed to implement it actually.
Careful, if we deliberately make vhost-user and vDPA diverge, then it
will be hard to share the migration interface.
> > Basically, what I’m hearing is that I need to implement a different
> > feature that has no practical impact right now, and also fix bugs around
> > it along the way...
> >
> 
> To fix this properly requires iterative device migration in qemu as
> far as I know, instead of using VMStates [1]. This way the state is
> requested to virtiofsd before the device reset.
I don't follow. Many devices are fine with non-iterative migration. They
shouldn't be forced to do iterative migration.
> What does virtiofsd do when the state is totally sent? Does it keep
> processing requests and generating new state or is only a one shot
> that will suspend the daemon? If it is the second I think it still can
> be done in one shot at the end, always indicating "no more state" at
> save_live_pending and sending all the state at
> save_live_complete_precopy.
> 
> Does that make sense to you?
> 
> Thanks!
> 
> [1] https://qemu.readthedocs.io/en/latest/devel/migration.html#iterative-device-migration
> 
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-13 10:14     ` Eugenio Perez Martin
  2023-04-13 11:07       ` Stefan Hajnoczi
  2023-04-13 17:31       ` Hanna Czenczek
@ 2023-04-17 15:38       ` Stefan Hajnoczi
  2023-04-17 19:09         ` Eugenio Perez Martin
  2023-04-17 17:14       ` Stefan Hajnoczi
  3 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-17 15:38 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Hanna Czenczek, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
[-- Attachment #1: Type: text/plain, Size: 5699 bytes --]
On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > So-called "internal" virtio-fs migration refers to transporting the
> > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > this, we need to be able to transfer virtiofsd's internal state to and
> > > from virtiofsd.
> > >
> > > Because virtiofsd's internal state will not be too large, we believe it
> > > is best to transfer it as a single binary blob after the streaming
> > > phase.  Because this method should be useful to other vhost-user
> > > implementations, too, it is introduced as a general-purpose addition to
> > > the protocol, not limited to vhost-user-fs.
> > >
> > > These are the additions to the protocol:
> > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > >   This feature signals support for transferring state, and is added so
> > >   that migration can fail early when the back-end has no support.
> > >
> > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > >   over which to transfer the state.  The front-end sends an FD to the
> > >   back-end into/from which it can write/read its state, and the back-end
> > >   can decide to either use it, or reply with a different FD for the
> > >   front-end to override the front-end's choice.
> > >   The front-end creates a simple pipe to transfer the state, but maybe
> > >   the back-end already has an FD into/from which it has to write/read
> > >   its state, in which case it will want to override the simple pipe.
> > >   Conversely, maybe in the future we find a way to have the front-end
> > >   get an immediate FD for the migration stream (in some cases), in which
> > >   case we will want to send this to the back-end instead of creating a
> > >   pipe.
> > >   Hence the negotiation: If one side has a better idea than a plain
> > >   pipe, we will want to use that.
> > >
> > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > >   pipe (the end indicated by EOF), the front-end invokes this function
> > >   to verify success.  There is no in-band way (through the pipe) to
> > >   indicate failure, so we need to check explicitly.
> > >
> > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > (which includes establishing the direction of transfer and migration
> > > phase), the sending side writes its data into the pipe, and the reading
> > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > checking for integrity (i.e. errors during deserialization).
> > >
> > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > ---
> > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > >  hw/virtio/vhost.c                 |  37 ++++++++
> > >  4 files changed, 287 insertions(+)
> > >
> > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > index ec3fbae58d..5935b32fe3 100644
> > > --- a/include/hw/virtio/vhost-backend.h
> > > +++ b/include/hw/virtio/vhost-backend.h
> > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > >  } VhostSetConfigType;
> > >
> > > +typedef enum VhostDeviceStateDirection {
> > > +    /* Transfer state from back-end (device) to front-end */
> > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > +    /* Transfer state from front-end to back-end (device) */
> > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > +} VhostDeviceStateDirection;
> > > +
> > > +typedef enum VhostDeviceStatePhase {
> > > +    /* The device (and all its vrings) is stopped */
> > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > +} VhostDeviceStatePhase;
> >
> > vDPA has:
> >
> >   /* Suspend a device so it does not process virtqueue requests anymore
> >    *
> >    * After the return of ioctl the device must preserve all the necessary state
> >    * (the virtqueue vring base plus the possible device specific states) that is
> >    * required for restoring in the future. The device must not change its
> >    * configuration after that point.
> >    */
> >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >
> >   /* Resume a device so it can resume processing virtqueue requests
> >    *
> >    * After the return of this ioctl the device will have restored all the
> >    * necessary states and it is fully operational to continue processing the
> >    * virtqueue descriptors.
> >    */
> >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >
> > I wonder if it makes sense to import these into vhost-user so that the
> > difference between kernel vhost and vhost-user is minimized. It's okay
> > if one of them is ahead of the other, but it would be nice to avoid
> > overlapping/duplicated functionality.
> >
> 
> That's what I had in mind in the first versions. I proposed VHOST_STOP
> instead of VHOST_VDPA_STOP for this very reason. Later it did change
> to SUSPEND.
I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
ioctl(VHOST_VDPA_RESUME).
The doc comments in <linux/vdpa.h> don't explain how the device can
leave the suspended state. Can you clarify this?
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-13 10:14     ` Eugenio Perez Martin
                         ` (2 preceding siblings ...)
  2023-04-17 15:38       ` Stefan Hajnoczi
@ 2023-04-17 17:14       ` Stefan Hajnoczi
  2023-04-17 19:06         ` Eugenio Perez Martin
  3 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-17 17:14 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Hanna Czenczek, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
[-- Attachment #1: Type: text/plain, Size: 7698 bytes --]
On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > So-called "internal" virtio-fs migration refers to transporting the
> > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > this, we need to be able to transfer virtiofsd's internal state to and
> > > from virtiofsd.
> > >
> > > Because virtiofsd's internal state will not be too large, we believe it
> > > is best to transfer it as a single binary blob after the streaming
> > > phase.  Because this method should be useful to other vhost-user
> > > implementations, too, it is introduced as a general-purpose addition to
> > > the protocol, not limited to vhost-user-fs.
> > >
> > > These are the additions to the protocol:
> > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > >   This feature signals support for transferring state, and is added so
> > >   that migration can fail early when the back-end has no support.
> > >
> > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > >   over which to transfer the state.  The front-end sends an FD to the
> > >   back-end into/from which it can write/read its state, and the back-end
> > >   can decide to either use it, or reply with a different FD for the
> > >   front-end to override the front-end's choice.
> > >   The front-end creates a simple pipe to transfer the state, but maybe
> > >   the back-end already has an FD into/from which it has to write/read
> > >   its state, in which case it will want to override the simple pipe.
> > >   Conversely, maybe in the future we find a way to have the front-end
> > >   get an immediate FD for the migration stream (in some cases), in which
> > >   case we will want to send this to the back-end instead of creating a
> > >   pipe.
> > >   Hence the negotiation: If one side has a better idea than a plain
> > >   pipe, we will want to use that.
> > >
> > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > >   pipe (the end indicated by EOF), the front-end invokes this function
> > >   to verify success.  There is no in-band way (through the pipe) to
> > >   indicate failure, so we need to check explicitly.
> > >
> > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > (which includes establishing the direction of transfer and migration
> > > phase), the sending side writes its data into the pipe, and the reading
> > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > checking for integrity (i.e. errors during deserialization).
> > >
> > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > ---
> > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > >  hw/virtio/vhost.c                 |  37 ++++++++
> > >  4 files changed, 287 insertions(+)
> > >
> > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > index ec3fbae58d..5935b32fe3 100644
> > > --- a/include/hw/virtio/vhost-backend.h
> > > +++ b/include/hw/virtio/vhost-backend.h
> > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > >  } VhostSetConfigType;
> > >
> > > +typedef enum VhostDeviceStateDirection {
> > > +    /* Transfer state from back-end (device) to front-end */
> > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > +    /* Transfer state from front-end to back-end (device) */
> > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > +} VhostDeviceStateDirection;
> > > +
> > > +typedef enum VhostDeviceStatePhase {
> > > +    /* The device (and all its vrings) is stopped */
> > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > +} VhostDeviceStatePhase;
> >
> > vDPA has:
> >
> >   /* Suspend a device so it does not process virtqueue requests anymore
> >    *
> >    * After the return of ioctl the device must preserve all the necessary state
> >    * (the virtqueue vring base plus the possible device specific states) that is
> >    * required for restoring in the future. The device must not change its
> >    * configuration after that point.
> >    */
> >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >
> >   /* Resume a device so it can resume processing virtqueue requests
> >    *
> >    * After the return of this ioctl the device will have restored all the
> >    * necessary states and it is fully operational to continue processing the
> >    * virtqueue descriptors.
> >    */
> >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >
> > I wonder if it makes sense to import these into vhost-user so that the
> > difference between kernel vhost and vhost-user is minimized. It's okay
> > if one of them is ahead of the other, but it would be nice to avoid
> > overlapping/duplicated functionality.
> >
> 
> That's what I had in mind in the first versions. I proposed VHOST_STOP
> instead of VHOST_VDPA_STOP for this very reason. Later it did change
> to SUSPEND.
> 
> Generally it is better if we make the interface less parametrized and
> we trust in the messages and its semantics in my opinion. In other
> words, instead of
> vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
> 
> Another way to apply this is with the "direction" parameter. Maybe it
> is better to split it into "set_state_fd" and "get_state_fd"?
> 
> In that case, reusing the ioctls as vhost-user messages would be ok.
> But that puts this proposal further from the VFIO code, which uses
> "migration_set_state(state)", and maybe it is better when the number
> of states is high.
Hi Eugenio,
Another question about vDPA suspend/resume:
  /* Host notifiers must be enabled at this point. */
  void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
  {
      int i;
  
      /* should only be called after backend is connected */
      assert(hdev->vhost_ops);
      event_notifier_test_and_clear(
          &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
      event_notifier_test_and_clear(&vdev->config_notifier);
  
      trace_vhost_dev_stop(hdev, vdev->name, vrings);
  
      if (hdev->vhost_ops->vhost_dev_start) {
          hdev->vhost_ops->vhost_dev_start(hdev, false);
          ^^^ SUSPEND ^^^
      }
      if (vrings) {
          vhost_dev_set_vring_enable(hdev, false);
      }
      for (i = 0; i < hdev->nvqs; ++i) {
          vhost_virtqueue_stop(hdev,
                               vdev,
                               hdev->vqs + i,
                               hdev->vq_index + i);
	^^^ fetch virtqueue state from kernel ^^^
      }
      if (hdev->vhost_ops->vhost_reset_status) {
          hdev->vhost_ops->vhost_reset_status(hdev);
	  ^^^ reset device^^^
I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
vhost_reset_status(). The device's migration code runs after
vhost_dev_stop() and the state will have been lost.
It looks like vDPA changes are necessary in order to support stateful
devices even though QEMU already uses SUSPEND. Is my understanding
correct?
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-13 17:31       ` Hanna Czenczek
  2023-04-17 15:12         ` Stefan Hajnoczi
@ 2023-04-17 18:37         ` Eugenio Perez Martin
  1 sibling, 0 replies; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-04-17 18:37 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Thu, Apr 13, 2023 at 7:32 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 13.04.23 12:14, Eugenio Perez Martin wrote:
> > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >>> So-called "internal" virtio-fs migration refers to transporting the
> >>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >>> this, we need to be able to transfer virtiofsd's internal state to and
> >>> from virtiofsd.
> >>>
> >>> Because virtiofsd's internal state will not be too large, we believe it
> >>> is best to transfer it as a single binary blob after the streaming
> >>> phase.  Because this method should be useful to other vhost-user
> >>> implementations, too, it is introduced as a general-purpose addition to
> >>> the protocol, not limited to vhost-user-fs.
> >>>
> >>> These are the additions to the protocol:
> >>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>>    This feature signals support for transferring state, and is added so
> >>>    that migration can fail early when the back-end has no support.
> >>>
> >>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>>    over which to transfer the state.  The front-end sends an FD to the
> >>>    back-end into/from which it can write/read its state, and the back-end
> >>>    can decide to either use it, or reply with a different FD for the
> >>>    front-end to override the front-end's choice.
> >>>    The front-end creates a simple pipe to transfer the state, but maybe
> >>>    the back-end already has an FD into/from which it has to write/read
> >>>    its state, in which case it will want to override the simple pipe.
> >>>    Conversely, maybe in the future we find a way to have the front-end
> >>>    get an immediate FD for the migration stream (in some cases), in which
> >>>    case we will want to send this to the back-end instead of creating a
> >>>    pipe.
> >>>    Hence the negotiation: If one side has a better idea than a plain
> >>>    pipe, we will want to use that.
> >>>
> >>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>>    pipe (the end indicated by EOF), the front-end invokes this function
> >>>    to verify success.  There is no in-band way (through the pipe) to
> >>>    indicate failure, so we need to check explicitly.
> >>>
> >>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >>> (which includes establishing the direction of transfer and migration
> >>> phase), the sending side writes its data into the pipe, and the reading
> >>> side reads it until it sees an EOF.  Then, the front-end will check for
> >>> success via CHECK_DEVICE_STATE, which on the destination side includes
> >>> checking for integrity (i.e. errors during deserialization).
> >>>
> >>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>> ---
> >>>   include/hw/virtio/vhost-backend.h |  24 +++++
> >>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>>   hw/virtio/vhost.c                 |  37 ++++++++
> >>>   4 files changed, 287 insertions(+)
> >>>
> >>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>> index ec3fbae58d..5935b32fe3 100644
> >>> --- a/include/hw/virtio/vhost-backend.h
> >>> +++ b/include/hw/virtio/vhost-backend.h
> >>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>>   } VhostSetConfigType;
> >>>
> >>> +typedef enum VhostDeviceStateDirection {
> >>> +    /* Transfer state from back-end (device) to front-end */
> >>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >>> +    /* Transfer state from front-end to back-end (device) */
> >>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >>> +} VhostDeviceStateDirection;
> >>> +
> >>> +typedef enum VhostDeviceStatePhase {
> >>> +    /* The device (and all its vrings) is stopped */
> >>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >>> +} VhostDeviceStatePhase;
> >> vDPA has:
> >>
> >>    /* Suspend a device so it does not process virtqueue requests anymore
> >>     *
> >>     * After the return of ioctl the device must preserve all the necessary state
> >>     * (the virtqueue vring base plus the possible device specific states) that is
> >>     * required for restoring in the future. The device must not change its
> >>     * configuration after that point.
> >>     */
> >>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >>
> >>    /* Resume a device so it can resume processing virtqueue requests
> >>     *
> >>     * After the return of this ioctl the device will have restored all the
> >>     * necessary states and it is fully operational to continue processing the
> >>     * virtqueue descriptors.
> >>     */
> >>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >>
> >> I wonder if it makes sense to import these into vhost-user so that the
> >> difference between kernel vhost and vhost-user is minimized. It's okay
> >> if one of them is ahead of the other, but it would be nice to avoid
> >> overlapping/duplicated functionality.
> >>
> > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > to SUSPEND.
> >
> > Generally it is better if we make the interface less parametrized and
> > we trust in the messages and its semantics in my opinion. In other
> > words, instead of
> > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
>
> I.e. you mean that this should simply be stateful instead of
> re-affirming the current state with a parameter?
>
> The problem I see is that transferring states in different phases of
> migration will require specialized implementations.  So running
> SET_DEVICE_STATE_FD in a different phase will require support from the
> back-end.  Same in the front-end, the exact protocol and thus
> implementation will (probably, difficult to say at this point) depend on
> the migration phase.  I would therefore prefer to have an explicit
> distinction in the command itself that affirms the phase we’re
> targeting.
>
I think we will have this same problem when more phases are added, as
the fd and direction arguments are always  passed whatever phase you
set. Future phases may not require it, or require different arguments.
> On the other hand, I don’t see the parameter complicating anything. The
> front-end must supply it, but it will know the phase anyway, so this is
> easy.  The back-end can just choose to ignore it, if it doesn’t feel the
> need to verify that the phase is what it thinks it is.
>
> > Another way to apply this is with the "direction" parameter. Maybe it
> > is better to split it into "set_state_fd" and "get_state_fd"?
>
> Well, it would rather be `set_state_send_fd` and `set_state_receive_fd`.
Right, thanks for the correction.
> We always negotiate a pipe between front-end and back-end, the question
> is just whether the back-end gets the receiving (load) or the sending
> (save) end.
>
> Technically, one can make it fully stateful and say that if the device
> hasn’t been started already, it’s always a LOAD, and otherwise always a
> SAVE.  But as above, I’d prefer to keep the parameter because the
> implementations are different, so I’d prefer there to be a
> re-affirmation that front-end and back-end are in sync about what should
> be done.
>
> Personally, I don’t really see the advantage of having two functions
> instead of one function with an enum with two values.  The thing about
> SET_DEVICE_STATE_FD is that it itself won’t differ much regardless of
> whether we’re loading or saving, it just negotiates the pipe – the
> difference is what happens after the pipe has been negotiated.  So if we
> split the function into two, both implementations will share most of
> their code anyway, which makes me think it should be a single function.
>
Yes, all of that makes sense.
My proposal was in the line of following other commands like
VHOST_USER_SET_VRING_BASE / VHOST_USER_GET_VRING_BASE or
VHOST_USER_SET_INFLIGHT_FD and VHOST_USER_GET_INFLIGHT_FD. If that has
been considered and it is more convenient to use the arguments I'm
totally fine.
> > In that case, reusing the ioctls as vhost-user messages would be ok.
> > But that puts this proposal further from the VFIO code, which uses
> > "migration_set_state(state)", and maybe it is better when the number
> > of states is high.
>
> I’m not sure what you mean (because I don’t know the VFIO code, I
> assume).  Are you saying that using a more finely grained
> migration_set_state() model would conflict with the rather coarse
> suspend/resume?
>
I don't think it exactly conflicts, as we should be able to map both
to a given set_state. They may overlap if vhost-user decides to use
them. Or if vdpa decides to use SET_DEVICE_STATE_FD.
This already happens with the different vhost backends, each one has a
different way to suspend the device in the case of a migration anyway.
Thanks!
> > BTW, is there any usage for *reply_fd at this moment from the backend?
>
> No, virtiofsd doesn’t plan to make use of it.
>
> Hanna
>
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-17 15:18             ` Stefan Hajnoczi
@ 2023-04-17 18:55               ` Eugenio Perez Martin
  2023-04-17 19:08                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-04-17 18:55 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Hanna Czenczek, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote:
> > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> > >
> > > On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > >>>> So-called "internal" virtio-fs migration refers to transporting the
> > > >>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > >>>> this, we need to be able to transfer virtiofsd's internal state to and
> > > >>>> from virtiofsd.
> > > >>>>
> > > >>>> Because virtiofsd's internal state will not be too large, we believe it
> > > >>>> is best to transfer it as a single binary blob after the streaming
> > > >>>> phase.  Because this method should be useful to other vhost-user
> > > >>>> implementations, too, it is introduced as a general-purpose addition to
> > > >>>> the protocol, not limited to vhost-user-fs.
> > > >>>>
> > > >>>> These are the additions to the protocol:
> > > >>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > >>>>     This feature signals support for transferring state, and is added so
> > > >>>>     that migration can fail early when the back-end has no support.
> > > >>>>
> > > >>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > >>>>     over which to transfer the state.  The front-end sends an FD to the
> > > >>>>     back-end into/from which it can write/read its state, and the back-end
> > > >>>>     can decide to either use it, or reply with a different FD for the
> > > >>>>     front-end to override the front-end's choice.
> > > >>>>     The front-end creates a simple pipe to transfer the state, but maybe
> > > >>>>     the back-end already has an FD into/from which it has to write/read
> > > >>>>     its state, in which case it will want to override the simple pipe.
> > > >>>>     Conversely, maybe in the future we find a way to have the front-end
> > > >>>>     get an immediate FD for the migration stream (in some cases), in which
> > > >>>>     case we will want to send this to the back-end instead of creating a
> > > >>>>     pipe.
> > > >>>>     Hence the negotiation: If one side has a better idea than a plain
> > > >>>>     pipe, we will want to use that.
> > > >>>>
> > > >>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > >>>>     pipe (the end indicated by EOF), the front-end invokes this function
> > > >>>>     to verify success.  There is no in-band way (through the pipe) to
> > > >>>>     indicate failure, so we need to check explicitly.
> > > >>>>
> > > >>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > >>>> (which includes establishing the direction of transfer and migration
> > > >>>> phase), the sending side writes its data into the pipe, and the reading
> > > >>>> side reads it until it sees an EOF.  Then, the front-end will check for
> > > >>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> > > >>>> checking for integrity (i.e. errors during deserialization).
> > > >>>>
> > > >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > >>>> ---
> > > >>>>    include/hw/virtio/vhost-backend.h |  24 +++++
> > > >>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > >>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > >>>>    hw/virtio/vhost.c                 |  37 ++++++++
> > > >>>>    4 files changed, 287 insertions(+)
> > > >>>>
> > > >>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > >>>> index ec3fbae58d..5935b32fe3 100644
> > > >>>> --- a/include/hw/virtio/vhost-backend.h
> > > >>>> +++ b/include/hw/virtio/vhost-backend.h
> > > >>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > >>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > >>>>    } VhostSetConfigType;
> > > >>>>
> > > >>>> +typedef enum VhostDeviceStateDirection {
> > > >>>> +    /* Transfer state from back-end (device) to front-end */
> > > >>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > >>>> +    /* Transfer state from front-end to back-end (device) */
> > > >>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > >>>> +} VhostDeviceStateDirection;
> > > >>>> +
> > > >>>> +typedef enum VhostDeviceStatePhase {
> > > >>>> +    /* The device (and all its vrings) is stopped */
> > > >>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > >>>> +} VhostDeviceStatePhase;
> > > >>> vDPA has:
> > > >>>
> > > >>>     /* Suspend a device so it does not process virtqueue requests anymore
> > > >>>      *
> > > >>>      * After the return of ioctl the device must preserve all the necessary state
> > > >>>      * (the virtqueue vring base plus the possible device specific states) that is
> > > >>>      * required for restoring in the future. The device must not change its
> > > >>>      * configuration after that point.
> > > >>>      */
> > > >>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > >>>
> > > >>>     /* Resume a device so it can resume processing virtqueue requests
> > > >>>      *
> > > >>>      * After the return of this ioctl the device will have restored all the
> > > >>>      * necessary states and it is fully operational to continue processing the
> > > >>>      * virtqueue descriptors.
> > > >>>      */
> > > >>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > >>>
> > > >>> I wonder if it makes sense to import these into vhost-user so that the
> > > >>> difference between kernel vhost and vhost-user is minimized. It's okay
> > > >>> if one of them is ahead of the other, but it would be nice to avoid
> > > >>> overlapping/duplicated functionality.
> > > >>>
> > > >>> (And I hope vDPA will import the device state vhost-user messages
> > > >>> introduced in this series.)
> > > >> I don’t understand your suggestion.  (Like, I very simply don’t
> > > >> understand :))
> > > >>
> > > >> These are vhost messages, right?  What purpose do you have in mind for
> > > >> them in vhost-user for internal migration?  They’re different from the
> > > >> state transfer messages, because they don’t transfer state to/from the
> > > >> front-end.  Also, the state transfer stuff is supposed to be distinct
> > > >> from starting/stopping the device; right now, it just requires the
> > > >> device to be stopped beforehand (or started only afterwards).  And in
> > > >> the future, new VhostDeviceStatePhase values may allow the messages to
> > > >> be used on devices that aren’t stopped.
> > > >>
> > > >> So they seem to serve very different purposes.  I can imagine using the
> > > >> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is
> > > >> working on), but they don’t really help with internal migration
> > > >> implemented here.  If I were to add them, they’d just be sent in
> > > >> addition to the new messages added in this patch here, i.e. SUSPEND on
> > > >> the source before SET_DEVICE_STATE_FD, and RESUME on the destination
> > > >> after CHECK_DEVICE_STATE (we could use RESUME in place of
> > > >> CHECK_DEVICE_STATE on the destination, but we can’t do that on the
> > > >> source, so we still need CHECK_DEVICE_STATE).
> > > > Yes, they are complementary to the device state fd message. I want to
> > > > make sure pre-conditions about the device's state (running vs stopped)
> > > > already take into account the vDPA SUSPEND/RESUME model.
> > > >
> > > > vDPA will need device state save/load in the future. For virtiofs
> > > > devices, for example. This is why I think we should plan for vDPA and
> > > > vhost-user to share the same interface.
> > >
> > > While the paragraph below is more important, I don’t feel like this
> > > would be important right now.  It’s clear that SUSPEND must come before
> > > transferring any state, and that RESUME must come after transferring
> > > state.  I don’t think we need to clarify this now, it’d be obvious when
> > > implementing SUSPEND/RESUME.
> > >
> > > > Also, I think the code path you're relying on (vhost_dev_stop()) on
> > > > doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS
> > > > because stopping the backend resets the device and throws away its
> > > > state. SUSPEND/RESUME solve this. This looks like a more general
> > > > problem since vhost_dev_stop() is called any time the VM is paused.
> > > > Maybe it needs to use SUSPEND/RESUME whenever possible.
> > >
> > > That’s a problem.  Quite a problem, to be honest, because this sounds
> > > rather complicated with honestly absolutely no practical benefit right
> > > now.
> > >
> > > Would you require SUSPEND/RESUME for state transfer even if the back-end
> > > does not implement GET/SET_STATUS?  Because then this would also lead to
> > > more complexity in virtiofsd.
> > >
> >
> > At this moment the vhost-user net in DPDK suspends at
> > VHOST_GET_VRING_BASE. Not the same case though, as here only the vq
> > indexes / wrap bits are transferred here.
> >
> > Vhost-vdpa implements the suspend call so it does not need to trust
> > VHOST_GET_VRING_BASE to be the last vring call done. Since virtiofsd
> > is using vhost-user maybe it is not needed to implement it actually.
>
> Careful, if we deliberately make vhost-user and vDPA diverge, then it
> will be hard to share the migration interface.
>
I don't recall the exact reasons for not following with the
VRING_GET_BASE == suspend for vDPA, IIRC was the lack of a proper
definition back then. But vhost-kernel and vhost-user already diverged
in that regard, for example. vhost-kernel set a tap backend of -1 to
suspend the device.
> > > Basically, what I’m hearing is that I need to implement a different
> > > feature that has no practical impact right now, and also fix bugs around
> > > it along the way...
> > >
> >
> > To fix this properly requires iterative device migration in qemu as
> > far as I know, instead of using VMStates [1]. This way the state is
> > requested to virtiofsd before the device reset.
>
> I don't follow. Many devices are fine with non-iterative migration. They
> shouldn't be forced to do iterative migration.
>
Sorry I think I didn't express myself well. I didn't mean to force
virtiofsd to support the iterative migration, but to use the device
iterative migration API in QEMU to send the needed commands before
vhost_dev_stop. In that regard, the device or the vhost-user commands
would not require changes.
I think it is convenient in the long run for virtiofsd, as if the
state grows so much that it's not feasible to fetch it in one shot
there is no need to make changes in the qemu migration protocol. I
think it is not unlikely in virtiofs, but maybe I'm missing something
obvious and it's state will never grow.
> > What does virtiofsd do when the state is totally sent? Does it keep
> > processing requests and generating new state or is only a one shot
> > that will suspend the daemon? If it is the second I think it still can
> > be done in one shot at the end, always indicating "no more state" at
> > save_live_pending and sending all the state at
> > save_live_complete_precopy.
> >
> > Does that make sense to you?
> >
> > Thanks!
> >
> > [1] https://qemu.readthedocs.io/en/latest/devel/migration.html#iterative-device-migration
> >
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-17 17:14       ` Stefan Hajnoczi
@ 2023-04-17 19:06         ` Eugenio Perez Martin
  2023-04-17 19:20           ` Stefan Hajnoczi
  0 siblings, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-04-17 19:06 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Hanna Czenczek, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > from virtiofsd.
> > > >
> > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > is best to transfer it as a single binary blob after the streaming
> > > > phase.  Because this method should be useful to other vhost-user
> > > > implementations, too, it is introduced as a general-purpose addition to
> > > > the protocol, not limited to vhost-user-fs.
> > > >
> > > > These are the additions to the protocol:
> > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > >   This feature signals support for transferring state, and is added so
> > > >   that migration can fail early when the back-end has no support.
> > > >
> > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > >   over which to transfer the state.  The front-end sends an FD to the
> > > >   back-end into/from which it can write/read its state, and the back-end
> > > >   can decide to either use it, or reply with a different FD for the
> > > >   front-end to override the front-end's choice.
> > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > >   the back-end already has an FD into/from which it has to write/read
> > > >   its state, in which case it will want to override the simple pipe.
> > > >   Conversely, maybe in the future we find a way to have the front-end
> > > >   get an immediate FD for the migration stream (in some cases), in which
> > > >   case we will want to send this to the back-end instead of creating a
> > > >   pipe.
> > > >   Hence the negotiation: If one side has a better idea than a plain
> > > >   pipe, we will want to use that.
> > > >
> > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > >   to verify success.  There is no in-band way (through the pipe) to
> > > >   indicate failure, so we need to check explicitly.
> > > >
> > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > (which includes establishing the direction of transfer and migration
> > > > phase), the sending side writes its data into the pipe, and the reading
> > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > checking for integrity (i.e. errors during deserialization).
> > > >
> > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > ---
> > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > >  4 files changed, 287 insertions(+)
> > > >
> > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > index ec3fbae58d..5935b32fe3 100644
> > > > --- a/include/hw/virtio/vhost-backend.h
> > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > >  } VhostSetConfigType;
> > > >
> > > > +typedef enum VhostDeviceStateDirection {
> > > > +    /* Transfer state from back-end (device) to front-end */
> > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > +    /* Transfer state from front-end to back-end (device) */
> > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > +} VhostDeviceStateDirection;
> > > > +
> > > > +typedef enum VhostDeviceStatePhase {
> > > > +    /* The device (and all its vrings) is stopped */
> > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > +} VhostDeviceStatePhase;
> > >
> > > vDPA has:
> > >
> > >   /* Suspend a device so it does not process virtqueue requests anymore
> > >    *
> > >    * After the return of ioctl the device must preserve all the necessary state
> > >    * (the virtqueue vring base plus the possible device specific states) that is
> > >    * required for restoring in the future. The device must not change its
> > >    * configuration after that point.
> > >    */
> > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > >
> > >   /* Resume a device so it can resume processing virtqueue requests
> > >    *
> > >    * After the return of this ioctl the device will have restored all the
> > >    * necessary states and it is fully operational to continue processing the
> > >    * virtqueue descriptors.
> > >    */
> > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > >
> > > I wonder if it makes sense to import these into vhost-user so that the
> > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > if one of them is ahead of the other, but it would be nice to avoid
> > > overlapping/duplicated functionality.
> > >
> >
> > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > to SUSPEND.
> >
> > Generally it is better if we make the interface less parametrized and
> > we trust in the messages and its semantics in my opinion. In other
> > words, instead of
> > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
> >
> > Another way to apply this is with the "direction" parameter. Maybe it
> > is better to split it into "set_state_fd" and "get_state_fd"?
> >
> > In that case, reusing the ioctls as vhost-user messages would be ok.
> > But that puts this proposal further from the VFIO code, which uses
> > "migration_set_state(state)", and maybe it is better when the number
> > of states is high.
>
> Hi Eugenio,
> Another question about vDPA suspend/resume:
>
>   /* Host notifiers must be enabled at this point. */
>   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
>   {
>       int i;
>
>       /* should only be called after backend is connected */
>       assert(hdev->vhost_ops);
>       event_notifier_test_and_clear(
>           &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
>       event_notifier_test_and_clear(&vdev->config_notifier);
>
>       trace_vhost_dev_stop(hdev, vdev->name, vrings);
>
>       if (hdev->vhost_ops->vhost_dev_start) {
>           hdev->vhost_ops->vhost_dev_start(hdev, false);
>           ^^^ SUSPEND ^^^
>       }
>       if (vrings) {
>           vhost_dev_set_vring_enable(hdev, false);
>       }
>       for (i = 0; i < hdev->nvqs; ++i) {
>           vhost_virtqueue_stop(hdev,
>                                vdev,
>                                hdev->vqs + i,
>                                hdev->vq_index + i);
>         ^^^ fetch virtqueue state from kernel ^^^
>       }
>       if (hdev->vhost_ops->vhost_reset_status) {
>           hdev->vhost_ops->vhost_reset_status(hdev);
>           ^^^ reset device^^^
>
> I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
> vhost_reset_status(). The device's migration code runs after
> vhost_dev_stop() and the state will have been lost.
>
vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
qemu VirtIONet device model. This is for all vhost backends.
Regarding the state like mac or mq configuration, SVQ runs for all the
VM run in the CVQ. So it can track all of that status in the device
model too.
When a migration effectively occurs, all the frontend state is
migrated as a regular emulated device. To route all of the state in a
normalized way for qemu is what leaves open the possibility to do
cross-backends migrations, etc.
Does that answer your question?
> It looks like vDPA changes are necessary in order to support stateful
> devices even though QEMU already uses SUSPEND. Is my understanding
> correct?
>
Changes are required elsewhere, as the code to restore the state
properly in the destination has not been merged.
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-17 18:55               ` Eugenio Perez Martin
@ 2023-04-17 19:08                 ` Stefan Hajnoczi
  2023-04-17 19:11                   ` Eugenio Perez Martin
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-17 19:08 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, 17 Apr 2023 at 14:56, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote:
> > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> > > >
> > > > On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > Basically, what I’m hearing is that I need to implement a different
> > > > feature that has no practical impact right now, and also fix bugs around
> > > > it along the way...
> > > >
> > >
> > > To fix this properly requires iterative device migration in qemu as
> > > far as I know, instead of using VMStates [1]. This way the state is
> > > requested to virtiofsd before the device reset.
> >
> > I don't follow. Many devices are fine with non-iterative migration. They
> > shouldn't be forced to do iterative migration.
> >
>
> Sorry I think I didn't express myself well. I didn't mean to force
> virtiofsd to support the iterative migration, but to use the device
> iterative migration API in QEMU to send the needed commands before
> vhost_dev_stop. In that regard, the device or the vhost-user commands
> would not require changes.
>
> I think it is convenient in the long run for virtiofsd, as if the
> state grows so much that it's not feasible to fetch it in one shot
> there is no need to make changes in the qemu migration protocol. I
> think it is not unlikely in virtiofs, but maybe I'm missing something
> obvious and it's state will never grow.
I don't understand. vCPUs are still running at that point and the
device state could change. It's not safe to save the full device state
until vCPUs have stopped (after vhost_dev_stop).
If you're suggestion somehow doing non-iterative migration but during
the iterative phase, then I don't think that's possible?
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-17 15:38       ` Stefan Hajnoczi
@ 2023-04-17 19:09         ` Eugenio Perez Martin
  2023-04-17 19:33           ` Stefan Hajnoczi
  0 siblings, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-04-17 19:09 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Hanna Czenczek, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > from virtiofsd.
> > > >
> > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > is best to transfer it as a single binary blob after the streaming
> > > > phase.  Because this method should be useful to other vhost-user
> > > > implementations, too, it is introduced as a general-purpose addition to
> > > > the protocol, not limited to vhost-user-fs.
> > > >
> > > > These are the additions to the protocol:
> > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > >   This feature signals support for transferring state, and is added so
> > > >   that migration can fail early when the back-end has no support.
> > > >
> > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > >   over which to transfer the state.  The front-end sends an FD to the
> > > >   back-end into/from which it can write/read its state, and the back-end
> > > >   can decide to either use it, or reply with a different FD for the
> > > >   front-end to override the front-end's choice.
> > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > >   the back-end already has an FD into/from which it has to write/read
> > > >   its state, in which case it will want to override the simple pipe.
> > > >   Conversely, maybe in the future we find a way to have the front-end
> > > >   get an immediate FD for the migration stream (in some cases), in which
> > > >   case we will want to send this to the back-end instead of creating a
> > > >   pipe.
> > > >   Hence the negotiation: If one side has a better idea than a plain
> > > >   pipe, we will want to use that.
> > > >
> > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > >   to verify success.  There is no in-band way (through the pipe) to
> > > >   indicate failure, so we need to check explicitly.
> > > >
> > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > (which includes establishing the direction of transfer and migration
> > > > phase), the sending side writes its data into the pipe, and the reading
> > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > checking for integrity (i.e. errors during deserialization).
> > > >
> > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > ---
> > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > >  4 files changed, 287 insertions(+)
> > > >
> > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > index ec3fbae58d..5935b32fe3 100644
> > > > --- a/include/hw/virtio/vhost-backend.h
> > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > >  } VhostSetConfigType;
> > > >
> > > > +typedef enum VhostDeviceStateDirection {
> > > > +    /* Transfer state from back-end (device) to front-end */
> > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > +    /* Transfer state from front-end to back-end (device) */
> > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > +} VhostDeviceStateDirection;
> > > > +
> > > > +typedef enum VhostDeviceStatePhase {
> > > > +    /* The device (and all its vrings) is stopped */
> > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > +} VhostDeviceStatePhase;
> > >
> > > vDPA has:
> > >
> > >   /* Suspend a device so it does not process virtqueue requests anymore
> > >    *
> > >    * After the return of ioctl the device must preserve all the necessary state
> > >    * (the virtqueue vring base plus the possible device specific states) that is
> > >    * required for restoring in the future. The device must not change its
> > >    * configuration after that point.
> > >    */
> > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > >
> > >   /* Resume a device so it can resume processing virtqueue requests
> > >    *
> > >    * After the return of this ioctl the device will have restored all the
> > >    * necessary states and it is fully operational to continue processing the
> > >    * virtqueue descriptors.
> > >    */
> > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > >
> > > I wonder if it makes sense to import these into vhost-user so that the
> > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > if one of them is ahead of the other, but it would be nice to avoid
> > > overlapping/duplicated functionality.
> > >
> >
> > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > to SUSPEND.
>
> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> ioctl(VHOST_VDPA_RESUME).
>
> The doc comments in <linux/vdpa.h> don't explain how the device can
> leave the suspended state. Can you clarify this?
>
Do you mean in what situations or regarding the semantics of _RESUME?
To me resume is an operation mainly to resume the device in the event
of a VM suspension, not a migration. It can be used as a fallback code
in some cases of migration failure though, but it is not currently
used in qemu.
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-17 19:08                 ` Stefan Hajnoczi
@ 2023-04-17 19:11                   ` Eugenio Perez Martin
  2023-04-17 19:46                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-04-17 19:11 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, Apr 17, 2023 at 9:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Mon, 17 Apr 2023 at 14:56, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote:
> > > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > >
> > > > > On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > > > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > > > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > Basically, what I’m hearing is that I need to implement a different
> > > > > feature that has no practical impact right now, and also fix bugs around
> > > > > it along the way...
> > > > >
> > > >
> > > > To fix this properly requires iterative device migration in qemu as
> > > > far as I know, instead of using VMStates [1]. This way the state is
> > > > requested to virtiofsd before the device reset.
> > >
> > > I don't follow. Many devices are fine with non-iterative migration. They
> > > shouldn't be forced to do iterative migration.
> > >
> >
> > Sorry I think I didn't express myself well. I didn't mean to force
> > virtiofsd to support the iterative migration, but to use the device
> > iterative migration API in QEMU to send the needed commands before
> > vhost_dev_stop. In that regard, the device or the vhost-user commands
> > would not require changes.
> >
> > I think it is convenient in the long run for virtiofsd, as if the
> > state grows so much that it's not feasible to fetch it in one shot
> > there is no need to make changes in the qemu migration protocol. I
> > think it is not unlikely in virtiofs, but maybe I'm missing something
> > obvious and it's state will never grow.
>
> I don't understand. vCPUs are still running at that point and the
> device state could change. It's not safe to save the full device state
> until vCPUs have stopped (after vhost_dev_stop).
>
I think the vCPU is already stopped at save_live_complete_precopy
callback. Maybe my understanding is wrong?
Thanks!
> If you're suggestion somehow doing non-iterative migration but during
> the iterative phase, then I don't think that's possible?
>
> Stefan
>
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-17 19:06         ` Eugenio Perez Martin
@ 2023-04-17 19:20           ` Stefan Hajnoczi
  2023-04-18  7:54             ` Eugenio Perez Martin
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-17 19:20 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > from virtiofsd.
> > > > >
> > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > is best to transfer it as a single binary blob after the streaming
> > > > > phase.  Because this method should be useful to other vhost-user
> > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > the protocol, not limited to vhost-user-fs.
> > > > >
> > > > > These are the additions to the protocol:
> > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > >   This feature signals support for transferring state, and is added so
> > > > >   that migration can fail early when the back-end has no support.
> > > > >
> > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > >   can decide to either use it, or reply with a different FD for the
> > > > >   front-end to override the front-end's choice.
> > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > >   the back-end already has an FD into/from which it has to write/read
> > > > >   its state, in which case it will want to override the simple pipe.
> > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > >   case we will want to send this to the back-end instead of creating a
> > > > >   pipe.
> > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > >   pipe, we will want to use that.
> > > > >
> > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > >   indicate failure, so we need to check explicitly.
> > > > >
> > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > (which includes establishing the direction of transfer and migration
> > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > checking for integrity (i.e. errors during deserialization).
> > > > >
> > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > ---
> > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > >  4 files changed, 287 insertions(+)
> > > > >
> > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > >  } VhostSetConfigType;
> > > > >
> > > > > +typedef enum VhostDeviceStateDirection {
> > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > +} VhostDeviceStateDirection;
> > > > > +
> > > > > +typedef enum VhostDeviceStatePhase {
> > > > > +    /* The device (and all its vrings) is stopped */
> > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > +} VhostDeviceStatePhase;
> > > >
> > > > vDPA has:
> > > >
> > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > >    *
> > > >    * After the return of ioctl the device must preserve all the necessary state
> > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > >    * required for restoring in the future. The device must not change its
> > > >    * configuration after that point.
> > > >    */
> > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > >
> > > >   /* Resume a device so it can resume processing virtqueue requests
> > > >    *
> > > >    * After the return of this ioctl the device will have restored all the
> > > >    * necessary states and it is fully operational to continue processing the
> > > >    * virtqueue descriptors.
> > > >    */
> > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > >
> > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > overlapping/duplicated functionality.
> > > >
> > >
> > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > to SUSPEND.
> > >
> > > Generally it is better if we make the interface less parametrized and
> > > we trust in the messages and its semantics in my opinion. In other
> > > words, instead of
> > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
> > >
> > > Another way to apply this is with the "direction" parameter. Maybe it
> > > is better to split it into "set_state_fd" and "get_state_fd"?
> > >
> > > In that case, reusing the ioctls as vhost-user messages would be ok.
> > > But that puts this proposal further from the VFIO code, which uses
> > > "migration_set_state(state)", and maybe it is better when the number
> > > of states is high.
> >
> > Hi Eugenio,
> > Another question about vDPA suspend/resume:
> >
> >   /* Host notifiers must be enabled at this point. */
> >   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
> >   {
> >       int i;
> >
> >       /* should only be called after backend is connected */
> >       assert(hdev->vhost_ops);
> >       event_notifier_test_and_clear(
> >           &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> >       event_notifier_test_and_clear(&vdev->config_notifier);
> >
> >       trace_vhost_dev_stop(hdev, vdev->name, vrings);
> >
> >       if (hdev->vhost_ops->vhost_dev_start) {
> >           hdev->vhost_ops->vhost_dev_start(hdev, false);
> >           ^^^ SUSPEND ^^^
> >       }
> >       if (vrings) {
> >           vhost_dev_set_vring_enable(hdev, false);
> >       }
> >       for (i = 0; i < hdev->nvqs; ++i) {
> >           vhost_virtqueue_stop(hdev,
> >                                vdev,
> >                                hdev->vqs + i,
> >                                hdev->vq_index + i);
> >         ^^^ fetch virtqueue state from kernel ^^^
> >       }
> >       if (hdev->vhost_ops->vhost_reset_status) {
> >           hdev->vhost_ops->vhost_reset_status(hdev);
> >           ^^^ reset device^^^
> >
> > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
> > vhost_reset_status(). The device's migration code runs after
> > vhost_dev_stop() and the state will have been lost.
> >
>
> vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> qemu VirtIONet device model. This is for all vhost backends.
>
> Regarding the state like mac or mq configuration, SVQ runs for all the
> VM run in the CVQ. So it can track all of that status in the device
> model too.
>
> When a migration effectively occurs, all the frontend state is
> migrated as a regular emulated device. To route all of the state in a
> normalized way for qemu is what leaves open the possibility to do
> cross-backends migrations, etc.
>
> Does that answer your question?
I think you're confirming that changes would be necessary in order for
vDPA to support the save/load operation that Hanna is introducing.
> > It looks like vDPA changes are necessary in order to support stateful
> > devices even though QEMU already uses SUSPEND. Is my understanding
> > correct?
> >
>
> Changes are required elsewhere, as the code to restore the state
> properly in the destination has not been merged.
I'm not sure what you mean by elsewhere?
I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
then VHOST_VDPA_SET_STATUS 0.
In order to save device state from the vDPA device in the future, it
will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
the device state can be saved before the device is reset.
Does that sound right?
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-17 19:09         ` Eugenio Perez Martin
@ 2023-04-17 19:33           ` Stefan Hajnoczi
  2023-04-18  8:09             ` Eugenio Perez Martin
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-17 19:33 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > from virtiofsd.
> > > > >
> > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > is best to transfer it as a single binary blob after the streaming
> > > > > phase.  Because this method should be useful to other vhost-user
> > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > the protocol, not limited to vhost-user-fs.
> > > > >
> > > > > These are the additions to the protocol:
> > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > >   This feature signals support for transferring state, and is added so
> > > > >   that migration can fail early when the back-end has no support.
> > > > >
> > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > >   can decide to either use it, or reply with a different FD for the
> > > > >   front-end to override the front-end's choice.
> > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > >   the back-end already has an FD into/from which it has to write/read
> > > > >   its state, in which case it will want to override the simple pipe.
> > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > >   case we will want to send this to the back-end instead of creating a
> > > > >   pipe.
> > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > >   pipe, we will want to use that.
> > > > >
> > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > >   indicate failure, so we need to check explicitly.
> > > > >
> > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > (which includes establishing the direction of transfer and migration
> > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > checking for integrity (i.e. errors during deserialization).
> > > > >
> > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > ---
> > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > >  4 files changed, 287 insertions(+)
> > > > >
> > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > >  } VhostSetConfigType;
> > > > >
> > > > > +typedef enum VhostDeviceStateDirection {
> > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > +} VhostDeviceStateDirection;
> > > > > +
> > > > > +typedef enum VhostDeviceStatePhase {
> > > > > +    /* The device (and all its vrings) is stopped */
> > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > +} VhostDeviceStatePhase;
> > > >
> > > > vDPA has:
> > > >
> > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > >    *
> > > >    * After the return of ioctl the device must preserve all the necessary state
> > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > >    * required for restoring in the future. The device must not change its
> > > >    * configuration after that point.
> > > >    */
> > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > >
> > > >   /* Resume a device so it can resume processing virtqueue requests
> > > >    *
> > > >    * After the return of this ioctl the device will have restored all the
> > > >    * necessary states and it is fully operational to continue processing the
> > > >    * virtqueue descriptors.
> > > >    */
> > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > >
> > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > overlapping/duplicated functionality.
> > > >
> > >
> > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > to SUSPEND.
> >
> > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > ioctl(VHOST_VDPA_RESUME).
> >
> > The doc comments in <linux/vdpa.h> don't explain how the device can
> > leave the suspended state. Can you clarify this?
> >
>
> Do you mean in what situations or regarding the semantics of _RESUME?
>
> To me resume is an operation mainly to resume the device in the event
> of a VM suspension, not a migration. It can be used as a fallback code
> in some cases of migration failure though, but it is not currently
> used in qemu.
Is a "VM suspension" the QEMU HMP 'stop' command?
I guess the reason why QEMU doesn't call RESUME anywhere is that it
resets the device in vhost_dev_stop()?
Does it make sense to combine SUSPEND and RESUME with Hanna's
SET_DEVICE_STATE_FD? For example, non-iterative migration works like
this:
- Saving the device's state is done by SUSPEND followed by
SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
savevm command or migration failed), then RESUME is called to
continue.
- Loading the device's state is done by SUSPEND followed by
SET_DEVICE_STATE_FD, followed by RESUME.
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-17 19:11                   ` Eugenio Perez Martin
@ 2023-04-17 19:46                     ` Stefan Hajnoczi
  2023-04-18 10:09                       ` Eugenio Perez Martin
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-17 19:46 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, 17 Apr 2023 at 15:12, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Mon, Apr 17, 2023 at 9:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Mon, 17 Apr 2023 at 14:56, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote:
> > > > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > >
> > > > > > On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > > > > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > > > > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > Basically, what I’m hearing is that I need to implement a different
> > > > > > feature that has no practical impact right now, and also fix bugs around
> > > > > > it along the way...
> > > > > >
> > > > >
> > > > > To fix this properly requires iterative device migration in qemu as
> > > > > far as I know, instead of using VMStates [1]. This way the state is
> > > > > requested to virtiofsd before the device reset.
> > > >
> > > > I don't follow. Many devices are fine with non-iterative migration. They
> > > > shouldn't be forced to do iterative migration.
> > > >
> > >
> > > Sorry I think I didn't express myself well. I didn't mean to force
> > > virtiofsd to support the iterative migration, but to use the device
> > > iterative migration API in QEMU to send the needed commands before
> > > vhost_dev_stop. In that regard, the device or the vhost-user commands
> > > would not require changes.
> > >
> > > I think it is convenient in the long run for virtiofsd, as if the
> > > state grows so much that it's not feasible to fetch it in one shot
> > > there is no need to make changes in the qemu migration protocol. I
> > > think it is not unlikely in virtiofs, but maybe I'm missing something
> > > obvious and it's state will never grow.
> >
> > I don't understand. vCPUs are still running at that point and the
> > device state could change. It's not safe to save the full device state
> > until vCPUs have stopped (after vhost_dev_stop).
> >
>
> I think the vCPU is already stopped at save_live_complete_precopy
> callback. Maybe my understanding is wrong?
Agreed, vCPUs are stopped in save_live_complete_precopy(). However,
you wrote "use the device iterative migration API in QEMU to send the
needed commands before vhost_dev_stop". save_live_complete_precopy()
runs after vhost_dev_stop() so it doesn't seem to solve the problem.
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-17 19:20           ` Stefan Hajnoczi
@ 2023-04-18  7:54             ` Eugenio Perez Martin
  2023-04-19 11:10               ` Hanna Czenczek
  0 siblings, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-04-18  7:54 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > > from virtiofsd.
> > > > > >
> > > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > > is best to transfer it as a single binary blob after the streaming
> > > > > > phase.  Because this method should be useful to other vhost-user
> > > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > > the protocol, not limited to vhost-user-fs.
> > > > > >
> > > > > > These are the additions to the protocol:
> > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > >   This feature signals support for transferring state, and is added so
> > > > > >   that migration can fail early when the back-end has no support.
> > > > > >
> > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > > >   can decide to either use it, or reply with a different FD for the
> > > > > >   front-end to override the front-end's choice.
> > > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > > >   the back-end already has an FD into/from which it has to write/read
> > > > > >   its state, in which case it will want to override the simple pipe.
> > > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > > >   case we will want to send this to the back-end instead of creating a
> > > > > >   pipe.
> > > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > > >   pipe, we will want to use that.
> > > > > >
> > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > > >   indicate failure, so we need to check explicitly.
> > > > > >
> > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > > (which includes establishing the direction of transfer and migration
> > > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > >
> > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > ---
> > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > >  4 files changed, 287 insertions(+)
> > > > > >
> > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > >  } VhostSetConfigType;
> > > > > >
> > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > +} VhostDeviceStateDirection;
> > > > > > +
> > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > +} VhostDeviceStatePhase;
> > > > >
> > > > > vDPA has:
> > > > >
> > > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > > >    *
> > > > >    * After the return of ioctl the device must preserve all the necessary state
> > > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > > >    * required for restoring in the future. The device must not change its
> > > > >    * configuration after that point.
> > > > >    */
> > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > >
> > > > >   /* Resume a device so it can resume processing virtqueue requests
> > > > >    *
> > > > >    * After the return of this ioctl the device will have restored all the
> > > > >    * necessary states and it is fully operational to continue processing the
> > > > >    * virtqueue descriptors.
> > > > >    */
> > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > >
> > > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > > overlapping/duplicated functionality.
> > > > >
> > > >
> > > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > > to SUSPEND.
> > > >
> > > > Generally it is better if we make the interface less parametrized and
> > > > we trust in the messages and its semantics in my opinion. In other
> > > > words, instead of
> > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
> > > >
> > > > Another way to apply this is with the "direction" parameter. Maybe it
> > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > >
> > > > In that case, reusing the ioctls as vhost-user messages would be ok.
> > > > But that puts this proposal further from the VFIO code, which uses
> > > > "migration_set_state(state)", and maybe it is better when the number
> > > > of states is high.
> > >
> > > Hi Eugenio,
> > > Another question about vDPA suspend/resume:
> > >
> > >   /* Host notifiers must be enabled at this point. */
> > >   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
> > >   {
> > >       int i;
> > >
> > >       /* should only be called after backend is connected */
> > >       assert(hdev->vhost_ops);
> > >       event_notifier_test_and_clear(
> > >           &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > >       event_notifier_test_and_clear(&vdev->config_notifier);
> > >
> > >       trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > >
> > >       if (hdev->vhost_ops->vhost_dev_start) {
> > >           hdev->vhost_ops->vhost_dev_start(hdev, false);
> > >           ^^^ SUSPEND ^^^
> > >       }
> > >       if (vrings) {
> > >           vhost_dev_set_vring_enable(hdev, false);
> > >       }
> > >       for (i = 0; i < hdev->nvqs; ++i) {
> > >           vhost_virtqueue_stop(hdev,
> > >                                vdev,
> > >                                hdev->vqs + i,
> > >                                hdev->vq_index + i);
> > >         ^^^ fetch virtqueue state from kernel ^^^
> > >       }
> > >       if (hdev->vhost_ops->vhost_reset_status) {
> > >           hdev->vhost_ops->vhost_reset_status(hdev);
> > >           ^^^ reset device^^^
> > >
> > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
> > > vhost_reset_status(). The device's migration code runs after
> > > vhost_dev_stop() and the state will have been lost.
> > >
> >
> > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > qemu VirtIONet device model. This is for all vhost backends.
> >
> > Regarding the state like mac or mq configuration, SVQ runs for all the
> > VM run in the CVQ. So it can track all of that status in the device
> > model too.
> >
> > When a migration effectively occurs, all the frontend state is
> > migrated as a regular emulated device. To route all of the state in a
> > normalized way for qemu is what leaves open the possibility to do
> > cross-backends migrations, etc.
> >
> > Does that answer your question?
>
> I think you're confirming that changes would be necessary in order for
> vDPA to support the save/load operation that Hanna is introducing.
>
Yes, this first iteration was centered on net, with an eye on block,
where state can be routed through classical emulated devices. This is
how vhost-kernel and vhost-user do classically. And it allows
cross-backend, to not modify qemu migration state, etc.
To introduce this opaque state to qemu, that must be fetched after the
suspend and not before, requires changes in vhost protocol, as
discussed previously.
> > > It looks like vDPA changes are necessary in order to support stateful
> > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > correct?
> > >
> >
> > Changes are required elsewhere, as the code to restore the state
> > properly in the destination has not been merged.
>
> I'm not sure what you mean by elsewhere?
>
I meant for vdpa *net* devices the changes are not required in vdpa
ioctls, but mostly in qemu.
If you meant stateful as "it must have a state blob that it must be
opaque to qemu", then I think the straightforward action is to fetch
state blob about the same time as vq indexes. But yes, changes (at
least a new ioctl) is needed for that.
> I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> then VHOST_VDPA_SET_STATUS 0.
>
> In order to save device state from the vDPA device in the future, it
> will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> the device state can be saved before the device is reset.
>
> Does that sound right?
>
The split between suspend and reset was added recently for that very
reason. In all the virtio devices, the frontend is initialized before
the backend, so I don't think it is a good idea to defer the backend
cleanup. Especially if we have already set the state is small enough
to not needing iterative migration from virtiofsd point of view.
If fetching that state at the same time as vq indexes is not valid,
could it follow the same model as the "in-flight descriptors"?
vhost-user follows them by using a shared memory region where their
state is tracked [1]. This allows qemu to survive vhost-user SW
backend crashes, and does not forbid the cross-backends live migration
as all the information is there to recover them.
For hw devices this is not convenient as it occupies PCI bandwidth. So
a possibility is to synchronize this memory region after a
synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
devices are not going to crash in the software sense, so all use cases
remain the same to qemu. And that shared memory information is
recoverable after vhost_dev_stop.
Does that sound reasonable to virtiofsd? To offer a shared memory
region where it dumps the state, maybe only after the
set_state(STATE_PHASE_STOPPED)?
Thanks!
[1] https://qemu.readthedocs.io/en/latest/interop/vhost-user.html#inflight-i-o-tracking
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-17 19:33           ` Stefan Hajnoczi
@ 2023-04-18  8:09             ` Eugenio Perez Martin
  2023-04-18 17:59               ` Stefan Hajnoczi
  0 siblings, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-04-18  8:09 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > > from virtiofsd.
> > > > > >
> > > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > > is best to transfer it as a single binary blob after the streaming
> > > > > > phase.  Because this method should be useful to other vhost-user
> > > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > > the protocol, not limited to vhost-user-fs.
> > > > > >
> > > > > > These are the additions to the protocol:
> > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > >   This feature signals support for transferring state, and is added so
> > > > > >   that migration can fail early when the back-end has no support.
> > > > > >
> > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > > >   can decide to either use it, or reply with a different FD for the
> > > > > >   front-end to override the front-end's choice.
> > > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > > >   the back-end already has an FD into/from which it has to write/read
> > > > > >   its state, in which case it will want to override the simple pipe.
> > > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > > >   case we will want to send this to the back-end instead of creating a
> > > > > >   pipe.
> > > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > > >   pipe, we will want to use that.
> > > > > >
> > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > > >   indicate failure, so we need to check explicitly.
> > > > > >
> > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > > (which includes establishing the direction of transfer and migration
> > > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > >
> > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > ---
> > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > >  4 files changed, 287 insertions(+)
> > > > > >
> > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > >  } VhostSetConfigType;
> > > > > >
> > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > +} VhostDeviceStateDirection;
> > > > > > +
> > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > +} VhostDeviceStatePhase;
> > > > >
> > > > > vDPA has:
> > > > >
> > > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > > >    *
> > > > >    * After the return of ioctl the device must preserve all the necessary state
> > > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > > >    * required for restoring in the future. The device must not change its
> > > > >    * configuration after that point.
> > > > >    */
> > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > >
> > > > >   /* Resume a device so it can resume processing virtqueue requests
> > > > >    *
> > > > >    * After the return of this ioctl the device will have restored all the
> > > > >    * necessary states and it is fully operational to continue processing the
> > > > >    * virtqueue descriptors.
> > > > >    */
> > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > >
> > > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > > overlapping/duplicated functionality.
> > > > >
> > > >
> > > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > > to SUSPEND.
> > >
> > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > ioctl(VHOST_VDPA_RESUME).
> > >
> > > The doc comments in <linux/vdpa.h> don't explain how the device can
> > > leave the suspended state. Can you clarify this?
> > >
> >
> > Do you mean in what situations or regarding the semantics of _RESUME?
> >
> > To me resume is an operation mainly to resume the device in the event
> > of a VM suspension, not a migration. It can be used as a fallback code
> > in some cases of migration failure though, but it is not currently
> > used in qemu.
>
> Is a "VM suspension" the QEMU HMP 'stop' command?
>
> I guess the reason why QEMU doesn't call RESUME anywhere is that it
> resets the device in vhost_dev_stop()?
>
The actual reason for not using RESUME is that the ioctl was added
after the SUSPEND design in qemu. Same as this proposal, it is was not
needed at the time.
In the case of vhost-vdpa net, the only usage of suspend is to fetch
the vq indexes, and in case of error vhost already fetches them from
guest's used ring way before vDPA, so it has little usage.
> Does it make sense to combine SUSPEND and RESUME with Hanna's
> SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> this:
> - Saving the device's state is done by SUSPEND followed by
> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> savevm command or migration failed), then RESUME is called to
> continue.
I think the previous steps make sense at vhost_dev_stop, not virtio
savevm handlers. To start spreading this logic to more places of qemu
can bring confusion.
> - Loading the device's state is done by SUSPEND followed by
> SET_DEVICE_STATE_FD, followed by RESUME.
>
I think the restore makes more sense after reset and before driver_ok,
suspend does not seem a right call there. SUSPEND implies there may be
other operations before, so the device may have processed some
requests wrong, as it is not in the right state.
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-17 19:46                     ` Stefan Hajnoczi
@ 2023-04-18 10:09                       ` Eugenio Perez Martin
  0 siblings, 0 replies; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-04-18 10:09 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, Apr 17, 2023 at 9:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Mon, 17 Apr 2023 at 15:12, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Mon, Apr 17, 2023 at 9:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Mon, 17 Apr 2023 at 14:56, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote:
> > > > > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > > >
> > > > > > > On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > > > > > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > > > > > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > Basically, what I’m hearing is that I need to implement a different
> > > > > > > feature that has no practical impact right now, and also fix bugs around
> > > > > > > it along the way...
> > > > > > >
> > > > > >
> > > > > > To fix this properly requires iterative device migration in qemu as
> > > > > > far as I know, instead of using VMStates [1]. This way the state is
> > > > > > requested to virtiofsd before the device reset.
> > > > >
> > > > > I don't follow. Many devices are fine with non-iterative migration. They
> > > > > shouldn't be forced to do iterative migration.
> > > > >
> > > >
> > > > Sorry I think I didn't express myself well. I didn't mean to force
> > > > virtiofsd to support the iterative migration, but to use the device
> > > > iterative migration API in QEMU to send the needed commands before
> > > > vhost_dev_stop. In that regard, the device or the vhost-user commands
> > > > would not require changes.
> > > >
> > > > I think it is convenient in the long run for virtiofsd, as if the
> > > > state grows so much that it's not feasible to fetch it in one shot
> > > > there is no need to make changes in the qemu migration protocol. I
> > > > think it is not unlikely in virtiofs, but maybe I'm missing something
> > > > obvious and it's state will never grow.
> > >
> > > I don't understand. vCPUs are still running at that point and the
> > > device state could change. It's not safe to save the full device state
> > > until vCPUs have stopped (after vhost_dev_stop).
> > >
> >
> > I think the vCPU is already stopped at save_live_complete_precopy
> > callback. Maybe my understanding is wrong?
>
> Agreed, vCPUs are stopped in save_live_complete_precopy(). However,
> you wrote "use the device iterative migration API in QEMU to send the
> needed commands before vhost_dev_stop". save_live_complete_precopy()
> runs after vhost_dev_stop() so it doesn't seem to solve the problem.
>
You're right, and it actually makes the most sense.
So I guess this converges with the other thread, let's follow the
discussion there.
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-18  8:09             ` Eugenio Perez Martin
@ 2023-04-18 17:59               ` Stefan Hajnoczi
  2023-04-18 18:31                 ` Eugenio Perez Martin
  2023-04-19 10:57                 ` [Virtio-fs] " Hanna Czenczek
  0 siblings, 2 replies; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-18 17:59 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
[-- Attachment #1: Type: text/plain, Size: 11414 bytes --]
On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > > > from virtiofsd.
> > > > > > >
> > > > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > > > is best to transfer it as a single binary blob after the streaming
> > > > > > > phase.  Because this method should be useful to other vhost-user
> > > > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > >
> > > > > > > These are the additions to the protocol:
> > > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > >   This feature signals support for transferring state, and is added so
> > > > > > >   that migration can fail early when the back-end has no support.
> > > > > > >
> > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > > > >   can decide to either use it, or reply with a different FD for the
> > > > > > >   front-end to override the front-end's choice.
> > > > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > > > >   the back-end already has an FD into/from which it has to write/read
> > > > > > >   its state, in which case it will want to override the simple pipe.
> > > > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > > > >   case we will want to send this to the back-end instead of creating a
> > > > > > >   pipe.
> > > > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > > > >   pipe, we will want to use that.
> > > > > > >
> > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > > > >   indicate failure, so we need to check explicitly.
> > > > > > >
> > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > > > (which includes establishing the direction of transfer and migration
> > > > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > >
> > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > ---
> > > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > >  4 files changed, 287 insertions(+)
> > > > > > >
> > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > >  } VhostSetConfigType;
> > > > > > >
> > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > +} VhostDeviceStateDirection;
> > > > > > > +
> > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > +} VhostDeviceStatePhase;
> > > > > >
> > > > > > vDPA has:
> > > > > >
> > > > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > > > >    *
> > > > > >    * After the return of ioctl the device must preserve all the necessary state
> > > > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > > > >    * required for restoring in the future. The device must not change its
> > > > > >    * configuration after that point.
> > > > > >    */
> > > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > >
> > > > > >   /* Resume a device so it can resume processing virtqueue requests
> > > > > >    *
> > > > > >    * After the return of this ioctl the device will have restored all the
> > > > > >    * necessary states and it is fully operational to continue processing the
> > > > > >    * virtqueue descriptors.
> > > > > >    */
> > > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > >
> > > > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > > > overlapping/duplicated functionality.
> > > > > >
> > > > >
> > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > > > to SUSPEND.
> > > >
> > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > > ioctl(VHOST_VDPA_RESUME).
> > > >
> > > > The doc comments in <linux/vdpa.h> don't explain how the device can
> > > > leave the suspended state. Can you clarify this?
> > > >
> > >
> > > Do you mean in what situations or regarding the semantics of _RESUME?
> > >
> > > To me resume is an operation mainly to resume the device in the event
> > > of a VM suspension, not a migration. It can be used as a fallback code
> > > in some cases of migration failure though, but it is not currently
> > > used in qemu.
> >
> > Is a "VM suspension" the QEMU HMP 'stop' command?
> >
> > I guess the reason why QEMU doesn't call RESUME anywhere is that it
> > resets the device in vhost_dev_stop()?
> >
> 
> The actual reason for not using RESUME is that the ioctl was added
> after the SUSPEND design in qemu. Same as this proposal, it is was not
> needed at the time.
> 
> In the case of vhost-vdpa net, the only usage of suspend is to fetch
> the vq indexes, and in case of error vhost already fetches them from
> guest's used ring way before vDPA, so it has little usage.
> 
> > Does it make sense to combine SUSPEND and RESUME with Hanna's
> > SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> > this:
> > - Saving the device's state is done by SUSPEND followed by
> > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> > savevm command or migration failed), then RESUME is called to
> > continue.
> 
> I think the previous steps make sense at vhost_dev_stop, not virtio
> savevm handlers. To start spreading this logic to more places of qemu
> can bring confusion.
I don't think there is a way around extending the QEMU vhost's code
model. The current model in QEMU's vhost code is that the backend is
reset when the VM stops. This model worked fine for stateless devices
but it doesn't work for stateful devices.
Imagine a vdpa-gpu device: you cannot reset the device in
vhost_dev_stop() and expect the GPU to continue working when
vhost_dev_start() is called again because all its state has been lost.
The guest driver will send requests that references a virtio-gpu
resources that no longer exist.
One solution is to save the device's state in vhost_dev_stop(). I think
this is what you're suggesting. It requires keeping a copy of the state
and then loading the state again in vhost_dev_start(). I don't think
this approach should be used because it requires all stateful devices to
support live migration (otherwise they break across HMP 'stop'/'cont').
Also, the device state for some devices may be large and it would also
become more complicated when iterative migration is added.
Instead, I think the QEMU vhost code needs to be structured so that
struct vhost_dev has a suspended state:
        ,---------.
	v         |
  started ------> stopped
    \   ^
     \  |
      -> suspended
The device doesn't lose state when it enters the suspended state. It can
be resumed again.
This is why I think SUSPEND/RESUME need to be part of the solution.
(It's also an argument for not including the phase argument in
SET_DEVICE_STATE_FD because the SUSPEND message is sent during
vhost_dev_stop() separately from saving the device's state.)
> > - Loading the device's state is done by SUSPEND followed by
> > SET_DEVICE_STATE_FD, followed by RESUME.
> >
> 
> I think the restore makes more sense after reset and before driver_ok,
> suspend does not seem a right call there. SUSPEND implies there may be
> other operations before, so the device may have processed some
> requests wrong, as it is not in the right state.
I find it more elegant to allow SUSPEND -> load -> RESUME if the device
state is saved using SUSPEND -> save -> RESUME since the operations are
symmetrical, but requiring the device to be reset works too. Here is my
understanding of your idea in more detail:
The VIRTIO Device Status Field value must be ACKNOWLEDGE | DRIVER |
FEATURES_OK, any device initialization configuration space writes must
be done, and virtqueues must be configured (Step 7 of 3.1.1 Driver
Requirements in VIRTIO 1.2).
At that point the device is able to parse the device state and set up
its internal state. Doing it any earlier (before feature negotiation or
virtqueue configuration) places the device in the awkward situation of
having to keep the device state in a buffer and defer loading it until
later, which is complex.
After device state loading is complete, the DRIVER_OK bit is set to
resume device operation.
Saving device state is only allowed when the DRIVER_OK bit has been set.
Does this sound right?
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-18 17:59               ` Stefan Hajnoczi
@ 2023-04-18 18:31                 ` Eugenio Perez Martin
  2023-04-18 20:40                   ` Stefan Hajnoczi
  2023-04-19 10:57                 ` [Virtio-fs] " Hanna Czenczek
  1 sibling, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-04-18 18:31 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > > > > from virtiofsd.
> > > > > > > >
> > > > > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > > > > is best to transfer it as a single binary blob after the streaming
> > > > > > > > phase.  Because this method should be useful to other vhost-user
> > > > > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > >
> > > > > > > > These are the additions to the protocol:
> > > > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > >   This feature signals support for transferring state, and is added so
> > > > > > > >   that migration can fail early when the back-end has no support.
> > > > > > > >
> > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > > > > >   can decide to either use it, or reply with a different FD for the
> > > > > > > >   front-end to override the front-end's choice.
> > > > > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > > > > >   the back-end already has an FD into/from which it has to write/read
> > > > > > > >   its state, in which case it will want to override the simple pipe.
> > > > > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > > > > >   case we will want to send this to the back-end instead of creating a
> > > > > > > >   pipe.
> > > > > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > > > > >   pipe, we will want to use that.
> > > > > > > >
> > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > > > > >   indicate failure, so we need to check explicitly.
> > > > > > > >
> > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > > > > (which includes establishing the direction of transfer and migration
> > > > > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > >
> > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > ---
> > > > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > >  4 files changed, 287 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > >  } VhostSetConfigType;
> > > > > > > >
> > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > +
> > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > +} VhostDeviceStatePhase;
> > > > > > >
> > > > > > > vDPA has:
> > > > > > >
> > > > > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > > > > >    *
> > > > > > >    * After the return of ioctl the device must preserve all the necessary state
> > > > > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > > > > >    * required for restoring in the future. The device must not change its
> > > > > > >    * configuration after that point.
> > > > > > >    */
> > > > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > >
> > > > > > >   /* Resume a device so it can resume processing virtqueue requests
> > > > > > >    *
> > > > > > >    * After the return of this ioctl the device will have restored all the
> > > > > > >    * necessary states and it is fully operational to continue processing the
> > > > > > >    * virtqueue descriptors.
> > > > > > >    */
> > > > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > >
> > > > > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > > > > overlapping/duplicated functionality.
> > > > > > >
> > > > > >
> > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > > > > to SUSPEND.
> > > > >
> > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > > > ioctl(VHOST_VDPA_RESUME).
> > > > >
> > > > > The doc comments in <linux/vdpa.h> don't explain how the device can
> > > > > leave the suspended state. Can you clarify this?
> > > > >
> > > >
> > > > Do you mean in what situations or regarding the semantics of _RESUME?
> > > >
> > > > To me resume is an operation mainly to resume the device in the event
> > > > of a VM suspension, not a migration. It can be used as a fallback code
> > > > in some cases of migration failure though, but it is not currently
> > > > used in qemu.
> > >
> > > Is a "VM suspension" the QEMU HMP 'stop' command?
> > >
> > > I guess the reason why QEMU doesn't call RESUME anywhere is that it
> > > resets the device in vhost_dev_stop()?
> > >
> >
> > The actual reason for not using RESUME is that the ioctl was added
> > after the SUSPEND design in qemu. Same as this proposal, it is was not
> > needed at the time.
> >
> > In the case of vhost-vdpa net, the only usage of suspend is to fetch
> > the vq indexes, and in case of error vhost already fetches them from
> > guest's used ring way before vDPA, so it has little usage.
> >
> > > Does it make sense to combine SUSPEND and RESUME with Hanna's
> > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> > > this:
> > > - Saving the device's state is done by SUSPEND followed by
> > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> > > savevm command or migration failed), then RESUME is called to
> > > continue.
> >
> > I think the previous steps make sense at vhost_dev_stop, not virtio
> > savevm handlers. To start spreading this logic to more places of qemu
> > can bring confusion.
>
> I don't think there is a way around extending the QEMU vhost's code
> model. The current model in QEMU's vhost code is that the backend is
> reset when the VM stops. This model worked fine for stateless devices
> but it doesn't work for stateful devices.
>
> Imagine a vdpa-gpu device: you cannot reset the device in
> vhost_dev_stop() and expect the GPU to continue working when
> vhost_dev_start() is called again because all its state has been lost.
> The guest driver will send requests that references a virtio-gpu
> resources that no longer exist.
>
> One solution is to save the device's state in vhost_dev_stop(). I think
> this is what you're suggesting. It requires keeping a copy of the state
> and then loading the state again in vhost_dev_start(). I don't think
> this approach should be used because it requires all stateful devices to
> support live migration (otherwise they break across HMP 'stop'/'cont').
> Also, the device state for some devices may be large and it would also
> become more complicated when iterative migration is added.
>
> Instead, I think the QEMU vhost code needs to be structured so that
> struct vhost_dev has a suspended state:
>
>         ,---------.
>         v         |
>   started ------> stopped
>     \   ^
>      \  |
>       -> suspended
>
> The device doesn't lose state when it enters the suspended state. It can
> be resumed again.
>
> This is why I think SUSPEND/RESUME need to be part of the solution.
I agree with all of this, especially after realizing vhost_dev_stop is
called before the last request of the state in the iterative
migration.
However I think we can move faster with the virtiofsd migration code,
as long as we agree on the vhost-user messages it will receive. This
is because we already agree that the state will be sent in one shot
and not iteratively, so it will be small.
I understand this may change in the future, that's why I proposed to
start using iterative right now. However it may make little sense if
it is not used in the vhost-user device. I also understand that other
devices may have a bigger state so it will be needed for them.
> (It's also an argument for not including the phase argument in
> SET_DEVICE_STATE_FD because the SUSPEND message is sent during
> vhost_dev_stop() separately from saving the device's state.)
>
> > > - Loading the device's state is done by SUSPEND followed by
> > > SET_DEVICE_STATE_FD, followed by RESUME.
> > >
> >
> > I think the restore makes more sense after reset and before driver_ok,
> > suspend does not seem a right call there. SUSPEND implies there may be
> > other operations before, so the device may have processed some
> > requests wrong, as it is not in the right state.
>
> I find it more elegant to allow SUSPEND -> load -> RESUME if the device
> state is saved using SUSPEND -> save -> RESUME since the operations are
> symmetrical, but requiring the device to be reset works too. Here is my
> understanding of your idea in more detail:
>
> The VIRTIO Device Status Field value must be ACKNOWLEDGE | DRIVER |
> FEATURES_OK, any device initialization configuration space writes must
> be done, and virtqueues must be configured (Step 7 of 3.1.1 Driver
> Requirements in VIRTIO 1.2).
>
> At that point the device is able to parse the device state and set up
> its internal state. Doing it any earlier (before feature negotiation or
> virtqueue configuration) places the device in the awkward situation of
> having to keep the device state in a buffer and defer loading it until
> later, which is complex.
>
> After device state loading is complete, the DRIVER_OK bit is set to
> resume device operation.
>
> Saving device state is only allowed when the DRIVER_OK bit has been set.
>
> Does this sound right?
>
Yes, I see it is accurate. If you agree that SUSPEND only makes sense
after DRIVER_OK, to restore the state while suspended complicates the
state machine by a lot. The device spec is simpler with these
restrictions in my opinion.
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-18 18:31                 ` Eugenio Perez Martin
@ 2023-04-18 20:40                   ` Stefan Hajnoczi
  2023-04-20 13:27                     ` Eugenio Pérez
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-18 20:40 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Tue, 18 Apr 2023 at 14:31, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> > > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > >
> > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > > > > > from virtiofsd.
> > > > > > > > >
> > > > > > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > > > > > is best to transfer it as a single binary blob after the streaming
> > > > > > > > > phase.  Because this method should be useful to other vhost-user
> > > > > > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > >
> > > > > > > > > These are the additions to the protocol:
> > > > > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > >   This feature signals support for transferring state, and is added so
> > > > > > > > >   that migration can fail early when the back-end has no support.
> > > > > > > > >
> > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > > > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > > > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > > > > > >   can decide to either use it, or reply with a different FD for the
> > > > > > > > >   front-end to override the front-end's choice.
> > > > > > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > > > > > >   the back-end already has an FD into/from which it has to write/read
> > > > > > > > >   its state, in which case it will want to override the simple pipe.
> > > > > > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > > > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > > > > > >   case we will want to send this to the back-end instead of creating a
> > > > > > > > >   pipe.
> > > > > > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > > > > > >   pipe, we will want to use that.
> > > > > > > > >
> > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > > > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > > > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > > > > > >   indicate failure, so we need to check explicitly.
> > > > > > > > >
> > > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > > > > > (which includes establishing the direction of transfer and migration
> > > > > > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > >
> > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > ---
> > > > > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > >  4 files changed, 287 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > >  } VhostSetConfigType;
> > > > > > > > >
> > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > +
> > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > >
> > > > > > > > vDPA has:
> > > > > > > >
> > > > > > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > > > > > >    *
> > > > > > > >    * After the return of ioctl the device must preserve all the necessary state
> > > > > > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > > > > > >    * required for restoring in the future. The device must not change its
> > > > > > > >    * configuration after that point.
> > > > > > > >    */
> > > > > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > >
> > > > > > > >   /* Resume a device so it can resume processing virtqueue requests
> > > > > > > >    *
> > > > > > > >    * After the return of this ioctl the device will have restored all the
> > > > > > > >    * necessary states and it is fully operational to continue processing the
> > > > > > > >    * virtqueue descriptors.
> > > > > > > >    */
> > > > > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > >
> > > > > > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > > > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > > > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > > > > > overlapping/duplicated functionality.
> > > > > > > >
> > > > > > >
> > > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > > > > > to SUSPEND.
> > > > > >
> > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > > > > ioctl(VHOST_VDPA_RESUME).
> > > > > >
> > > > > > The doc comments in <linux/vdpa.h> don't explain how the device can
> > > > > > leave the suspended state. Can you clarify this?
> > > > > >
> > > > >
> > > > > Do you mean in what situations or regarding the semantics of _RESUME?
> > > > >
> > > > > To me resume is an operation mainly to resume the device in the event
> > > > > of a VM suspension, not a migration. It can be used as a fallback code
> > > > > in some cases of migration failure though, but it is not currently
> > > > > used in qemu.
> > > >
> > > > Is a "VM suspension" the QEMU HMP 'stop' command?
> > > >
> > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it
> > > > resets the device in vhost_dev_stop()?
> > > >
> > >
> > > The actual reason for not using RESUME is that the ioctl was added
> > > after the SUSPEND design in qemu. Same as this proposal, it is was not
> > > needed at the time.
> > >
> > > In the case of vhost-vdpa net, the only usage of suspend is to fetch
> > > the vq indexes, and in case of error vhost already fetches them from
> > > guest's used ring way before vDPA, so it has little usage.
> > >
> > > > Does it make sense to combine SUSPEND and RESUME with Hanna's
> > > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> > > > this:
> > > > - Saving the device's state is done by SUSPEND followed by
> > > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> > > > savevm command or migration failed), then RESUME is called to
> > > > continue.
> > >
> > > I think the previous steps make sense at vhost_dev_stop, not virtio
> > > savevm handlers. To start spreading this logic to more places of qemu
> > > can bring confusion.
> >
> > I don't think there is a way around extending the QEMU vhost's code
> > model. The current model in QEMU's vhost code is that the backend is
> > reset when the VM stops. This model worked fine for stateless devices
> > but it doesn't work for stateful devices.
> >
> > Imagine a vdpa-gpu device: you cannot reset the device in
> > vhost_dev_stop() and expect the GPU to continue working when
> > vhost_dev_start() is called again because all its state has been lost.
> > The guest driver will send requests that references a virtio-gpu
> > resources that no longer exist.
> >
> > One solution is to save the device's state in vhost_dev_stop(). I think
> > this is what you're suggesting. It requires keeping a copy of the state
> > and then loading the state again in vhost_dev_start(). I don't think
> > this approach should be used because it requires all stateful devices to
> > support live migration (otherwise they break across HMP 'stop'/'cont').
> > Also, the device state for some devices may be large and it would also
> > become more complicated when iterative migration is added.
> >
> > Instead, I think the QEMU vhost code needs to be structured so that
> > struct vhost_dev has a suspended state:
> >
> >         ,---------.
> >         v         |
> >   started ------> stopped
> >     \   ^
> >      \  |
> >       -> suspended
> >
> > The device doesn't lose state when it enters the suspended state. It can
> > be resumed again.
> >
> > This is why I think SUSPEND/RESUME need to be part of the solution.
>
> I agree with all of this, especially after realizing vhost_dev_stop is
> called before the last request of the state in the iterative
> migration.
>
> However I think we can move faster with the virtiofsd migration code,
> as long as we agree on the vhost-user messages it will receive. This
> is because we already agree that the state will be sent in one shot
> and not iteratively, so it will be small.
>
> I understand this may change in the future, that's why I proposed to
> start using iterative right now. However it may make little sense if
> it is not used in the vhost-user device. I also understand that other
> devices may have a bigger state so it will be needed for them.
Can you summarize how you'd like save to work today? I'm not sure what
you have in mind.
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-14 15:17           ` Eugenio Perez Martin
  2023-04-17 15:18             ` Stefan Hajnoczi
@ 2023-04-19 10:45             ` Hanna Czenczek
  2023-04-19 10:57               ` Stefan Hajnoczi
  1 sibling, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-19 10:45 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On 14.04.23 17:17, Eugenio Perez Martin wrote:
> On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
[...]
>> Basically, what I’m hearing is that I need to implement a different
>> feature that has no practical impact right now, and also fix bugs around
>> it along the way...
>>
> To fix this properly requires iterative device migration in qemu as
> far as I know, instead of using VMStates [1]. This way the state is
> requested to virtiofsd before the device reset.
>
> What does virtiofsd do when the state is totally sent? Does it keep
> processing requests and generating new state or is only a one shot
> that will suspend the daemon? If it is the second I think it still can
> be done in one shot at the end, always indicating "no more state" at
> save_live_pending and sending all the state at
> save_live_complete_precopy.
This sounds to me as if we should reset all devices during migration, 
and I don’t understand that.  virtiofsd will not immediately process 
requests when the state is sent, because the device is still stopped, 
but when it is re-enabled (e.g. because of a failed migration), it will 
have retained its state and continue processing requests as if nothing 
happened.  A reset would break this and other stateful back-ends, as I 
think Stefan has mentioned somewhere else.
It seems to me as if there are devices that need a reset, and so need 
suspend+resume around it, but I also think there are back-ends that 
don’t, where this would only unnecessarily complicate the back-end 
implementation.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-17 15:12         ` Stefan Hajnoczi
@ 2023-04-19 10:47           ` Hanna Czenczek
  0 siblings, 0 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-19 10:47 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Eugenio Perez Martin, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On 17.04.23 17:12, Stefan Hajnoczi wrote:
[...]
> This brings to mind how iterative migration will work. The interface for
> iterative migration is basically the same as non-iterative migration
> plus a method to query the number of bytes remaining. When the number of
> bytes falls below a threshold, the vCPUs are stopped and the remainder
> of the data is read.
>
> Some details from VFIO migration:
> - The VMM must explicitly change the state when transitioning from
>    iterative and non-iterative migration, but the data transfer fd
>    remains the same.
> - The state of the device (running, stopped, resuming, etc) doesn't
>    change asynchronously, it's always driven by the VMM. However, setting
>    the state can fail and then the new state may be an error state.
>
> Mapping this to SET_DEVICE_STATE_FD:
> - VhostDeviceStatePhase is extended with
>    VHOST_TRANSFER_STATE_PHASE_RUNNING = 1 for iterative migration. The
>    frontend sends SET_DEVICE_STATE_FD again with
>    VHOST_TRANSFER_STATE_PHASE_STOPPED when entering non-iterative
>    migration and the frontend sends the iterative fd from the previous
>    SET_DEVICE_STATE_FD call to the backend. The backend may reply with
>    another fd, if necessary. If the backend changes the fd, then the
>    contents of the previous fd must be fully read and transferred before
>    the contents of the new fd are migrated. (Maybe this is too complex
>    and we should forbid changing the fd when going from RUNNING ->
>    STOPPED.)
> - CHECK_DEVICE_STATE can be extended to report the number of bytes
>    remaining. The semantics change so that CHECK_DEVICE_STATE can be
>    called while the VMM is still reading from the fd. It becomes:
>
>      enum CheckDeviceStateResult {
>          Saving(bytes_remaining : usize),
> 	Failed(error_code : u64),
>      }
Sounds good.  Personally, I’d forbid changing the FD when just changing 
state, which raises the question of whether there should then be a 
separate command for just changing the state (like VFIO_DEVICE_FEATURE 
..._MIG_DEVICE_STATE?), but that would be a question for then.
Changing the CHECK_DEVICE_STATE interface sounds good to me.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-19 10:45             ` Hanna Czenczek
@ 2023-04-19 10:57               ` Stefan Hajnoczi
  0 siblings, 0 replies; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-19 10:57 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Eugenio Perez Martin, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Wed, 19 Apr 2023 at 06:45, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 14.04.23 17:17, Eugenio Perez Martin wrote:
> > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> [...]
>
> >> Basically, what I’m hearing is that I need to implement a different
> >> feature that has no practical impact right now, and also fix bugs around
> >> it along the way...
> >>
> > To fix this properly requires iterative device migration in qemu as
> > far as I know, instead of using VMStates [1]. This way the state is
> > requested to virtiofsd before the device reset.
> >
> > What does virtiofsd do when the state is totally sent? Does it keep
> > processing requests and generating new state or is only a one shot
> > that will suspend the daemon? If it is the second I think it still can
> > be done in one shot at the end, always indicating "no more state" at
> > save_live_pending and sending all the state at
> > save_live_complete_precopy.
>
> This sounds to me as if we should reset all devices during migration,
> and I don’t understand that.  virtiofsd will not immediately process
> requests when the state is sent, because the device is still stopped,
> but when it is re-enabled (e.g. because of a failed migration), it will
> have retained its state and continue processing requests as if nothing
> happened.  A reset would break this and other stateful back-ends, as I
> think Stefan has mentioned somewhere else.
>
> It seems to me as if there are devices that need a reset, and so need
> suspend+resume around it, but I also think there are back-ends that
> don’t, where this would only unnecessarily complicate the back-end
> implementation.
Existing vhost-user backends must continue working, so I think having
two code paths is (almost) unavoidable.
One approach is to add SUSPEND/RESUME to the vhost-user protocol with
a corresponding VHOST_USER_PROTOCOL_F_SUSPEND feature bit. vhost-user
frontends can identify backends that support SUSPEND/RESUME instead of
device reset. Old vhost-user backends will continue to use device
reset.
I said avoiding two code paths is almost unavoidable. It may be
possible to rely on existing VHOST_USER_GET_VRING_BASE's semantics (it
stops a single virtqueue) instead of SUSPEND. RESUME is replaced by
VHOST_USER_SET_VRING_* and gets the device going again. However, I'm
not 100% sure if this will work (even for all existing devices). It
would require carefully studying both the spec and various
implementations to see if it's viable. There's a chance of losing the
performance optimization that VHOST_USER_SET_STATUS provided to DPDK
if the device is not reset.
In my opinion SUSPEND/RESUME is the cleanest way to do this.
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [Virtio-fs] [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-18 17:59               ` Stefan Hajnoczi
  2023-04-18 18:31                 ` Eugenio Perez Martin
@ 2023-04-19 10:57                 ` Hanna Czenczek
  2023-04-19 11:10                   ` Stefan Hajnoczi
  1 sibling, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-19 10:57 UTC (permalink / raw)
  To: Stefan Hajnoczi, Eugenio Perez Martin
  Cc: Juan Quintela, Stefan Hajnoczi, Michael S . Tsirkin, qemu-devel,
	virtio-fs, Anton Kuchin
On 18.04.23 19:59, Stefan Hajnoczi wrote:
> On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
>> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>>>> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
>>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>>>>>>>> So-called "internal" virtio-fs migration refers to transporting the
>>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
>>>>>>>> from virtiofsd.
>>>>>>>>
>>>>>>>> Because virtiofsd's internal state will not be too large, we believe it
>>>>>>>> is best to transfer it as a single binary blob after the streaming
>>>>>>>> phase.  Because this method should be useful to other vhost-user
>>>>>>>> implementations, too, it is introduced as a general-purpose addition to
>>>>>>>> the protocol, not limited to vhost-user-fs.
>>>>>>>>
>>>>>>>> These are the additions to the protocol:
>>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>>>>>>>    This feature signals support for transferring state, and is added so
>>>>>>>>    that migration can fail early when the back-end has no support.
>>>>>>>>
>>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>>>>>>>    over which to transfer the state.  The front-end sends an FD to the
>>>>>>>>    back-end into/from which it can write/read its state, and the back-end
>>>>>>>>    can decide to either use it, or reply with a different FD for the
>>>>>>>>    front-end to override the front-end's choice.
>>>>>>>>    The front-end creates a simple pipe to transfer the state, but maybe
>>>>>>>>    the back-end already has an FD into/from which it has to write/read
>>>>>>>>    its state, in which case it will want to override the simple pipe.
>>>>>>>>    Conversely, maybe in the future we find a way to have the front-end
>>>>>>>>    get an immediate FD for the migration stream (in some cases), in which
>>>>>>>>    case we will want to send this to the back-end instead of creating a
>>>>>>>>    pipe.
>>>>>>>>    Hence the negotiation: If one side has a better idea than a plain
>>>>>>>>    pipe, we will want to use that.
>>>>>>>>
>>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>>>>>>>    pipe (the end indicated by EOF), the front-end invokes this function
>>>>>>>>    to verify success.  There is no in-band way (through the pipe) to
>>>>>>>>    indicate failure, so we need to check explicitly.
>>>>>>>>
>>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>>>>>>>> (which includes establishing the direction of transfer and migration
>>>>>>>> phase), the sending side writes its data into the pipe, and the reading
>>>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
>>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
>>>>>>>> checking for integrity (i.e. errors during deserialization).
>>>>>>>>
>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>> ---
>>>>>>>>   include/hw/virtio/vhost-backend.h |  24 +++++
>>>>>>>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>>>>>>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>>>>>>>   hw/virtio/vhost.c                 |  37 ++++++++
>>>>>>>>   4 files changed, 287 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>>>>>>> index ec3fbae58d..5935b32fe3 100644
>>>>>>>> --- a/include/hw/virtio/vhost-backend.h
>>>>>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>>>>>>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>>>>>>>   } VhostSetConfigType;
>>>>>>>>
>>>>>>>> +typedef enum VhostDeviceStateDirection {
>>>>>>>> +    /* Transfer state from back-end (device) to front-end */
>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>>>>>>>> +    /* Transfer state from front-end to back-end (device) */
>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>>>>>>>> +} VhostDeviceStateDirection;
>>>>>>>> +
>>>>>>>> +typedef enum VhostDeviceStatePhase {
>>>>>>>> +    /* The device (and all its vrings) is stopped */
>>>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>>>>>>>> +} VhostDeviceStatePhase;
>>>>>>> vDPA has:
>>>>>>>
>>>>>>>    /* Suspend a device so it does not process virtqueue requests anymore
>>>>>>>     *
>>>>>>>     * After the return of ioctl the device must preserve all the necessary state
>>>>>>>     * (the virtqueue vring base plus the possible device specific states) that is
>>>>>>>     * required for restoring in the future. The device must not change its
>>>>>>>     * configuration after that point.
>>>>>>>     */
>>>>>>>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>>>>>>>
>>>>>>>    /* Resume a device so it can resume processing virtqueue requests
>>>>>>>     *
>>>>>>>     * After the return of this ioctl the device will have restored all the
>>>>>>>     * necessary states and it is fully operational to continue processing the
>>>>>>>     * virtqueue descriptors.
>>>>>>>     */
>>>>>>>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>>>>>>>
>>>>>>> I wonder if it makes sense to import these into vhost-user so that the
>>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
>>>>>>> if one of them is ahead of the other, but it would be nice to avoid
>>>>>>> overlapping/duplicated functionality.
>>>>>>>
>>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
>>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
>>>>>> to SUSPEND.
>>>>> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
>>>>> ioctl(VHOST_VDPA_RESUME).
>>>>>
>>>>> The doc comments in <linux/vdpa.h> don't explain how the device can
>>>>> leave the suspended state. Can you clarify this?
>>>>>
>>>> Do you mean in what situations or regarding the semantics of _RESUME?
>>>>
>>>> To me resume is an operation mainly to resume the device in the event
>>>> of a VM suspension, not a migration. It can be used as a fallback code
>>>> in some cases of migration failure though, but it is not currently
>>>> used in qemu.
>>> Is a "VM suspension" the QEMU HMP 'stop' command?
>>>
>>> I guess the reason why QEMU doesn't call RESUME anywhere is that it
>>> resets the device in vhost_dev_stop()?
>>>
>> The actual reason for not using RESUME is that the ioctl was added
>> after the SUSPEND design in qemu. Same as this proposal, it is was not
>> needed at the time.
>>
>> In the case of vhost-vdpa net, the only usage of suspend is to fetch
>> the vq indexes, and in case of error vhost already fetches them from
>> guest's used ring way before vDPA, so it has little usage.
>>
>>> Does it make sense to combine SUSPEND and RESUME with Hanna's
>>> SET_DEVICE_STATE_FD? For example, non-iterative migration works like
>>> this:
>>> - Saving the device's state is done by SUSPEND followed by
>>> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
>>> savevm command or migration failed), then RESUME is called to
>>> continue.
>> I think the previous steps make sense at vhost_dev_stop, not virtio
>> savevm handlers. To start spreading this logic to more places of qemu
>> can bring confusion.
> I don't think there is a way around extending the QEMU vhost's code
> model. The current model in QEMU's vhost code is that the backend is
> reset when the VM stops. This model worked fine for stateless devices
> but it doesn't work for stateful devices.
>
> Imagine a vdpa-gpu device: you cannot reset the device in
> vhost_dev_stop() and expect the GPU to continue working when
> vhost_dev_start() is called again because all its state has been lost.
> The guest driver will send requests that references a virtio-gpu
> resources that no longer exist.
>
> One solution is to save the device's state in vhost_dev_stop(). I think
> this is what you're suggesting. It requires keeping a copy of the state
> and then loading the state again in vhost_dev_start(). I don't think
> this approach should be used because it requires all stateful devices to
> support live migration (otherwise they break across HMP 'stop'/'cont').
> Also, the device state for some devices may be large and it would also
> become more complicated when iterative migration is added.
>
> Instead, I think the QEMU vhost code needs to be structured so that
> struct vhost_dev has a suspended state:
>
>          ,---------.
> 	v         |
>    started ------> stopped
>      \   ^
>       \  |
>        -> suspended
>
> The device doesn't lose state when it enters the suspended state. It can
> be resumed again.
>
> This is why I think SUSPEND/RESUME need to be part of the solution.
> (It's also an argument for not including the phase argument in
> SET_DEVICE_STATE_FD because the SUSPEND message is sent during
> vhost_dev_stop() separately from saving the device's state.)
So let me ask if I understand this protocol correctly: Basically, 
SUSPEND would ask the device to fully serialize its internal state, 
retain it in some buffer, and RESUME would then deserialize the state 
from the buffer, right?
While this state needn’t necessarily be immediately migratable, I 
suppose (e.g. one could retain file descriptors there, and it doesn’t 
need to be a serialized byte buffer, but could still be structured), it 
would basically be a live migration implementation already.  As far as I 
understand, that’s why you suggest not running a SUSPEND+RESUME cycle on 
anything but live migration, right?
I wonder how that model would then work with iterative migration, 
though.  Basically, for non-iterative migration, the back-end would 
expect SUSPEND first to flush its state out to a buffer, and then the 
state transfer would just copy from that buffer.  For iterative 
migration, though, there is no SUSPEND first, so the back-end must 
implicitly begin to serialize its state and send it over.  I find that a 
bit strange.
Also, how would this work with currently migratable stateless 
back-ends?  Do they already implement SUSPEND+RESUME as no-ops?  If not, 
I think we should detect stateless back-ends and skip the operations in 
qemu lest we have to update those back-ends for no real reason.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [Virtio-fs] [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-19 10:57                 ` [Virtio-fs] " Hanna Czenczek
@ 2023-04-19 11:10                   ` Stefan Hajnoczi
  2023-04-19 11:15                     ` Hanna Czenczek
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-19 11:10 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, Eugenio Perez Martin, Juan Quintela,
	Michael S . Tsirkin, qemu-devel, virtio-fs, Anton Kuchin
On Wed, 19 Apr 2023 at 06:57, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 18.04.23 19:59, Stefan Hajnoczi wrote:
> > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> >> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >>>> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> >>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >>>>>>>> So-called "internal" virtio-fs migration refers to transporting the
> >>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
> >>>>>>>> from virtiofsd.
> >>>>>>>>
> >>>>>>>> Because virtiofsd's internal state will not be too large, we believe it
> >>>>>>>> is best to transfer it as a single binary blob after the streaming
> >>>>>>>> phase.  Because this method should be useful to other vhost-user
> >>>>>>>> implementations, too, it is introduced as a general-purpose addition to
> >>>>>>>> the protocol, not limited to vhost-user-fs.
> >>>>>>>>
> >>>>>>>> These are the additions to the protocol:
> >>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>>>>>>>    This feature signals support for transferring state, and is added so
> >>>>>>>>    that migration can fail early when the back-end has no support.
> >>>>>>>>
> >>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>>>>>>>    over which to transfer the state.  The front-end sends an FD to the
> >>>>>>>>    back-end into/from which it can write/read its state, and the back-end
> >>>>>>>>    can decide to either use it, or reply with a different FD for the
> >>>>>>>>    front-end to override the front-end's choice.
> >>>>>>>>    The front-end creates a simple pipe to transfer the state, but maybe
> >>>>>>>>    the back-end already has an FD into/from which it has to write/read
> >>>>>>>>    its state, in which case it will want to override the simple pipe.
> >>>>>>>>    Conversely, maybe in the future we find a way to have the front-end
> >>>>>>>>    get an immediate FD for the migration stream (in some cases), in which
> >>>>>>>>    case we will want to send this to the back-end instead of creating a
> >>>>>>>>    pipe.
> >>>>>>>>    Hence the negotiation: If one side has a better idea than a plain
> >>>>>>>>    pipe, we will want to use that.
> >>>>>>>>
> >>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>>>>>>>    pipe (the end indicated by EOF), the front-end invokes this function
> >>>>>>>>    to verify success.  There is no in-band way (through the pipe) to
> >>>>>>>>    indicate failure, so we need to check explicitly.
> >>>>>>>>
> >>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >>>>>>>> (which includes establishing the direction of transfer and migration
> >>>>>>>> phase), the sending side writes its data into the pipe, and the reading
> >>>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
> >>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> >>>>>>>> checking for integrity (i.e. errors during deserialization).
> >>>>>>>>
> >>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>>>>>> ---
> >>>>>>>>   include/hw/virtio/vhost-backend.h |  24 +++++
> >>>>>>>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>>>>>>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>>>>>>>   hw/virtio/vhost.c                 |  37 ++++++++
> >>>>>>>>   4 files changed, 287 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>>>>>>> index ec3fbae58d..5935b32fe3 100644
> >>>>>>>> --- a/include/hw/virtio/vhost-backend.h
> >>>>>>>> +++ b/include/hw/virtio/vhost-backend.h
> >>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>>>>>>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>>>>>>>   } VhostSetConfigType;
> >>>>>>>>
> >>>>>>>> +typedef enum VhostDeviceStateDirection {
> >>>>>>>> +    /* Transfer state from back-end (device) to front-end */
> >>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >>>>>>>> +    /* Transfer state from front-end to back-end (device) */
> >>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >>>>>>>> +} VhostDeviceStateDirection;
> >>>>>>>> +
> >>>>>>>> +typedef enum VhostDeviceStatePhase {
> >>>>>>>> +    /* The device (and all its vrings) is stopped */
> >>>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >>>>>>>> +} VhostDeviceStatePhase;
> >>>>>>> vDPA has:
> >>>>>>>
> >>>>>>>    /* Suspend a device so it does not process virtqueue requests anymore
> >>>>>>>     *
> >>>>>>>     * After the return of ioctl the device must preserve all the necessary state
> >>>>>>>     * (the virtqueue vring base plus the possible device specific states) that is
> >>>>>>>     * required for restoring in the future. The device must not change its
> >>>>>>>     * configuration after that point.
> >>>>>>>     */
> >>>>>>>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >>>>>>>
> >>>>>>>    /* Resume a device so it can resume processing virtqueue requests
> >>>>>>>     *
> >>>>>>>     * After the return of this ioctl the device will have restored all the
> >>>>>>>     * necessary states and it is fully operational to continue processing the
> >>>>>>>     * virtqueue descriptors.
> >>>>>>>     */
> >>>>>>>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >>>>>>>
> >>>>>>> I wonder if it makes sense to import these into vhost-user so that the
> >>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
> >>>>>>> if one of them is ahead of the other, but it would be nice to avoid
> >>>>>>> overlapping/duplicated functionality.
> >>>>>>>
> >>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
> >>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
> >>>>>> to SUSPEND.
> >>>>> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> >>>>> ioctl(VHOST_VDPA_RESUME).
> >>>>>
> >>>>> The doc comments in <linux/vdpa.h> don't explain how the device can
> >>>>> leave the suspended state. Can you clarify this?
> >>>>>
> >>>> Do you mean in what situations or regarding the semantics of _RESUME?
> >>>>
> >>>> To me resume is an operation mainly to resume the device in the event
> >>>> of a VM suspension, not a migration. It can be used as a fallback code
> >>>> in some cases of migration failure though, but it is not currently
> >>>> used in qemu.
> >>> Is a "VM suspension" the QEMU HMP 'stop' command?
> >>>
> >>> I guess the reason why QEMU doesn't call RESUME anywhere is that it
> >>> resets the device in vhost_dev_stop()?
> >>>
> >> The actual reason for not using RESUME is that the ioctl was added
> >> after the SUSPEND design in qemu. Same as this proposal, it is was not
> >> needed at the time.
> >>
> >> In the case of vhost-vdpa net, the only usage of suspend is to fetch
> >> the vq indexes, and in case of error vhost already fetches them from
> >> guest's used ring way before vDPA, so it has little usage.
> >>
> >>> Does it make sense to combine SUSPEND and RESUME with Hanna's
> >>> SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> >>> this:
> >>> - Saving the device's state is done by SUSPEND followed by
> >>> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> >>> savevm command or migration failed), then RESUME is called to
> >>> continue.
> >> I think the previous steps make sense at vhost_dev_stop, not virtio
> >> savevm handlers. To start spreading this logic to more places of qemu
> >> can bring confusion.
> > I don't think there is a way around extending the QEMU vhost's code
> > model. The current model in QEMU's vhost code is that the backend is
> > reset when the VM stops. This model worked fine for stateless devices
> > but it doesn't work for stateful devices.
> >
> > Imagine a vdpa-gpu device: you cannot reset the device in
> > vhost_dev_stop() and expect the GPU to continue working when
> > vhost_dev_start() is called again because all its state has been lost.
> > The guest driver will send requests that references a virtio-gpu
> > resources that no longer exist.
> >
> > One solution is to save the device's state in vhost_dev_stop(). I think
> > this is what you're suggesting. It requires keeping a copy of the state
> > and then loading the state again in vhost_dev_start(). I don't think
> > this approach should be used because it requires all stateful devices to
> > support live migration (otherwise they break across HMP 'stop'/'cont').
> > Also, the device state for some devices may be large and it would also
> > become more complicated when iterative migration is added.
> >
> > Instead, I think the QEMU vhost code needs to be structured so that
> > struct vhost_dev has a suspended state:
> >
> >          ,---------.
> >       v         |
> >    started ------> stopped
> >      \   ^
> >       \  |
> >        -> suspended
> >
> > The device doesn't lose state when it enters the suspended state. It can
> > be resumed again.
> >
> > This is why I think SUSPEND/RESUME need to be part of the solution.
> > (It's also an argument for not including the phase argument in
> > SET_DEVICE_STATE_FD because the SUSPEND message is sent during
> > vhost_dev_stop() separately from saving the device's state.)
>
> So let me ask if I understand this protocol correctly: Basically,
> SUSPEND would ask the device to fully serialize its internal state,
> retain it in some buffer, and RESUME would then deserialize the state
> from the buffer, right?
That's not how I understand SUSPEND/RESUME. I was thinking that
SUSPEND pauses device operation so that virtqueues are no longer
processed and no other events occur (e.g. VIRTIO Configuration Change
Notifications). RESUME continues device operation. Neither command is
directly related to device state serialization but SUSPEND freezes the
device state, while RESUME allows the device state to change again.
> While this state needn’t necessarily be immediately migratable, I
> suppose (e.g. one could retain file descriptors there, and it doesn’t
> need to be a serialized byte buffer, but could still be structured), it
> would basically be a live migration implementation already.  As far as I
> understand, that’s why you suggest not running a SUSPEND+RESUME cycle on
> anything but live migration, right?
No, SUSPEND/RESUME would also be used across vm_stop()/vm_start().
That way stateful devices are no longer reset across HMP 'stop'/'cont'
(we're lucky it even works for most existing vhost-user backends today
and that's just because they don't yet implement
VHOST_USER_SET_STATUS).
> I wonder how that model would then work with iterative migration,
> though.  Basically, for non-iterative migration, the back-end would
> expect SUSPEND first to flush its state out to a buffer, and then the
> state transfer would just copy from that buffer.  For iterative
> migration, though, there is no SUSPEND first, so the back-end must
> implicitly begin to serialize its state and send it over.  I find that a
> bit strange.
I expected SET_DEVICE_STATE_FD to be sent while the device is still
running for iterative migration. Device state chunks are saved while
the device is still operating.
When the VMM decides to stop the guest, it sends SUSPEND to freeze the
device. The remainder of the device state can then be read from the fd
in the knowledge that the size is now finite.
After migration completes, the device is still suspended on the
source. If migration failed, RESUME is sent to continue running on the
source.
> Also, how would this work with currently migratable stateless
> back-ends?  Do they already implement SUSPEND+RESUME as no-ops?  If not,
> I think we should detect stateless back-ends and skip the operations in
> qemu lest we have to update those back-ends for no real reason.
Yes, I think backwards compatibility is a requirement, too. The
vhost-user frontend checks the SUSPEND vhost-user protocol feature
bit. If the bit is cleared, then it must assume this device is
stateless and use device reset operations. Otherwise it can use
SUSPEND/RESUME.
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-18  7:54             ` Eugenio Perez Martin
@ 2023-04-19 11:10               ` Hanna Czenczek
  2023-04-19 11:21                 ` Stefan Hajnoczi
  2023-04-20 10:44                 ` Eugenio Pérez
  0 siblings, 2 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-19 11:10 UTC (permalink / raw)
  To: Eugenio Perez Martin, Stefan Hajnoczi
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On 18.04.23 09:54, Eugenio Perez Martin wrote:
> On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>>> On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>>>>>>> So-called "internal" virtio-fs migration refers to transporting the
>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
>>>>>>> from virtiofsd.
>>>>>>>
>>>>>>> Because virtiofsd's internal state will not be too large, we believe it
>>>>>>> is best to transfer it as a single binary blob after the streaming
>>>>>>> phase.  Because this method should be useful to other vhost-user
>>>>>>> implementations, too, it is introduced as a general-purpose addition to
>>>>>>> the protocol, not limited to vhost-user-fs.
>>>>>>>
>>>>>>> These are the additions to the protocol:
>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>>>>>>    This feature signals support for transferring state, and is added so
>>>>>>>    that migration can fail early when the back-end has no support.
>>>>>>>
>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>>>>>>    over which to transfer the state.  The front-end sends an FD to the
>>>>>>>    back-end into/from which it can write/read its state, and the back-end
>>>>>>>    can decide to either use it, or reply with a different FD for the
>>>>>>>    front-end to override the front-end's choice.
>>>>>>>    The front-end creates a simple pipe to transfer the state, but maybe
>>>>>>>    the back-end already has an FD into/from which it has to write/read
>>>>>>>    its state, in which case it will want to override the simple pipe.
>>>>>>>    Conversely, maybe in the future we find a way to have the front-end
>>>>>>>    get an immediate FD for the migration stream (in some cases), in which
>>>>>>>    case we will want to send this to the back-end instead of creating a
>>>>>>>    pipe.
>>>>>>>    Hence the negotiation: If one side has a better idea than a plain
>>>>>>>    pipe, we will want to use that.
>>>>>>>
>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>>>>>>    pipe (the end indicated by EOF), the front-end invokes this function
>>>>>>>    to verify success.  There is no in-band way (through the pipe) to
>>>>>>>    indicate failure, so we need to check explicitly.
>>>>>>>
>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>>>>>>> (which includes establishing the direction of transfer and migration
>>>>>>> phase), the sending side writes its data into the pipe, and the reading
>>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
>>>>>>> checking for integrity (i.e. errors during deserialization).
>>>>>>>
>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>> ---
>>>>>>>   include/hw/virtio/vhost-backend.h |  24 +++++
>>>>>>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>>>>>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>>>>>>   hw/virtio/vhost.c                 |  37 ++++++++
>>>>>>>   4 files changed, 287 insertions(+)
>>>>>>>
>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>>>>>> index ec3fbae58d..5935b32fe3 100644
>>>>>>> --- a/include/hw/virtio/vhost-backend.h
>>>>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>>>>>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>>>>>>   } VhostSetConfigType;
>>>>>>>
>>>>>>> +typedef enum VhostDeviceStateDirection {
>>>>>>> +    /* Transfer state from back-end (device) to front-end */
>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>>>>>>> +    /* Transfer state from front-end to back-end (device) */
>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>>>>>>> +} VhostDeviceStateDirection;
>>>>>>> +
>>>>>>> +typedef enum VhostDeviceStatePhase {
>>>>>>> +    /* The device (and all its vrings) is stopped */
>>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>>>>>>> +} VhostDeviceStatePhase;
>>>>>> vDPA has:
>>>>>>
>>>>>>    /* Suspend a device so it does not process virtqueue requests anymore
>>>>>>     *
>>>>>>     * After the return of ioctl the device must preserve all the necessary state
>>>>>>     * (the virtqueue vring base plus the possible device specific states) that is
>>>>>>     * required for restoring in the future. The device must not change its
>>>>>>     * configuration after that point.
>>>>>>     */
>>>>>>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>>>>>>
>>>>>>    /* Resume a device so it can resume processing virtqueue requests
>>>>>>     *
>>>>>>     * After the return of this ioctl the device will have restored all the
>>>>>>     * necessary states and it is fully operational to continue processing the
>>>>>>     * virtqueue descriptors.
>>>>>>     */
>>>>>>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>>>>>>
>>>>>> I wonder if it makes sense to import these into vhost-user so that the
>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
>>>>>> if one of them is ahead of the other, but it would be nice to avoid
>>>>>> overlapping/duplicated functionality.
>>>>>>
>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
>>>>> to SUSPEND.
>>>>>
>>>>> Generally it is better if we make the interface less parametrized and
>>>>> we trust in the messages and its semantics in my opinion. In other
>>>>> words, instead of
>>>>> vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
>>>>> individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
>>>>>
>>>>> Another way to apply this is with the "direction" parameter. Maybe it
>>>>> is better to split it into "set_state_fd" and "get_state_fd"?
>>>>>
>>>>> In that case, reusing the ioctls as vhost-user messages would be ok.
>>>>> But that puts this proposal further from the VFIO code, which uses
>>>>> "migration_set_state(state)", and maybe it is better when the number
>>>>> of states is high.
>>>> Hi Eugenio,
>>>> Another question about vDPA suspend/resume:
>>>>
>>>>    /* Host notifiers must be enabled at this point. */
>>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
>>>>    {
>>>>        int i;
>>>>
>>>>        /* should only be called after backend is connected */
>>>>        assert(hdev->vhost_ops);
>>>>        event_notifier_test_and_clear(
>>>>            &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
>>>>        event_notifier_test_and_clear(&vdev->config_notifier);
>>>>
>>>>        trace_vhost_dev_stop(hdev, vdev->name, vrings);
>>>>
>>>>        if (hdev->vhost_ops->vhost_dev_start) {
>>>>            hdev->vhost_ops->vhost_dev_start(hdev, false);
>>>>            ^^^ SUSPEND ^^^
>>>>        }
>>>>        if (vrings) {
>>>>            vhost_dev_set_vring_enable(hdev, false);
>>>>        }
>>>>        for (i = 0; i < hdev->nvqs; ++i) {
>>>>            vhost_virtqueue_stop(hdev,
>>>>                                 vdev,
>>>>                                 hdev->vqs + i,
>>>>                                 hdev->vq_index + i);
>>>>          ^^^ fetch virtqueue state from kernel ^^^
>>>>        }
>>>>        if (hdev->vhost_ops->vhost_reset_status) {
>>>>            hdev->vhost_ops->vhost_reset_status(hdev);
>>>>            ^^^ reset device^^^
>>>>
>>>> I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
>>>> vhost_reset_status(). The device's migration code runs after
>>>> vhost_dev_stop() and the state will have been lost.
>>>>
>>> vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
>>> qemu VirtIONet device model. This is for all vhost backends.
>>>
>>> Regarding the state like mac or mq configuration, SVQ runs for all the
>>> VM run in the CVQ. So it can track all of that status in the device
>>> model too.
>>>
>>> When a migration effectively occurs, all the frontend state is
>>> migrated as a regular emulated device. To route all of the state in a
>>> normalized way for qemu is what leaves open the possibility to do
>>> cross-backends migrations, etc.
>>>
>>> Does that answer your question?
>> I think you're confirming that changes would be necessary in order for
>> vDPA to support the save/load operation that Hanna is introducing.
>>
> Yes, this first iteration was centered on net, with an eye on block,
> where state can be routed through classical emulated devices. This is
> how vhost-kernel and vhost-user do classically. And it allows
> cross-backend, to not modify qemu migration state, etc.
>
> To introduce this opaque state to qemu, that must be fetched after the
> suspend and not before, requires changes in vhost protocol, as
> discussed previously.
>
>>>> It looks like vDPA changes are necessary in order to support stateful
>>>> devices even though QEMU already uses SUSPEND. Is my understanding
>>>> correct?
>>>>
>>> Changes are required elsewhere, as the code to restore the state
>>> properly in the destination has not been merged.
>> I'm not sure what you mean by elsewhere?
>>
> I meant for vdpa *net* devices the changes are not required in vdpa
> ioctls, but mostly in qemu.
>
> If you meant stateful as "it must have a state blob that it must be
> opaque to qemu", then I think the straightforward action is to fetch
> state blob about the same time as vq indexes. But yes, changes (at
> least a new ioctl) is needed for that.
>
>> I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
>> then VHOST_VDPA_SET_STATUS 0.
>>
>> In order to save device state from the vDPA device in the future, it
>> will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
>> the device state can be saved before the device is reset.
>>
>> Does that sound right?
>>
> The split between suspend and reset was added recently for that very
> reason. In all the virtio devices, the frontend is initialized before
> the backend, so I don't think it is a good idea to defer the backend
> cleanup. Especially if we have already set the state is small enough
> to not needing iterative migration from virtiofsd point of view.
>
> If fetching that state at the same time as vq indexes is not valid,
> could it follow the same model as the "in-flight descriptors"?
> vhost-user follows them by using a shared memory region where their
> state is tracked [1]. This allows qemu to survive vhost-user SW
> backend crashes, and does not forbid the cross-backends live migration
> as all the information is there to recover them.
>
> For hw devices this is not convenient as it occupies PCI bandwidth. So
> a possibility is to synchronize this memory region after a
> synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> devices are not going to crash in the software sense, so all use cases
> remain the same to qemu. And that shared memory information is
> recoverable after vhost_dev_stop.
>
> Does that sound reasonable to virtiofsd? To offer a shared memory
> region where it dumps the state, maybe only after the
> set_state(STATE_PHASE_STOPPED)?
I don’t think we need the set_state() call, necessarily, if SUSPEND is 
mandatory anyway.
As for the shared memory, the RFC before this series used shared memory, 
so it’s possible, yes.  But “shared memory region” can mean a lot of 
things – it sounds like you’re saying the back-end (virtiofsd) should 
provide it to the front-end, is that right?  That could work like this:
On the source side:
S1. SUSPEND goes to virtiofsd
S2. virtiofsd maybe double-checks that the device is stopped, then 
serializes its state into a newly allocated shared memory area[1]
S3. virtiofsd responds to SUSPEND
S4. front-end requests shared memory, virtiofsd responds with a handle, 
maybe already closes its reference
S5. front-end saves state, closes its handle, freeing the SHM
[1] Maybe virtiofsd can correctly size the serialized state’s size, then 
it can immediately allocate this area and serialize directly into it; 
maybe it can’t, then we’ll need a bounce buffer.  Not really a 
fundamental problem, but there are limitations around what you can do 
with serde implementations in Rust…
On the destination side:
D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; 
virtiofsd would serialize its empty state into an SHM area, and respond 
to SUSPEND
D2. front-end reads state from migration stream into an SHM it has allocated
D3. front-end supplies this SHM to virtiofsd, which discards its 
previous area, and now uses this one
D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
Couple of questions:
A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND 
would imply to deserialize a state, and the state is to be transferred 
through SHM, this is what would need to be done.  So maybe we should 
skip SUSPEND on the destination?
B. You described that the back-end should supply the SHM, which works 
well on the source.  On the destination, only the front-end knows how 
big the state is, so I’ve decided above that it should allocate the SHM 
(D2) and provide it to the back-end.  Is that feasible or is it 
important (e.g. for real hardware) that the back-end supplies the SHM?  
(In which case the front-end would need to tell the back-end how big the 
state SHM needs to be.)
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [Virtio-fs] [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-19 11:10                   ` Stefan Hajnoczi
@ 2023-04-19 11:15                     ` Hanna Czenczek
  2023-04-19 11:24                       ` Stefan Hajnoczi
  0 siblings, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-19 11:15 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Eugenio Perez Martin, Juan Quintela,
	Michael S . Tsirkin, qemu-devel, virtio-fs, Anton Kuchin
On 19.04.23 13:10, Stefan Hajnoczi wrote:
> On Wed, 19 Apr 2023 at 06:57, Hanna Czenczek <hreitz@redhat.com> wrote:
>> On 18.04.23 19:59, Stefan Hajnoczi wrote:
>>> On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
>>>> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>>>>>> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
>>>>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>>>>>>>>>> So-called "internal" virtio-fs migration refers to transporting the
>>>>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>>>>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
>>>>>>>>>> from virtiofsd.
>>>>>>>>>>
>>>>>>>>>> Because virtiofsd's internal state will not be too large, we believe it
>>>>>>>>>> is best to transfer it as a single binary blob after the streaming
>>>>>>>>>> phase.  Because this method should be useful to other vhost-user
>>>>>>>>>> implementations, too, it is introduced as a general-purpose addition to
>>>>>>>>>> the protocol, not limited to vhost-user-fs.
>>>>>>>>>>
>>>>>>>>>> These are the additions to the protocol:
>>>>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>>>>>>>>>     This feature signals support for transferring state, and is added so
>>>>>>>>>>     that migration can fail early when the back-end has no support.
>>>>>>>>>>
>>>>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>>>>>>>>>     over which to transfer the state.  The front-end sends an FD to the
>>>>>>>>>>     back-end into/from which it can write/read its state, and the back-end
>>>>>>>>>>     can decide to either use it, or reply with a different FD for the
>>>>>>>>>>     front-end to override the front-end's choice.
>>>>>>>>>>     The front-end creates a simple pipe to transfer the state, but maybe
>>>>>>>>>>     the back-end already has an FD into/from which it has to write/read
>>>>>>>>>>     its state, in which case it will want to override the simple pipe.
>>>>>>>>>>     Conversely, maybe in the future we find a way to have the front-end
>>>>>>>>>>     get an immediate FD for the migration stream (in some cases), in which
>>>>>>>>>>     case we will want to send this to the back-end instead of creating a
>>>>>>>>>>     pipe.
>>>>>>>>>>     Hence the negotiation: If one side has a better idea than a plain
>>>>>>>>>>     pipe, we will want to use that.
>>>>>>>>>>
>>>>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>>>>>>>>>     pipe (the end indicated by EOF), the front-end invokes this function
>>>>>>>>>>     to verify success.  There is no in-band way (through the pipe) to
>>>>>>>>>>     indicate failure, so we need to check explicitly.
>>>>>>>>>>
>>>>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>>>>>>>>>> (which includes establishing the direction of transfer and migration
>>>>>>>>>> phase), the sending side writes its data into the pipe, and the reading
>>>>>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
>>>>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
>>>>>>>>>> checking for integrity (i.e. errors during deserialization).
>>>>>>>>>>
>>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>>    include/hw/virtio/vhost-backend.h |  24 +++++
>>>>>>>>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>>>>>>>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>>>>>>>>>    hw/virtio/vhost.c                 |  37 ++++++++
>>>>>>>>>>    4 files changed, 287 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>>>>>>>>> index ec3fbae58d..5935b32fe3 100644
>>>>>>>>>> --- a/include/hw/virtio/vhost-backend.h
>>>>>>>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>>>>>>>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>>>>>>>>>    } VhostSetConfigType;
>>>>>>>>>>
>>>>>>>>>> +typedef enum VhostDeviceStateDirection {
>>>>>>>>>> +    /* Transfer state from back-end (device) to front-end */
>>>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>>>>>>>>>> +    /* Transfer state from front-end to back-end (device) */
>>>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>>>>>>>>>> +} VhostDeviceStateDirection;
>>>>>>>>>> +
>>>>>>>>>> +typedef enum VhostDeviceStatePhase {
>>>>>>>>>> +    /* The device (and all its vrings) is stopped */
>>>>>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>>>>>>>>>> +} VhostDeviceStatePhase;
>>>>>>>>> vDPA has:
>>>>>>>>>
>>>>>>>>>     /* Suspend a device so it does not process virtqueue requests anymore
>>>>>>>>>      *
>>>>>>>>>      * After the return of ioctl the device must preserve all the necessary state
>>>>>>>>>      * (the virtqueue vring base plus the possible device specific states) that is
>>>>>>>>>      * required for restoring in the future. The device must not change its
>>>>>>>>>      * configuration after that point.
>>>>>>>>>      */
>>>>>>>>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>>>>>>>>>
>>>>>>>>>     /* Resume a device so it can resume processing virtqueue requests
>>>>>>>>>      *
>>>>>>>>>      * After the return of this ioctl the device will have restored all the
>>>>>>>>>      * necessary states and it is fully operational to continue processing the
>>>>>>>>>      * virtqueue descriptors.
>>>>>>>>>      */
>>>>>>>>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>>>>>>>>>
>>>>>>>>> I wonder if it makes sense to import these into vhost-user so that the
>>>>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
>>>>>>>>> if one of them is ahead of the other, but it would be nice to avoid
>>>>>>>>> overlapping/duplicated functionality.
>>>>>>>>>
>>>>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
>>>>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
>>>>>>>> to SUSPEND.
>>>>>>> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
>>>>>>> ioctl(VHOST_VDPA_RESUME).
>>>>>>>
>>>>>>> The doc comments in <linux/vdpa.h> don't explain how the device can
>>>>>>> leave the suspended state. Can you clarify this?
>>>>>>>
>>>>>> Do you mean in what situations or regarding the semantics of _RESUME?
>>>>>>
>>>>>> To me resume is an operation mainly to resume the device in the event
>>>>>> of a VM suspension, not a migration. It can be used as a fallback code
>>>>>> in some cases of migration failure though, but it is not currently
>>>>>> used in qemu.
>>>>> Is a "VM suspension" the QEMU HMP 'stop' command?
>>>>>
>>>>> I guess the reason why QEMU doesn't call RESUME anywhere is that it
>>>>> resets the device in vhost_dev_stop()?
>>>>>
>>>> The actual reason for not using RESUME is that the ioctl was added
>>>> after the SUSPEND design in qemu. Same as this proposal, it is was not
>>>> needed at the time.
>>>>
>>>> In the case of vhost-vdpa net, the only usage of suspend is to fetch
>>>> the vq indexes, and in case of error vhost already fetches them from
>>>> guest's used ring way before vDPA, so it has little usage.
>>>>
>>>>> Does it make sense to combine SUSPEND and RESUME with Hanna's
>>>>> SET_DEVICE_STATE_FD? For example, non-iterative migration works like
>>>>> this:
>>>>> - Saving the device's state is done by SUSPEND followed by
>>>>> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
>>>>> savevm command or migration failed), then RESUME is called to
>>>>> continue.
>>>> I think the previous steps make sense at vhost_dev_stop, not virtio
>>>> savevm handlers. To start spreading this logic to more places of qemu
>>>> can bring confusion.
>>> I don't think there is a way around extending the QEMU vhost's code
>>> model. The current model in QEMU's vhost code is that the backend is
>>> reset when the VM stops. This model worked fine for stateless devices
>>> but it doesn't work for stateful devices.
>>>
>>> Imagine a vdpa-gpu device: you cannot reset the device in
>>> vhost_dev_stop() and expect the GPU to continue working when
>>> vhost_dev_start() is called again because all its state has been lost.
>>> The guest driver will send requests that references a virtio-gpu
>>> resources that no longer exist.
>>>
>>> One solution is to save the device's state in vhost_dev_stop(). I think
>>> this is what you're suggesting. It requires keeping a copy of the state
>>> and then loading the state again in vhost_dev_start(). I don't think
>>> this approach should be used because it requires all stateful devices to
>>> support live migration (otherwise they break across HMP 'stop'/'cont').
>>> Also, the device state for some devices may be large and it would also
>>> become more complicated when iterative migration is added.
>>>
>>> Instead, I think the QEMU vhost code needs to be structured so that
>>> struct vhost_dev has a suspended state:
>>>
>>>           ,---------.
>>>        v         |
>>>     started ------> stopped
>>>       \   ^
>>>        \  |
>>>         -> suspended
>>>
>>> The device doesn't lose state when it enters the suspended state. It can
>>> be resumed again.
>>>
>>> This is why I think SUSPEND/RESUME need to be part of the solution.
>>> (It's also an argument for not including the phase argument in
>>> SET_DEVICE_STATE_FD because the SUSPEND message is sent during
>>> vhost_dev_stop() separately from saving the device's state.)
>> So let me ask if I understand this protocol correctly: Basically,
>> SUSPEND would ask the device to fully serialize its internal state,
>> retain it in some buffer, and RESUME would then deserialize the state
>> from the buffer, right?
> That's not how I understand SUSPEND/RESUME. I was thinking that
> SUSPEND pauses device operation so that virtqueues are no longer
> processed and no other events occur (e.g. VIRTIO Configuration Change
> Notifications). RESUME continues device operation. Neither command is
> directly related to device state serialization but SUSPEND freezes the
> device state, while RESUME allows the device state to change again.
I understood that a reset would basically reset all internal state, 
which is why SUSPEND+RESUME were required around it, to retain the state.
>> While this state needn’t necessarily be immediately migratable, I
>> suppose (e.g. one could retain file descriptors there, and it doesn’t
>> need to be a serialized byte buffer, but could still be structured), it
>> would basically be a live migration implementation already.  As far as I
>> understand, that’s why you suggest not running a SUSPEND+RESUME cycle on
>> anything but live migration, right?
> No, SUSPEND/RESUME would also be used across vm_stop()/vm_start().
> That way stateful devices are no longer reset across HMP 'stop'/'cont'
> (we're lucky it even works for most existing vhost-user backends today
> and that's just because they don't yet implement
> VHOST_USER_SET_STATUS).
So that’s what I seem to misunderstand: If stateful devices are reset, 
how does SUSPEND+RESUME prevent that?
>> I wonder how that model would then work with iterative migration,
>> though.  Basically, for non-iterative migration, the back-end would
>> expect SUSPEND first to flush its state out to a buffer, and then the
>> state transfer would just copy from that buffer.  For iterative
>> migration, though, there is no SUSPEND first, so the back-end must
>> implicitly begin to serialize its state and send it over.  I find that a
>> bit strange.
> I expected SET_DEVICE_STATE_FD to be sent while the device is still
> running for iterative migration. Device state chunks are saved while
> the device is still operating.
>
> When the VMM decides to stop the guest, it sends SUSPEND to freeze the
> device. The remainder of the device state can then be read from the fd
> in the knowledge that the size is now finite.
>
> After migration completes, the device is still suspended on the
> source. If migration failed, RESUME is sent to continue running on the
> source.
Sure, that makes perfect sense as long as SUSPEND/RESUME are unrelated 
to device state serialization,
>> Also, how would this work with currently migratable stateless
>> back-ends?  Do they already implement SUSPEND+RESUME as no-ops?  If not,
>> I think we should detect stateless back-ends and skip the operations in
>> qemu lest we have to update those back-ends for no real reason.
> Yes, I think backwards compatibility is a requirement, too. The
> vhost-user frontend checks the SUSPEND vhost-user protocol feature
> bit. If the bit is cleared, then it must assume this device is
> stateless and use device reset operations. Otherwise it can use
> SUSPEND/RESUME.
Yes, all stateful devices should currently block migration, so we could 
require them to implement SUSPEND/RESUME, and assume that any that don’t 
are stateless.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-19 11:10               ` Hanna Czenczek
@ 2023-04-19 11:21                 ` Stefan Hajnoczi
  2023-04-19 11:24                   ` Hanna Czenczek
  2023-04-20 13:29                   ` Eugenio Pérez
  2023-04-20 10:44                 ` Eugenio Pérez
  1 sibling, 2 replies; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-19 11:21 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Eugenio Perez Martin, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >> On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >>> On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> >>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >>>>>>> So-called "internal" virtio-fs migration refers to transporting the
> >>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
> >>>>>>> from virtiofsd.
> >>>>>>>
> >>>>>>> Because virtiofsd's internal state will not be too large, we believe it
> >>>>>>> is best to transfer it as a single binary blob after the streaming
> >>>>>>> phase.  Because this method should be useful to other vhost-user
> >>>>>>> implementations, too, it is introduced as a general-purpose addition to
> >>>>>>> the protocol, not limited to vhost-user-fs.
> >>>>>>>
> >>>>>>> These are the additions to the protocol:
> >>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>>>>>>    This feature signals support for transferring state, and is added so
> >>>>>>>    that migration can fail early when the back-end has no support.
> >>>>>>>
> >>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>>>>>>    over which to transfer the state.  The front-end sends an FD to the
> >>>>>>>    back-end into/from which it can write/read its state, and the back-end
> >>>>>>>    can decide to either use it, or reply with a different FD for the
> >>>>>>>    front-end to override the front-end's choice.
> >>>>>>>    The front-end creates a simple pipe to transfer the state, but maybe
> >>>>>>>    the back-end already has an FD into/from which it has to write/read
> >>>>>>>    its state, in which case it will want to override the simple pipe.
> >>>>>>>    Conversely, maybe in the future we find a way to have the front-end
> >>>>>>>    get an immediate FD for the migration stream (in some cases), in which
> >>>>>>>    case we will want to send this to the back-end instead of creating a
> >>>>>>>    pipe.
> >>>>>>>    Hence the negotiation: If one side has a better idea than a plain
> >>>>>>>    pipe, we will want to use that.
> >>>>>>>
> >>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>>>>>>    pipe (the end indicated by EOF), the front-end invokes this function
> >>>>>>>    to verify success.  There is no in-band way (through the pipe) to
> >>>>>>>    indicate failure, so we need to check explicitly.
> >>>>>>>
> >>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >>>>>>> (which includes establishing the direction of transfer and migration
> >>>>>>> phase), the sending side writes its data into the pipe, and the reading
> >>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
> >>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> >>>>>>> checking for integrity (i.e. errors during deserialization).
> >>>>>>>
> >>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>>>>> ---
> >>>>>>>   include/hw/virtio/vhost-backend.h |  24 +++++
> >>>>>>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>>>>>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>>>>>>   hw/virtio/vhost.c                 |  37 ++++++++
> >>>>>>>   4 files changed, 287 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>>>>>> index ec3fbae58d..5935b32fe3 100644
> >>>>>>> --- a/include/hw/virtio/vhost-backend.h
> >>>>>>> +++ b/include/hw/virtio/vhost-backend.h
> >>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>>>>>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>>>>>>   } VhostSetConfigType;
> >>>>>>>
> >>>>>>> +typedef enum VhostDeviceStateDirection {
> >>>>>>> +    /* Transfer state from back-end (device) to front-end */
> >>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >>>>>>> +    /* Transfer state from front-end to back-end (device) */
> >>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >>>>>>> +} VhostDeviceStateDirection;
> >>>>>>> +
> >>>>>>> +typedef enum VhostDeviceStatePhase {
> >>>>>>> +    /* The device (and all its vrings) is stopped */
> >>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >>>>>>> +} VhostDeviceStatePhase;
> >>>>>> vDPA has:
> >>>>>>
> >>>>>>    /* Suspend a device so it does not process virtqueue requests anymore
> >>>>>>     *
> >>>>>>     * After the return of ioctl the device must preserve all the necessary state
> >>>>>>     * (the virtqueue vring base plus the possible device specific states) that is
> >>>>>>     * required for restoring in the future. The device must not change its
> >>>>>>     * configuration after that point.
> >>>>>>     */
> >>>>>>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >>>>>>
> >>>>>>    /* Resume a device so it can resume processing virtqueue requests
> >>>>>>     *
> >>>>>>     * After the return of this ioctl the device will have restored all the
> >>>>>>     * necessary states and it is fully operational to continue processing the
> >>>>>>     * virtqueue descriptors.
> >>>>>>     */
> >>>>>>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >>>>>>
> >>>>>> I wonder if it makes sense to import these into vhost-user so that the
> >>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
> >>>>>> if one of them is ahead of the other, but it would be nice to avoid
> >>>>>> overlapping/duplicated functionality.
> >>>>>>
> >>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
> >>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
> >>>>> to SUSPEND.
> >>>>>
> >>>>> Generally it is better if we make the interface less parametrized and
> >>>>> we trust in the messages and its semantics in my opinion. In other
> >>>>> words, instead of
> >>>>> vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> >>>>> individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
> >>>>>
> >>>>> Another way to apply this is with the "direction" parameter. Maybe it
> >>>>> is better to split it into "set_state_fd" and "get_state_fd"?
> >>>>>
> >>>>> In that case, reusing the ioctls as vhost-user messages would be ok.
> >>>>> But that puts this proposal further from the VFIO code, which uses
> >>>>> "migration_set_state(state)", and maybe it is better when the number
> >>>>> of states is high.
> >>>> Hi Eugenio,
> >>>> Another question about vDPA suspend/resume:
> >>>>
> >>>>    /* Host notifiers must be enabled at this point. */
> >>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
> >>>>    {
> >>>>        int i;
> >>>>
> >>>>        /* should only be called after backend is connected */
> >>>>        assert(hdev->vhost_ops);
> >>>>        event_notifier_test_and_clear(
> >>>>            &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> >>>>        event_notifier_test_and_clear(&vdev->config_notifier);
> >>>>
> >>>>        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> >>>>
> >>>>        if (hdev->vhost_ops->vhost_dev_start) {
> >>>>            hdev->vhost_ops->vhost_dev_start(hdev, false);
> >>>>            ^^^ SUSPEND ^^^
> >>>>        }
> >>>>        if (vrings) {
> >>>>            vhost_dev_set_vring_enable(hdev, false);
> >>>>        }
> >>>>        for (i = 0; i < hdev->nvqs; ++i) {
> >>>>            vhost_virtqueue_stop(hdev,
> >>>>                                 vdev,
> >>>>                                 hdev->vqs + i,
> >>>>                                 hdev->vq_index + i);
> >>>>          ^^^ fetch virtqueue state from kernel ^^^
> >>>>        }
> >>>>        if (hdev->vhost_ops->vhost_reset_status) {
> >>>>            hdev->vhost_ops->vhost_reset_status(hdev);
> >>>>            ^^^ reset device^^^
> >>>>
> >>>> I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
> >>>> vhost_reset_status(). The device's migration code runs after
> >>>> vhost_dev_stop() and the state will have been lost.
> >>>>
> >>> vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> >>> qemu VirtIONet device model. This is for all vhost backends.
> >>>
> >>> Regarding the state like mac or mq configuration, SVQ runs for all the
> >>> VM run in the CVQ. So it can track all of that status in the device
> >>> model too.
> >>>
> >>> When a migration effectively occurs, all the frontend state is
> >>> migrated as a regular emulated device. To route all of the state in a
> >>> normalized way for qemu is what leaves open the possibility to do
> >>> cross-backends migrations, etc.
> >>>
> >>> Does that answer your question?
> >> I think you're confirming that changes would be necessary in order for
> >> vDPA to support the save/load operation that Hanna is introducing.
> >>
> > Yes, this first iteration was centered on net, with an eye on block,
> > where state can be routed through classical emulated devices. This is
> > how vhost-kernel and vhost-user do classically. And it allows
> > cross-backend, to not modify qemu migration state, etc.
> >
> > To introduce this opaque state to qemu, that must be fetched after the
> > suspend and not before, requires changes in vhost protocol, as
> > discussed previously.
> >
> >>>> It looks like vDPA changes are necessary in order to support stateful
> >>>> devices even though QEMU already uses SUSPEND. Is my understanding
> >>>> correct?
> >>>>
> >>> Changes are required elsewhere, as the code to restore the state
> >>> properly in the destination has not been merged.
> >> I'm not sure what you mean by elsewhere?
> >>
> > I meant for vdpa *net* devices the changes are not required in vdpa
> > ioctls, but mostly in qemu.
> >
> > If you meant stateful as "it must have a state blob that it must be
> > opaque to qemu", then I think the straightforward action is to fetch
> > state blob about the same time as vq indexes. But yes, changes (at
> > least a new ioctl) is needed for that.
> >
> >> I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> >> then VHOST_VDPA_SET_STATUS 0.
> >>
> >> In order to save device state from the vDPA device in the future, it
> >> will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> >> the device state can be saved before the device is reset.
> >>
> >> Does that sound right?
> >>
> > The split between suspend and reset was added recently for that very
> > reason. In all the virtio devices, the frontend is initialized before
> > the backend, so I don't think it is a good idea to defer the backend
> > cleanup. Especially if we have already set the state is small enough
> > to not needing iterative migration from virtiofsd point of view.
> >
> > If fetching that state at the same time as vq indexes is not valid,
> > could it follow the same model as the "in-flight descriptors"?
> > vhost-user follows them by using a shared memory region where their
> > state is tracked [1]. This allows qemu to survive vhost-user SW
> > backend crashes, and does not forbid the cross-backends live migration
> > as all the information is there to recover them.
> >
> > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > a possibility is to synchronize this memory region after a
> > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > devices are not going to crash in the software sense, so all use cases
> > remain the same to qemu. And that shared memory information is
> > recoverable after vhost_dev_stop.
> >
> > Does that sound reasonable to virtiofsd? To offer a shared memory
> > region where it dumps the state, maybe only after the
> > set_state(STATE_PHASE_STOPPED)?
>
> I don’t think we need the set_state() call, necessarily, if SUSPEND is
> mandatory anyway.
>
> As for the shared memory, the RFC before this series used shared memory,
> so it’s possible, yes.  But “shared memory region” can mean a lot of
> things – it sounds like you’re saying the back-end (virtiofsd) should
> provide it to the front-end, is that right?  That could work like this:
>
> On the source side:
>
> S1. SUSPEND goes to virtiofsd
> S2. virtiofsd maybe double-checks that the device is stopped, then
> serializes its state into a newly allocated shared memory area[1]
> S3. virtiofsd responds to SUSPEND
> S4. front-end requests shared memory, virtiofsd responds with a handle,
> maybe already closes its reference
> S5. front-end saves state, closes its handle, freeing the SHM
>
> [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> it can immediately allocate this area and serialize directly into it;
> maybe it can’t, then we’ll need a bounce buffer.  Not really a
> fundamental problem, but there are limitations around what you can do
> with serde implementations in Rust…
>
> On the destination side:
>
> D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> virtiofsd would serialize its empty state into an SHM area, and respond
> to SUSPEND
> D2. front-end reads state from migration stream into an SHM it has allocated
> D3. front-end supplies this SHM to virtiofsd, which discards its
> previous area, and now uses this one
> D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
>
> Couple of questions:
>
> A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> would imply to deserialize a state, and the state is to be transferred
> through SHM, this is what would need to be done.  So maybe we should
> skip SUSPEND on the destination?
> B. You described that the back-end should supply the SHM, which works
> well on the source.  On the destination, only the front-end knows how
> big the state is, so I’ve decided above that it should allocate the SHM
> (D2) and provide it to the back-end.  Is that feasible or is it
> important (e.g. for real hardware) that the back-end supplies the SHM?
> (In which case the front-end would need to tell the back-end how big the
> state SHM needs to be.)
How does this work for iterative live migration?
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-19 11:21                 ` Stefan Hajnoczi
@ 2023-04-19 11:24                   ` Hanna Czenczek
  2023-04-20 13:29                   ` Eugenio Pérez
  1 sibling, 0 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-04-19 11:24 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Eugenio Perez Martin, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On 19.04.23 13:21, Stefan Hajnoczi wrote:
> On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
>> On 18.04.23 09:54, Eugenio Perez Martin wrote:
>>> On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>> On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>>>>> On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
>>>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>>>>>>>>> So-called "internal" virtio-fs migration refers to transporting the
>>>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>>>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
>>>>>>>>> from virtiofsd.
>>>>>>>>>
>>>>>>>>> Because virtiofsd's internal state will not be too large, we believe it
>>>>>>>>> is best to transfer it as a single binary blob after the streaming
>>>>>>>>> phase.  Because this method should be useful to other vhost-user
>>>>>>>>> implementations, too, it is introduced as a general-purpose addition to
>>>>>>>>> the protocol, not limited to vhost-user-fs.
>>>>>>>>>
>>>>>>>>> These are the additions to the protocol:
>>>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>>>>>>>>     This feature signals support for transferring state, and is added so
>>>>>>>>>     that migration can fail early when the back-end has no support.
>>>>>>>>>
>>>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>>>>>>>>     over which to transfer the state.  The front-end sends an FD to the
>>>>>>>>>     back-end into/from which it can write/read its state, and the back-end
>>>>>>>>>     can decide to either use it, or reply with a different FD for the
>>>>>>>>>     front-end to override the front-end's choice.
>>>>>>>>>     The front-end creates a simple pipe to transfer the state, but maybe
>>>>>>>>>     the back-end already has an FD into/from which it has to write/read
>>>>>>>>>     its state, in which case it will want to override the simple pipe.
>>>>>>>>>     Conversely, maybe in the future we find a way to have the front-end
>>>>>>>>>     get an immediate FD for the migration stream (in some cases), in which
>>>>>>>>>     case we will want to send this to the back-end instead of creating a
>>>>>>>>>     pipe.
>>>>>>>>>     Hence the negotiation: If one side has a better idea than a plain
>>>>>>>>>     pipe, we will want to use that.
>>>>>>>>>
>>>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>>>>>>>>     pipe (the end indicated by EOF), the front-end invokes this function
>>>>>>>>>     to verify success.  There is no in-band way (through the pipe) to
>>>>>>>>>     indicate failure, so we need to check explicitly.
>>>>>>>>>
>>>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>>>>>>>>> (which includes establishing the direction of transfer and migration
>>>>>>>>> phase), the sending side writes its data into the pipe, and the reading
>>>>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
>>>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
>>>>>>>>> checking for integrity (i.e. errors during deserialization).
>>>>>>>>>
>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>>> ---
>>>>>>>>>    include/hw/virtio/vhost-backend.h |  24 +++++
>>>>>>>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>>>>>>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>>>>>>>>    hw/virtio/vhost.c                 |  37 ++++++++
>>>>>>>>>    4 files changed, 287 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>>>>>>>> index ec3fbae58d..5935b32fe3 100644
>>>>>>>>> --- a/include/hw/virtio/vhost-backend.h
>>>>>>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>>>>>>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>>>>>>>>    } VhostSetConfigType;
>>>>>>>>>
>>>>>>>>> +typedef enum VhostDeviceStateDirection {
>>>>>>>>> +    /* Transfer state from back-end (device) to front-end */
>>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>>>>>>>>> +    /* Transfer state from front-end to back-end (device) */
>>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>>>>>>>>> +} VhostDeviceStateDirection;
>>>>>>>>> +
>>>>>>>>> +typedef enum VhostDeviceStatePhase {
>>>>>>>>> +    /* The device (and all its vrings) is stopped */
>>>>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>>>>>>>>> +} VhostDeviceStatePhase;
>>>>>>>> vDPA has:
>>>>>>>>
>>>>>>>>     /* Suspend a device so it does not process virtqueue requests anymore
>>>>>>>>      *
>>>>>>>>      * After the return of ioctl the device must preserve all the necessary state
>>>>>>>>      * (the virtqueue vring base plus the possible device specific states) that is
>>>>>>>>      * required for restoring in the future. The device must not change its
>>>>>>>>      * configuration after that point.
>>>>>>>>      */
>>>>>>>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>>>>>>>>
>>>>>>>>     /* Resume a device so it can resume processing virtqueue requests
>>>>>>>>      *
>>>>>>>>      * After the return of this ioctl the device will have restored all the
>>>>>>>>      * necessary states and it is fully operational to continue processing the
>>>>>>>>      * virtqueue descriptors.
>>>>>>>>      */
>>>>>>>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>>>>>>>>
>>>>>>>> I wonder if it makes sense to import these into vhost-user so that the
>>>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
>>>>>>>> if one of them is ahead of the other, but it would be nice to avoid
>>>>>>>> overlapping/duplicated functionality.
>>>>>>>>
>>>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
>>>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
>>>>>>> to SUSPEND.
>>>>>>>
>>>>>>> Generally it is better if we make the interface less parametrized and
>>>>>>> we trust in the messages and its semantics in my opinion. In other
>>>>>>> words, instead of
>>>>>>> vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
>>>>>>> individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
>>>>>>>
>>>>>>> Another way to apply this is with the "direction" parameter. Maybe it
>>>>>>> is better to split it into "set_state_fd" and "get_state_fd"?
>>>>>>>
>>>>>>> In that case, reusing the ioctls as vhost-user messages would be ok.
>>>>>>> But that puts this proposal further from the VFIO code, which uses
>>>>>>> "migration_set_state(state)", and maybe it is better when the number
>>>>>>> of states is high.
>>>>>> Hi Eugenio,
>>>>>> Another question about vDPA suspend/resume:
>>>>>>
>>>>>>     /* Host notifiers must be enabled at this point. */
>>>>>>     void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
>>>>>>     {
>>>>>>         int i;
>>>>>>
>>>>>>         /* should only be called after backend is connected */
>>>>>>         assert(hdev->vhost_ops);
>>>>>>         event_notifier_test_and_clear(
>>>>>>             &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
>>>>>>         event_notifier_test_and_clear(&vdev->config_notifier);
>>>>>>
>>>>>>         trace_vhost_dev_stop(hdev, vdev->name, vrings);
>>>>>>
>>>>>>         if (hdev->vhost_ops->vhost_dev_start) {
>>>>>>             hdev->vhost_ops->vhost_dev_start(hdev, false);
>>>>>>             ^^^ SUSPEND ^^^
>>>>>>         }
>>>>>>         if (vrings) {
>>>>>>             vhost_dev_set_vring_enable(hdev, false);
>>>>>>         }
>>>>>>         for (i = 0; i < hdev->nvqs; ++i) {
>>>>>>             vhost_virtqueue_stop(hdev,
>>>>>>                                  vdev,
>>>>>>                                  hdev->vqs + i,
>>>>>>                                  hdev->vq_index + i);
>>>>>>           ^^^ fetch virtqueue state from kernel ^^^
>>>>>>         }
>>>>>>         if (hdev->vhost_ops->vhost_reset_status) {
>>>>>>             hdev->vhost_ops->vhost_reset_status(hdev);
>>>>>>             ^^^ reset device^^^
>>>>>>
>>>>>> I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
>>>>>> vhost_reset_status(). The device's migration code runs after
>>>>>> vhost_dev_stop() and the state will have been lost.
>>>>>>
>>>>> vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
>>>>> qemu VirtIONet device model. This is for all vhost backends.
>>>>>
>>>>> Regarding the state like mac or mq configuration, SVQ runs for all the
>>>>> VM run in the CVQ. So it can track all of that status in the device
>>>>> model too.
>>>>>
>>>>> When a migration effectively occurs, all the frontend state is
>>>>> migrated as a regular emulated device. To route all of the state in a
>>>>> normalized way for qemu is what leaves open the possibility to do
>>>>> cross-backends migrations, etc.
>>>>>
>>>>> Does that answer your question?
>>>> I think you're confirming that changes would be necessary in order for
>>>> vDPA to support the save/load operation that Hanna is introducing.
>>>>
>>> Yes, this first iteration was centered on net, with an eye on block,
>>> where state can be routed through classical emulated devices. This is
>>> how vhost-kernel and vhost-user do classically. And it allows
>>> cross-backend, to not modify qemu migration state, etc.
>>>
>>> To introduce this opaque state to qemu, that must be fetched after the
>>> suspend and not before, requires changes in vhost protocol, as
>>> discussed previously.
>>>
>>>>>> It looks like vDPA changes are necessary in order to support stateful
>>>>>> devices even though QEMU already uses SUSPEND. Is my understanding
>>>>>> correct?
>>>>>>
>>>>> Changes are required elsewhere, as the code to restore the state
>>>>> properly in the destination has not been merged.
>>>> I'm not sure what you mean by elsewhere?
>>>>
>>> I meant for vdpa *net* devices the changes are not required in vdpa
>>> ioctls, but mostly in qemu.
>>>
>>> If you meant stateful as "it must have a state blob that it must be
>>> opaque to qemu", then I think the straightforward action is to fetch
>>> state blob about the same time as vq indexes. But yes, changes (at
>>> least a new ioctl) is needed for that.
>>>
>>>> I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
>>>> then VHOST_VDPA_SET_STATUS 0.
>>>>
>>>> In order to save device state from the vDPA device in the future, it
>>>> will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
>>>> the device state can be saved before the device is reset.
>>>>
>>>> Does that sound right?
>>>>
>>> The split between suspend and reset was added recently for that very
>>> reason. In all the virtio devices, the frontend is initialized before
>>> the backend, so I don't think it is a good idea to defer the backend
>>> cleanup. Especially if we have already set the state is small enough
>>> to not needing iterative migration from virtiofsd point of view.
>>>
>>> If fetching that state at the same time as vq indexes is not valid,
>>> could it follow the same model as the "in-flight descriptors"?
>>> vhost-user follows them by using a shared memory region where their
>>> state is tracked [1]. This allows qemu to survive vhost-user SW
>>> backend crashes, and does not forbid the cross-backends live migration
>>> as all the information is there to recover them.
>>>
>>> For hw devices this is not convenient as it occupies PCI bandwidth. So
>>> a possibility is to synchronize this memory region after a
>>> synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
>>> devices are not going to crash in the software sense, so all use cases
>>> remain the same to qemu. And that shared memory information is
>>> recoverable after vhost_dev_stop.
>>>
>>> Does that sound reasonable to virtiofsd? To offer a shared memory
>>> region where it dumps the state, maybe only after the
>>> set_state(STATE_PHASE_STOPPED)?
>> I don’t think we need the set_state() call, necessarily, if SUSPEND is
>> mandatory anyway.
>>
>> As for the shared memory, the RFC before this series used shared memory,
>> so it’s possible, yes.  But “shared memory region” can mean a lot of
>> things – it sounds like you’re saying the back-end (virtiofsd) should
>> provide it to the front-end, is that right?  That could work like this:
>>
>> On the source side:
>>
>> S1. SUSPEND goes to virtiofsd
>> S2. virtiofsd maybe double-checks that the device is stopped, then
>> serializes its state into a newly allocated shared memory area[1]
>> S3. virtiofsd responds to SUSPEND
>> S4. front-end requests shared memory, virtiofsd responds with a handle,
>> maybe already closes its reference
>> S5. front-end saves state, closes its handle, freeing the SHM
>>
>> [1] Maybe virtiofsd can correctly size the serialized state’s size, then
>> it can immediately allocate this area and serialize directly into it;
>> maybe it can’t, then we’ll need a bounce buffer.  Not really a
>> fundamental problem, but there are limitations around what you can do
>> with serde implementations in Rust…
>>
>> On the destination side:
>>
>> D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
>> virtiofsd would serialize its empty state into an SHM area, and respond
>> to SUSPEND
>> D2. front-end reads state from migration stream into an SHM it has allocated
>> D3. front-end supplies this SHM to virtiofsd, which discards its
>> previous area, and now uses this one
>> D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
>>
>> Couple of questions:
>>
>> A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
>> would imply to deserialize a state, and the state is to be transferred
>> through SHM, this is what would need to be done.  So maybe we should
>> skip SUSPEND on the destination?
>> B. You described that the back-end should supply the SHM, which works
>> well on the source.  On the destination, only the front-end knows how
>> big the state is, so I’ve decided above that it should allocate the SHM
>> (D2) and provide it to the back-end.  Is that feasible or is it
>> important (e.g. for real hardware) that the back-end supplies the SHM?
>> (In which case the front-end would need to tell the back-end how big the
>> state SHM needs to be.)
> How does this work for iterative live migration?
Right, probably not at all. :)
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [Virtio-fs] [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-19 11:15                     ` Hanna Czenczek
@ 2023-04-19 11:24                       ` Stefan Hajnoczi
  0 siblings, 0 replies; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-04-19 11:24 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, Eugenio Perez Martin, Juan Quintela,
	Michael S . Tsirkin, qemu-devel, virtio-fs, Anton Kuchin
On Wed, 19 Apr 2023 at 07:16, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 19.04.23 13:10, Stefan Hajnoczi wrote:
> > On Wed, 19 Apr 2023 at 06:57, Hanna Czenczek <hreitz@redhat.com> wrote:
> >> On 18.04.23 19:59, Stefan Hajnoczi wrote:
> >>> On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> >>>> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >>>>>> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> >>>>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >>>>>>>>>> So-called "internal" virtio-fs migration refers to transporting the
> >>>>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >>>>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
> >>>>>>>>>> from virtiofsd.
> >>>>>>>>>>
> >>>>>>>>>> Because virtiofsd's internal state will not be too large, we believe it
> >>>>>>>>>> is best to transfer it as a single binary blob after the streaming
> >>>>>>>>>> phase.  Because this method should be useful to other vhost-user
> >>>>>>>>>> implementations, too, it is introduced as a general-purpose addition to
> >>>>>>>>>> the protocol, not limited to vhost-user-fs.
> >>>>>>>>>>
> >>>>>>>>>> These are the additions to the protocol:
> >>>>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>>>>>>>>>     This feature signals support for transferring state, and is added so
> >>>>>>>>>>     that migration can fail early when the back-end has no support.
> >>>>>>>>>>
> >>>>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>>>>>>>>>     over which to transfer the state.  The front-end sends an FD to the
> >>>>>>>>>>     back-end into/from which it can write/read its state, and the back-end
> >>>>>>>>>>     can decide to either use it, or reply with a different FD for the
> >>>>>>>>>>     front-end to override the front-end's choice.
> >>>>>>>>>>     The front-end creates a simple pipe to transfer the state, but maybe
> >>>>>>>>>>     the back-end already has an FD into/from which it has to write/read
> >>>>>>>>>>     its state, in which case it will want to override the simple pipe.
> >>>>>>>>>>     Conversely, maybe in the future we find a way to have the front-end
> >>>>>>>>>>     get an immediate FD for the migration stream (in some cases), in which
> >>>>>>>>>>     case we will want to send this to the back-end instead of creating a
> >>>>>>>>>>     pipe.
> >>>>>>>>>>     Hence the negotiation: If one side has a better idea than a plain
> >>>>>>>>>>     pipe, we will want to use that.
> >>>>>>>>>>
> >>>>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>>>>>>>>>     pipe (the end indicated by EOF), the front-end invokes this function
> >>>>>>>>>>     to verify success.  There is no in-band way (through the pipe) to
> >>>>>>>>>>     indicate failure, so we need to check explicitly.
> >>>>>>>>>>
> >>>>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >>>>>>>>>> (which includes establishing the direction of transfer and migration
> >>>>>>>>>> phase), the sending side writes its data into the pipe, and the reading
> >>>>>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
> >>>>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> >>>>>>>>>> checking for integrity (i.e. errors during deserialization).
> >>>>>>>>>>
> >>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>>>>>>>> ---
> >>>>>>>>>>    include/hw/virtio/vhost-backend.h |  24 +++++
> >>>>>>>>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>>>>>>>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>>>>>>>>>    hw/virtio/vhost.c                 |  37 ++++++++
> >>>>>>>>>>    4 files changed, 287 insertions(+)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>>>>>>>>> index ec3fbae58d..5935b32fe3 100644
> >>>>>>>>>> --- a/include/hw/virtio/vhost-backend.h
> >>>>>>>>>> +++ b/include/hw/virtio/vhost-backend.h
> >>>>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>>>>>>>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>>>>>>>>>    } VhostSetConfigType;
> >>>>>>>>>>
> >>>>>>>>>> +typedef enum VhostDeviceStateDirection {
> >>>>>>>>>> +    /* Transfer state from back-end (device) to front-end */
> >>>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >>>>>>>>>> +    /* Transfer state from front-end to back-end (device) */
> >>>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >>>>>>>>>> +} VhostDeviceStateDirection;
> >>>>>>>>>> +
> >>>>>>>>>> +typedef enum VhostDeviceStatePhase {
> >>>>>>>>>> +    /* The device (and all its vrings) is stopped */
> >>>>>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >>>>>>>>>> +} VhostDeviceStatePhase;
> >>>>>>>>> vDPA has:
> >>>>>>>>>
> >>>>>>>>>     /* Suspend a device so it does not process virtqueue requests anymore
> >>>>>>>>>      *
> >>>>>>>>>      * After the return of ioctl the device must preserve all the necessary state
> >>>>>>>>>      * (the virtqueue vring base plus the possible device specific states) that is
> >>>>>>>>>      * required for restoring in the future. The device must not change its
> >>>>>>>>>      * configuration after that point.
> >>>>>>>>>      */
> >>>>>>>>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >>>>>>>>>
> >>>>>>>>>     /* Resume a device so it can resume processing virtqueue requests
> >>>>>>>>>      *
> >>>>>>>>>      * After the return of this ioctl the device will have restored all the
> >>>>>>>>>      * necessary states and it is fully operational to continue processing the
> >>>>>>>>>      * virtqueue descriptors.
> >>>>>>>>>      */
> >>>>>>>>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >>>>>>>>>
> >>>>>>>>> I wonder if it makes sense to import these into vhost-user so that the
> >>>>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
> >>>>>>>>> if one of them is ahead of the other, but it would be nice to avoid
> >>>>>>>>> overlapping/duplicated functionality.
> >>>>>>>>>
> >>>>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
> >>>>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
> >>>>>>>> to SUSPEND.
> >>>>>>> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> >>>>>>> ioctl(VHOST_VDPA_RESUME).
> >>>>>>>
> >>>>>>> The doc comments in <linux/vdpa.h> don't explain how the device can
> >>>>>>> leave the suspended state. Can you clarify this?
> >>>>>>>
> >>>>>> Do you mean in what situations or regarding the semantics of _RESUME?
> >>>>>>
> >>>>>> To me resume is an operation mainly to resume the device in the event
> >>>>>> of a VM suspension, not a migration. It can be used as a fallback code
> >>>>>> in some cases of migration failure though, but it is not currently
> >>>>>> used in qemu.
> >>>>> Is a "VM suspension" the QEMU HMP 'stop' command?
> >>>>>
> >>>>> I guess the reason why QEMU doesn't call RESUME anywhere is that it
> >>>>> resets the device in vhost_dev_stop()?
> >>>>>
> >>>> The actual reason for not using RESUME is that the ioctl was added
> >>>> after the SUSPEND design in qemu. Same as this proposal, it is was not
> >>>> needed at the time.
> >>>>
> >>>> In the case of vhost-vdpa net, the only usage of suspend is to fetch
> >>>> the vq indexes, and in case of error vhost already fetches them from
> >>>> guest's used ring way before vDPA, so it has little usage.
> >>>>
> >>>>> Does it make sense to combine SUSPEND and RESUME with Hanna's
> >>>>> SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> >>>>> this:
> >>>>> - Saving the device's state is done by SUSPEND followed by
> >>>>> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> >>>>> savevm command or migration failed), then RESUME is called to
> >>>>> continue.
> >>>> I think the previous steps make sense at vhost_dev_stop, not virtio
> >>>> savevm handlers. To start spreading this logic to more places of qemu
> >>>> can bring confusion.
> >>> I don't think there is a way around extending the QEMU vhost's code
> >>> model. The current model in QEMU's vhost code is that the backend is
> >>> reset when the VM stops. This model worked fine for stateless devices
> >>> but it doesn't work for stateful devices.
> >>>
> >>> Imagine a vdpa-gpu device: you cannot reset the device in
> >>> vhost_dev_stop() and expect the GPU to continue working when
> >>> vhost_dev_start() is called again because all its state has been lost.
> >>> The guest driver will send requests that references a virtio-gpu
> >>> resources that no longer exist.
> >>>
> >>> One solution is to save the device's state in vhost_dev_stop(). I think
> >>> this is what you're suggesting. It requires keeping a copy of the state
> >>> and then loading the state again in vhost_dev_start(). I don't think
> >>> this approach should be used because it requires all stateful devices to
> >>> support live migration (otherwise they break across HMP 'stop'/'cont').
> >>> Also, the device state for some devices may be large and it would also
> >>> become more complicated when iterative migration is added.
> >>>
> >>> Instead, I think the QEMU vhost code needs to be structured so that
> >>> struct vhost_dev has a suspended state:
> >>>
> >>>           ,---------.
> >>>        v         |
> >>>     started ------> stopped
> >>>       \   ^
> >>>        \  |
> >>>         -> suspended
> >>>
> >>> The device doesn't lose state when it enters the suspended state. It can
> >>> be resumed again.
> >>>
> >>> This is why I think SUSPEND/RESUME need to be part of the solution.
> >>> (It's also an argument for not including the phase argument in
> >>> SET_DEVICE_STATE_FD because the SUSPEND message is sent during
> >>> vhost_dev_stop() separately from saving the device's state.)
> >> So let me ask if I understand this protocol correctly: Basically,
> >> SUSPEND would ask the device to fully serialize its internal state,
> >> retain it in some buffer, and RESUME would then deserialize the state
> >> from the buffer, right?
> > That's not how I understand SUSPEND/RESUME. I was thinking that
> > SUSPEND pauses device operation so that virtqueues are no longer
> > processed and no other events occur (e.g. VIRTIO Configuration Change
> > Notifications). RESUME continues device operation. Neither command is
> > directly related to device state serialization but SUSPEND freezes the
> > device state, while RESUME allows the device state to change again.
>
> I understood that a reset would basically reset all internal state,
> which is why SUSPEND+RESUME were required around it, to retain the state.
The SUSPEND/RESUME operations I'm thinking of come directly from
<linux/vhost.h>:
/* Suspend a device so it does not process virtqueue requests anymore
 *
 * After the return of ioctl the device must preserve all the necessary state
 * (the virtqueue vring base plus the possible device specific states) that is
 * required for restoring in the future. The device must not change its
 * configuration after that point.
 */
#define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
/* Resume a device so it can resume processing virtqueue requests
 *
 * After the return of this ioctl the device will have restored all the
 * necessary states and it is fully operational to continue processing the
 * virtqueue descriptors.
 */
#define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >> While this state needn’t necessarily be immediately migratable, I
> >> suppose (e.g. one could retain file descriptors there, and it doesn’t
> >> need to be a serialized byte buffer, but could still be structured), it
> >> would basically be a live migration implementation already.  As far as I
> >> understand, that’s why you suggest not running a SUSPEND+RESUME cycle on
> >> anything but live migration, right?
> > No, SUSPEND/RESUME would also be used across vm_stop()/vm_start().
> > That way stateful devices are no longer reset across HMP 'stop'/'cont'
> > (we're lucky it even works for most existing vhost-user backends today
> > and that's just because they don't yet implement
> > VHOST_USER_SET_STATUS).
>
> So that’s what I seem to misunderstand: If stateful devices are reset,
> how does SUSPEND+RESUME prevent that?
The vhost-user frontend can check the VHOST_USER_PROTOCOL_F_SUSPEND
feature bit to determine that the backend supports SUSPEND/RESUME and
that mechanism should be used instead of resetting the device.
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-19 11:10               ` Hanna Czenczek
  2023-04-19 11:21                 ` Stefan Hajnoczi
@ 2023-04-20 10:44                 ` Eugenio Pérez
  1 sibling, 0 replies; 93+ messages in thread
From: Eugenio Pérez @ 2023-04-20 10:44 UTC (permalink / raw)
  To: Hanna Czenczek, Stefan Hajnoczi
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Wed, 2023-04-19 at 13:10 +0200, Hanna Czenczek wrote:
> On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > wrote:
> > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > wrote:
> > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > stefanha@redhat.com> wrote:
> > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > So-called "internal" virtio-fs migration refers to transporting
> > > > > > > > the
> > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > stream.  To do
> > > > > > > > this, we need to be able to transfer virtiofsd's internal state
> > > > > > > > to and
> > > > > > > > from virtiofsd.
> > > > > > > > 
> > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > believe it
> > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > streaming
> > > > > > > > phase.  Because this method should be useful to other vhost-user
> > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > addition to
> > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > 
> > > > > > > > These are the additions to the protocol:
> > > > > > > > - New vhost-user protocol feature
> > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > added so
> > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > support.
> > > > > > > > 
> > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate
> > > > > > > > a pipe
> > > > > > > >    over which to transfer the state.  The front-end sends an FD
> > > > > > > > to the
> > > > > > > >    back-end into/from which it can write/read its state, and the
> > > > > > > > back-end
> > > > > > > >    can decide to either use it, or reply with a different FD for
> > > > > > > > the
> > > > > > > >    front-end to override the front-end's choice.
> > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > but maybe
> > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > write/read
> > > > > > > >    its state, in which case it will want to override the simple
> > > > > > > > pipe.
> > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > front-end
> > > > > > > >    get an immediate FD for the migration stream (in some cases),
> > > > > > > > in which
> > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > creating a
> > > > > > > >    pipe.
> > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > plain
> > > > > > > >    pipe, we will want to use that.
> > > > > > > > 
> > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > through the
> > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > function
> > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > pipe) to
> > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > 
> > > > > > > > Once the transfer pipe has been established via
> > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > migration
> > > > > > > > phase), the sending side writes its data into the pipe, and the
> > > > > > > > reading
> > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > check for
> > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > includes
> > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > 
> > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > ---
> > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > >   } VhostSetConfigType;
> > > > > > > > 
> > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > +
> > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > vDPA has:
> > > > > > > 
> > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > anymore
> > > > > > >     *
> > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > necessary state
> > > > > > >     * (the virtqueue vring base plus the possible device specific
> > > > > > > states) that is
> > > > > > >     * required for restoring in the future. The device must not
> > > > > > > change its
> > > > > > >     * configuration after that point.
> > > > > > >     */
> > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > 
> > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > requests
> > > > > > >     *
> > > > > > >     * After the return of this ioctl the device will have restored
> > > > > > > all the
> > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > processing the
> > > > > > >     * virtqueue descriptors.
> > > > > > >     */
> > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > 
> > > > > > > I wonder if it makes sense to import these into vhost-user so that
> > > > > > > the
> > > > > > > difference between kernel vhost and vhost-user is minimized. It's
> > > > > > > okay
> > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > avoid
> > > > > > > overlapping/duplicated functionality.
> > > > > > > 
> > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > VHOST_STOP
> > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > > > > to SUSPEND.
> > > > > > 
> > > > > > Generally it is better if we make the interface less parametrized
> > > > > > and
> > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > words, instead of
> > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > send
> > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > command.
> > > > > > 
> > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > it
> > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > 
> > > > > > In that case, reusing the ioctls as vhost-user messages would be ok.
> > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > "migration_set_state(state)", and maybe it is better when the number
> > > > > > of states is high.
> > > > > Hi Eugenio,
> > > > > Another question about vDPA suspend/resume:
> > > > > 
> > > > >    /* Host notifiers must be enabled at this point. */
> > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > bool vrings)
> > > > >    {
> > > > >        int i;
> > > > > 
> > > > >        /* should only be called after backend is connected */
> > > > >        assert(hdev->vhost_ops);
> > > > >        event_notifier_test_and_clear(
> > > > >            &hdev-
> > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > 
> > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > 
> > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > >            ^^^ SUSPEND ^^^
> > > > >        }
> > > > >        if (vrings) {
> > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > >        }
> > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > >            vhost_virtqueue_stop(hdev,
> > > > >                                 vdev,
> > > > >                                 hdev->vqs + i,
> > > > >                                 hdev->vq_index + i);
> > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > >        }
> > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > >            ^^^ reset device^^^
> > > > > 
> > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
> > > > > vhost_reset_status(). The device's migration code runs after
> > > > > vhost_dev_stop() and the state will have been lost.
> > > > > 
> > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > 
> > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > VM run in the CVQ. So it can track all of that status in the device
> > > > model too.
> > > > 
> > > > When a migration effectively occurs, all the frontend state is
> > > > migrated as a regular emulated device. To route all of the state in a
> > > > normalized way for qemu is what leaves open the possibility to do
> > > > cross-backends migrations, etc.
> > > > 
> > > > Does that answer your question?
> > > I think you're confirming that changes would be necessary in order for
> > > vDPA to support the save/load operation that Hanna is introducing.
> > > 
> > Yes, this first iteration was centered on net, with an eye on block,
> > where state can be routed through classical emulated devices. This is
> > how vhost-kernel and vhost-user do classically. And it allows
> > cross-backend, to not modify qemu migration state, etc.
> > 
> > To introduce this opaque state to qemu, that must be fetched after the
> > suspend and not before, requires changes in vhost protocol, as
> > discussed previously.
> > 
> > > > > It looks like vDPA changes are necessary in order to support stateful
> > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > correct?
> > > > > 
> > > > Changes are required elsewhere, as the code to restore the state
> > > > properly in the destination has not been merged.
> > > I'm not sure what you mean by elsewhere?
> > > 
> > I meant for vdpa *net* devices the changes are not required in vdpa
> > ioctls, but mostly in qemu.
> > 
> > If you meant stateful as "it must have a state blob that it must be
> > opaque to qemu", then I think the straightforward action is to fetch
> > state blob about the same time as vq indexes. But yes, changes (at
> > least a new ioctl) is needed for that.
> > 
> > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > then VHOST_VDPA_SET_STATUS 0.
> > > 
> > > In order to save device state from the vDPA device in the future, it
> > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > the device state can be saved before the device is reset.
> > > 
> > > Does that sound right?
> > > 
> > The split between suspend and reset was added recently for that very
> > reason. In all the virtio devices, the frontend is initialized before
> > the backend, so I don't think it is a good idea to defer the backend
> > cleanup. Especially if we have already set the state is small enough
> > to not needing iterative migration from virtiofsd point of view.
> > 
> > If fetching that state at the same time as vq indexes is not valid,
> > could it follow the same model as the "in-flight descriptors"?
> > vhost-user follows them by using a shared memory region where their
> > state is tracked [1]. This allows qemu to survive vhost-user SW
> > backend crashes, and does not forbid the cross-backends live migration
> > as all the information is there to recover them.
> > 
> > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > a possibility is to synchronize this memory region after a
> > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > devices are not going to crash in the software sense, so all use cases
> > remain the same to qemu. And that shared memory information is
> > recoverable after vhost_dev_stop.
> > 
> > Does that sound reasonable to virtiofsd? To offer a shared memory
> > region where it dumps the state, maybe only after the
> > set_state(STATE_PHASE_STOPPED)?
> 
> I don’t think we need the set_state() call, necessarily, if SUSPEND is 
> mandatory anyway.
> 
Right, I was taking them as interchangeable.
Note that I just put this on the table because it solves another use case
(transfer stateful devices + virtiofsd crash) with an interface that mimics
another one already existing.  I don't want to block the pipe proposal at all.
> As for the shared memory, the RFC before this series used shared memory, 
> so it’s possible, yes.  But “shared memory region” can mean a lot of 
> things – it sounds like you’re saying the back-end (virtiofsd) should 
> provide it to the front-end, is that right?  That could work like this:
> 
inflight_fd provides both calls: VHOST_USER_SET_INFLIGHT_FD and
VHOST_USER_GET_INFLIGHT_FD.
> On the source side:
> 
> S1. SUSPEND goes to virtiofsd
> S2. virtiofsd maybe double-checks that the device is stopped, then 
> serializes its state into a newly allocated shared memory area[1]
> S3. virtiofsd responds to SUSPEND
> S4. front-end requests shared memory, virtiofsd responds with a handle, 
> maybe already closes its reference
> S5. front-end saves state, closes its handle, freeing the SHM
> 
> [1] Maybe virtiofsd can correctly size the serialized state’s size, then 
> it can immediately allocate this area and serialize directly into it; 
> maybe it can’t, then we’ll need a bounce buffer.  Not really a 
> fundamental problem, but there are limitations around what you can do 
> with serde implementations in Rust…
> 
I think shared memory regions can grow and shrink with ftruncate, but it
complicates the solution for sure.  I was under the impression it will be a
fixed amount of state, probably based on some actual limits in the vfs.  Now I
see it is not.
> On the destination side:
> 
> D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; 
> virtiofsd would serialize its empty state into an SHM area, and respond 
> to SUSPEND
> D2. front-end reads state from migration stream into an SHM it has allocated
> D3. front-end supplies this SHM to virtiofsd, which discards its 
> previous area, and now uses this one
> D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> 
> Couple of questions:
> 
> A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND 
> would imply to deserialize a state, and the state is to be transferred 
> through SHM, this is what would need to be done.  So maybe we should 
> skip SUSPEND on the destination?
I think to skip suspend is the best call, yes.
> B. You described that the back-end should supply the SHM, which works 
> well on the source.  On the destination, only the front-end knows how 
> big the state is, so I’ve decided above that it should allocate the SHM 
> (D2) and provide it to the back-end.  Is that feasible or is it 
> important (e.g. for real hardware) that the back-end supplies the SHM?  
> (In which case the front-end would need to tell the back-end how big the 
> state SHM needs to be.)
> 
It is feasible for sure.  I think that the best scenario is when the data has a
fixed size forever, like QueueRegionSplit and QueueRegionPacked.  If that is not
possible, I think the best is to indicate the length of the data so the device
can fetch in strides as large as possible.  Maybe other HW guys can answer
better here though.
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-18 20:40                   ` Stefan Hajnoczi
@ 2023-04-20 13:27                     ` Eugenio Pérez
  2023-05-08 19:12                       ` Stefan Hajnoczi
  0 siblings, 1 reply; 93+ messages in thread
From: Eugenio Pérez @ 2023-04-20 13:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Tue, 2023-04-18 at 16:40 -0400, Stefan Hajnoczi wrote:
> On Tue, 18 Apr 2023 at 14:31, Eugenio Perez Martin <eperezma@redhat.com>
> wrote:
> > On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> > > > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > wrote:
> > > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <
> > > > > eperezma@redhat.com> wrote:
> > > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com
> > > > > > > wrote:
> > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > wrote:
> > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek
> > > > > > > > > wrote:
> > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > transporting the
> > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > stream.  To do
> > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > state to and
> > > > > > > > > > from virtiofsd.
> > > > > > > > > > 
> > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > believe it
> > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > streaming
> > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > user
> > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > addition to
> > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > 
> > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > >   This feature signals support for transferring state, and
> > > > > > > > > > is added so
> > > > > > > > > >   that migration can fail early when the back-end has no
> > > > > > > > > > support.
> > > > > > > > > > 
> > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > negotiate a pipe
> > > > > > > > > >   over which to transfer the state.  The front-end sends an
> > > > > > > > > > FD to the
> > > > > > > > > >   back-end into/from which it can write/read its state, and
> > > > > > > > > > the back-end
> > > > > > > > > >   can decide to either use it, or reply with a different FD
> > > > > > > > > > for the
> > > > > > > > > >   front-end to override the front-end's choice.
> > > > > > > > > >   The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > but maybe
> > > > > > > > > >   the back-end already has an FD into/from which it has to
> > > > > > > > > > write/read
> > > > > > > > > >   its state, in which case it will want to override the
> > > > > > > > > > simple pipe.
> > > > > > > > > >   Conversely, maybe in the future we find a way to have the
> > > > > > > > > > front-end
> > > > > > > > > >   get an immediate FD for the migration stream (in some
> > > > > > > > > > cases), in which
> > > > > > > > > >   case we will want to send this to the back-end instead of
> > > > > > > > > > creating a
> > > > > > > > > >   pipe.
> > > > > > > > > >   Hence the negotiation: If one side has a better idea than
> > > > > > > > > > a plain
> > > > > > > > > >   pipe, we will want to use that.
> > > > > > > > > > 
> > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > through the
> > > > > > > > > >   pipe (the end indicated by EOF), the front-end invokes
> > > > > > > > > > this function
> > > > > > > > > >   to verify success.  There is no in-band way (through the
> > > > > > > > > > pipe) to
> > > > > > > > > >   indicate failure, so we need to check explicitly.
> > > > > > > > > > 
> > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > migration
> > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > the reading
> > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end
> > > > > > > > > > will check for
> > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination
> > > > > > > > > > side includes
> > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > 
> > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > ---
> > > > > > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > >  hw/virtio/vhost-user.c            | 147
> > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > >  4 files changed, 287 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > >  } VhostSetConfigType;
> > > > > > > > > > 
> > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > +    /* Transfer state from back-end (device) to front-end
> > > > > > > > > > */
> > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > +    /* Transfer state from front-end to back-end (device)
> > > > > > > > > > */
> > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > +
> > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > 
> > > > > > > > > vDPA has:
> > > > > > > > > 
> > > > > > > > >   /* Suspend a device so it does not process virtqueue
> > > > > > > > > requests anymore
> > > > > > > > >    *
> > > > > > > > >    * After the return of ioctl the device must preserve all
> > > > > > > > > the necessary state
> > > > > > > > >    * (the virtqueue vring base plus the possible device
> > > > > > > > > specific states) that is
> > > > > > > > >    * required for restoring in the future. The device must not
> > > > > > > > > change its
> > > > > > > > >    * configuration after that point.
> > > > > > > > >    */
> > > > > > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > 
> > > > > > > > >   /* Resume a device so it can resume processing virtqueue
> > > > > > > > > requests
> > > > > > > > >    *
> > > > > > > > >    * After the return of this ioctl the device will have
> > > > > > > > > restored all the
> > > > > > > > >    * necessary states and it is fully operational to continue
> > > > > > > > > processing the
> > > > > > > > >    * virtqueue descriptors.
> > > > > > > > >    */
> > > > > > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > 
> > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > that the
> > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > It's okay
> > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > avoid
> > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > VHOST_STOP
> > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > change
> > > > > > > > to SUSPEND.
> > > > > > > 
> > > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > > > > > ioctl(VHOST_VDPA_RESUME).
> > > > > > > 
> > > > > > > The doc comments in <linux/vdpa.h> don't explain how the device
> > > > > > > can
> > > > > > > leave the suspended state. Can you clarify this?
> > > > > > > 
> > > > > > 
> > > > > > Do you mean in what situations or regarding the semantics of
> > > > > > _RESUME?
> > > > > > 
> > > > > > To me resume is an operation mainly to resume the device in the
> > > > > > event
> > > > > > of a VM suspension, not a migration. It can be used as a fallback
> > > > > > code
> > > > > > in some cases of migration failure though, but it is not currently
> > > > > > used in qemu.
> > > > > 
> > > > > Is a "VM suspension" the QEMU HMP 'stop' command?
> > > > > 
> > > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it
> > > > > resets the device in vhost_dev_stop()?
> > > > > 
> > > > 
> > > > The actual reason for not using RESUME is that the ioctl was added
> > > > after the SUSPEND design in qemu. Same as this proposal, it is was not
> > > > needed at the time.
> > > > 
> > > > In the case of vhost-vdpa net, the only usage of suspend is to fetch
> > > > the vq indexes, and in case of error vhost already fetches them from
> > > > guest's used ring way before vDPA, so it has little usage.
> > > > 
> > > > > Does it make sense to combine SUSPEND and RESUME with Hanna's
> > > > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> > > > > this:
> > > > > - Saving the device's state is done by SUSPEND followed by
> > > > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> > > > > savevm command or migration failed), then RESUME is called to
> > > > > continue.
> > > > 
> > > > I think the previous steps make sense at vhost_dev_stop, not virtio
> > > > savevm handlers. To start spreading this logic to more places of qemu
> > > > can bring confusion.
> > > 
> > > I don't think there is a way around extending the QEMU vhost's code
> > > model. The current model in QEMU's vhost code is that the backend is
> > > reset when the VM stops. This model worked fine for stateless devices
> > > but it doesn't work for stateful devices.
> > > 
> > > Imagine a vdpa-gpu device: you cannot reset the device in
> > > vhost_dev_stop() and expect the GPU to continue working when
> > > vhost_dev_start() is called again because all its state has been lost.
> > > The guest driver will send requests that references a virtio-gpu
> > > resources that no longer exist.
> > > 
> > > One solution is to save the device's state in vhost_dev_stop(). I think
> > > this is what you're suggesting. It requires keeping a copy of the state
> > > and then loading the state again in vhost_dev_start(). I don't think
> > > this approach should be used because it requires all stateful devices to
> > > support live migration (otherwise they break across HMP 'stop'/'cont').
> > > Also, the device state for some devices may be large and it would also
> > > become more complicated when iterative migration is added.
> > > 
> > > Instead, I think the QEMU vhost code needs to be structured so that
> > > struct vhost_dev has a suspended state:
> > > 
> > >         ,---------.
> > >         v         |
> > >   started ------> stopped
> > >     \   ^
> > >      \  |
> > >       -> suspended
> > > 
> > > The device doesn't lose state when it enters the suspended state. It can
> > > be resumed again.
> > > 
> > > This is why I think SUSPEND/RESUME need to be part of the solution.
I just realize that we can add an arrow from suspended to stopped, isn't it?
"Started" before seems to imply the device may process descriptors after
suspend.
> > 
> > I agree with all of this, especially after realizing vhost_dev_stop is
> > called before the last request of the state in the iterative
> > migration.
> > 
> > However I think we can move faster with the virtiofsd migration code,
> > as long as we agree on the vhost-user messages it will receive. This
> > is because we already agree that the state will be sent in one shot
> > and not iteratively, so it will be small.
> > 
> > I understand this may change in the future, that's why I proposed to
> > start using iterative right now. However it may make little sense if
> > it is not used in the vhost-user device. I also understand that other
> > devices may have a bigger state so it will be needed for them.
> 
> Can you summarize how you'd like save to work today? I'm not sure what
> you have in mind.
> 
I think we're trying to find a solution that satisfies many things.  On one
side, we're assuming that the virtiofsd state will be small enough to be
assumable it will not require iterative migration in the short term.  However,
we also want to support iterative migration, for the shake of *other* future
vhost devices that may need it.
I also think we should prioritize the protocols stability, in the sense of not
adding calls that we will not reuse for iterative LM.  Being vhost-user protocol
more important to maintain than the qemu migration.
To implement the changes you mention will be needed in the future.  But we have
already set that the virtiofsd is small, so we can just fetch them by the same
time than we send VHOST_USER_GET_VRING_BASE message and send the status with the
proposed non-iterative approach.
If we agree on that, now the question is how to fetch them from the device.  The
answers are a little bit scattered in the mail threads, but I think we agree on:
a) We need to signal that the device must stop processing requests.
b) We need a way for the device to dump the state.
At this moment I think any proposal satisfies a), and pipe satisfies better b). 
With proper backend feature flags, the device may support to start writing to
the pipe before SUSPEND so we can implement iterative migration on top.
Does that makes sense?
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-19 11:21                 ` Stefan Hajnoczi
  2023-04-19 11:24                   ` Hanna Czenczek
@ 2023-04-20 13:29                   ` Eugenio Pérez
  2023-05-08 20:10                     ` Stefan Hajnoczi
  1 sibling, 1 reply; 93+ messages in thread
From: Eugenio Pérez @ 2023-04-20 13:29 UTC (permalink / raw)
  To: Stefan Hajnoczi, Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote:
> On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
> > On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > wrote:
> > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > > wrote:
> > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > > wrote:
> > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > wrote:
> > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > transporting the
> > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > stream.  To do
> > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > state to and
> > > > > > > > > from virtiofsd.
> > > > > > > > > 
> > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > believe it
> > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > streaming
> > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > user
> > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > addition to
> > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > 
> > > > > > > > > These are the additions to the protocol:
> > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > > added so
> > > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > > support.
> > > > > > > > > 
> > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > negotiate a pipe
> > > > > > > > >    over which to transfer the state.  The front-end sends an
> > > > > > > > > FD to the
> > > > > > > > >    back-end into/from which it can write/read its state, and
> > > > > > > > > the back-end
> > > > > > > > >    can decide to either use it, or reply with a different FD
> > > > > > > > > for the
> > > > > > > > >    front-end to override the front-end's choice.
> > > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > > but maybe
> > > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > > write/read
> > > > > > > > >    its state, in which case it will want to override the
> > > > > > > > > simple pipe.
> > > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > > front-end
> > > > > > > > >    get an immediate FD for the migration stream (in some
> > > > > > > > > cases), in which
> > > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > > creating a
> > > > > > > > >    pipe.
> > > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > > plain
> > > > > > > > >    pipe, we will want to use that.
> > > > > > > > > 
> > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > through the
> > > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > > function
> > > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > > pipe) to
> > > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > > 
> > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > migration
> > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > the reading
> > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > > check for
> > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > > includes
> > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > 
> > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > ---
> > > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > >   } VhostSetConfigType;
> > > > > > > > > 
> > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > +
> > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > vDPA has:
> > > > > > > > 
> > > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > > anymore
> > > > > > > >     *
> > > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > > necessary state
> > > > > > > >     * (the virtqueue vring base plus the possible device
> > > > > > > > specific states) that is
> > > > > > > >     * required for restoring in the future. The device must not
> > > > > > > > change its
> > > > > > > >     * configuration after that point.
> > > > > > > >     */
> > > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > 
> > > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > > requests
> > > > > > > >     *
> > > > > > > >     * After the return of this ioctl the device will have
> > > > > > > > restored all the
> > > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > > processing the
> > > > > > > >     * virtqueue descriptors.
> > > > > > > >     */
> > > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > 
> > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > that the
> > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > It's okay
> > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > avoid
> > > > > > > > overlapping/duplicated functionality.
> > > > > > > > 
> > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > VHOST_STOP
> > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > change
> > > > > > > to SUSPEND.
> > > > > > > 
> > > > > > > Generally it is better if we make the interface less parametrized
> > > > > > > and
> > > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > > words, instead of
> > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > > send
> > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > > command.
> > > > > > > 
> > > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > > it
> > > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > > 
> > > > > > > In that case, reusing the ioctls as vhost-user messages would be
> > > > > > > ok.
> > > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > > "migration_set_state(state)", and maybe it is better when the
> > > > > > > number
> > > > > > > of states is high.
> > > > > > Hi Eugenio,
> > > > > > Another question about vDPA suspend/resume:
> > > > > > 
> > > > > >    /* Host notifiers must be enabled at this point. */
> > > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > > bool vrings)
> > > > > >    {
> > > > > >        int i;
> > > > > > 
> > > > > >        /* should only be called after backend is connected */
> > > > > >        assert(hdev->vhost_ops);
> > > > > >        event_notifier_test_and_clear(
> > > > > >            &hdev-
> > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > > 
> > > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > > 
> > > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > > >            ^^^ SUSPEND ^^^
> > > > > >        }
> > > > > >        if (vrings) {
> > > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > > >        }
> > > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > > >            vhost_virtqueue_stop(hdev,
> > > > > >                                 vdev,
> > > > > >                                 hdev->vqs + i,
> > > > > >                                 hdev->vq_index + i);
> > > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > > >        }
> > > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > > >            ^^^ reset device^^^
> > > > > > 
> > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop()
> > > > > > ->
> > > > > > vhost_reset_status(). The device's migration code runs after
> > > > > > vhost_dev_stop() and the state will have been lost.
> > > > > > 
> > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > > 
> > > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > > VM run in the CVQ. So it can track all of that status in the device
> > > > > model too.
> > > > > 
> > > > > When a migration effectively occurs, all the frontend state is
> > > > > migrated as a regular emulated device. To route all of the state in a
> > > > > normalized way for qemu is what leaves open the possibility to do
> > > > > cross-backends migrations, etc.
> > > > > 
> > > > > Does that answer your question?
> > > > I think you're confirming that changes would be necessary in order for
> > > > vDPA to support the save/load operation that Hanna is introducing.
> > > > 
> > > Yes, this first iteration was centered on net, with an eye on block,
> > > where state can be routed through classical emulated devices. This is
> > > how vhost-kernel and vhost-user do classically. And it allows
> > > cross-backend, to not modify qemu migration state, etc.
> > > 
> > > To introduce this opaque state to qemu, that must be fetched after the
> > > suspend and not before, requires changes in vhost protocol, as
> > > discussed previously.
> > > 
> > > > > > It looks like vDPA changes are necessary in order to support
> > > > > > stateful
> > > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > > correct?
> > > > > > 
> > > > > Changes are required elsewhere, as the code to restore the state
> > > > > properly in the destination has not been merged.
> > > > I'm not sure what you mean by elsewhere?
> > > > 
> > > I meant for vdpa *net* devices the changes are not required in vdpa
> > > ioctls, but mostly in qemu.
> > > 
> > > If you meant stateful as "it must have a state blob that it must be
> > > opaque to qemu", then I think the straightforward action is to fetch
> > > state blob about the same time as vq indexes. But yes, changes (at
> > > least a new ioctl) is needed for that.
> > > 
> > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > > then VHOST_VDPA_SET_STATUS 0.
> > > > 
> > > > In order to save device state from the vDPA device in the future, it
> > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > > the device state can be saved before the device is reset.
> > > > 
> > > > Does that sound right?
> > > > 
> > > The split between suspend and reset was added recently for that very
> > > reason. In all the virtio devices, the frontend is initialized before
> > > the backend, so I don't think it is a good idea to defer the backend
> > > cleanup. Especially if we have already set the state is small enough
> > > to not needing iterative migration from virtiofsd point of view.
> > > 
> > > If fetching that state at the same time as vq indexes is not valid,
> > > could it follow the same model as the "in-flight descriptors"?
> > > vhost-user follows them by using a shared memory region where their
> > > state is tracked [1]. This allows qemu to survive vhost-user SW
> > > backend crashes, and does not forbid the cross-backends live migration
> > > as all the information is there to recover them.
> > > 
> > > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > > a possibility is to synchronize this memory region after a
> > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > > devices are not going to crash in the software sense, so all use cases
> > > remain the same to qemu. And that shared memory information is
> > > recoverable after vhost_dev_stop.
> > > 
> > > Does that sound reasonable to virtiofsd? To offer a shared memory
> > > region where it dumps the state, maybe only after the
> > > set_state(STATE_PHASE_STOPPED)?
> > 
> > I don’t think we need the set_state() call, necessarily, if SUSPEND is
> > mandatory anyway.
> > 
> > As for the shared memory, the RFC before this series used shared memory,
> > so it’s possible, yes.  But “shared memory region” can mean a lot of
> > things – it sounds like you’re saying the back-end (virtiofsd) should
> > provide it to the front-end, is that right?  That could work like this:
> > 
> > On the source side:
> > 
> > S1. SUSPEND goes to virtiofsd
> > S2. virtiofsd maybe double-checks that the device is stopped, then
> > serializes its state into a newly allocated shared memory area[1]
> > S3. virtiofsd responds to SUSPEND
> > S4. front-end requests shared memory, virtiofsd responds with a handle,
> > maybe already closes its reference
> > S5. front-end saves state, closes its handle, freeing the SHM
> > 
> > [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> > it can immediately allocate this area and serialize directly into it;
> > maybe it can’t, then we’ll need a bounce buffer.  Not really a
> > fundamental problem, but there are limitations around what you can do
> > with serde implementations in Rust…
> > 
> > On the destination side:
> > 
> > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> > virtiofsd would serialize its empty state into an SHM area, and respond
> > to SUSPEND
> > D2. front-end reads state from migration stream into an SHM it has allocated
> > D3. front-end supplies this SHM to virtiofsd, which discards its
> > previous area, and now uses this one
> > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> > 
> > Couple of questions:
> > 
> > A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> > would imply to deserialize a state, and the state is to be transferred
> > through SHM, this is what would need to be done.  So maybe we should
> > skip SUSPEND on the destination?
> > B. You described that the back-end should supply the SHM, which works
> > well on the source.  On the destination, only the front-end knows how
> > big the state is, so I’ve decided above that it should allocate the SHM
> > (D2) and provide it to the back-end.  Is that feasible or is it
> > important (e.g. for real hardware) that the back-end supplies the SHM?
> > (In which case the front-end would need to tell the back-end how big the
> > state SHM needs to be.)
> 
> How does this work for iterative live migration?
> 
A pipe will always fit better for iterative from qemu POV, that's for sure. 
Especially if we want to keep that opaqueness.
But  we will need to communicate with the HW device using shared memory sooner
or later for big states.  If we don't transform it in qemu, we will need to do
it in the kernel.  Also, the pipe will not support daemon crashes.
Again I'm just putting this on the table, just in case it fits better or it is
convenient.  I missed the previous patch where SHM was proposed too, so maybe I
missed some feedback useful here.  I think the pipe is a better solution in the
long run because of the iterative part.
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-04-11 15:05 [PATCH 0/4] vhost-user-fs: Internal migration Hanna Czenczek
                   ` (5 preceding siblings ...)
  2023-04-13 16:11 ` Michael S. Tsirkin
@ 2023-05-04 16:05 ` Hanna Czenczek
  2023-05-04 21:14   ` Stefan Hajnoczi
  6 siblings, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-05-04 16:05 UTC (permalink / raw)
  To: qemu-devel, virtio-fs
  Cc: Stefan Hajnoczi, German Maglione, Anton Kuchin, Juan Quintela,
	Michael S . Tsirkin, Stefano Garzarella, Eugenio Perez Martin
On 11.04.23 17:05, Hanna Czenczek wrote:
[...]
> Hanna Czenczek (4):
>    vhost: Re-enable vrings after setting features
>    vhost-user: Interface for migration state transfer
>    vhost: Add high-level state save/load functions
>    vhost-user-fs: Implement internal migration
I’m trying to write v2, and my intention was to keep the code 
conceptually largely the same, but include in the documentation change 
thoughts and notes on how this interface is to be used in the future, 
when e.g. vDPA “extensions” come over to vhost-user.  My plan was to, 
based on that documentation, discuss further.
But now I’m struggling to even write that documentation because it’s not 
clear to me what exactly the result of the discussion was, so I need to 
stop even before that.
So as far as I understand, we need/want SUSPEND/RESUME for two reasons:
1. As a signal to the back-end when virt queues are no longer to be 
processed, so that it is clear that it will not do that when asked for 
migration state.
2. Stateful devices that support SET_STATUS receive a status of 0 when 
the VM is stopped, which supposedly resets the internal state. While 
suspended, device state is frozen, so as far as I understand, SUSPEND 
before SET_STATUS would have the status change be deferred until RESUME.
I don’t want to hang myself up on 2 because it doesn’t really seem 
important to this series, but: Why does a status of 0 reset the internal 
state?  [Note: This is all virtio_reset() seems to do, set the status to 
0.]  The vhost-user specification only points to the virtio 
specification, which doesn’t say anything to that effect. Instead, an 
explicit device reset is mentioned, which would be 
VHOST_USER_RESET_DEVICE, i.e. something completely different. Because 
RESET_DEVICE directly contradicts SUSPEND’s description, I would like to 
think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid.
Is it that a status 0 won’t explicitly reset the internal state, but 
because it does mean that the driver is unbound, the state should 
implicitly be reset?
Anyway.  1 seems to be the relevant point for migration.  As far as I 
understand, currently, a vhost-user back-end has no way of knowing when 
to stop processing virt queues.  Basically, rings can only transition 
from stopped to started, but not vice versa.  The vhost-user 
specification has this bit: “Once the source has finished migration, 
rings will be stopped by the source. No further update must be done 
before rings are restarted.”  It just doesn’t say how the front-end lets 
the back-end know that the rings are (to be) stopped.  So this seems 
like a pre-existing problem for stateless migration.  Unless this is 
communicated precisely by setting the device status to 0?
Naturally, what I want to know most of all is whether you believe I can 
get away without SUSPEND/RESUME for now.  To me, it seems like honestly 
not really, only when turning two blind eyes, because otherwise we can’t 
ensure that virtiofsd isn’t still processing pending virt queue requests 
when the state transfer is begun, even when the guest CPUs are already 
stopped.  Of course, virtiofsd could stop queue processing right there 
and then, but…  That feels like a hack that in the grand scheme of 
things just isn’t necessary when we could “just” introduce 
SUSPEND/RESUME into vhost-user for exactly this.
Beyond the SUSPEND/RESUME question, I understand everything can stay 
as-is for now, as the design doesn’t seem to conflict too badly with 
possible future extensions for other migration phases or more finely 
grained migration phase control between front-end and back-end.
Did I at least roughly get the gist?
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-04 16:05 ` Hanna Czenczek
@ 2023-05-04 21:14   ` Stefan Hajnoczi
  2023-05-05  9:03     ` Hanna Czenczek
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-05-04 21:14 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella, Eugenio Perez Martin
On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 11.04.23 17:05, Hanna Czenczek wrote:
>
> [...]
>
> > Hanna Czenczek (4):
> >    vhost: Re-enable vrings after setting features
> >    vhost-user: Interface for migration state transfer
> >    vhost: Add high-level state save/load functions
> >    vhost-user-fs: Implement internal migration
>
> I’m trying to write v2, and my intention was to keep the code
> conceptually largely the same, but include in the documentation change
> thoughts and notes on how this interface is to be used in the future,
> when e.g. vDPA “extensions” come over to vhost-user.  My plan was to,
> based on that documentation, discuss further.
>
> But now I’m struggling to even write that documentation because it’s not
> clear to me what exactly the result of the discussion was, so I need to
> stop even before that.
>
> So as far as I understand, we need/want SUSPEND/RESUME for two reasons:
> 1. As a signal to the back-end when virt queues are no longer to be
> processed, so that it is clear that it will not do that when asked for
> migration state.
> 2. Stateful devices that support SET_STATUS receive a status of 0 when
> the VM is stopped, which supposedly resets the internal state. While
> suspended, device state is frozen, so as far as I understand, SUSPEND
> before SET_STATUS would have the status change be deferred until RESUME.
I'm not sure about SUSPEND -> SET_STATUS 0 -> RESUME. I guess the
device would be reset right away and it would either remain suspended
or be resumed as part of reset :).
Unfortunately the concepts of SUSPEND/RESUME and the Device Status
Field are orthogonal and there is no spec that explains how they
interact.
>
> I don’t want to hang myself up on 2 because it doesn’t really seem
> important to this series, but: Why does a status of 0 reset the internal
> state?  [Note: This is all virtio_reset() seems to do, set the status to
> 0.]  The vhost-user specification only points to the virtio
> specification, which doesn’t say anything to that effect. Instead, an
> explicit device reset is mentioned, which would be
> VHOST_USER_RESET_DEVICE, i.e. something completely different. Because
> RESET_DEVICE directly contradicts SUSPEND’s description, I would like to
> think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid.
The vhost-user protocol didn't have the concept of the VIRTIO Device
Status Field until SET_STATUS was added.
In order to participate in the VIRTIO device lifecycle to some extent,
the pre-SET_STATUS vhost-user protocol relied on vhost-user-specific
messages like RESET_DEVICE.
At the VIRTIO level, devices are reset by setting the Device Status
Field to 0. All state is lost and the Device Initialization process
must be followed to make the device operational again.
Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
It's messy and not your fault. I think QEMU should solve this by
treating stateful devices differently from non-stateful devices. That
way existing vhost-user backends continue to work and new stateful
devices can also be supported.
>
> Is it that a status 0 won’t explicitly reset the internal state, but
> because it does mean that the driver is unbound, the state should
> implicitly be reset?
I think the fundamental problem is that transports like virtio-pci put
registers back in their initialization state upon reset, so internal
state is lost.
The VIRTIO spec does not go into detail on device state across reset
though, so I don't think much more can be said about the semantics.
> Anyway.  1 seems to be the relevant point for migration.  As far as I
> understand, currently, a vhost-user back-end has no way of knowing when
> to stop processing virt queues.  Basically, rings can only transition
> from stopped to started, but not vice versa.  The vhost-user
> specification has this bit: “Once the source has finished migration,
> rings will be stopped by the source. No further update must be done
> before rings are restarted.”  It just doesn’t say how the front-end lets
> the back-end know that the rings are (to be) stopped.  So this seems
> like a pre-existing problem for stateless migration.  Unless this is
> communicated precisely by setting the device status to 0?
No, my understanding is different. The vhost-user spec says the
backend must "stop [the] ring upon receiving
``VHOST_USER_GET_VRING_BASE``". The "Ring states" section goes into
more detail and adds the concept of enabled/disabled too.
SUSPEND is stronger than GET_VRING_BASE though. GET_VRING_BASE only
applies to a single virtqueue, whereas SUSPEND acts upon the entire
device, including non-virtqueue aspects like Configuration Change
Notifications (VHOST_USER_BACKEND_CONFIG_CHANGE_MSG).
You can approximate SUSPEND today by sending GET_VRING_BASE for all
virtqueues. I think in practice this does fully stop the device even
if the spec doesn't require it.
If we want minimal changes to vhost-user, then we could rely on
GET_VRING_BASE to suspend and SET_VRING_ENABLE to resume. And
SET_STATUS 0 must not be sent so that the device's state is not lost.
However, this approach means this effort needs to be redone when it's
time to add stateful device support to vDPA and the QEMU vhost code
will become more complex. I think it's better to agree on a proper
model that works for both vhost-user and vhost-vdpa now to avoid more
hacks/special cases.
> Naturally, what I want to know most of all is whether you believe I can
> get away without SUSPEND/RESUME for now.  To me, it seems like honestly
> not really, only when turning two blind eyes, because otherwise we can’t
> ensure that virtiofsd isn’t still processing pending virt queue requests
> when the state transfer is begun, even when the guest CPUs are already
> stopped.  Of course, virtiofsd could stop queue processing right there
> and then, but…  That feels like a hack that in the grand scheme of
> things just isn’t necessary when we could “just” introduce
> SUSPEND/RESUME into vhost-user for exactly this.
>
> Beyond the SUSPEND/RESUME question, I understand everything can stay
> as-is for now, as the design doesn’t seem to conflict too badly with
> possible future extensions for other migration phases or more finely
> grained migration phase control between front-end and back-end.
>
> Did I at least roughly get the gist?
One part we haven't discussed much: I'm not sure how much trouble
you'll face due to the fact that QEMU assumes vhost devices can be
reset across vhost_dev_stop() -> vhost_dev_start(). I don't think we
should keep a copy of the state in-memory just so it can be restored
in vhost_dev_start(). I think it's better to change QEMU's vhost code
to leave stateful devices suspended (but not reset) across
vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
this aspect?
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-04 21:14   ` Stefan Hajnoczi
@ 2023-05-05  9:03     ` Hanna Czenczek
  2023-05-05  9:51       ` Hanna Czenczek
                         ` (2 more replies)
  0 siblings, 3 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-05-05  9:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella, Eugenio Perez Martin
On 04.05.23 23:14, Stefan Hajnoczi wrote:
> On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
>> On 11.04.23 17:05, Hanna Czenczek wrote:
>>
>> [...]
>>
>>> Hanna Czenczek (4):
>>>     vhost: Re-enable vrings after setting features
>>>     vhost-user: Interface for migration state transfer
>>>     vhost: Add high-level state save/load functions
>>>     vhost-user-fs: Implement internal migration
>> I’m trying to write v2, and my intention was to keep the code
>> conceptually largely the same, but include in the documentation change
>> thoughts and notes on how this interface is to be used in the future,
>> when e.g. vDPA “extensions” come over to vhost-user.  My plan was to,
>> based on that documentation, discuss further.
>>
>> But now I’m struggling to even write that documentation because it’s not
>> clear to me what exactly the result of the discussion was, so I need to
>> stop even before that.
>>
>> So as far as I understand, we need/want SUSPEND/RESUME for two reasons:
>> 1. As a signal to the back-end when virt queues are no longer to be
>> processed, so that it is clear that it will not do that when asked for
>> migration state.
>> 2. Stateful devices that support SET_STATUS receive a status of 0 when
>> the VM is stopped, which supposedly resets the internal state. While
>> suspended, device state is frozen, so as far as I understand, SUSPEND
>> before SET_STATUS would have the status change be deferred until RESUME.
> I'm not sure about SUSPEND -> SET_STATUS 0 -> RESUME. I guess the
> device would be reset right away and it would either remain suspended
> or be resumed as part of reset :).
>
> Unfortunately the concepts of SUSPEND/RESUME and the Device Status
> Field are orthogonal and there is no spec that explains how they
> interact.
Ah, OK.  So I guess it’s up to the implementation to decide whether the 
virtio device status counts as part of the “configuration” that “[it] 
must not change”.
>> I don’t want to hang myself up on 2 because it doesn’t really seem
>> important to this series, but: Why does a status of 0 reset the internal
>> state?  [Note: This is all virtio_reset() seems to do, set the status to
>> 0.]  The vhost-user specification only points to the virtio
>> specification, which doesn’t say anything to that effect. Instead, an
>> explicit device reset is mentioned, which would be
>> VHOST_USER_RESET_DEVICE, i.e. something completely different. Because
>> RESET_DEVICE directly contradicts SUSPEND’s description, I would like to
>> think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid.
> The vhost-user protocol didn't have the concept of the VIRTIO Device
> Status Field until SET_STATUS was added.
>
> In order to participate in the VIRTIO device lifecycle to some extent,
> the pre-SET_STATUS vhost-user protocol relied on vhost-user-specific
> messages like RESET_DEVICE.
>
> At the VIRTIO level, devices are reset by setting the Device Status
> Field to 0.
(I didn’t find this in the virtio specification until today, turns out 
it’s under 4.1.4.3 “Common configuration structure layout”, not under 
2.1 “Device Status Field”, where I was looking.)
> All state is lost and the Device Initialization process
> must be followed to make the device operational again.
>
> Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
>
> It's messy and not your fault. I think QEMU should solve this by
> treating stateful devices differently from non-stateful devices. That
> way existing vhost-user backends continue to work and new stateful
> devices can also be supported.
It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for 
stateful devices.  In a previous email, you wrote that these should 
implement SUSPEND+RESUME so qemu can use those instead.  But those are 
separate things, so I assume we just use SET_STATUS 0 when stopping the 
VM because this happens to also stop processing vrings as a side effect?
I.e. I understand “treating stateful devices differently” to mean that 
qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end 
supports it, and stateful back-ends should support it.
>> Is it that a status 0 won’t explicitly reset the internal state, but
>> because it does mean that the driver is unbound, the state should
>> implicitly be reset?
> I think the fundamental problem is that transports like virtio-pci put
> registers back in their initialization state upon reset, so internal
> state is lost.
>
> The VIRTIO spec does not go into detail on device state across reset
> though, so I don't think much more can be said about the semantics.
>
>> Anyway.  1 seems to be the relevant point for migration.  As far as I
>> understand, currently, a vhost-user back-end has no way of knowing when
>> to stop processing virt queues.  Basically, rings can only transition
>> from stopped to started, but not vice versa.  The vhost-user
>> specification has this bit: “Once the source has finished migration,
>> rings will be stopped by the source. No further update must be done
>> before rings are restarted.”  It just doesn’t say how the front-end lets
>> the back-end know that the rings are (to be) stopped.  So this seems
>> like a pre-existing problem for stateless migration.  Unless this is
>> communicated precisely by setting the device status to 0?
> No, my understanding is different. The vhost-user spec says the
> backend must "stop [the] ring upon receiving
> ``VHOST_USER_GET_VRING_BASE``".
Yes, I missed that part!
> The "Ring states" section goes into
> more detail and adds the concept of enabled/disabled too.
>
> SUSPEND is stronger than GET_VRING_BASE though. GET_VRING_BASE only
> applies to a single virtqueue, whereas SUSPEND acts upon the entire
> device, including non-virtqueue aspects like Configuration Change
> Notifications (VHOST_USER_BACKEND_CONFIG_CHANGE_MSG).
>
> You can approximate SUSPEND today by sending GET_VRING_BASE for all
> virtqueues. I think in practice this does fully stop the device even
> if the spec doesn't require it.
>
> If we want minimal changes to vhost-user, then we could rely on
> GET_VRING_BASE to suspend and SET_VRING_ENABLE to resume. And
> SET_STATUS 0 must not be sent so that the device's state is not lost.
So you mean that we’d use SUSPEND instead of SET_STATUS 0, but because 
we have no SUSPEND, we’d ensure that GET_VRING_BASE is/was called on all 
vrings?
> However, this approach means this effort needs to be redone when it's
> time to add stateful device support to vDPA and the QEMU vhost code
> will become more complex. I think it's better to agree on a proper
> model that works for both vhost-user and vhost-vdpa now to avoid more
> hacks/special cases.
Agreeing is easy, actually adding SUSPEND+RESUME to the vhost-user 
protocol is what I’d prefer to avoid. :)
The question is whether it’s really effort if we were (in qemu) to just 
implement SUSPEND as a GET_VRING_BASE to all vrings for vhost-user.  I 
don’t think there is a direct equivalent to RESUME, because the back-end 
is supposed to start rings automatically when it receives a kick, but 
that will only happen when the vCPUs run, so that should be fine.
>> Naturally, what I want to know most of all is whether you believe I can
>> get away without SUSPEND/RESUME for now.  To me, it seems like honestly
>> not really, only when turning two blind eyes, because otherwise we can’t
>> ensure that virtiofsd isn’t still processing pending virt queue requests
>> when the state transfer is begun, even when the guest CPUs are already
>> stopped.  Of course, virtiofsd could stop queue processing right there
>> and then, but…  That feels like a hack that in the grand scheme of
>> things just isn’t necessary when we could “just” introduce
>> SUSPEND/RESUME into vhost-user for exactly this.
>>
>> Beyond the SUSPEND/RESUME question, I understand everything can stay
>> as-is for now, as the design doesn’t seem to conflict too badly with
>> possible future extensions for other migration phases or more finely
>> grained migration phase control between front-end and back-end.
>>
>> Did I at least roughly get the gist?
> One part we haven't discussed much: I'm not sure how much trouble
> you'll face due to the fact that QEMU assumes vhost devices can be
> reset across vhost_dev_stop() -> vhost_dev_start(). I don't think we
> should keep a copy of the state in-memory just so it can be restored
> in vhost_dev_start().
All I can report is that virtiofsd continues to work fine after a 
cancelled/failed migration.
> I think it's better to change QEMU's vhost code
> to leave stateful devices suspended (but not reset) across
> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> this aspect?
Yes and no; I mean, I haven’t in detail, but I thought this is what’s 
meant by suspending instead of resetting when the VM is stopped.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-05  9:03     ` Hanna Czenczek
@ 2023-05-05  9:51       ` Hanna Czenczek
  2023-05-05 14:26         ` Eugenio Perez Martin
  2023-05-05  9:53       ` Eugenio Perez Martin
  2023-05-09 15:41       ` Stefan Hajnoczi
  2 siblings, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-05-05  9:51 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, virtio-fs, Stefan Hajnoczi, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella, Eugenio Perez Martin
(By the way, thanks for the explanations :))
On 05.05.23 11:03, Hanna Czenczek wrote:
> On 04.05.23 23:14, Stefan Hajnoczi wrote:
[...]
>> I think it's better to change QEMU's vhost code
>> to leave stateful devices suspended (but not reset) across
>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
>> this aspect?
>
> Yes and no; I mean, I haven’t in detail, but I thought this is what’s 
> meant by suspending instead of resetting when the VM is stopped.
So, now looking at vhost_dev_stop(), one problem I can see is that 
depending on the back-end, different operations it does will do 
different things.
It tries to stop the whole device via vhost_ops->vhost_dev_start(), 
which for vDPA will suspend the device, but for vhost-user will reset it 
(if F_STATUS is there).
It disables all vrings, which doesn’t mean stopping, but may be 
necessary, too.  (I haven’t yet really understood the use of disabled 
vrings, I heard that virtio-net would have a need for it.)
It then also stops all vrings, though, so that’s OK.  And because this 
will always do GET_VRING_BASE, this is actually always the same 
regardless of transport.
Finally (for this purpose), it resets the device status via 
vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and 
this is what resets the device there.
So vhost-user resets the device in .vhost_dev_start, but vDPA only does 
so in .vhost_reset_status.  It would seem better to me if vhost-user 
would also reset the device only in .vhost_reset_status, not in 
.vhost_dev_start.  .vhost_dev_start seems precisely like the place to 
run SUSPEND/RESUME.
Another question I have (but this is basically what I wrote in my last 
email) is why we even call .vhost_reset_status here.  If the device 
and/or all of the vrings are already stopped, why do we need to reset 
it?  Naïvely, I had assumed we only really need to reset the device if 
the guest changes, so that a new guest driver sees a freshly initialized 
device.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-05  9:03     ` Hanna Czenczek
  2023-05-05  9:51       ` Hanna Czenczek
@ 2023-05-05  9:53       ` Eugenio Perez Martin
  2023-05-05 12:51         ` Hanna Czenczek
  2023-05-09 15:41       ` Stefan Hajnoczi
  2 siblings, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-05-05  9:53 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, Stefan Hajnoczi,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 04.05.23 23:14, Stefan Hajnoczi wrote:
> > On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
> >> On 11.04.23 17:05, Hanna Czenczek wrote:
> >>
> >> [...]
> >>
> >>> Hanna Czenczek (4):
> >>>     vhost: Re-enable vrings after setting features
> >>>     vhost-user: Interface for migration state transfer
> >>>     vhost: Add high-level state save/load functions
> >>>     vhost-user-fs: Implement internal migration
> >> I’m trying to write v2, and my intention was to keep the code
> >> conceptually largely the same, but include in the documentation change
> >> thoughts and notes on how this interface is to be used in the future,
> >> when e.g. vDPA “extensions” come over to vhost-user.  My plan was to,
> >> based on that documentation, discuss further.
> >>
> >> But now I’m struggling to even write that documentation because it’s not
> >> clear to me what exactly the result of the discussion was, so I need to
> >> stop even before that.
> >>
> >> So as far as I understand, we need/want SUSPEND/RESUME for two reasons:
> >> 1. As a signal to the back-end when virt queues are no longer to be
> >> processed, so that it is clear that it will not do that when asked for
> >> migration state.
> >> 2. Stateful devices that support SET_STATUS receive a status of 0 when
> >> the VM is stopped, which supposedly resets the internal state. While
> >> suspended, device state is frozen, so as far as I understand, SUSPEND
> >> before SET_STATUS would have the status change be deferred until RESUME.
> > I'm not sure about SUSPEND -> SET_STATUS 0 -> RESUME. I guess the
> > device would be reset right away and it would either remain suspended
> > or be resumed as part of reset :).
> >
> > Unfortunately the concepts of SUSPEND/RESUME and the Device Status
> > Field are orthogonal and there is no spec that explains how they
> > interact.
>
> Ah, OK.  So I guess it’s up to the implementation to decide whether the
> virtio device status counts as part of the “configuration” that “[it]
> must not change”.
>
That's a very good point indeed. I think the easier way to think about
it is that reset must be able to recover the device always, so it must
take precedence over the suspension. But I think it is good to make it
explicit, at least in current vDPA headers.
> >> I don’t want to hang myself up on 2 because it doesn’t really seem
> >> important to this series, but: Why does a status of 0 reset the internal
> >> state?  [Note: This is all virtio_reset() seems to do, set the status to
> >> 0.]  The vhost-user specification only points to the virtio
> >> specification, which doesn’t say anything to that effect. Instead, an
> >> explicit device reset is mentioned, which would be
> >> VHOST_USER_RESET_DEVICE, i.e. something completely different. Because
> >> RESET_DEVICE directly contradicts SUSPEND’s description, I would like to
> >> think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid.
> > The vhost-user protocol didn't have the concept of the VIRTIO Device
> > Status Field until SET_STATUS was added.
> >
> > In order to participate in the VIRTIO device lifecycle to some extent,
> > the pre-SET_STATUS vhost-user protocol relied on vhost-user-specific
> > messages like RESET_DEVICE.
> >
> > At the VIRTIO level, devices are reset by setting the Device Status
> > Field to 0.
>
> (I didn’t find this in the virtio specification until today, turns out
> it’s under 4.1.4.3 “Common configuration structure layout”, not under
> 2.1 “Device Status Field”, where I was looking.)
>
Yes, but you had a point. That section is only for PCI transport, not
as a generic way to reset the device. Channel I/O uses an explicit
CCW_CMD_VDEV_RESET command for reset, more similar to
VHOST_USER_RESET_DEVICE.
> > All state is lost and the Device Initialization process
> > must be followed to make the device operational again.
> >
> > Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
> >
> > It's messy and not your fault. I think QEMU should solve this by
> > treating stateful devices differently from non-stateful devices. That
> > way existing vhost-user backends continue to work and new stateful
> > devices can also be supported.
>
> It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for
> stateful devices.  In a previous email, you wrote that these should
> implement SUSPEND+RESUME so qemu can use those instead.  But those are
> separate things, so I assume we just use SET_STATUS 0 when stopping the
> VM because this happens to also stop processing vrings as a side effect?
>
> I.e. I understand “treating stateful devices differently” to mean that
> qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end
> supports it, and stateful back-ends should support it.
>
Honestly I cannot think of any use case where the vhost-user backend
did not ignore set_status(0) and had to retrieve vq states. So maybe
we can totally remove that call from qemu?
> >> Is it that a status 0 won’t explicitly reset the internal state, but
> >> because it does mean that the driver is unbound, the state should
> >> implicitly be reset?
> > I think the fundamental problem is that transports like virtio-pci put
> > registers back in their initialization state upon reset, so internal
> > state is lost.
> >
> > The VIRTIO spec does not go into detail on device state across reset
> > though, so I don't think much more can be said about the semantics.
> >
> >> Anyway.  1 seems to be the relevant point for migration.  As far as I
> >> understand, currently, a vhost-user back-end has no way of knowing when
> >> to stop processing virt queues.  Basically, rings can only transition
> >> from stopped to started, but not vice versa.  The vhost-user
> >> specification has this bit: “Once the source has finished migration,
> >> rings will be stopped by the source. No further update must be done
> >> before rings are restarted.”  It just doesn’t say how the front-end lets
> >> the back-end know that the rings are (to be) stopped.  So this seems
> >> like a pre-existing problem for stateless migration.  Unless this is
> >> communicated precisely by setting the device status to 0?
> > No, my understanding is different. The vhost-user spec says the
> > backend must "stop [the] ring upon receiving
> > ``VHOST_USER_GET_VRING_BASE``".
>
> Yes, I missed that part!
>
> > The "Ring states" section goes into
> > more detail and adds the concept of enabled/disabled too.
> >
> > SUSPEND is stronger than GET_VRING_BASE though. GET_VRING_BASE only
> > applies to a single virtqueue, whereas SUSPEND acts upon the entire
> > device, including non-virtqueue aspects like Configuration Change
> > Notifications (VHOST_USER_BACKEND_CONFIG_CHANGE_MSG).
> >
> > You can approximate SUSPEND today by sending GET_VRING_BASE for all
> > virtqueues. I think in practice this does fully stop the device even
> > if the spec doesn't require it.
> >
> > If we want minimal changes to vhost-user, then we could rely on
> > GET_VRING_BASE to suspend and SET_VRING_ENABLE to resume. And
> > SET_STATUS 0 must not be sent so that the device's state is not lost.
>
> So you mean that we’d use SUSPEND instead of SET_STATUS 0, but because
> we have no SUSPEND, we’d ensure that GET_VRING_BASE is/was called on all
> vrings?
>
> > However, this approach means this effort needs to be redone when it's
> > time to add stateful device support to vDPA and the QEMU vhost code
> > will become more complex. I think it's better to agree on a proper
> > model that works for both vhost-user and vhost-vdpa now to avoid more
> > hacks/special cases.
>
> Agreeing is easy, actually adding SUSPEND+RESUME to the vhost-user
> protocol is what I’d prefer to avoid. :)
>
> The question is whether it’s really effort if we were (in qemu) to just
> implement SUSPEND as a GET_VRING_BASE to all vrings for vhost-user.  I
> don’t think there is a direct equivalent to RESUME, because the back-end
> is supposed to start rings automatically when it receives a kick, but
> that will only happen when the vCPUs run, so that should be fine.
>
> >> Naturally, what I want to know most of all is whether you believe I can
> >> get away without SUSPEND/RESUME for now.  To me, it seems like honestly
> >> not really, only when turning two blind eyes, because otherwise we can’t
> >> ensure that virtiofsd isn’t still processing pending virt queue requests
> >> when the state transfer is begun, even when the guest CPUs are already
> >> stopped.  Of course, virtiofsd could stop queue processing right there
> >> and then, but…  That feels like a hack that in the grand scheme of
> >> things just isn’t necessary when we could “just” introduce
> >> SUSPEND/RESUME into vhost-user for exactly this.
> >>
> >> Beyond the SUSPEND/RESUME question, I understand everything can stay
> >> as-is for now, as the design doesn’t seem to conflict too badly with
> >> possible future extensions for other migration phases or more finely
> >> grained migration phase control between front-end and back-end.
> >>
> >> Did I at least roughly get the gist?
> > One part we haven't discussed much: I'm not sure how much trouble
> > you'll face due to the fact that QEMU assumes vhost devices can be
> > reset across vhost_dev_stop() -> vhost_dev_start(). I don't think we
> > should keep a copy of the state in-memory just so it can be restored
> > in vhost_dev_start().
>
> All I can report is that virtiofsd continues to work fine after a
> cancelled/failed migration.
>
Isn't the device reset after a failed migration? At least net devices
are reset before sending VMState. If it cannot be applied at the
destination, the device is already reset...
> > I think it's better to change QEMU's vhost code
> > to leave stateful devices suspended (but not reset) across
> > vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> > vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> > this aspect?
>
> Yes and no; I mean, I haven’t in detail, but I thought this is what’s
> meant by suspending instead of resetting when the VM is stopped.
>
... unless we perform these changes of course :).
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-05  9:53       ` Eugenio Perez Martin
@ 2023-05-05 12:51         ` Hanna Czenczek
  2023-05-08 21:10           ` Stefan Hajnoczi
  0 siblings, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-05-05 12:51 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, Stefan Hajnoczi,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On 05.05.23 11:53, Eugenio Perez Martin wrote:
> On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote:
>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
>>> On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
[...]
>>> All state is lost and the Device Initialization process
>>> must be followed to make the device operational again.
>>>
>>> Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
>>>
>>> It's messy and not your fault. I think QEMU should solve this by
>>> treating stateful devices differently from non-stateful devices. That
>>> way existing vhost-user backends continue to work and new stateful
>>> devices can also be supported.
>> It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for
>> stateful devices.  In a previous email, you wrote that these should
>> implement SUSPEND+RESUME so qemu can use those instead.  But those are
>> separate things, so I assume we just use SET_STATUS 0 when stopping the
>> VM because this happens to also stop processing vrings as a side effect?
>>
>> I.e. I understand “treating stateful devices differently” to mean that
>> qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end
>> supports it, and stateful back-ends should support it.
>>
> Honestly I cannot think of any use case where the vhost-user backend
> did not ignore set_status(0) and had to retrieve vq states. So maybe
> we can totally remove that call from qemu?
I don’t know so I can’t really say; but I don’t quite understand why 
qemu would reset a device at any point but perhaps VM reset (and even 
then I’d expect the post-reset guest to just reset the device on boot by 
itself, too).
[...]
>>>> Naturally, what I want to know most of all is whether you believe I can
>>>> get away without SUSPEND/RESUME for now.  To me, it seems like honestly
>>>> not really, only when turning two blind eyes, because otherwise we can’t
>>>> ensure that virtiofsd isn’t still processing pending virt queue requests
>>>> when the state transfer is begun, even when the guest CPUs are already
>>>> stopped.  Of course, virtiofsd could stop queue processing right there
>>>> and then, but…  That feels like a hack that in the grand scheme of
>>>> things just isn’t necessary when we could “just” introduce
>>>> SUSPEND/RESUME into vhost-user for exactly this.
>>>>
>>>> Beyond the SUSPEND/RESUME question, I understand everything can stay
>>>> as-is for now, as the design doesn’t seem to conflict too badly with
>>>> possible future extensions for other migration phases or more finely
>>>> grained migration phase control between front-end and back-end.
>>>>
>>>> Did I at least roughly get the gist?
>>> One part we haven't discussed much: I'm not sure how much trouble
>>> you'll face due to the fact that QEMU assumes vhost devices can be
>>> reset across vhost_dev_stop() -> vhost_dev_start(). I don't think we
>>> should keep a copy of the state in-memory just so it can be restored
>>> in vhost_dev_start().
>> All I can report is that virtiofsd continues to work fine after a
>> cancelled/failed migration.
>>
> Isn't the device reset after a failed migration? At least net devices
> are reset before sending VMState. If it cannot be applied at the
> destination, the device is already reset...
It doesn’t look like the Rust crate virtiofsd uses for vhost-user 
supports either F_STATUS or F_RESET_DEVICE, so I think this just doesn’t 
affect virtiofsd.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-05  9:51       ` Hanna Czenczek
@ 2023-05-05 14:26         ` Eugenio Perez Martin
  2023-05-05 14:37           ` Hanna Czenczek
  2023-05-09 15:30           ` Stefan Hajnoczi
  0 siblings, 2 replies; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-05-05 14:26 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, Stefan Hajnoczi,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> (By the way, thanks for the explanations :))
>
> On 05.05.23 11:03, Hanna Czenczek wrote:
> > On 04.05.23 23:14, Stefan Hajnoczi wrote:
>
> [...]
>
> >> I think it's better to change QEMU's vhost code
> >> to leave stateful devices suspended (but not reset) across
> >> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> >> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> >> this aspect?
> >
> > Yes and no; I mean, I haven’t in detail, but I thought this is what’s
> > meant by suspending instead of resetting when the VM is stopped.
>
> So, now looking at vhost_dev_stop(), one problem I can see is that
> depending on the back-end, different operations it does will do
> different things.
>
> It tries to stop the whole device via vhost_ops->vhost_dev_start(),
> which for vDPA will suspend the device, but for vhost-user will reset it
> (if F_STATUS is there).
>
> It disables all vrings, which doesn’t mean stopping, but may be
> necessary, too.  (I haven’t yet really understood the use of disabled
> vrings, I heard that virtio-net would have a need for it.)
>
> It then also stops all vrings, though, so that’s OK.  And because this
> will always do GET_VRING_BASE, this is actually always the same
> regardless of transport.
>
> Finally (for this purpose), it resets the device status via
> vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
> this is what resets the device there.
>
>
> So vhost-user resets the device in .vhost_dev_start, but vDPA only does
> so in .vhost_reset_status.  It would seem better to me if vhost-user
> would also reset the device only in .vhost_reset_status, not in
> .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
> run SUSPEND/RESUME.
>
I think the same. I just saw It's been proposed at [1].
> Another question I have (but this is basically what I wrote in my last
> email) is why we even call .vhost_reset_status here.  If the device
> and/or all of the vrings are already stopped, why do we need to reset
> it?  Naïvely, I had assumed we only really need to reset the device if
> the guest changes, so that a new guest driver sees a freshly initialized
> device.
>
I don't know why we didn't need to call it :). I'm assuming the
previous vhost-user net did fine resetting vq indexes, using
VHOST_USER_SET_VRING_BASE. But I don't know about more complex
devices.
The guest can reset the device, or write 0 to the PCI config status,
at any time. How does virtiofs handle it, being stateful?
Thanks!
[1] https://lore.kernel.org/qemu-devel/20230501230409.274178-1-stefanha@redhat.com/
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-05 14:26         ` Eugenio Perez Martin
@ 2023-05-05 14:37           ` Hanna Czenczek
  2023-05-08 17:00             ` Hanna Czenczek
  2023-05-09 15:30           ` Stefan Hajnoczi
  1 sibling, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-05-05 14:37 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, Stefan Hajnoczi,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On 05.05.23 16:26, Eugenio Perez Martin wrote:
> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> wrote:
>> (By the way, thanks for the explanations :))
>>
>> On 05.05.23 11:03, Hanna Czenczek wrote:
>>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
>> [...]
>>
>>>> I think it's better to change QEMU's vhost code
>>>> to leave stateful devices suspended (but not reset) across
>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
>>>> this aspect?
>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s
>>> meant by suspending instead of resetting when the VM is stopped.
>> So, now looking at vhost_dev_stop(), one problem I can see is that
>> depending on the back-end, different operations it does will do
>> different things.
>>
>> It tries to stop the whole device via vhost_ops->vhost_dev_start(),
>> which for vDPA will suspend the device, but for vhost-user will reset it
>> (if F_STATUS is there).
>>
>> It disables all vrings, which doesn’t mean stopping, but may be
>> necessary, too.  (I haven’t yet really understood the use of disabled
>> vrings, I heard that virtio-net would have a need for it.)
>>
>> It then also stops all vrings, though, so that’s OK.  And because this
>> will always do GET_VRING_BASE, this is actually always the same
>> regardless of transport.
>>
>> Finally (for this purpose), it resets the device status via
>> vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
>> this is what resets the device there.
>>
>>
>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does
>> so in .vhost_reset_status.  It would seem better to me if vhost-user
>> would also reset the device only in .vhost_reset_status, not in
>> .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
>> run SUSPEND/RESUME.
>>
> I think the same. I just saw It's been proposed at [1].
>
>> Another question I have (but this is basically what I wrote in my last
>> email) is why we even call .vhost_reset_status here.  If the device
>> and/or all of the vrings are already stopped, why do we need to reset
>> it?  Naïvely, I had assumed we only really need to reset the device if
>> the guest changes, so that a new guest driver sees a freshly initialized
>> device.
>>
> I don't know why we didn't need to call it :). I'm assuming the
> previous vhost-user net did fine resetting vq indexes, using
> VHOST_USER_SET_VRING_BASE. But I don't know about more complex
> devices.
>
> The guest can reset the device, or write 0 to the PCI config status,
> at any time. How does virtiofs handle it, being stateful?
Honestly a good question because virtiofsd implements neither SET_STATUS 
nor RESET_DEVICE.  I’ll have to investigate that.
I think when the guest resets the device, SET_VRING_BASE always comes 
along some way or another, so that’s how the vrings are reset.  Maybe 
the internal state is reset only following more high-level FUSE commands 
like INIT.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-05 14:37           ` Hanna Czenczek
@ 2023-05-08 17:00             ` Hanna Czenczek
  2023-05-08 17:51               ` Eugenio Perez Martin
  0 siblings, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-05-08 17:00 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, Stefan Hajnoczi,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On 05.05.23 16:37, Hanna Czenczek wrote:
> On 05.05.23 16:26, Eugenio Perez Martin wrote:
>> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> 
>> wrote:
>>> (By the way, thanks for the explanations :))
>>>
>>> On 05.05.23 11:03, Hanna Czenczek wrote:
>>>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
>>> [...]
>>>
>>>>> I think it's better to change QEMU's vhost code
>>>>> to leave stateful devices suspended (but not reset) across
>>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
>>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
>>>>> this aspect?
>>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s
>>>> meant by suspending instead of resetting when the VM is stopped.
>>> So, now looking at vhost_dev_stop(), one problem I can see is that
>>> depending on the back-end, different operations it does will do
>>> different things.
>>>
>>> It tries to stop the whole device via vhost_ops->vhost_dev_start(),
>>> which for vDPA will suspend the device, but for vhost-user will 
>>> reset it
>>> (if F_STATUS is there).
>>>
>>> It disables all vrings, which doesn’t mean stopping, but may be
>>> necessary, too.  (I haven’t yet really understood the use of disabled
>>> vrings, I heard that virtio-net would have a need for it.)
>>>
>>> It then also stops all vrings, though, so that’s OK.  And because this
>>> will always do GET_VRING_BASE, this is actually always the same
>>> regardless of transport.
>>>
>>> Finally (for this purpose), it resets the device status via
>>> vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
>>> this is what resets the device there.
>>>
>>>
>>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does
>>> so in .vhost_reset_status.  It would seem better to me if vhost-user
>>> would also reset the device only in .vhost_reset_status, not in
>>> .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
>>> run SUSPEND/RESUME.
>>>
>> I think the same. I just saw It's been proposed at [1].
>>
>>> Another question I have (but this is basically what I wrote in my last
>>> email) is why we even call .vhost_reset_status here.  If the device
>>> and/or all of the vrings are already stopped, why do we need to reset
>>> it?  Naïvely, I had assumed we only really need to reset the device if
>>> the guest changes, so that a new guest driver sees a freshly 
>>> initialized
>>> device.
>>>
>> I don't know why we didn't need to call it :). I'm assuming the
>> previous vhost-user net did fine resetting vq indexes, using
>> VHOST_USER_SET_VRING_BASE. But I don't know about more complex
>> devices.
>>
>> The guest can reset the device, or write 0 to the PCI config status,
>> at any time. How does virtiofs handle it, being stateful?
>
> Honestly a good question because virtiofsd implements neither 
> SET_STATUS nor RESET_DEVICE.  I’ll have to investigate that.
>
> I think when the guest resets the device, SET_VRING_BASE always comes 
> along some way or another, so that’s how the vrings are reset.  Maybe 
> the internal state is reset only following more high-level FUSE 
> commands like INIT.
So a meeting and one session of looking-into-the-code later:
We reset every virt queue on GET_VRING_BASE, which is wrong, but happens 
to serve the purpose.  (German is currently on that.)
In our meeting, German said the reset would occur when the memory 
regions are changed, but I can’t see that in the code.  I think it only 
happens implicitly through the SET_VRING_BASE call, which resets the 
internal avail/used pointers.
[This doesn’t seem different from libvhost-user, though, which 
implements neither SET_STATUS nor RESET_DEVICE, and which pretends to 
reset the device on RESET_OWNER, but really doesn’t (its 
vu_reset_device_exec() function just disables all vrings, doesn’t reset 
or even stop them).]
Consequently, the internal state is never reset.  It would be cleared on 
a FUSE Destroy message, but if you just force-reset the system, the 
state remains into the next reboot.  Not even FUSE Init clears it, which 
seems weird.  It happens to work because it’s still the same filesystem, 
so the existing state fits, but it kind of seems dangerous to keep e.g. 
files open.  I don’t think it’s really exploitable because everything 
still goes through the guest kernel, but, well.  We should clear the 
state on Init, and probably also implement SET_STATUS and clear the 
state there.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-08 17:00             ` Hanna Czenczek
@ 2023-05-08 17:51               ` Eugenio Perez Martin
  2023-05-08 19:31                 ` Eugenio Perez Martin
  0 siblings, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-05-08 17:51 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, Stefan Hajnoczi,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, May 8, 2023 at 7:00 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 05.05.23 16:37, Hanna Czenczek wrote:
> > On 05.05.23 16:26, Eugenio Perez Martin wrote:
> >> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com>
> >> wrote:
> >>> (By the way, thanks for the explanations :))
> >>>
> >>> On 05.05.23 11:03, Hanna Czenczek wrote:
> >>>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
> >>> [...]
> >>>
> >>>>> I think it's better to change QEMU's vhost code
> >>>>> to leave stateful devices suspended (but not reset) across
> >>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> >>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> >>>>> this aspect?
> >>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s
> >>>> meant by suspending instead of resetting when the VM is stopped.
> >>> So, now looking at vhost_dev_stop(), one problem I can see is that
> >>> depending on the back-end, different operations it does will do
> >>> different things.
> >>>
> >>> It tries to stop the whole device via vhost_ops->vhost_dev_start(),
> >>> which for vDPA will suspend the device, but for vhost-user will
> >>> reset it
> >>> (if F_STATUS is there).
> >>>
> >>> It disables all vrings, which doesn’t mean stopping, but may be
> >>> necessary, too.  (I haven’t yet really understood the use of disabled
> >>> vrings, I heard that virtio-net would have a need for it.)
> >>>
> >>> It then also stops all vrings, though, so that’s OK.  And because this
> >>> will always do GET_VRING_BASE, this is actually always the same
> >>> regardless of transport.
> >>>
> >>> Finally (for this purpose), it resets the device status via
> >>> vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
> >>> this is what resets the device there.
> >>>
> >>>
> >>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does
> >>> so in .vhost_reset_status.  It would seem better to me if vhost-user
> >>> would also reset the device only in .vhost_reset_status, not in
> >>> .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
> >>> run SUSPEND/RESUME.
> >>>
> >> I think the same. I just saw It's been proposed at [1].
> >>
> >>> Another question I have (but this is basically what I wrote in my last
> >>> email) is why we even call .vhost_reset_status here.  If the device
> >>> and/or all of the vrings are already stopped, why do we need to reset
> >>> it?  Naïvely, I had assumed we only really need to reset the device if
> >>> the guest changes, so that a new guest driver sees a freshly
> >>> initialized
> >>> device.
> >>>
> >> I don't know why we didn't need to call it :). I'm assuming the
> >> previous vhost-user net did fine resetting vq indexes, using
> >> VHOST_USER_SET_VRING_BASE. But I don't know about more complex
> >> devices.
> >>
> >> The guest can reset the device, or write 0 to the PCI config status,
> >> at any time. How does virtiofs handle it, being stateful?
> >
> > Honestly a good question because virtiofsd implements neither
> > SET_STATUS nor RESET_DEVICE.  I’ll have to investigate that.
> >
> > I think when the guest resets the device, SET_VRING_BASE always comes
> > along some way or another, so that’s how the vrings are reset.  Maybe
> > the internal state is reset only following more high-level FUSE
> > commands like INIT.
>
> So a meeting and one session of looking-into-the-code later:
>
> We reset every virt queue on GET_VRING_BASE, which is wrong, but happens
> to serve the purpose.  (German is currently on that.)
>
> In our meeting, German said the reset would occur when the memory
> regions are changed, but I can’t see that in the code.
That would imply that the status is reset when the guest's memory is
added or removed?
> I think it only
> happens implicitly through the SET_VRING_BASE call, which resets the
> internal avail/used pointers.
>
> [This doesn’t seem different from libvhost-user, though, which
> implements neither SET_STATUS nor RESET_DEVICE, and which pretends to
> reset the device on RESET_OWNER, but really doesn’t (its
> vu_reset_device_exec() function just disables all vrings, doesn’t reset
> or even stop them).]
>
> Consequently, the internal state is never reset.  It would be cleared on
> a FUSE Destroy message, but if you just force-reset the system, the
> state remains into the next reboot.  Not even FUSE Init clears it, which
> seems weird.  It happens to work because it’s still the same filesystem,
> so the existing state fits, but it kind of seems dangerous to keep e.g.
> files open.  I don’t think it’s really exploitable because everything
> still goes through the guest kernel, but, well.  We should clear the
> state on Init, and probably also implement SET_STATUS and clear the
> state there.
>
I see. That's in the line of assuming GET_VRING_BASE is the last
message received from qemu.
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-20 13:27                     ` Eugenio Pérez
@ 2023-05-08 19:12                       ` Stefan Hajnoczi
  2023-05-09  6:31                         ` Eugenio Perez Martin
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-05-08 19:12 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
[-- Attachment #1: Type: text/plain, Size: 16048 bytes --]
On Thu, Apr 20, 2023 at 03:27:51PM +0200, Eugenio Pérez wrote:
> On Tue, 2023-04-18 at 16:40 -0400, Stefan Hajnoczi wrote:
> > On Tue, 18 Apr 2023 at 14:31, Eugenio Perez Martin <eperezma@redhat.com>
> > wrote:
> > > On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> > > > > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > > wrote:
> > > > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <
> > > > > > eperezma@redhat.com> wrote:
> > > > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com
> > > > > > > > wrote:
> > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > > wrote:
> > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek
> > > > > > > > > > wrote:
> > > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > > transporting the
> > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > > stream.  To do
> > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > > state to and
> > > > > > > > > > > from virtiofsd.
> > > > > > > > > > > 
> > > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > > believe it
> > > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > > streaming
> > > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > > user
> > > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > > addition to
> > > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > > 
> > > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > > >   This feature signals support for transferring state, and
> > > > > > > > > > > is added so
> > > > > > > > > > >   that migration can fail early when the back-end has no
> > > > > > > > > > > support.
> > > > > > > > > > > 
> > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > > negotiate a pipe
> > > > > > > > > > >   over which to transfer the state.  The front-end sends an
> > > > > > > > > > > FD to the
> > > > > > > > > > >   back-end into/from which it can write/read its state, and
> > > > > > > > > > > the back-end
> > > > > > > > > > >   can decide to either use it, or reply with a different FD
> > > > > > > > > > > for the
> > > > > > > > > > >   front-end to override the front-end's choice.
> > > > > > > > > > >   The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > > but maybe
> > > > > > > > > > >   the back-end already has an FD into/from which it has to
> > > > > > > > > > > write/read
> > > > > > > > > > >   its state, in which case it will want to override the
> > > > > > > > > > > simple pipe.
> > > > > > > > > > >   Conversely, maybe in the future we find a way to have the
> > > > > > > > > > > front-end
> > > > > > > > > > >   get an immediate FD for the migration stream (in some
> > > > > > > > > > > cases), in which
> > > > > > > > > > >   case we will want to send this to the back-end instead of
> > > > > > > > > > > creating a
> > > > > > > > > > >   pipe.
> > > > > > > > > > >   Hence the negotiation: If one side has a better idea than
> > > > > > > > > > > a plain
> > > > > > > > > > >   pipe, we will want to use that.
> > > > > > > > > > > 
> > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > > through the
> > > > > > > > > > >   pipe (the end indicated by EOF), the front-end invokes
> > > > > > > > > > > this function
> > > > > > > > > > >   to verify success.  There is no in-band way (through the
> > > > > > > > > > > pipe) to
> > > > > > > > > > >   indicate failure, so we need to check explicitly.
> > > > > > > > > > > 
> > > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > > migration
> > > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > > the reading
> > > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end
> > > > > > > > > > > will check for
> > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination
> > > > > > > > > > > side includes
> > > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > > 
> > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > > >  hw/virtio/vhost-user.c            | 147
> > > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > > >  4 files changed, 287 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > > >  } VhostSetConfigType;
> > > > > > > > > > > 
> > > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > > +    /* Transfer state from back-end (device) to front-end
> > > > > > > > > > > */
> > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > > +    /* Transfer state from front-end to back-end (device)
> > > > > > > > > > > */
> > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > > +
> > > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > > 
> > > > > > > > > > vDPA has:
> > > > > > > > > > 
> > > > > > > > > >   /* Suspend a device so it does not process virtqueue
> > > > > > > > > > requests anymore
> > > > > > > > > >    *
> > > > > > > > > >    * After the return of ioctl the device must preserve all
> > > > > > > > > > the necessary state
> > > > > > > > > >    * (the virtqueue vring base plus the possible device
> > > > > > > > > > specific states) that is
> > > > > > > > > >    * required for restoring in the future. The device must not
> > > > > > > > > > change its
> > > > > > > > > >    * configuration after that point.
> > > > > > > > > >    */
> > > > > > > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > > 
> > > > > > > > > >   /* Resume a device so it can resume processing virtqueue
> > > > > > > > > > requests
> > > > > > > > > >    *
> > > > > > > > > >    * After the return of this ioctl the device will have
> > > > > > > > > > restored all the
> > > > > > > > > >    * necessary states and it is fully operational to continue
> > > > > > > > > > processing the
> > > > > > > > > >    * virtqueue descriptors.
> > > > > > > > > >    */
> > > > > > > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > > 
> > > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > > that the
> > > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > > It's okay
> > > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > > avoid
> > > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > > VHOST_STOP
> > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > > change
> > > > > > > > > to SUSPEND.
> > > > > > > > 
> > > > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > > > > > > ioctl(VHOST_VDPA_RESUME).
> > > > > > > > 
> > > > > > > > The doc comments in <linux/vdpa.h> don't explain how the device
> > > > > > > > can
> > > > > > > > leave the suspended state. Can you clarify this?
> > > > > > > > 
> > > > > > > 
> > > > > > > Do you mean in what situations or regarding the semantics of
> > > > > > > _RESUME?
> > > > > > > 
> > > > > > > To me resume is an operation mainly to resume the device in the
> > > > > > > event
> > > > > > > of a VM suspension, not a migration. It can be used as a fallback
> > > > > > > code
> > > > > > > in some cases of migration failure though, but it is not currently
> > > > > > > used in qemu.
> > > > > > 
> > > > > > Is a "VM suspension" the QEMU HMP 'stop' command?
> > > > > > 
> > > > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it
> > > > > > resets the device in vhost_dev_stop()?
> > > > > > 
> > > > > 
> > > > > The actual reason for not using RESUME is that the ioctl was added
> > > > > after the SUSPEND design in qemu. Same as this proposal, it is was not
> > > > > needed at the time.
> > > > > 
> > > > > In the case of vhost-vdpa net, the only usage of suspend is to fetch
> > > > > the vq indexes, and in case of error vhost already fetches them from
> > > > > guest's used ring way before vDPA, so it has little usage.
> > > > > 
> > > > > > Does it make sense to combine SUSPEND and RESUME with Hanna's
> > > > > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> > > > > > this:
> > > > > > - Saving the device's state is done by SUSPEND followed by
> > > > > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> > > > > > savevm command or migration failed), then RESUME is called to
> > > > > > continue.
> > > > > 
> > > > > I think the previous steps make sense at vhost_dev_stop, not virtio
> > > > > savevm handlers. To start spreading this logic to more places of qemu
> > > > > can bring confusion.
> > > > 
> > > > I don't think there is a way around extending the QEMU vhost's code
> > > > model. The current model in QEMU's vhost code is that the backend is
> > > > reset when the VM stops. This model worked fine for stateless devices
> > > > but it doesn't work for stateful devices.
> > > > 
> > > > Imagine a vdpa-gpu device: you cannot reset the device in
> > > > vhost_dev_stop() and expect the GPU to continue working when
> > > > vhost_dev_start() is called again because all its state has been lost.
> > > > The guest driver will send requests that references a virtio-gpu
> > > > resources that no longer exist.
> > > > 
> > > > One solution is to save the device's state in vhost_dev_stop(). I think
> > > > this is what you're suggesting. It requires keeping a copy of the state
> > > > and then loading the state again in vhost_dev_start(). I don't think
> > > > this approach should be used because it requires all stateful devices to
> > > > support live migration (otherwise they break across HMP 'stop'/'cont').
> > > > Also, the device state for some devices may be large and it would also
> > > > become more complicated when iterative migration is added.
> > > > 
> > > > Instead, I think the QEMU vhost code needs to be structured so that
> > > > struct vhost_dev has a suspended state:
> > > > 
> > > >         ,---------.
> > > >         v         |
> > > >   started ------> stopped
> > > >     \   ^
> > > >      \  |
> > > >       -> suspended
> > > > 
> > > > The device doesn't lose state when it enters the suspended state. It can
> > > > be resumed again.
> > > > 
> > > > This is why I think SUSPEND/RESUME need to be part of the solution.
> 
> I just realize that we can add an arrow from suspended to stopped, isn't it?
Yes, it could be used in the case of a successful live migration:
[started] -> vhost_dev_suspend() [suspended] -> vhost_dev_stop() [stopped]
> "Started" before seems to imply the device may process descriptors after
> suspend.
Yes, in the case of a failed live migration:
[started] -> vhost_dev_suspend() [suspended] -> vhost_dev_resume() [started]
> > > 
> > > I agree with all of this, especially after realizing vhost_dev_stop is
> > > called before the last request of the state in the iterative
> > > migration.
> > > 
> > > However I think we can move faster with the virtiofsd migration code,
> > > as long as we agree on the vhost-user messages it will receive. This
> > > is because we already agree that the state will be sent in one shot
> > > and not iteratively, so it will be small.
> > > 
> > > I understand this may change in the future, that's why I proposed to
> > > start using iterative right now. However it may make little sense if
> > > it is not used in the vhost-user device. I also understand that other
> > > devices may have a bigger state so it will be needed for them.
> > 
> > Can you summarize how you'd like save to work today? I'm not sure what
> > you have in mind.
> > 
> 
> I think we're trying to find a solution that satisfies many things.  On one
> side, we're assuming that the virtiofsd state will be small enough to be
> assumable it will not require iterative migration in the short term.  However,
> we also want to support iterative migration, for the shake of *other* future
> vhost devices that may need it.
> 
> I also think we should prioritize the protocols stability, in the sense of not
> adding calls that we will not reuse for iterative LM.  Being vhost-user protocol
> more important to maintain than the qemu migration.
> 
> To implement the changes you mention will be needed in the future.  But we have
> already set that the virtiofsd is small, so we can just fetch them by the same
> time than we send VHOST_USER_GET_VRING_BASE message and send the status with the
> proposed non-iterative approach.
VHOST_USER_GET_VRING_BASE itself isn't really enough because it stops a
specific virtqueue but not the whole device. Unfortunately stopping all
virtqueues is not the same as SUSPEND since spontaneous device activity
is possible independent of any virtqueue (e.g. virtio-scsi events and
maybe virtio-net link status).
That's why I think SUSPEND is necessary for a solution that's generic
enough to cover all device types.
> If we agree on that, now the question is how to fetch them from the device.  The
> answers are a little bit scattered in the mail threads, but I think we agree on:
> a) We need to signal that the device must stop processing requests.
> b) We need a way for the device to dump the state.
> 
> At this moment I think any proposal satisfies a), and pipe satisfies better b). 
> With proper backend feature flags, the device may support to start writing to
> the pipe before SUSPEND so we can implement iterative migration on top.
> 
> Does that makes sense?
Yes, and that sounds like what Hanna is proposing for b) plus our
discussion about SUSPEND/RESUME in order to achieve a).
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-08 17:51               ` Eugenio Perez Martin
@ 2023-05-08 19:31                 ` Eugenio Perez Martin
  2023-05-09  8:59                   ` Hanna Czenczek
  0 siblings, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-05-08 19:31 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, Stefan Hajnoczi,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, May 8, 2023 at 7:51 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Mon, May 8, 2023 at 7:00 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> >
> > On 05.05.23 16:37, Hanna Czenczek wrote:
> > > On 05.05.23 16:26, Eugenio Perez Martin wrote:
> > >> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com>
> > >> wrote:
> > >>> (By the way, thanks for the explanations :))
> > >>>
> > >>> On 05.05.23 11:03, Hanna Czenczek wrote:
> > >>>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
> > >>> [...]
> > >>>
> > >>>>> I think it's better to change QEMU's vhost code
> > >>>>> to leave stateful devices suspended (but not reset) across
> > >>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> > >>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> > >>>>> this aspect?
> > >>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s
> > >>>> meant by suspending instead of resetting when the VM is stopped.
> > >>> So, now looking at vhost_dev_stop(), one problem I can see is that
> > >>> depending on the back-end, different operations it does will do
> > >>> different things.
> > >>>
> > >>> It tries to stop the whole device via vhost_ops->vhost_dev_start(),
> > >>> which for vDPA will suspend the device, but for vhost-user will
> > >>> reset it
> > >>> (if F_STATUS is there).
> > >>>
> > >>> It disables all vrings, which doesn’t mean stopping, but may be
> > >>> necessary, too.  (I haven’t yet really understood the use of disabled
> > >>> vrings, I heard that virtio-net would have a need for it.)
> > >>>
> > >>> It then also stops all vrings, though, so that’s OK.  And because this
> > >>> will always do GET_VRING_BASE, this is actually always the same
> > >>> regardless of transport.
> > >>>
> > >>> Finally (for this purpose), it resets the device status via
> > >>> vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
> > >>> this is what resets the device there.
> > >>>
> > >>>
> > >>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does
> > >>> so in .vhost_reset_status.  It would seem better to me if vhost-user
> > >>> would also reset the device only in .vhost_reset_status, not in
> > >>> .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
> > >>> run SUSPEND/RESUME.
> > >>>
> > >> I think the same. I just saw It's been proposed at [1].
> > >>
> > >>> Another question I have (but this is basically what I wrote in my last
> > >>> email) is why we even call .vhost_reset_status here.  If the device
> > >>> and/or all of the vrings are already stopped, why do we need to reset
> > >>> it?  Naïvely, I had assumed we only really need to reset the device if
> > >>> the guest changes, so that a new guest driver sees a freshly
> > >>> initialized
> > >>> device.
> > >>>
> > >> I don't know why we didn't need to call it :). I'm assuming the
> > >> previous vhost-user net did fine resetting vq indexes, using
> > >> VHOST_USER_SET_VRING_BASE. But I don't know about more complex
> > >> devices.
> > >>
> > >> The guest can reset the device, or write 0 to the PCI config status,
> > >> at any time. How does virtiofs handle it, being stateful?
> > >
> > > Honestly a good question because virtiofsd implements neither
> > > SET_STATUS nor RESET_DEVICE.  I’ll have to investigate that.
> > >
> > > I think when the guest resets the device, SET_VRING_BASE always comes
> > > along some way or another, so that’s how the vrings are reset.  Maybe
> > > the internal state is reset only following more high-level FUSE
> > > commands like INIT.
> >
> > So a meeting and one session of looking-into-the-code later:
> >
> > We reset every virt queue on GET_VRING_BASE, which is wrong, but happens
> > to serve the purpose.  (German is currently on that.)
> >
> > In our meeting, German said the reset would occur when the memory
> > regions are changed, but I can’t see that in the code.
>
> That would imply that the status is reset when the guest's memory is
> added or removed?
>
> > I think it only
> > happens implicitly through the SET_VRING_BASE call, which resets the
> > internal avail/used pointers.
> >
> > [This doesn’t seem different from libvhost-user, though, which
> > implements neither SET_STATUS nor RESET_DEVICE, and which pretends to
> > reset the device on RESET_OWNER, but really doesn’t (its
> > vu_reset_device_exec() function just disables all vrings, doesn’t reset
> > or even stop them).]
> >
> > Consequently, the internal state is never reset.  It would be cleared on
> > a FUSE Destroy message, but if you just force-reset the system, the
> > state remains into the next reboot.  Not even FUSE Init clears it, which
> > seems weird.  It happens to work because it’s still the same filesystem,
> > so the existing state fits, but it kind of seems dangerous to keep e.g.
> > files open.  I don’t think it’s really exploitable because everything
> > still goes through the guest kernel, but, well.  We should clear the
> > state on Init, and probably also implement SET_STATUS and clear the
> > state there.
> >
>
> I see. That's in the line of assuming GET_VRING_BASE is the last
> message received from qemu.
>
Actually, does it prevent device recovery after a failure in
migration? Is the same state set for the device?
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-04-20 13:29                   ` Eugenio Pérez
@ 2023-05-08 20:10                     ` Stefan Hajnoczi
  2023-05-09  6:45                       ` Eugenio Perez Martin
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-05-08 20:10 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
[-- Attachment #1: Type: text/plain, Size: 19332 bytes --]
On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote:
> On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote:
> > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > wrote:
> > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > > > wrote:
> > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > wrote:
> > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > wrote:
> > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > transporting the
> > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > stream.  To do
> > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > state to and
> > > > > > > > > > from virtiofsd.
> > > > > > > > > > 
> > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > believe it
> > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > streaming
> > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > user
> > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > addition to
> > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > 
> > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > > > added so
> > > > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > > > support.
> > > > > > > > > > 
> > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > negotiate a pipe
> > > > > > > > > >    over which to transfer the state.  The front-end sends an
> > > > > > > > > > FD to the
> > > > > > > > > >    back-end into/from which it can write/read its state, and
> > > > > > > > > > the back-end
> > > > > > > > > >    can decide to either use it, or reply with a different FD
> > > > > > > > > > for the
> > > > > > > > > >    front-end to override the front-end's choice.
> > > > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > but maybe
> > > > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > > > write/read
> > > > > > > > > >    its state, in which case it will want to override the
> > > > > > > > > > simple pipe.
> > > > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > > > front-end
> > > > > > > > > >    get an immediate FD for the migration stream (in some
> > > > > > > > > > cases), in which
> > > > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > > > creating a
> > > > > > > > > >    pipe.
> > > > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > > > plain
> > > > > > > > > >    pipe, we will want to use that.
> > > > > > > > > > 
> > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > through the
> > > > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > > > function
> > > > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > > > pipe) to
> > > > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > > > 
> > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > migration
> > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > the reading
> > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > > > check for
> > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > > > includes
> > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > 
> > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > ---
> > > > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > >   } VhostSetConfigType;
> > > > > > > > > > 
> > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > +
> > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > vDPA has:
> > > > > > > > > 
> > > > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > > > anymore
> > > > > > > > >     *
> > > > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > > > necessary state
> > > > > > > > >     * (the virtqueue vring base plus the possible device
> > > > > > > > > specific states) that is
> > > > > > > > >     * required for restoring in the future. The device must not
> > > > > > > > > change its
> > > > > > > > >     * configuration after that point.
> > > > > > > > >     */
> > > > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > 
> > > > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > > > requests
> > > > > > > > >     *
> > > > > > > > >     * After the return of this ioctl the device will have
> > > > > > > > > restored all the
> > > > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > > > processing the
> > > > > > > > >     * virtqueue descriptors.
> > > > > > > > >     */
> > > > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > 
> > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > that the
> > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > It's okay
> > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > avoid
> > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > 
> > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > VHOST_STOP
> > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > change
> > > > > > > > to SUSPEND.
> > > > > > > > 
> > > > > > > > Generally it is better if we make the interface less parametrized
> > > > > > > > and
> > > > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > > > words, instead of
> > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > > > send
> > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > > > command.
> > > > > > > > 
> > > > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > > > it
> > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > > > 
> > > > > > > > In that case, reusing the ioctls as vhost-user messages would be
> > > > > > > > ok.
> > > > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > > > "migration_set_state(state)", and maybe it is better when the
> > > > > > > > number
> > > > > > > > of states is high.
> > > > > > > Hi Eugenio,
> > > > > > > Another question about vDPA suspend/resume:
> > > > > > > 
> > > > > > >    /* Host notifiers must be enabled at this point. */
> > > > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > > > bool vrings)
> > > > > > >    {
> > > > > > >        int i;
> > > > > > > 
> > > > > > >        /* should only be called after backend is connected */
> > > > > > >        assert(hdev->vhost_ops);
> > > > > > >        event_notifier_test_and_clear(
> > > > > > >            &hdev-
> > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > > > 
> > > > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > > > 
> > > > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > > > >            ^^^ SUSPEND ^^^
> > > > > > >        }
> > > > > > >        if (vrings) {
> > > > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > > > >        }
> > > > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > > > >            vhost_virtqueue_stop(hdev,
> > > > > > >                                 vdev,
> > > > > > >                                 hdev->vqs + i,
> > > > > > >                                 hdev->vq_index + i);
> > > > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > > > >        }
> > > > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > > > >            ^^^ reset device^^^
> > > > > > > 
> > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop()
> > > > > > > ->
> > > > > > > vhost_reset_status(). The device's migration code runs after
> > > > > > > vhost_dev_stop() and the state will have been lost.
> > > > > > > 
> > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > > > 
> > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > > > VM run in the CVQ. So it can track all of that status in the device
> > > > > > model too.
> > > > > > 
> > > > > > When a migration effectively occurs, all the frontend state is
> > > > > > migrated as a regular emulated device. To route all of the state in a
> > > > > > normalized way for qemu is what leaves open the possibility to do
> > > > > > cross-backends migrations, etc.
> > > > > > 
> > > > > > Does that answer your question?
> > > > > I think you're confirming that changes would be necessary in order for
> > > > > vDPA to support the save/load operation that Hanna is introducing.
> > > > > 
> > > > Yes, this first iteration was centered on net, with an eye on block,
> > > > where state can be routed through classical emulated devices. This is
> > > > how vhost-kernel and vhost-user do classically. And it allows
> > > > cross-backend, to not modify qemu migration state, etc.
> > > > 
> > > > To introduce this opaque state to qemu, that must be fetched after the
> > > > suspend and not before, requires changes in vhost protocol, as
> > > > discussed previously.
> > > > 
> > > > > > > It looks like vDPA changes are necessary in order to support
> > > > > > > stateful
> > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > > > correct?
> > > > > > > 
> > > > > > Changes are required elsewhere, as the code to restore the state
> > > > > > properly in the destination has not been merged.
> > > > > I'm not sure what you mean by elsewhere?
> > > > > 
> > > > I meant for vdpa *net* devices the changes are not required in vdpa
> > > > ioctls, but mostly in qemu.
> > > > 
> > > > If you meant stateful as "it must have a state blob that it must be
> > > > opaque to qemu", then I think the straightforward action is to fetch
> > > > state blob about the same time as vq indexes. But yes, changes (at
> > > > least a new ioctl) is needed for that.
> > > > 
> > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > > > then VHOST_VDPA_SET_STATUS 0.
> > > > > 
> > > > > In order to save device state from the vDPA device in the future, it
> > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > > > the device state can be saved before the device is reset.
> > > > > 
> > > > > Does that sound right?
> > > > > 
> > > > The split between suspend and reset was added recently for that very
> > > > reason. In all the virtio devices, the frontend is initialized before
> > > > the backend, so I don't think it is a good idea to defer the backend
> > > > cleanup. Especially if we have already set the state is small enough
> > > > to not needing iterative migration from virtiofsd point of view.
> > > > 
> > > > If fetching that state at the same time as vq indexes is not valid,
> > > > could it follow the same model as the "in-flight descriptors"?
> > > > vhost-user follows them by using a shared memory region where their
> > > > state is tracked [1]. This allows qemu to survive vhost-user SW
> > > > backend crashes, and does not forbid the cross-backends live migration
> > > > as all the information is there to recover them.
> > > > 
> > > > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > > > a possibility is to synchronize this memory region after a
> > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > > > devices are not going to crash in the software sense, so all use cases
> > > > remain the same to qemu. And that shared memory information is
> > > > recoverable after vhost_dev_stop.
> > > > 
> > > > Does that sound reasonable to virtiofsd? To offer a shared memory
> > > > region where it dumps the state, maybe only after the
> > > > set_state(STATE_PHASE_STOPPED)?
> > > 
> > > I don’t think we need the set_state() call, necessarily, if SUSPEND is
> > > mandatory anyway.
> > > 
> > > As for the shared memory, the RFC before this series used shared memory,
> > > so it’s possible, yes.  But “shared memory region” can mean a lot of
> > > things – it sounds like you’re saying the back-end (virtiofsd) should
> > > provide it to the front-end, is that right?  That could work like this:
> > > 
> > > On the source side:
> > > 
> > > S1. SUSPEND goes to virtiofsd
> > > S2. virtiofsd maybe double-checks that the device is stopped, then
> > > serializes its state into a newly allocated shared memory area[1]
> > > S3. virtiofsd responds to SUSPEND
> > > S4. front-end requests shared memory, virtiofsd responds with a handle,
> > > maybe already closes its reference
> > > S5. front-end saves state, closes its handle, freeing the SHM
> > > 
> > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> > > it can immediately allocate this area and serialize directly into it;
> > > maybe it can’t, then we’ll need a bounce buffer.  Not really a
> > > fundamental problem, but there are limitations around what you can do
> > > with serde implementations in Rust…
> > > 
> > > On the destination side:
> > > 
> > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> > > virtiofsd would serialize its empty state into an SHM area, and respond
> > > to SUSPEND
> > > D2. front-end reads state from migration stream into an SHM it has allocated
> > > D3. front-end supplies this SHM to virtiofsd, which discards its
> > > previous area, and now uses this one
> > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> > > 
> > > Couple of questions:
> > > 
> > > A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> > > would imply to deserialize a state, and the state is to be transferred
> > > through SHM, this is what would need to be done.  So maybe we should
> > > skip SUSPEND on the destination?
> > > B. You described that the back-end should supply the SHM, which works
> > > well on the source.  On the destination, only the front-end knows how
> > > big the state is, so I’ve decided above that it should allocate the SHM
> > > (D2) and provide it to the back-end.  Is that feasible or is it
> > > important (e.g. for real hardware) that the back-end supplies the SHM?
> > > (In which case the front-end would need to tell the back-end how big the
> > > state SHM needs to be.)
> > 
> > How does this work for iterative live migration?
> > 
> 
> A pipe will always fit better for iterative from qemu POV, that's for sure. 
> Especially if we want to keep that opaqueness.
> 
> But  we will need to communicate with the HW device using shared memory sooner
> or later for big states.  If we don't transform it in qemu, we will need to do
> it in the kernel.  Also, the pipe will not support daemon crashes.
>
> Again I'm just putting this on the table, just in case it fits better or it is
> convenient.  I missed the previous patch where SHM was proposed too, so maybe I
> missed some feedback useful here.  I think the pipe is a better solution in the
> long run because of the iterative part.
Pipes and shared memory are conceptually equivalent for building
streaming interfaces. It's just more complex to design a shared memory
interface and it reinvents what file descriptors already offer.
I have no doubt we could design iterative migration over a shared memory
interface if we needed to, but I'm not sure why? When you mention
hardware, are you suggesting defining a standard memory/register layout
that hardware implements and mapping it to userspace (QEMU)? Is there a
big advantage to exposing memory versus a file descriptor?
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-05 12:51         ` Hanna Czenczek
@ 2023-05-08 21:10           ` Stefan Hajnoczi
  2023-05-09  8:53             ` Hanna Czenczek
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-05-08 21:10 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Eugenio Perez Martin, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
[-- Attachment #1: Type: text/plain, Size: 2787 bytes --]
On Fri, May 05, 2023 at 02:51:55PM +0200, Hanna Czenczek wrote:
> On 05.05.23 11:53, Eugenio Perez Martin wrote:
> > On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote:
> > > On 04.05.23 23:14, Stefan Hajnoczi wrote:
> > > > On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
> 
> [...]
> 
> > > > All state is lost and the Device Initialization process
> > > > must be followed to make the device operational again.
> > > > 
> > > > Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
> > > > 
> > > > It's messy and not your fault. I think QEMU should solve this by
> > > > treating stateful devices differently from non-stateful devices. That
> > > > way existing vhost-user backends continue to work and new stateful
> > > > devices can also be supported.
> > > It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for
> > > stateful devices.  In a previous email, you wrote that these should
> > > implement SUSPEND+RESUME so qemu can use those instead.  But those are
> > > separate things, so I assume we just use SET_STATUS 0 when stopping the
> > > VM because this happens to also stop processing vrings as a side effect?
> > > 
> > > I.e. I understand “treating stateful devices differently” to mean that
> > > qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end
> > > supports it, and stateful back-ends should support it.
> > > 
> > Honestly I cannot think of any use case where the vhost-user backend
> > did not ignore set_status(0) and had to retrieve vq states. So maybe
> > we can totally remove that call from qemu?
> 
> I don’t know so I can’t really say; but I don’t quite understand why qemu
> would reset a device at any point but perhaps VM reset (and even then I’d
> expect the post-reset guest to just reset the device on boot by itself,
> too).
DPDK stores the Device Status field value and uses it later:
https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2791
While DPDK performs no immediate action upon SET_STATUS 0, omitting the
message will change the behavior of other DPDK code like
virtio_is_ready().
Changing the semantics of the vhost-user protocol in a way that's not
backwards compatible is something we should avoid unless there is no
other way.
The fundamental problem is that QEMU's vhost code is designed to reset
vhost devices because it assumes they are stateless. If an F_SUSPEND
protocol feature bit is added, then it becomes possible to detect new
backends and suspend/resume them rather than reset them.
That's the solution that I favor because it's backwards compatible and
the same model can be applied to stateful vDPA devices in the future.
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-05-08 19:12                       ` Stefan Hajnoczi
@ 2023-05-09  6:31                         ` Eugenio Perez Martin
  2023-05-09  9:01                           ` Hanna Czenczek
  0 siblings, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-05-09  6:31 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, May 8, 2023 at 9:12 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 20, 2023 at 03:27:51PM +0200, Eugenio Pérez wrote:
> > On Tue, 2023-04-18 at 16:40 -0400, Stefan Hajnoczi wrote:
> > > On Tue, 18 Apr 2023 at 14:31, Eugenio Perez Martin <eperezma@redhat.com>
> > > wrote:
> > > > On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> > > > > > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > > > wrote:
> > > > > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <
> > > > > > > eperezma@redhat.com> wrote:
> > > > > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com
> > > > > > > > > wrote:
> > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > > > wrote:
> > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek
> > > > > > > > > > > wrote:
> > > > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > > > transporting the
> > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > > > stream.  To do
> > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > > > state to and
> > > > > > > > > > > > from virtiofsd.
> > > > > > > > > > > >
> > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > > > believe it
> > > > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > > > streaming
> > > > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > > > user
> > > > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > > > addition to
> > > > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > > >
> > > > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > > > >   This feature signals support for transferring state, and
> > > > > > > > > > > > is added so
> > > > > > > > > > > >   that migration can fail early when the back-end has no
> > > > > > > > > > > > support.
> > > > > > > > > > > >
> > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > > > negotiate a pipe
> > > > > > > > > > > >   over which to transfer the state.  The front-end sends an
> > > > > > > > > > > > FD to the
> > > > > > > > > > > >   back-end into/from which it can write/read its state, and
> > > > > > > > > > > > the back-end
> > > > > > > > > > > >   can decide to either use it, or reply with a different FD
> > > > > > > > > > > > for the
> > > > > > > > > > > >   front-end to override the front-end's choice.
> > > > > > > > > > > >   The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > > > but maybe
> > > > > > > > > > > >   the back-end already has an FD into/from which it has to
> > > > > > > > > > > > write/read
> > > > > > > > > > > >   its state, in which case it will want to override the
> > > > > > > > > > > > simple pipe.
> > > > > > > > > > > >   Conversely, maybe in the future we find a way to have the
> > > > > > > > > > > > front-end
> > > > > > > > > > > >   get an immediate FD for the migration stream (in some
> > > > > > > > > > > > cases), in which
> > > > > > > > > > > >   case we will want to send this to the back-end instead of
> > > > > > > > > > > > creating a
> > > > > > > > > > > >   pipe.
> > > > > > > > > > > >   Hence the negotiation: If one side has a better idea than
> > > > > > > > > > > > a plain
> > > > > > > > > > > >   pipe, we will want to use that.
> > > > > > > > > > > >
> > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > > > through the
> > > > > > > > > > > >   pipe (the end indicated by EOF), the front-end invokes
> > > > > > > > > > > > this function
> > > > > > > > > > > >   to verify success.  There is no in-band way (through the
> > > > > > > > > > > > pipe) to
> > > > > > > > > > > >   indicate failure, so we need to check explicitly.
> > > > > > > > > > > >
> > > > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > > > migration
> > > > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > > > the reading
> > > > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end
> > > > > > > > > > > > will check for
> > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination
> > > > > > > > > > > > side includes
> > > > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > > >
> > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > > > >  hw/virtio/vhost-user.c            | 147
> > > > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > > > >  4 files changed, 287 insertions(+)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > > > >  } VhostSetConfigType;
> > > > > > > > > > > >
> > > > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > > > +    /* Transfer state from back-end (device) to front-end
> > > > > > > > > > > > */
> > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > > > +    /* Transfer state from front-end to back-end (device)
> > > > > > > > > > > > */
> > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > > > +
> > > > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > > >
> > > > > > > > > > > vDPA has:
> > > > > > > > > > >
> > > > > > > > > > >   /* Suspend a device so it does not process virtqueue
> > > > > > > > > > > requests anymore
> > > > > > > > > > >    *
> > > > > > > > > > >    * After the return of ioctl the device must preserve all
> > > > > > > > > > > the necessary state
> > > > > > > > > > >    * (the virtqueue vring base plus the possible device
> > > > > > > > > > > specific states) that is
> > > > > > > > > > >    * required for restoring in the future. The device must not
> > > > > > > > > > > change its
> > > > > > > > > > >    * configuration after that point.
> > > > > > > > > > >    */
> > > > > > > > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > > >
> > > > > > > > > > >   /* Resume a device so it can resume processing virtqueue
> > > > > > > > > > > requests
> > > > > > > > > > >    *
> > > > > > > > > > >    * After the return of this ioctl the device will have
> > > > > > > > > > > restored all the
> > > > > > > > > > >    * necessary states and it is fully operational to continue
> > > > > > > > > > > processing the
> > > > > > > > > > >    * virtqueue descriptors.
> > > > > > > > > > >    */
> > > > > > > > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > > >
> > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > > > that the
> > > > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > > > It's okay
> > > > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > > > avoid
> > > > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > > > VHOST_STOP
> > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > > > change
> > > > > > > > > > to SUSPEND.
> > > > > > > > >
> > > > > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > > > > > > > ioctl(VHOST_VDPA_RESUME).
> > > > > > > > >
> > > > > > > > > The doc comments in <linux/vdpa.h> don't explain how the device
> > > > > > > > > can
> > > > > > > > > leave the suspended state. Can you clarify this?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Do you mean in what situations or regarding the semantics of
> > > > > > > > _RESUME?
> > > > > > > >
> > > > > > > > To me resume is an operation mainly to resume the device in the
> > > > > > > > event
> > > > > > > > of a VM suspension, not a migration. It can be used as a fallback
> > > > > > > > code
> > > > > > > > in some cases of migration failure though, but it is not currently
> > > > > > > > used in qemu.
> > > > > > >
> > > > > > > Is a "VM suspension" the QEMU HMP 'stop' command?
> > > > > > >
> > > > > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it
> > > > > > > resets the device in vhost_dev_stop()?
> > > > > > >
> > > > > >
> > > > > > The actual reason for not using RESUME is that the ioctl was added
> > > > > > after the SUSPEND design in qemu. Same as this proposal, it is was not
> > > > > > needed at the time.
> > > > > >
> > > > > > In the case of vhost-vdpa net, the only usage of suspend is to fetch
> > > > > > the vq indexes, and in case of error vhost already fetches them from
> > > > > > guest's used ring way before vDPA, so it has little usage.
> > > > > >
> > > > > > > Does it make sense to combine SUSPEND and RESUME with Hanna's
> > > > > > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> > > > > > > this:
> > > > > > > - Saving the device's state is done by SUSPEND followed by
> > > > > > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> > > > > > > savevm command or migration failed), then RESUME is called to
> > > > > > > continue.
> > > > > >
> > > > > > I think the previous steps make sense at vhost_dev_stop, not virtio
> > > > > > savevm handlers. To start spreading this logic to more places of qemu
> > > > > > can bring confusion.
> > > > >
> > > > > I don't think there is a way around extending the QEMU vhost's code
> > > > > model. The current model in QEMU's vhost code is that the backend is
> > > > > reset when the VM stops. This model worked fine for stateless devices
> > > > > but it doesn't work for stateful devices.
> > > > >
> > > > > Imagine a vdpa-gpu device: you cannot reset the device in
> > > > > vhost_dev_stop() and expect the GPU to continue working when
> > > > > vhost_dev_start() is called again because all its state has been lost.
> > > > > The guest driver will send requests that references a virtio-gpu
> > > > > resources that no longer exist.
> > > > >
> > > > > One solution is to save the device's state in vhost_dev_stop(). I think
> > > > > this is what you're suggesting. It requires keeping a copy of the state
> > > > > and then loading the state again in vhost_dev_start(). I don't think
> > > > > this approach should be used because it requires all stateful devices to
> > > > > support live migration (otherwise they break across HMP 'stop'/'cont').
> > > > > Also, the device state for some devices may be large and it would also
> > > > > become more complicated when iterative migration is added.
> > > > >
> > > > > Instead, I think the QEMU vhost code needs to be structured so that
> > > > > struct vhost_dev has a suspended state:
> > > > >
> > > > >         ,---------.
> > > > >         v         |
> > > > >   started ------> stopped
> > > > >     \   ^
> > > > >      \  |
> > > > >       -> suspended
> > > > >
> > > > > The device doesn't lose state when it enters the suspended state. It can
> > > > > be resumed again.
> > > > >
> > > > > This is why I think SUSPEND/RESUME need to be part of the solution.
> >
> > I just realize that we can add an arrow from suspended to stopped, isn't it?
>
> Yes, it could be used in the case of a successful live migration:
> [started] -> vhost_dev_suspend() [suspended] -> vhost_dev_stop() [stopped]
>
> > "Started" before seems to imply the device may process descriptors after
> > suspend.
>
> Yes, in the case of a failed live migration:
> [started] -> vhost_dev_suspend() [suspended] -> vhost_dev_resume() [started]
>
I meant "the device may (is allowed) to process descriptors after
suspend and before stopped". I think we have the same view here, just
trying to specify the semantics here as completely as possible :).
> > > >
> > > > I agree with all of this, especially after realizing vhost_dev_stop is
> > > > called before the last request of the state in the iterative
> > > > migration.
> > > >
> > > > However I think we can move faster with the virtiofsd migration code,
> > > > as long as we agree on the vhost-user messages it will receive. This
> > > > is because we already agree that the state will be sent in one shot
> > > > and not iteratively, so it will be small.
> > > >
> > > > I understand this may change in the future, that's why I proposed to
> > > > start using iterative right now. However it may make little sense if
> > > > it is not used in the vhost-user device. I also understand that other
> > > > devices may have a bigger state so it will be needed for them.
> > >
> > > Can you summarize how you'd like save to work today? I'm not sure what
> > > you have in mind.
> > >
> >
> > I think we're trying to find a solution that satisfies many things.  On one
> > side, we're assuming that the virtiofsd state will be small enough to be
> > assumable it will not require iterative migration in the short term.  However,
> > we also want to support iterative migration, for the shake of *other* future
> > vhost devices that may need it.
> >
> > I also think we should prioritize the protocols stability, in the sense of not
> > adding calls that we will not reuse for iterative LM.  Being vhost-user protocol
> > more important to maintain than the qemu migration.
> >
> > To implement the changes you mention will be needed in the future.  But we have
> > already set that the virtiofsd is small, so we can just fetch them by the same
> > time than we send VHOST_USER_GET_VRING_BASE message and send the status with the
> > proposed non-iterative approach.
>
> VHOST_USER_GET_VRING_BASE itself isn't really enough because it stops a
> specific virtqueue but not the whole device. Unfortunately stopping all
> virtqueues is not the same as SUSPEND since spontaneous device activity
> is possible independent of any virtqueue (e.g. virtio-scsi events and
> maybe virtio-net link status).
>
> That's why I think SUSPEND is necessary for a solution that's generic
> enough to cover all device types.
>
I agree.
In particular virtiofsd is already resetting all the device at
VHOST_USER_GET_VRING_BASE if I'm not wrong, so that's even more of a
reason to implement suspend call.
Thanks!
> > If we agree on that, now the question is how to fetch them from the device.  The
> > answers are a little bit scattered in the mail threads, but I think we agree on:
> > a) We need to signal that the device must stop processing requests.
> > b) We need a way for the device to dump the state.
> >
> > At this moment I think any proposal satisfies a), and pipe satisfies better b).
> > With proper backend feature flags, the device may support to start writing to
> > the pipe before SUSPEND so we can implement iterative migration on top.
> >
> > Does that makes sense?
>
> Yes, and that sounds like what Hanna is proposing for b) plus our
> discussion about SUSPEND/RESUME in order to achieve a).
>
> Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-05-08 20:10                     ` Stefan Hajnoczi
@ 2023-05-09  6:45                       ` Eugenio Perez Martin
  2023-05-09 15:09                         ` Stefan Hajnoczi
  0 siblings, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-05-09  6:45 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Mon, May 8, 2023 at 10:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote:
> > On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote:
> > > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > > wrote:
> > > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > > > > wrote:
> > > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > wrote:
> > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > > wrote:
> > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > > transporting the
> > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > > stream.  To do
> > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > > state to and
> > > > > > > > > > > from virtiofsd.
> > > > > > > > > > >
> > > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > > believe it
> > > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > > streaming
> > > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > > user
> > > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > > addition to
> > > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > >
> > > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > > > > added so
> > > > > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > > > > support.
> > > > > > > > > > >
> > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > > negotiate a pipe
> > > > > > > > > > >    over which to transfer the state.  The front-end sends an
> > > > > > > > > > > FD to the
> > > > > > > > > > >    back-end into/from which it can write/read its state, and
> > > > > > > > > > > the back-end
> > > > > > > > > > >    can decide to either use it, or reply with a different FD
> > > > > > > > > > > for the
> > > > > > > > > > >    front-end to override the front-end's choice.
> > > > > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > > but maybe
> > > > > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > > > > write/read
> > > > > > > > > > >    its state, in which case it will want to override the
> > > > > > > > > > > simple pipe.
> > > > > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > > > > front-end
> > > > > > > > > > >    get an immediate FD for the migration stream (in some
> > > > > > > > > > > cases), in which
> > > > > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > > > > creating a
> > > > > > > > > > >    pipe.
> > > > > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > > > > plain
> > > > > > > > > > >    pipe, we will want to use that.
> > > > > > > > > > >
> > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > > through the
> > > > > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > > > > function
> > > > > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > > > > pipe) to
> > > > > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > > > >
> > > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > > migration
> > > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > > the reading
> > > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > > > > check for
> > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > > > > includes
> > > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > >
> > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > > ---
> > > > > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > > >   } VhostSetConfigType;
> > > > > > > > > > >
> > > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > > +
> > > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > > vDPA has:
> > > > > > > > > >
> > > > > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > > > > anymore
> > > > > > > > > >     *
> > > > > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > > > > necessary state
> > > > > > > > > >     * (the virtqueue vring base plus the possible device
> > > > > > > > > > specific states) that is
> > > > > > > > > >     * required for restoring in the future. The device must not
> > > > > > > > > > change its
> > > > > > > > > >     * configuration after that point.
> > > > > > > > > >     */
> > > > > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > >
> > > > > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > > > > requests
> > > > > > > > > >     *
> > > > > > > > > >     * After the return of this ioctl the device will have
> > > > > > > > > > restored all the
> > > > > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > > > > processing the
> > > > > > > > > >     * virtqueue descriptors.
> > > > > > > > > >     */
> > > > > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > >
> > > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > > that the
> > > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > > It's okay
> > > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > > avoid
> > > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > >
> > > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > > VHOST_STOP
> > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > > change
> > > > > > > > > to SUSPEND.
> > > > > > > > >
> > > > > > > > > Generally it is better if we make the interface less parametrized
> > > > > > > > > and
> > > > > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > > > > words, instead of
> > > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > > > > send
> > > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > > > > command.
> > > > > > > > >
> > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > > > > it
> > > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > > > >
> > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be
> > > > > > > > > ok.
> > > > > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > > > > "migration_set_state(state)", and maybe it is better when the
> > > > > > > > > number
> > > > > > > > > of states is high.
> > > > > > > > Hi Eugenio,
> > > > > > > > Another question about vDPA suspend/resume:
> > > > > > > >
> > > > > > > >    /* Host notifiers must be enabled at this point. */
> > > > > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > > > > bool vrings)
> > > > > > > >    {
> > > > > > > >        int i;
> > > > > > > >
> > > > > > > >        /* should only be called after backend is connected */
> > > > > > > >        assert(hdev->vhost_ops);
> > > > > > > >        event_notifier_test_and_clear(
> > > > > > > >            &hdev-
> > > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > > > >
> > > > > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > > > >
> > > > > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > > > > >            ^^^ SUSPEND ^^^
> > > > > > > >        }
> > > > > > > >        if (vrings) {
> > > > > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > > > > >        }
> > > > > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > > > > >            vhost_virtqueue_stop(hdev,
> > > > > > > >                                 vdev,
> > > > > > > >                                 hdev->vqs + i,
> > > > > > > >                                 hdev->vq_index + i);
> > > > > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > > > > >        }
> > > > > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > > > > >            ^^^ reset device^^^
> > > > > > > >
> > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop()
> > > > > > > > ->
> > > > > > > > vhost_reset_status(). The device's migration code runs after
> > > > > > > > vhost_dev_stop() and the state will have been lost.
> > > > > > > >
> > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > > > >
> > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > > > > VM run in the CVQ. So it can track all of that status in the device
> > > > > > > model too.
> > > > > > >
> > > > > > > When a migration effectively occurs, all the frontend state is
> > > > > > > migrated as a regular emulated device. To route all of the state in a
> > > > > > > normalized way for qemu is what leaves open the possibility to do
> > > > > > > cross-backends migrations, etc.
> > > > > > >
> > > > > > > Does that answer your question?
> > > > > > I think you're confirming that changes would be necessary in order for
> > > > > > vDPA to support the save/load operation that Hanna is introducing.
> > > > > >
> > > > > Yes, this first iteration was centered on net, with an eye on block,
> > > > > where state can be routed through classical emulated devices. This is
> > > > > how vhost-kernel and vhost-user do classically. And it allows
> > > > > cross-backend, to not modify qemu migration state, etc.
> > > > >
> > > > > To introduce this opaque state to qemu, that must be fetched after the
> > > > > suspend and not before, requires changes in vhost protocol, as
> > > > > discussed previously.
> > > > >
> > > > > > > > It looks like vDPA changes are necessary in order to support
> > > > > > > > stateful
> > > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > > > > correct?
> > > > > > > >
> > > > > > > Changes are required elsewhere, as the code to restore the state
> > > > > > > properly in the destination has not been merged.
> > > > > > I'm not sure what you mean by elsewhere?
> > > > > >
> > > > > I meant for vdpa *net* devices the changes are not required in vdpa
> > > > > ioctls, but mostly in qemu.
> > > > >
> > > > > If you meant stateful as "it must have a state blob that it must be
> > > > > opaque to qemu", then I think the straightforward action is to fetch
> > > > > state blob about the same time as vq indexes. But yes, changes (at
> > > > > least a new ioctl) is needed for that.
> > > > >
> > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > > > > then VHOST_VDPA_SET_STATUS 0.
> > > > > >
> > > > > > In order to save device state from the vDPA device in the future, it
> > > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > > > > the device state can be saved before the device is reset.
> > > > > >
> > > > > > Does that sound right?
> > > > > >
> > > > > The split between suspend and reset was added recently for that very
> > > > > reason. In all the virtio devices, the frontend is initialized before
> > > > > the backend, so I don't think it is a good idea to defer the backend
> > > > > cleanup. Especially if we have already set the state is small enough
> > > > > to not needing iterative migration from virtiofsd point of view.
> > > > >
> > > > > If fetching that state at the same time as vq indexes is not valid,
> > > > > could it follow the same model as the "in-flight descriptors"?
> > > > > vhost-user follows them by using a shared memory region where their
> > > > > state is tracked [1]. This allows qemu to survive vhost-user SW
> > > > > backend crashes, and does not forbid the cross-backends live migration
> > > > > as all the information is there to recover them.
> > > > >
> > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > > > > a possibility is to synchronize this memory region after a
> > > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > > > > devices are not going to crash in the software sense, so all use cases
> > > > > remain the same to qemu. And that shared memory information is
> > > > > recoverable after vhost_dev_stop.
> > > > >
> > > > > Does that sound reasonable to virtiofsd? To offer a shared memory
> > > > > region where it dumps the state, maybe only after the
> > > > > set_state(STATE_PHASE_STOPPED)?
> > > >
> > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is
> > > > mandatory anyway.
> > > >
> > > > As for the shared memory, the RFC before this series used shared memory,
> > > > so it’s possible, yes.  But “shared memory region” can mean a lot of
> > > > things – it sounds like you’re saying the back-end (virtiofsd) should
> > > > provide it to the front-end, is that right?  That could work like this:
> > > >
> > > > On the source side:
> > > >
> > > > S1. SUSPEND goes to virtiofsd
> > > > S2. virtiofsd maybe double-checks that the device is stopped, then
> > > > serializes its state into a newly allocated shared memory area[1]
> > > > S3. virtiofsd responds to SUSPEND
> > > > S4. front-end requests shared memory, virtiofsd responds with a handle,
> > > > maybe already closes its reference
> > > > S5. front-end saves state, closes its handle, freeing the SHM
> > > >
> > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> > > > it can immediately allocate this area and serialize directly into it;
> > > > maybe it can’t, then we’ll need a bounce buffer.  Not really a
> > > > fundamental problem, but there are limitations around what you can do
> > > > with serde implementations in Rust…
> > > >
> > > > On the destination side:
> > > >
> > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> > > > virtiofsd would serialize its empty state into an SHM area, and respond
> > > > to SUSPEND
> > > > D2. front-end reads state from migration stream into an SHM it has allocated
> > > > D3. front-end supplies this SHM to virtiofsd, which discards its
> > > > previous area, and now uses this one
> > > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> > > >
> > > > Couple of questions:
> > > >
> > > > A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> > > > would imply to deserialize a state, and the state is to be transferred
> > > > through SHM, this is what would need to be done.  So maybe we should
> > > > skip SUSPEND on the destination?
> > > > B. You described that the back-end should supply the SHM, which works
> > > > well on the source.  On the destination, only the front-end knows how
> > > > big the state is, so I’ve decided above that it should allocate the SHM
> > > > (D2) and provide it to the back-end.  Is that feasible or is it
> > > > important (e.g. for real hardware) that the back-end supplies the SHM?
> > > > (In which case the front-end would need to tell the back-end how big the
> > > > state SHM needs to be.)
> > >
> > > How does this work for iterative live migration?
> > >
> >
> > A pipe will always fit better for iterative from qemu POV, that's for sure.
> > Especially if we want to keep that opaqueness.
> >
> > But  we will need to communicate with the HW device using shared memory sooner
> > or later for big states.  If we don't transform it in qemu, we will need to do
> > it in the kernel.  Also, the pipe will not support daemon crashes.
> >
> > Again I'm just putting this on the table, just in case it fits better or it is
> > convenient.  I missed the previous patch where SHM was proposed too, so maybe I
> > missed some feedback useful here.  I think the pipe is a better solution in the
> > long run because of the iterative part.
>
> Pipes and shared memory are conceptually equivalent for building
> streaming interfaces. It's just more complex to design a shared memory
> interface and it reinvents what file descriptors already offer.
>
> I have no doubt we could design iterative migration over a shared memory
> interface if we needed to, but I'm not sure why? When you mention
> hardware, are you suggesting defining a standard memory/register layout
> that hardware implements and mapping it to userspace (QEMU)?
Right.
> Is there a
> big advantage to exposing memory versus a file descriptor?
>
For hardware it allows to retrieve and set the device state without
intervention of the kernel, saving context switches. For virtiofsd
this may not make a lot of sense, but I'm thinking on devices with big
states (virtio gpu, maybe?).
For software it allows the backend to survive a crash, as the old
state can be set directly to a fresh backend instance.
As I said, I'm not saying we must go with shared memory. We can always
add it on top, assuming the cost of maintaining both models. I'm just
trying to make sure we evaluate both.
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-08 21:10           ` Stefan Hajnoczi
@ 2023-05-09  8:53             ` Hanna Czenczek
  2023-05-09 14:53               ` Stefan Hajnoczi
  0 siblings, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-05-09  8:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Eugenio Perez Martin, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On 08.05.23 23:10, Stefan Hajnoczi wrote:
> On Fri, May 05, 2023 at 02:51:55PM +0200, Hanna Czenczek wrote:
>> On 05.05.23 11:53, Eugenio Perez Martin wrote:
>>> On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote:
>>>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
>>>>> On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
>> [...]
>>
>>>>> All state is lost and the Device Initialization process
>>>>> must be followed to make the device operational again.
>>>>>
>>>>> Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
>>>>>
>>>>> It's messy and not your fault. I think QEMU should solve this by
>>>>> treating stateful devices differently from non-stateful devices. That
>>>>> way existing vhost-user backends continue to work and new stateful
>>>>> devices can also be supported.
>>>> It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for
>>>> stateful devices.  In a previous email, you wrote that these should
>>>> implement SUSPEND+RESUME so qemu can use those instead.  But those are
>>>> separate things, so I assume we just use SET_STATUS 0 when stopping the
>>>> VM because this happens to also stop processing vrings as a side effect?
>>>>
>>>> I.e. I understand “treating stateful devices differently” to mean that
>>>> qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end
>>>> supports it, and stateful back-ends should support it.
>>>>
>>> Honestly I cannot think of any use case where the vhost-user backend
>>> did not ignore set_status(0) and had to retrieve vq states. So maybe
>>> we can totally remove that call from qemu?
>> I don’t know so I can’t really say; but I don’t quite understand why qemu
>> would reset a device at any point but perhaps VM reset (and even then I’d
>> expect the post-reset guest to just reset the device on boot by itself,
>> too).
> DPDK stores the Device Status field value and uses it later:
> https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2791
>
> While DPDK performs no immediate action upon SET_STATUS 0, omitting the
> message will change the behavior of other DPDK code like
> virtio_is_ready().
>
> Changing the semantics of the vhost-user protocol in a way that's not
> backwards compatible is something we should avoid unless there is no
> other way.
Well, I have two opinions on this:
First, that in DPDK sounds wrong.  vhost_dev_stop() is called mostly by 
devices that call it when set_status is called on them.  But they don’t 
call it if status == 0, they call it if virtio_device_should_start() 
returns false, which is the case when the VM is stopped.  So basically 
we set a status value on the back-end device that is not the status 
value that is set in qemu. If DPDK makes actual use of this status value 
that differs from that of the front-end in qemu, that sounds like it 
probably actually wrong.
Second, it’s entirely possible and probably probable that DPDK doesn’t 
make “actual use of this status value”; the only use it probably has is 
to determine whether the device is supposed to be stopped, which is 
exactly what qemu has tried to confer by setting it to 0.  So it’s 
basically two implementations that have agreed on abusing a value to 
emulate behavior that isn’t otherwise implement (SUSPEND), and that 
works because all devices are stateless.  Then, I agree, we can’t change 
this until it gets SUSPEND support.
> The fundamental problem is that QEMU's vhost code is designed to reset
> vhost devices because it assumes they are stateless. If an F_SUSPEND
> protocol feature bit is added, then it becomes possible to detect new
> backends and suspend/resume them rather than reset them.
>
> That's the solution that I favor because it's backwards compatible and
> the same model can be applied to stateful vDPA devices in the future.
So basically the idea is the following: vhost_dev_stop() should just 
suspend the device, not reset it.  For devices that don’t support 
SUSPEND, we still need to do something, and just calling GET_VRING_BASE 
on all vrings is deemed inadequate, so they are reset (SET_STATUS 0) as 
a work-around, assuming that stateful devices that care (i.e. implement 
SET_STATUS) will also implement SUSPEND to not have this “legacy reset” 
happen to them.
Sounds good to me.  (If I understood that right. :))
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-08 19:31                 ` Eugenio Perez Martin
@ 2023-05-09  8:59                   ` Hanna Czenczek
  0 siblings, 0 replies; 93+ messages in thread
From: Hanna Czenczek @ 2023-05-09  8:59 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, Stefan Hajnoczi,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On 08.05.23 21:31, Eugenio Perez Martin wrote:
> On Mon, May 8, 2023 at 7:51 PM Eugenio Perez Martin <eperezma@redhat.com> wrote:
>> On Mon, May 8, 2023 at 7:00 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>>> On 05.05.23 16:37, Hanna Czenczek wrote:
>>>> On 05.05.23 16:26, Eugenio Perez Martin wrote:
>>>>> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com>
>>>>> wrote:
>>>>>> (By the way, thanks for the explanations :))
>>>>>>
>>>>>> On 05.05.23 11:03, Hanna Czenczek wrote:
>>>>>>> On 04.05.23 23:14, Stefan Hajnoczi wrote:
>>>>>> [...]
>>>>>>
>>>>>>>> I think it's better to change QEMU's vhost code
>>>>>>>> to leave stateful devices suspended (but not reset) across
>>>>>>>> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
>>>>>>>> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
>>>>>>>> this aspect?
>>>>>>> Yes and no; I mean, I haven’t in detail, but I thought this is what’s
>>>>>>> meant by suspending instead of resetting when the VM is stopped.
>>>>>> So, now looking at vhost_dev_stop(), one problem I can see is that
>>>>>> depending on the back-end, different operations it does will do
>>>>>> different things.
>>>>>>
>>>>>> It tries to stop the whole device via vhost_ops->vhost_dev_start(),
>>>>>> which for vDPA will suspend the device, but for vhost-user will
>>>>>> reset it
>>>>>> (if F_STATUS is there).
>>>>>>
>>>>>> It disables all vrings, which doesn’t mean stopping, but may be
>>>>>> necessary, too.  (I haven’t yet really understood the use of disabled
>>>>>> vrings, I heard that virtio-net would have a need for it.)
>>>>>>
>>>>>> It then also stops all vrings, though, so that’s OK.  And because this
>>>>>> will always do GET_VRING_BASE, this is actually always the same
>>>>>> regardless of transport.
>>>>>>
>>>>>> Finally (for this purpose), it resets the device status via
>>>>>> vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
>>>>>> this is what resets the device there.
>>>>>>
>>>>>>
>>>>>> So vhost-user resets the device in .vhost_dev_start, but vDPA only does
>>>>>> so in .vhost_reset_status.  It would seem better to me if vhost-user
>>>>>> would also reset the device only in .vhost_reset_status, not in
>>>>>> .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
>>>>>> run SUSPEND/RESUME.
>>>>>>
>>>>> I think the same. I just saw It's been proposed at [1].
>>>>>
>>>>>> Another question I have (but this is basically what I wrote in my last
>>>>>> email) is why we even call .vhost_reset_status here.  If the device
>>>>>> and/or all of the vrings are already stopped, why do we need to reset
>>>>>> it?  Naïvely, I had assumed we only really need to reset the device if
>>>>>> the guest changes, so that a new guest driver sees a freshly
>>>>>> initialized
>>>>>> device.
>>>>>>
>>>>> I don't know why we didn't need to call it :). I'm assuming the
>>>>> previous vhost-user net did fine resetting vq indexes, using
>>>>> VHOST_USER_SET_VRING_BASE. But I don't know about more complex
>>>>> devices.
>>>>>
>>>>> The guest can reset the device, or write 0 to the PCI config status,
>>>>> at any time. How does virtiofs handle it, being stateful?
>>>> Honestly a good question because virtiofsd implements neither
>>>> SET_STATUS nor RESET_DEVICE.  I’ll have to investigate that.
>>>>
>>>> I think when the guest resets the device, SET_VRING_BASE always comes
>>>> along some way or another, so that’s how the vrings are reset.  Maybe
>>>> the internal state is reset only following more high-level FUSE
>>>> commands like INIT.
>>> So a meeting and one session of looking-into-the-code later:
>>>
>>> We reset every virt queue on GET_VRING_BASE, which is wrong, but happens
>>> to serve the purpose.  (German is currently on that.)
>>>
>>> In our meeting, German said the reset would occur when the memory
>>> regions are changed, but I can’t see that in the code.
>> That would imply that the status is reset when the guest's memory is
>> added or removed?
No, but that whenever the memory in which there is a vring is changed, 
or whenever a vring’s address is changed, that vring is reset.
>>> I think it only
>>> happens implicitly through the SET_VRING_BASE call, which resets the
>>> internal avail/used pointers.
>>>
>>> [This doesn’t seem different from libvhost-user, though, which
>>> implements neither SET_STATUS nor RESET_DEVICE, and which pretends to
>>> reset the device on RESET_OWNER, but really doesn’t (its
>>> vu_reset_device_exec() function just disables all vrings, doesn’t reset
>>> or even stop them).]
>>>
>>> Consequently, the internal state is never reset.  It would be cleared on
>>> a FUSE Destroy message, but if you just force-reset the system, the
>>> state remains into the next reboot.  Not even FUSE Init clears it, which
>>> seems weird.  It happens to work because it’s still the same filesystem,
>>> so the existing state fits, but it kind of seems dangerous to keep e.g.
>>> files open.  I don’t think it’s really exploitable because everything
>>> still goes through the guest kernel, but, well.  We should clear the
>>> state on Init, and probably also implement SET_STATUS and clear the
>>> state there.
>>>
>> I see. That's in the line of assuming GET_VRING_BASE is the last
>> message received from qemu.
>>
> Actually, does it prevent device recovery after a failure in
> migration? Is the same state set for the device?
In theory no, because GET_VRING_BASE will return the current index, so 
it’ll be restored by SET_VRING_BASE even if the vring is reset in between.
In practice yes, because the current implementation has GET_VRING_BASE 
reset the vring before fetching the index, so the returned index is 
always 0, and it can’t be restored.  But this also prevents device 
recovery in successful migration.  German has sent a pull request for 
that: https://github.com/rust-vmm/vhost/pull/154
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-05-09  6:31                         ` Eugenio Perez Martin
@ 2023-05-09  9:01                           ` Hanna Czenczek
  2023-05-09 15:26                             ` Eugenio Perez Martin
  0 siblings, 1 reply; 93+ messages in thread
From: Hanna Czenczek @ 2023-05-09  9:01 UTC (permalink / raw)
  To: Eugenio Perez Martin, Stefan Hajnoczi
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On 09.05.23 08:31, Eugenio Perez Martin wrote:
> On Mon, May 8, 2023 at 9:12 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
[...]
>> VHOST_USER_GET_VRING_BASE itself isn't really enough because it stops a
>> specific virtqueue but not the whole device. Unfortunately stopping all
>> virtqueues is not the same as SUSPEND since spontaneous device activity
>> is possible independent of any virtqueue (e.g. virtio-scsi events and
>> maybe virtio-net link status).
>>
>> That's why I think SUSPEND is necessary for a solution that's generic
>> enough to cover all device types.
>>
> I agree.
>
> In particular virtiofsd is already resetting all the device at
> VHOST_USER_GET_VRING_BASE if I'm not wrong, so that's even more of a
> reason to implement suspend call.
Oh, no, just the vring in question.  Not the whole device.
In addition, we still need the GET_VRING_BASE call anyway, because, 
well, we want to restore the vring on the destination via SET_VRING_BASE.
Hanna
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-09  8:53             ` Hanna Czenczek
@ 2023-05-09 14:53               ` Stefan Hajnoczi
  0 siblings, 0 replies; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-05-09 14:53 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Eugenio Perez Martin, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
[-- Attachment #1: Type: text/plain, Size: 4785 bytes --]
On Tue, May 09, 2023 at 10:53:35AM +0200, Hanna Czenczek wrote:
> On 08.05.23 23:10, Stefan Hajnoczi wrote:
> > On Fri, May 05, 2023 at 02:51:55PM +0200, Hanna Czenczek wrote:
> > > On 05.05.23 11:53, Eugenio Perez Martin wrote:
> > > > On Fri, May 5, 2023 at 11:03 AM Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > On 04.05.23 23:14, Stefan Hajnoczi wrote:
> > > > > > On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > [...]
> > > 
> > > > > > All state is lost and the Device Initialization process
> > > > > > must be followed to make the device operational again.
> > > > > > 
> > > > > > Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
> > > > > > 
> > > > > > It's messy and not your fault. I think QEMU should solve this by
> > > > > > treating stateful devices differently from non-stateful devices. That
> > > > > > way existing vhost-user backends continue to work and new stateful
> > > > > > devices can also be supported.
> > > > > It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for
> > > > > stateful devices.  In a previous email, you wrote that these should
> > > > > implement SUSPEND+RESUME so qemu can use those instead.  But those are
> > > > > separate things, so I assume we just use SET_STATUS 0 when stopping the
> > > > > VM because this happens to also stop processing vrings as a side effect?
> > > > > 
> > > > > I.e. I understand “treating stateful devices differently” to mean that
> > > > > qemu should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end
> > > > > supports it, and stateful back-ends should support it.
> > > > > 
> > > > Honestly I cannot think of any use case where the vhost-user backend
> > > > did not ignore set_status(0) and had to retrieve vq states. So maybe
> > > > we can totally remove that call from qemu?
> > > I don’t know so I can’t really say; but I don’t quite understand why qemu
> > > would reset a device at any point but perhaps VM reset (and even then I’d
> > > expect the post-reset guest to just reset the device on boot by itself,
> > > too).
> > DPDK stores the Device Status field value and uses it later:
> > https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2791
> > 
> > While DPDK performs no immediate action upon SET_STATUS 0, omitting the
> > message will change the behavior of other DPDK code like
> > virtio_is_ready().
> > 
> > Changing the semantics of the vhost-user protocol in a way that's not
> > backwards compatible is something we should avoid unless there is no
> > other way.
> 
> Well, I have two opinions on this:
> 
> First, that in DPDK sounds wrong.  vhost_dev_stop() is called mostly by
> devices that call it when set_status is called on them.  But they don’t call
> it if status == 0, they call it if virtio_device_should_start() returns
> false, which is the case when the VM is stopped.  So basically we set a
> status value on the back-end device that is not the status value that is set
> in qemu. If DPDK makes actual use of this status value that differs from
> that of the front-end in qemu, that sounds like it probably actually wrong.
> 
> Second, it’s entirely possible and probably probable that DPDK doesn’t make
> “actual use of this status value”; the only use it probably has is to
> determine whether the device is supposed to be stopped, which is exactly
> what qemu has tried to confer by setting it to 0.  So it’s basically two
> implementations that have agreed on abusing a value to emulate behavior that
> isn’t otherwise implement (SUSPEND), and that works because all devices are
> stateless.  Then, I agree, we can’t change this until it gets SUSPEND
> support.
> 
> > The fundamental problem is that QEMU's vhost code is designed to reset
> > vhost devices because it assumes they are stateless. If an F_SUSPEND
> > protocol feature bit is added, then it becomes possible to detect new
> > backends and suspend/resume them rather than reset them.
> > 
> > That's the solution that I favor because it's backwards compatible and
> > the same model can be applied to stateful vDPA devices in the future.
> 
> So basically the idea is the following: vhost_dev_stop() should just suspend
> the device, not reset it.  For devices that don’t support SUSPEND, we still
> need to do something, and just calling GET_VRING_BASE on all vrings is
> deemed inadequate, so they are reset (SET_STATUS 0) as a work-around,
> assuming that stateful devices that care (i.e. implement SET_STATUS) will
> also implement SUSPEND to not have this “legacy reset” happen to them.
> 
> Sounds good to me.  (If I understood that right. :))
Yes.
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-05-09  6:45                       ` Eugenio Perez Martin
@ 2023-05-09 15:09                         ` Stefan Hajnoczi
  2023-05-09 15:35                           ` Eugenio Perez Martin
  0 siblings, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-05-09 15:09 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
[-- Attachment #1: Type: text/plain, Size: 22055 bytes --]
On Tue, May 09, 2023 at 08:45:33AM +0200, Eugenio Perez Martin wrote:
> On Mon, May 8, 2023 at 10:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote:
> > > On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote:
> > > > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > > > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > > > wrote:
> > > > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > > > > > wrote:
> > > > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > wrote:
> > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > > > wrote:
> > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > > > transporting the
> > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > > > stream.  To do
> > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > > > state to and
> > > > > > > > > > > > from virtiofsd.
> > > > > > > > > > > >
> > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > > > believe it
> > > > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > > > streaming
> > > > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > > > user
> > > > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > > > addition to
> > > > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > > >
> > > > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > > > > > added so
> > > > > > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > > > > > support.
> > > > > > > > > > > >
> > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > > > negotiate a pipe
> > > > > > > > > > > >    over which to transfer the state.  The front-end sends an
> > > > > > > > > > > > FD to the
> > > > > > > > > > > >    back-end into/from which it can write/read its state, and
> > > > > > > > > > > > the back-end
> > > > > > > > > > > >    can decide to either use it, or reply with a different FD
> > > > > > > > > > > > for the
> > > > > > > > > > > >    front-end to override the front-end's choice.
> > > > > > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > > > but maybe
> > > > > > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > > > > > write/read
> > > > > > > > > > > >    its state, in which case it will want to override the
> > > > > > > > > > > > simple pipe.
> > > > > > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > > > > > front-end
> > > > > > > > > > > >    get an immediate FD for the migration stream (in some
> > > > > > > > > > > > cases), in which
> > > > > > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > > > > > creating a
> > > > > > > > > > > >    pipe.
> > > > > > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > > > > > plain
> > > > > > > > > > > >    pipe, we will want to use that.
> > > > > > > > > > > >
> > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > > > through the
> > > > > > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > > > > > function
> > > > > > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > > > > > pipe) to
> > > > > > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > > > > >
> > > > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > > > migration
> > > > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > > > the reading
> > > > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > > > > > check for
> > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > > > > > includes
> > > > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > > >
> > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > > > >   } VhostSetConfigType;
> > > > > > > > > > > >
> > > > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > > > +
> > > > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > > > vDPA has:
> > > > > > > > > > >
> > > > > > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > > > > > anymore
> > > > > > > > > > >     *
> > > > > > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > > > > > necessary state
> > > > > > > > > > >     * (the virtqueue vring base plus the possible device
> > > > > > > > > > > specific states) that is
> > > > > > > > > > >     * required for restoring in the future. The device must not
> > > > > > > > > > > change its
> > > > > > > > > > >     * configuration after that point.
> > > > > > > > > > >     */
> > > > > > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > > >
> > > > > > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > > > > > requests
> > > > > > > > > > >     *
> > > > > > > > > > >     * After the return of this ioctl the device will have
> > > > > > > > > > > restored all the
> > > > > > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > > > > > processing the
> > > > > > > > > > >     * virtqueue descriptors.
> > > > > > > > > > >     */
> > > > > > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > > >
> > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > > > that the
> > > > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > > > It's okay
> > > > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > > > avoid
> > > > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > > >
> > > > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > > > VHOST_STOP
> > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > > > change
> > > > > > > > > > to SUSPEND.
> > > > > > > > > >
> > > > > > > > > > Generally it is better if we make the interface less parametrized
> > > > > > > > > > and
> > > > > > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > > > > > words, instead of
> > > > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > > > > > send
> > > > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > > > > > command.
> > > > > > > > > >
> > > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > > > > > it
> > > > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > > > > >
> > > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be
> > > > > > > > > > ok.
> > > > > > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > > > > > "migration_set_state(state)", and maybe it is better when the
> > > > > > > > > > number
> > > > > > > > > > of states is high.
> > > > > > > > > Hi Eugenio,
> > > > > > > > > Another question about vDPA suspend/resume:
> > > > > > > > >
> > > > > > > > >    /* Host notifiers must be enabled at this point. */
> > > > > > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > > > > > bool vrings)
> > > > > > > > >    {
> > > > > > > > >        int i;
> > > > > > > > >
> > > > > > > > >        /* should only be called after backend is connected */
> > > > > > > > >        assert(hdev->vhost_ops);
> > > > > > > > >        event_notifier_test_and_clear(
> > > > > > > > >            &hdev-
> > > > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > > > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > > > > >
> > > > > > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > > > > >
> > > > > > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > > > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > > > > > >            ^^^ SUSPEND ^^^
> > > > > > > > >        }
> > > > > > > > >        if (vrings) {
> > > > > > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > > > > > >        }
> > > > > > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > > > > > >            vhost_virtqueue_stop(hdev,
> > > > > > > > >                                 vdev,
> > > > > > > > >                                 hdev->vqs + i,
> > > > > > > > >                                 hdev->vq_index + i);
> > > > > > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > > > > > >        }
> > > > > > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > > > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > > > > > >            ^^^ reset device^^^
> > > > > > > > >
> > > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop()
> > > > > > > > > ->
> > > > > > > > > vhost_reset_status(). The device's migration code runs after
> > > > > > > > > vhost_dev_stop() and the state will have been lost.
> > > > > > > > >
> > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > > > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > > > > >
> > > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > > > > > VM run in the CVQ. So it can track all of that status in the device
> > > > > > > > model too.
> > > > > > > >
> > > > > > > > When a migration effectively occurs, all the frontend state is
> > > > > > > > migrated as a regular emulated device. To route all of the state in a
> > > > > > > > normalized way for qemu is what leaves open the possibility to do
> > > > > > > > cross-backends migrations, etc.
> > > > > > > >
> > > > > > > > Does that answer your question?
> > > > > > > I think you're confirming that changes would be necessary in order for
> > > > > > > vDPA to support the save/load operation that Hanna is introducing.
> > > > > > >
> > > > > > Yes, this first iteration was centered on net, with an eye on block,
> > > > > > where state can be routed through classical emulated devices. This is
> > > > > > how vhost-kernel and vhost-user do classically. And it allows
> > > > > > cross-backend, to not modify qemu migration state, etc.
> > > > > >
> > > > > > To introduce this opaque state to qemu, that must be fetched after the
> > > > > > suspend and not before, requires changes in vhost protocol, as
> > > > > > discussed previously.
> > > > > >
> > > > > > > > > It looks like vDPA changes are necessary in order to support
> > > > > > > > > stateful
> > > > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > > > > > correct?
> > > > > > > > >
> > > > > > > > Changes are required elsewhere, as the code to restore the state
> > > > > > > > properly in the destination has not been merged.
> > > > > > > I'm not sure what you mean by elsewhere?
> > > > > > >
> > > > > > I meant for vdpa *net* devices the changes are not required in vdpa
> > > > > > ioctls, but mostly in qemu.
> > > > > >
> > > > > > If you meant stateful as "it must have a state blob that it must be
> > > > > > opaque to qemu", then I think the straightforward action is to fetch
> > > > > > state blob about the same time as vq indexes. But yes, changes (at
> > > > > > least a new ioctl) is needed for that.
> > > > > >
> > > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > > > > > then VHOST_VDPA_SET_STATUS 0.
> > > > > > >
> > > > > > > In order to save device state from the vDPA device in the future, it
> > > > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > > > > > the device state can be saved before the device is reset.
> > > > > > >
> > > > > > > Does that sound right?
> > > > > > >
> > > > > > The split between suspend and reset was added recently for that very
> > > > > > reason. In all the virtio devices, the frontend is initialized before
> > > > > > the backend, so I don't think it is a good idea to defer the backend
> > > > > > cleanup. Especially if we have already set the state is small enough
> > > > > > to not needing iterative migration from virtiofsd point of view.
> > > > > >
> > > > > > If fetching that state at the same time as vq indexes is not valid,
> > > > > > could it follow the same model as the "in-flight descriptors"?
> > > > > > vhost-user follows them by using a shared memory region where their
> > > > > > state is tracked [1]. This allows qemu to survive vhost-user SW
> > > > > > backend crashes, and does not forbid the cross-backends live migration
> > > > > > as all the information is there to recover them.
> > > > > >
> > > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > > > > > a possibility is to synchronize this memory region after a
> > > > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > > > > > devices are not going to crash in the software sense, so all use cases
> > > > > > remain the same to qemu. And that shared memory information is
> > > > > > recoverable after vhost_dev_stop.
> > > > > >
> > > > > > Does that sound reasonable to virtiofsd? To offer a shared memory
> > > > > > region where it dumps the state, maybe only after the
> > > > > > set_state(STATE_PHASE_STOPPED)?
> > > > >
> > > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is
> > > > > mandatory anyway.
> > > > >
> > > > > As for the shared memory, the RFC before this series used shared memory,
> > > > > so it’s possible, yes.  But “shared memory region” can mean a lot of
> > > > > things – it sounds like you’re saying the back-end (virtiofsd) should
> > > > > provide it to the front-end, is that right?  That could work like this:
> > > > >
> > > > > On the source side:
> > > > >
> > > > > S1. SUSPEND goes to virtiofsd
> > > > > S2. virtiofsd maybe double-checks that the device is stopped, then
> > > > > serializes its state into a newly allocated shared memory area[1]
> > > > > S3. virtiofsd responds to SUSPEND
> > > > > S4. front-end requests shared memory, virtiofsd responds with a handle,
> > > > > maybe already closes its reference
> > > > > S5. front-end saves state, closes its handle, freeing the SHM
> > > > >
> > > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> > > > > it can immediately allocate this area and serialize directly into it;
> > > > > maybe it can’t, then we’ll need a bounce buffer.  Not really a
> > > > > fundamental problem, but there are limitations around what you can do
> > > > > with serde implementations in Rust…
> > > > >
> > > > > On the destination side:
> > > > >
> > > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> > > > > virtiofsd would serialize its empty state into an SHM area, and respond
> > > > > to SUSPEND
> > > > > D2. front-end reads state from migration stream into an SHM it has allocated
> > > > > D3. front-end supplies this SHM to virtiofsd, which discards its
> > > > > previous area, and now uses this one
> > > > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> > > > >
> > > > > Couple of questions:
> > > > >
> > > > > A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> > > > > would imply to deserialize a state, and the state is to be transferred
> > > > > through SHM, this is what would need to be done.  So maybe we should
> > > > > skip SUSPEND on the destination?
> > > > > B. You described that the back-end should supply the SHM, which works
> > > > > well on the source.  On the destination, only the front-end knows how
> > > > > big the state is, so I’ve decided above that it should allocate the SHM
> > > > > (D2) and provide it to the back-end.  Is that feasible or is it
> > > > > important (e.g. for real hardware) that the back-end supplies the SHM?
> > > > > (In which case the front-end would need to tell the back-end how big the
> > > > > state SHM needs to be.)
> > > >
> > > > How does this work for iterative live migration?
> > > >
> > >
> > > A pipe will always fit better for iterative from qemu POV, that's for sure.
> > > Especially if we want to keep that opaqueness.
> > >
> > > But  we will need to communicate with the HW device using shared memory sooner
> > > or later for big states.  If we don't transform it in qemu, we will need to do
> > > it in the kernel.  Also, the pipe will not support daemon crashes.
> > >
> > > Again I'm just putting this on the table, just in case it fits better or it is
> > > convenient.  I missed the previous patch where SHM was proposed too, so maybe I
> > > missed some feedback useful here.  I think the pipe is a better solution in the
> > > long run because of the iterative part.
> >
> > Pipes and shared memory are conceptually equivalent for building
> > streaming interfaces. It's just more complex to design a shared memory
> > interface and it reinvents what file descriptors already offer.
> >
> > I have no doubt we could design iterative migration over a shared memory
> > interface if we needed to, but I'm not sure why? When you mention
> > hardware, are you suggesting defining a standard memory/register layout
> > that hardware implements and mapping it to userspace (QEMU)?
> 
> Right.
> 
> > Is there a
> > big advantage to exposing memory versus a file descriptor?
> >
> 
> For hardware it allows to retrieve and set the device state without
> intervention of the kernel, saving context switches. For virtiofsd
> this may not make a lot of sense, but I'm thinking on devices with big
> states (virtio gpu, maybe?).
A streaming interface implemented using shared memory involves consuming
chunks of bytes. Each time data has been read, an action must be
performed to notify the device and receive a notification when more data
becomes available.
That notification involves the kernel (e.g. an eventfd that is triggered
by a hardware interrupt) and a read(2) syscall to reset the eventfd.
Unless userspace disables notifications and polls (busy waits) the
hardware registers, there is still going to be kernel involvement and a
context switch. For this reason, I think that shared memory vs pipes
will not be significantly different.
> For software it allows the backend to survive a crash, as the old
> state can be set directly to a fresh backend instance.
Can you explain by describing the steps involved? Are you sure it can
only be done with shared memory and not pipes?
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-05-09  9:01                           ` Hanna Czenczek
@ 2023-05-09 15:26                             ` Eugenio Perez Martin
  0 siblings, 0 replies; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-05-09 15:26 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Tue, May 9, 2023 at 11:01 AM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 09.05.23 08:31, Eugenio Perez Martin wrote:
> > On Mon, May 8, 2023 at 9:12 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> [...]
>
> >> VHOST_USER_GET_VRING_BASE itself isn't really enough because it stops a
> >> specific virtqueue but not the whole device. Unfortunately stopping all
> >> virtqueues is not the same as SUSPEND since spontaneous device activity
> >> is possible independent of any virtqueue (e.g. virtio-scsi events and
> >> maybe virtio-net link status).
> >>
> >> That's why I think SUSPEND is necessary for a solution that's generic
> >> enough to cover all device types.
> >>
> > I agree.
> >
> > In particular virtiofsd is already resetting all the device at
> > VHOST_USER_GET_VRING_BASE if I'm not wrong, so that's even more of a
> > reason to implement suspend call.
>
> Oh, no, just the vring in question.  Not the whole device.
>
> In addition, we still need the GET_VRING_BASE call anyway, because,
> well, we want to restore the vring on the destination via SET_VRING_BASE.
>
Ok, that makes sense, sorry for the confusion!
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-05 14:26         ` Eugenio Perez Martin
  2023-05-05 14:37           ` Hanna Czenczek
@ 2023-05-09 15:30           ` Stefan Hajnoczi
  2023-05-09 15:43             ` Eugenio Perez Martin
  1 sibling, 1 reply; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-05-09 15:30 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Hanna Czenczek, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
[-- Attachment #1: Type: text/plain, Size: 4530 bytes --]
On Fri, May 05, 2023 at 04:26:08PM +0200, Eugenio Perez Martin wrote:
> On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> wrote:
> >
> > (By the way, thanks for the explanations :))
> >
> > On 05.05.23 11:03, Hanna Czenczek wrote:
> > > On 04.05.23 23:14, Stefan Hajnoczi wrote:
> >
> > [...]
> >
> > >> I think it's better to change QEMU's vhost code
> > >> to leave stateful devices suspended (but not reset) across
> > >> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> > >> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> > >> this aspect?
> > >
> > > Yes and no; I mean, I haven’t in detail, but I thought this is what’s
> > > meant by suspending instead of resetting when the VM is stopped.
> >
> > So, now looking at vhost_dev_stop(), one problem I can see is that
> > depending on the back-end, different operations it does will do
> > different things.
> >
> > It tries to stop the whole device via vhost_ops->vhost_dev_start(),
> > which for vDPA will suspend the device, but for vhost-user will reset it
> > (if F_STATUS is there).
> >
> > It disables all vrings, which doesn’t mean stopping, but may be
> > necessary, too.  (I haven’t yet really understood the use of disabled
> > vrings, I heard that virtio-net would have a need for it.)
> >
> > It then also stops all vrings, though, so that’s OK.  And because this
> > will always do GET_VRING_BASE, this is actually always the same
> > regardless of transport.
> >
> > Finally (for this purpose), it resets the device status via
> > vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
> > this is what resets the device there.
> >
> >
> > So vhost-user resets the device in .vhost_dev_start, but vDPA only does
> > so in .vhost_reset_status.  It would seem better to me if vhost-user
> > would also reset the device only in .vhost_reset_status, not in
> > .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
> > run SUSPEND/RESUME.
> >
> 
> I think the same. I just saw It's been proposed at [1].
> 
> > Another question I have (but this is basically what I wrote in my last
> > email) is why we even call .vhost_reset_status here.  If the device
> > and/or all of the vrings are already stopped, why do we need to reset
> > it?  Naïvely, I had assumed we only really need to reset the device if
> > the guest changes, so that a new guest driver sees a freshly initialized
> > device.
> >
> 
> I don't know why we didn't need to call it :). I'm assuming the
> previous vhost-user net did fine resetting vq indexes, using
> VHOST_USER_SET_VRING_BASE. But I don't know about more complex
> devices.
It was added so DPDK can batch rx virtqueue RSS updates:
commit 923b8921d210763359e96246a58658ac0db6c645
Author: Yajun Wu <yajunw@nvidia.com>
Date:   Mon Oct 17 14:44:52 2022 +0800
    vhost-user: Support vhost_dev_start
    
    The motivation of adding vhost-user vhost_dev_start support is to
    improve backend configuration speed and reduce live migration VM
    downtime.
    
    Today VQ configuration is issued one by one. For virtio net with
    multi-queue support, backend needs to update RSS (Receive side
    scaling) on every rx queue enable. Updating RSS is time-consuming
    (typical time like 7ms).
    
    Implement already defined vhost status and message in the vhost
    specification [1].
    (a) VHOST_USER_PROTOCOL_F_STATUS
    (b) VHOST_USER_SET_STATUS
    (c) VHOST_USER_GET_STATUS
    
    Send message VHOST_USER_SET_STATUS with VIRTIO_CONFIG_S_DRIVER_OK for
    device start and reset(0) for device stop.
    
    On reception of the DRIVER_OK message, backend can apply the needed setting
    only once (instead of incremental) and also utilize parallelism on enabling
    queues.
    
    This improves QEMU's live migration downtime with vhost user backend
    implementation by great margin, specially for the large number of VQs of 64
    from 800 msec to 250 msec.
    
    [1] https://qemu-project.gitlab.io/qemu/interop/vhost-user.html
    
    Signed-off-by: Yajun Wu <yajunw@nvidia.com>
    Acked-by: Parav Pandit <parav@nvidia.com>
    Message-Id: <20221017064452.1226514-3-yajunw@nvidia.com>
    Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> Thanks!
> 
> [1] https://lore.kernel.org/qemu-devel/20230501230409.274178-1-stefanha@redhat.com/
> 
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-05-09 15:09                         ` Stefan Hajnoczi
@ 2023-05-09 15:35                           ` Eugenio Perez Martin
  2023-05-09 17:33                             ` Stefan Hajnoczi
  0 siblings, 1 reply; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-05-09 15:35 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Tue, May 9, 2023 at 5:09 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, May 09, 2023 at 08:45:33AM +0200, Eugenio Perez Martin wrote:
> > On Mon, May 8, 2023 at 10:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote:
> > > > On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote:
> > > > > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > > On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > > > > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > > > > wrote:
> > > > > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > > > > > > wrote:
> > > > > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > wrote:
> > > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > > > > wrote:
> > > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > > > > transporting the
> > > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > > > > stream.  To do
> > > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > > > > state to and
> > > > > > > > > > > > > from virtiofsd.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > > > > believe it
> > > > > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > > > > streaming
> > > > > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > > > > user
> > > > > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > > > > addition to
> > > > > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > > > > > > added so
> > > > > > > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > > > > > > support.
> > > > > > > > > > > > >
> > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > > > > negotiate a pipe
> > > > > > > > > > > > >    over which to transfer the state.  The front-end sends an
> > > > > > > > > > > > > FD to the
> > > > > > > > > > > > >    back-end into/from which it can write/read its state, and
> > > > > > > > > > > > > the back-end
> > > > > > > > > > > > >    can decide to either use it, or reply with a different FD
> > > > > > > > > > > > > for the
> > > > > > > > > > > > >    front-end to override the front-end's choice.
> > > > > > > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > > > > but maybe
> > > > > > > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > > > > > > write/read
> > > > > > > > > > > > >    its state, in which case it will want to override the
> > > > > > > > > > > > > simple pipe.
> > > > > > > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > > > > > > front-end
> > > > > > > > > > > > >    get an immediate FD for the migration stream (in some
> > > > > > > > > > > > > cases), in which
> > > > > > > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > > > > > > creating a
> > > > > > > > > > > > >    pipe.
> > > > > > > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > > > > > > plain
> > > > > > > > > > > > >    pipe, we will want to use that.
> > > > > > > > > > > > >
> > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > > > > through the
> > > > > > > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > > > > > > function
> > > > > > > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > > > > > > pipe) to
> > > > > > > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > > > > migration
> > > > > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > > > > the reading
> > > > > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > > > > > > check for
> > > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > > > > > > includes
> > > > > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > > > > >   } VhostSetConfigType;
> > > > > > > > > > > > >
> > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > > > > vDPA has:
> > > > > > > > > > > >
> > > > > > > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > > > > > > anymore
> > > > > > > > > > > >     *
> > > > > > > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > > > > > > necessary state
> > > > > > > > > > > >     * (the virtqueue vring base plus the possible device
> > > > > > > > > > > > specific states) that is
> > > > > > > > > > > >     * required for restoring in the future. The device must not
> > > > > > > > > > > > change its
> > > > > > > > > > > >     * configuration after that point.
> > > > > > > > > > > >     */
> > > > > > > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > > > >
> > > > > > > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > > > > > > requests
> > > > > > > > > > > >     *
> > > > > > > > > > > >     * After the return of this ioctl the device will have
> > > > > > > > > > > > restored all the
> > > > > > > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > > > > > > processing the
> > > > > > > > > > > >     * virtqueue descriptors.
> > > > > > > > > > > >     */
> > > > > > > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > > > >
> > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > > > > that the
> > > > > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > > > > It's okay
> > > > > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > > > > avoid
> > > > > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > > > >
> > > > > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > > > > VHOST_STOP
> > > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > > > > change
> > > > > > > > > > > to SUSPEND.
> > > > > > > > > > >
> > > > > > > > > > > Generally it is better if we make the interface less parametrized
> > > > > > > > > > > and
> > > > > > > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > > > > > > words, instead of
> > > > > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > > > > > > send
> > > > > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > > > > > > command.
> > > > > > > > > > >
> > > > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > > > > > > it
> > > > > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > > > > > >
> > > > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be
> > > > > > > > > > > ok.
> > > > > > > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > > > > > > "migration_set_state(state)", and maybe it is better when the
> > > > > > > > > > > number
> > > > > > > > > > > of states is high.
> > > > > > > > > > Hi Eugenio,
> > > > > > > > > > Another question about vDPA suspend/resume:
> > > > > > > > > >
> > > > > > > > > >    /* Host notifiers must be enabled at this point. */
> > > > > > > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > > > > > > bool vrings)
> > > > > > > > > >    {
> > > > > > > > > >        int i;
> > > > > > > > > >
> > > > > > > > > >        /* should only be called after backend is connected */
> > > > > > > > > >        assert(hdev->vhost_ops);
> > > > > > > > > >        event_notifier_test_and_clear(
> > > > > > > > > >            &hdev-
> > > > > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > > > > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > > > > > >
> > > > > > > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > > > > > >
> > > > > > > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > > > > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > > > > > > >            ^^^ SUSPEND ^^^
> > > > > > > > > >        }
> > > > > > > > > >        if (vrings) {
> > > > > > > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > > > > > > >        }
> > > > > > > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > > > > > > >            vhost_virtqueue_stop(hdev,
> > > > > > > > > >                                 vdev,
> > > > > > > > > >                                 hdev->vqs + i,
> > > > > > > > > >                                 hdev->vq_index + i);
> > > > > > > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > > > > > > >        }
> > > > > > > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > > > > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > > > > > > >            ^^^ reset device^^^
> > > > > > > > > >
> > > > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop()
> > > > > > > > > > ->
> > > > > > > > > > vhost_reset_status(). The device's migration code runs after
> > > > > > > > > > vhost_dev_stop() and the state will have been lost.
> > > > > > > > > >
> > > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > > > > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > > > > > >
> > > > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > > > > > > VM run in the CVQ. So it can track all of that status in the device
> > > > > > > > > model too.
> > > > > > > > >
> > > > > > > > > When a migration effectively occurs, all the frontend state is
> > > > > > > > > migrated as a regular emulated device. To route all of the state in a
> > > > > > > > > normalized way for qemu is what leaves open the possibility to do
> > > > > > > > > cross-backends migrations, etc.
> > > > > > > > >
> > > > > > > > > Does that answer your question?
> > > > > > > > I think you're confirming that changes would be necessary in order for
> > > > > > > > vDPA to support the save/load operation that Hanna is introducing.
> > > > > > > >
> > > > > > > Yes, this first iteration was centered on net, with an eye on block,
> > > > > > > where state can be routed through classical emulated devices. This is
> > > > > > > how vhost-kernel and vhost-user do classically. And it allows
> > > > > > > cross-backend, to not modify qemu migration state, etc.
> > > > > > >
> > > > > > > To introduce this opaque state to qemu, that must be fetched after the
> > > > > > > suspend and not before, requires changes in vhost protocol, as
> > > > > > > discussed previously.
> > > > > > >
> > > > > > > > > > It looks like vDPA changes are necessary in order to support
> > > > > > > > > > stateful
> > > > > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > > > > > > correct?
> > > > > > > > > >
> > > > > > > > > Changes are required elsewhere, as the code to restore the state
> > > > > > > > > properly in the destination has not been merged.
> > > > > > > > I'm not sure what you mean by elsewhere?
> > > > > > > >
> > > > > > > I meant for vdpa *net* devices the changes are not required in vdpa
> > > > > > > ioctls, but mostly in qemu.
> > > > > > >
> > > > > > > If you meant stateful as "it must have a state blob that it must be
> > > > > > > opaque to qemu", then I think the straightforward action is to fetch
> > > > > > > state blob about the same time as vq indexes. But yes, changes (at
> > > > > > > least a new ioctl) is needed for that.
> > > > > > >
> > > > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > > > > > > then VHOST_VDPA_SET_STATUS 0.
> > > > > > > >
> > > > > > > > In order to save device state from the vDPA device in the future, it
> > > > > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > > > > > > the device state can be saved before the device is reset.
> > > > > > > >
> > > > > > > > Does that sound right?
> > > > > > > >
> > > > > > > The split between suspend and reset was added recently for that very
> > > > > > > reason. In all the virtio devices, the frontend is initialized before
> > > > > > > the backend, so I don't think it is a good idea to defer the backend
> > > > > > > cleanup. Especially if we have already set the state is small enough
> > > > > > > to not needing iterative migration from virtiofsd point of view.
> > > > > > >
> > > > > > > If fetching that state at the same time as vq indexes is not valid,
> > > > > > > could it follow the same model as the "in-flight descriptors"?
> > > > > > > vhost-user follows them by using a shared memory region where their
> > > > > > > state is tracked [1]. This allows qemu to survive vhost-user SW
> > > > > > > backend crashes, and does not forbid the cross-backends live migration
> > > > > > > as all the information is there to recover them.
> > > > > > >
> > > > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > > > > > > a possibility is to synchronize this memory region after a
> > > > > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > > > > > > devices are not going to crash in the software sense, so all use cases
> > > > > > > remain the same to qemu. And that shared memory information is
> > > > > > > recoverable after vhost_dev_stop.
> > > > > > >
> > > > > > > Does that sound reasonable to virtiofsd? To offer a shared memory
> > > > > > > region where it dumps the state, maybe only after the
> > > > > > > set_state(STATE_PHASE_STOPPED)?
> > > > > >
> > > > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is
> > > > > > mandatory anyway.
> > > > > >
> > > > > > As for the shared memory, the RFC before this series used shared memory,
> > > > > > so it’s possible, yes.  But “shared memory region” can mean a lot of
> > > > > > things – it sounds like you’re saying the back-end (virtiofsd) should
> > > > > > provide it to the front-end, is that right?  That could work like this:
> > > > > >
> > > > > > On the source side:
> > > > > >
> > > > > > S1. SUSPEND goes to virtiofsd
> > > > > > S2. virtiofsd maybe double-checks that the device is stopped, then
> > > > > > serializes its state into a newly allocated shared memory area[1]
> > > > > > S3. virtiofsd responds to SUSPEND
> > > > > > S4. front-end requests shared memory, virtiofsd responds with a handle,
> > > > > > maybe already closes its reference
> > > > > > S5. front-end saves state, closes its handle, freeing the SHM
> > > > > >
> > > > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> > > > > > it can immediately allocate this area and serialize directly into it;
> > > > > > maybe it can’t, then we’ll need a bounce buffer.  Not really a
> > > > > > fundamental problem, but there are limitations around what you can do
> > > > > > with serde implementations in Rust…
> > > > > >
> > > > > > On the destination side:
> > > > > >
> > > > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> > > > > > virtiofsd would serialize its empty state into an SHM area, and respond
> > > > > > to SUSPEND
> > > > > > D2. front-end reads state from migration stream into an SHM it has allocated
> > > > > > D3. front-end supplies this SHM to virtiofsd, which discards its
> > > > > > previous area, and now uses this one
> > > > > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> > > > > >
> > > > > > Couple of questions:
> > > > > >
> > > > > > A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> > > > > > would imply to deserialize a state, and the state is to be transferred
> > > > > > through SHM, this is what would need to be done.  So maybe we should
> > > > > > skip SUSPEND on the destination?
> > > > > > B. You described that the back-end should supply the SHM, which works
> > > > > > well on the source.  On the destination, only the front-end knows how
> > > > > > big the state is, so I’ve decided above that it should allocate the SHM
> > > > > > (D2) and provide it to the back-end.  Is that feasible or is it
> > > > > > important (e.g. for real hardware) that the back-end supplies the SHM?
> > > > > > (In which case the front-end would need to tell the back-end how big the
> > > > > > state SHM needs to be.)
> > > > >
> > > > > How does this work for iterative live migration?
> > > > >
> > > >
> > > > A pipe will always fit better for iterative from qemu POV, that's for sure.
> > > > Especially if we want to keep that opaqueness.
> > > >
> > > > But  we will need to communicate with the HW device using shared memory sooner
> > > > or later for big states.  If we don't transform it in qemu, we will need to do
> > > > it in the kernel.  Also, the pipe will not support daemon crashes.
> > > >
> > > > Again I'm just putting this on the table, just in case it fits better or it is
> > > > convenient.  I missed the previous patch where SHM was proposed too, so maybe I
> > > > missed some feedback useful here.  I think the pipe is a better solution in the
> > > > long run because of the iterative part.
> > >
> > > Pipes and shared memory are conceptually equivalent for building
> > > streaming interfaces. It's just more complex to design a shared memory
> > > interface and it reinvents what file descriptors already offer.
> > >
> > > I have no doubt we could design iterative migration over a shared memory
> > > interface if we needed to, but I'm not sure why? When you mention
> > > hardware, are you suggesting defining a standard memory/register layout
> > > that hardware implements and mapping it to userspace (QEMU)?
> >
> > Right.
> >
> > > Is there a
> > > big advantage to exposing memory versus a file descriptor?
> > >
> >
> > For hardware it allows to retrieve and set the device state without
> > intervention of the kernel, saving context switches. For virtiofsd
> > this may not make a lot of sense, but I'm thinking on devices with big
> > states (virtio gpu, maybe?).
>
> A streaming interface implemented using shared memory involves consuming
> chunks of bytes. Each time data has been read, an action must be
> performed to notify the device and receive a notification when more data
> becomes available.
>
> That notification involves the kernel (e.g. an eventfd that is triggered
> by a hardware interrupt) and a read(2) syscall to reset the eventfd.
>
> Unless userspace disables notifications and polls (busy waits) the
> hardware registers, there is still going to be kernel involvement and a
> context switch. For this reason, I think that shared memory vs pipes
> will not be significantly different.
>
Yes, for big states that's right. I was thinking of not-so-big states,
where all of it can be asked in one shot, but it may be problematic
with iterative migration for sure. In that regard pipes are way
better.
> > For software it allows the backend to survive a crash, as the old
> > state can be set directly to a fresh backend instance.
>
> Can you explain by describing the steps involved?
It's how vhost-user inflight I/O tracking works [1]: QEMU and the
backend shares a memory region where the backend dump states
continuously. In the event of a crash, this state can be dumped
directly to a new vhost-user backend.
> Are you sure it can only be done with shared memory and not pipes?
>
Sorry for the confusion, but I never intended to say that :).
[1] https://qemu.readthedocs.io/en/latest/interop/vhost-user.html#inflight-i-o-tracking
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-05  9:03     ` Hanna Czenczek
  2023-05-05  9:51       ` Hanna Czenczek
  2023-05-05  9:53       ` Eugenio Perez Martin
@ 2023-05-09 15:41       ` Stefan Hajnoczi
  2 siblings, 0 replies; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-05-09 15:41 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: Stefan Hajnoczi, qemu-devel, virtio-fs, German Maglione,
	Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella, Eugenio Perez Martin
[-- Attachment #1: Type: text/plain, Size: 7560 bytes --]
On Fri, May 05, 2023 at 11:03:16AM +0200, Hanna Czenczek wrote:
> On 04.05.23 23:14, Stefan Hajnoczi wrote:
> > On Thu, 4 May 2023 at 13:39, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > On 11.04.23 17:05, Hanna Czenczek wrote:
> > > 
> > > [...]
> > > 
> > > > Hanna Czenczek (4):
> > > >     vhost: Re-enable vrings after setting features
> > > >     vhost-user: Interface for migration state transfer
> > > >     vhost: Add high-level state save/load functions
> > > >     vhost-user-fs: Implement internal migration
> > > I’m trying to write v2, and my intention was to keep the code
> > > conceptually largely the same, but include in the documentation change
> > > thoughts and notes on how this interface is to be used in the future,
> > > when e.g. vDPA “extensions” come over to vhost-user.  My plan was to,
> > > based on that documentation, discuss further.
> > > 
> > > But now I’m struggling to even write that documentation because it’s not
> > > clear to me what exactly the result of the discussion was, so I need to
> > > stop even before that.
> > > 
> > > So as far as I understand, we need/want SUSPEND/RESUME for two reasons:
> > > 1. As a signal to the back-end when virt queues are no longer to be
> > > processed, so that it is clear that it will not do that when asked for
> > > migration state.
> > > 2. Stateful devices that support SET_STATUS receive a status of 0 when
> > > the VM is stopped, which supposedly resets the internal state. While
> > > suspended, device state is frozen, so as far as I understand, SUSPEND
> > > before SET_STATUS would have the status change be deferred until RESUME.
> > I'm not sure about SUSPEND -> SET_STATUS 0 -> RESUME. I guess the
> > device would be reset right away and it would either remain suspended
> > or be resumed as part of reset :).
> > 
> > Unfortunately the concepts of SUSPEND/RESUME and the Device Status
> > Field are orthogonal and there is no spec that explains how they
> > interact.
> 
> Ah, OK.  So I guess it’s up to the implementation to decide whether the
> virtio device status counts as part of the “configuration” that “[it] must
> not change”.
> 
> > > I don’t want to hang myself up on 2 because it doesn’t really seem
> > > important to this series, but: Why does a status of 0 reset the internal
> > > state?  [Note: This is all virtio_reset() seems to do, set the status to
> > > 0.]  The vhost-user specification only points to the virtio
> > > specification, which doesn’t say anything to that effect. Instead, an
> > > explicit device reset is mentioned, which would be
> > > VHOST_USER_RESET_DEVICE, i.e. something completely different. Because
> > > RESET_DEVICE directly contradicts SUSPEND’s description, I would like to
> > > think that invoking RESET_DEVICE on a SUSPEND-ed device is just invalid.
> > The vhost-user protocol didn't have the concept of the VIRTIO Device
> > Status Field until SET_STATUS was added.
> > 
> > In order to participate in the VIRTIO device lifecycle to some extent,
> > the pre-SET_STATUS vhost-user protocol relied on vhost-user-specific
> > messages like RESET_DEVICE.
> > 
> > At the VIRTIO level, devices are reset by setting the Device Status
> > Field to 0.
> 
> (I didn’t find this in the virtio specification until today, turns out it’s
> under 4.1.4.3 “Common configuration structure layout”, not under 2.1 “Device
> Status Field”, where I was looking.)
> 
> > All state is lost and the Device Initialization process
> > must be followed to make the device operational again.
> > 
> > Existing vhost-user backends don't implement SET_STATUS 0 (it's new).
> > 
> > It's messy and not your fault. I think QEMU should solve this by
> > treating stateful devices differently from non-stateful devices. That
> > way existing vhost-user backends continue to work and new stateful
> > devices can also be supported.
> 
> It’s my understanding that SET_STATUS 0/RESET_DEVICE is problematic for
> stateful devices.  In a previous email, you wrote that these should
> implement SUSPEND+RESUME so qemu can use those instead.  But those are
> separate things, so I assume we just use SET_STATUS 0 when stopping the VM
> because this happens to also stop processing vrings as a side effect?
SET_STATUS 0 doesn't do anything in most existing vhost-user backends
and QEMU's vhost code doesn't rely on it doing anything. It was added as
an optimization hint for DPDK's vhost-user-net implementation without
noticing that it breaks stateful devices (see commit 923b8921d210).
> 
> I.e. I understand “treating stateful devices differently” to mean that qemu
> should use SUSPEND+RESUME instead of SET_STATUS 0 when the back-end supports
> it, and stateful back-ends should support it.
> 
> > > Is it that a status 0 won’t explicitly reset the internal state, but
> > > because it does mean that the driver is unbound, the state should
> > > implicitly be reset?
> > I think the fundamental problem is that transports like virtio-pci put
> > registers back in their initialization state upon reset, so internal
> > state is lost.
> > 
> > The VIRTIO spec does not go into detail on device state across reset
> > though, so I don't think much more can be said about the semantics.
> > 
> > > Anyway.  1 seems to be the relevant point for migration.  As far as I
> > > understand, currently, a vhost-user back-end has no way of knowing when
> > > to stop processing virt queues.  Basically, rings can only transition
> > > from stopped to started, but not vice versa.  The vhost-user
> > > specification has this bit: “Once the source has finished migration,
> > > rings will be stopped by the source. No further update must be done
> > > before rings are restarted.”  It just doesn’t say how the front-end lets
> > > the back-end know that the rings are (to be) stopped.  So this seems
> > > like a pre-existing problem for stateless migration.  Unless this is
> > > communicated precisely by setting the device status to 0?
> > No, my understanding is different. The vhost-user spec says the
> > backend must "stop [the] ring upon receiving
> > ``VHOST_USER_GET_VRING_BASE``".
> 
> Yes, I missed that part!
> 
> > The "Ring states" section goes into
> > more detail and adds the concept of enabled/disabled too.
> > 
> > SUSPEND is stronger than GET_VRING_BASE though. GET_VRING_BASE only
> > applies to a single virtqueue, whereas SUSPEND acts upon the entire
> > device, including non-virtqueue aspects like Configuration Change
> > Notifications (VHOST_USER_BACKEND_CONFIG_CHANGE_MSG).
> > 
> > You can approximate SUSPEND today by sending GET_VRING_BASE for all
> > virtqueues. I think in practice this does fully stop the device even
> > if the spec doesn't require it.
> > 
> > If we want minimal changes to vhost-user, then we could rely on
> > GET_VRING_BASE to suspend and SET_VRING_ENABLE to resume. And
> > SET_STATUS 0 must not be sent so that the device's state is not lost.
> 
> So you mean that we’d use SUSPEND instead of SET_STATUS 0, but because we
> have no SUSPEND, we’d ensure that GET_VRING_BASE is/was called on all
> vrings?
Yes. I prefer adding SUSPEND+RESUME to vhost-user, but if we were
limited to today's vhost-user commands, then relying on GET_VRING_BASE
and skipping SET_STATUS calls for vhost_dev_start/stop() would come
close to achieving the behavior needed by stateful backends.
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 0/4] vhost-user-fs: Internal migration
  2023-05-09 15:30           ` Stefan Hajnoczi
@ 2023-05-09 15:43             ` Eugenio Perez Martin
  0 siblings, 0 replies; 93+ messages in thread
From: Eugenio Perez Martin @ 2023-05-09 15:43 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Hanna Czenczek, Stefan Hajnoczi, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Tue, May 9, 2023 at 5:30 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Fri, May 05, 2023 at 04:26:08PM +0200, Eugenio Perez Martin wrote:
> > On Fri, May 5, 2023 at 11:51 AM Hanna Czenczek <hreitz@redhat.com> wrote:
> > >
> > > (By the way, thanks for the explanations :))
> > >
> > > On 05.05.23 11:03, Hanna Czenczek wrote:
> > > > On 04.05.23 23:14, Stefan Hajnoczi wrote:
> > >
> > > [...]
> > >
> > > >> I think it's better to change QEMU's vhost code
> > > >> to leave stateful devices suspended (but not reset) across
> > > >> vhost_dev_stop() -> vhost_dev_start(), maybe by introducing
> > > >> vhost_dev_suspend() and vhost_dev_resume(). Have you thought about
> > > >> this aspect?
> > > >
> > > > Yes and no; I mean, I haven’t in detail, but I thought this is what’s
> > > > meant by suspending instead of resetting when the VM is stopped.
> > >
> > > So, now looking at vhost_dev_stop(), one problem I can see is that
> > > depending on the back-end, different operations it does will do
> > > different things.
> > >
> > > It tries to stop the whole device via vhost_ops->vhost_dev_start(),
> > > which for vDPA will suspend the device, but for vhost-user will reset it
> > > (if F_STATUS is there).
> > >
> > > It disables all vrings, which doesn’t mean stopping, but may be
> > > necessary, too.  (I haven’t yet really understood the use of disabled
> > > vrings, I heard that virtio-net would have a need for it.)
> > >
> > > It then also stops all vrings, though, so that’s OK.  And because this
> > > will always do GET_VRING_BASE, this is actually always the same
> > > regardless of transport.
> > >
> > > Finally (for this purpose), it resets the device status via
> > > vhost_ops->vhost_reset_status().  This is only implemented on vDPA, and
> > > this is what resets the device there.
> > >
> > >
> > > So vhost-user resets the device in .vhost_dev_start, but vDPA only does
> > > so in .vhost_reset_status.  It would seem better to me if vhost-user
> > > would also reset the device only in .vhost_reset_status, not in
> > > .vhost_dev_start.  .vhost_dev_start seems precisely like the place to
> > > run SUSPEND/RESUME.
> > >
> >
> > I think the same. I just saw It's been proposed at [1].
> >
> > > Another question I have (but this is basically what I wrote in my last
> > > email) is why we even call .vhost_reset_status here.  If the device
> > > and/or all of the vrings are already stopped, why do we need to reset
> > > it?  Naïvely, I had assumed we only really need to reset the device if
> > > the guest changes, so that a new guest driver sees a freshly initialized
> > > device.
> > >
> >
> > I don't know why we didn't need to call it :). I'm assuming the
> > previous vhost-user net did fine resetting vq indexes, using
> > VHOST_USER_SET_VRING_BASE. But I don't know about more complex
> > devices.
>
> It was added so DPDK can batch rx virtqueue RSS updates:
>
> commit 923b8921d210763359e96246a58658ac0db6c645
> Author: Yajun Wu <yajunw@nvidia.com>
> Date:   Mon Oct 17 14:44:52 2022 +0800
>
>     vhost-user: Support vhost_dev_start
>
>     The motivation of adding vhost-user vhost_dev_start support is to
>     improve backend configuration speed and reduce live migration VM
>     downtime.
>
>     Today VQ configuration is issued one by one. For virtio net with
>     multi-queue support, backend needs to update RSS (Receive side
>     scaling) on every rx queue enable. Updating RSS is time-consuming
>     (typical time like 7ms).
>
>     Implement already defined vhost status and message in the vhost
>     specification [1].
>     (a) VHOST_USER_PROTOCOL_F_STATUS
>     (b) VHOST_USER_SET_STATUS
>     (c) VHOST_USER_GET_STATUS
>
>     Send message VHOST_USER_SET_STATUS with VIRTIO_CONFIG_S_DRIVER_OK for
>     device start and reset(0) for device stop.
>
>     On reception of the DRIVER_OK message, backend can apply the needed setting
>     only once (instead of incremental) and also utilize parallelism on enabling
>     queues.
>
>     This improves QEMU's live migration downtime with vhost user backend
>     implementation by great margin, specially for the large number of VQs of 64
>     from 800 msec to 250 msec.
>
>     [1] https://qemu-project.gitlab.io/qemu/interop/vhost-user.html
>
>     Signed-off-by: Yajun Wu <yajunw@nvidia.com>
>     Acked-by: Parav Pandit <parav@nvidia.com>
>     Message-Id: <20221017064452.1226514-3-yajunw@nvidia.com>
>     Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
>     Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
Sorry for the confusion, what I was wondering is how vhost-user
devices do not need any signal to reset the device before
VHOST_USER_SET_STATUS. And my guess is that it is enough to get / set
the vq indexes.
vhost_user_reset_device is limited to scsi, so it would not work for
the rest of devices.
Thanks!
^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: [PATCH 2/4] vhost-user: Interface for migration state transfer
  2023-05-09 15:35                           ` Eugenio Perez Martin
@ 2023-05-09 17:33                             ` Stefan Hajnoczi
  0 siblings, 0 replies; 93+ messages in thread
From: Stefan Hajnoczi @ 2023-05-09 17:33 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Stefan Hajnoczi, Hanna Czenczek, qemu-devel, virtio-fs,
	German Maglione, Anton Kuchin, Juan Quintela, Michael S . Tsirkin,
	Stefano Garzarella
On Tue, 9 May 2023 at 11:35, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Tue, May 9, 2023 at 5:09 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, May 09, 2023 at 08:45:33AM +0200, Eugenio Perez Martin wrote:
> > > On Mon, May 8, 2023 at 10:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote:
> > > > > On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote:
> > > > > > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > > > On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > > > > > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > > > > > > > wrote:
> > > > > > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > > > > > transporting the
> > > > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > > > > > stream.  To do
> > > > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > > > > > state to and
> > > > > > > > > > > > > > from virtiofsd.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > > > > > believe it
> > > > > > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > > > > > streaming
> > > > > > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > > > > > user
> > > > > > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > > > > > addition to
> > > > > > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > > > > > > > added so
> > > > > > > > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > > > > > > > support.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > > > > > negotiate a pipe
> > > > > > > > > > > > > >    over which to transfer the state.  The front-end sends an
> > > > > > > > > > > > > > FD to the
> > > > > > > > > > > > > >    back-end into/from which it can write/read its state, and
> > > > > > > > > > > > > > the back-end
> > > > > > > > > > > > > >    can decide to either use it, or reply with a different FD
> > > > > > > > > > > > > > for the
> > > > > > > > > > > > > >    front-end to override the front-end's choice.
> > > > > > > > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > > > > > but maybe
> > > > > > > > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > > > > > > > write/read
> > > > > > > > > > > > > >    its state, in which case it will want to override the
> > > > > > > > > > > > > > simple pipe.
> > > > > > > > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > > > > > > > front-end
> > > > > > > > > > > > > >    get an immediate FD for the migration stream (in some
> > > > > > > > > > > > > > cases), in which
> > > > > > > > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > > > > > > > creating a
> > > > > > > > > > > > > >    pipe.
> > > > > > > > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > > > > > > > plain
> > > > > > > > > > > > > >    pipe, we will want to use that.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > > > > > through the
> > > > > > > > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > > > > > > > function
> > > > > > > > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > > > > > > > pipe) to
> > > > > > > > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > > > > > migration
> > > > > > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > > > > > the reading
> > > > > > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > > > > > > > check for
> > > > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > > > > > > > includes
> > > > > > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > > > > > >   } VhostSetConfigType;
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > > > > > vDPA has:
> > > > > > > > > > > > >
> > > > > > > > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > > > > > > > anymore
> > > > > > > > > > > > >     *
> > > > > > > > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > > > > > > > necessary state
> > > > > > > > > > > > >     * (the virtqueue vring base plus the possible device
> > > > > > > > > > > > > specific states) that is
> > > > > > > > > > > > >     * required for restoring in the future. The device must not
> > > > > > > > > > > > > change its
> > > > > > > > > > > > >     * configuration after that point.
> > > > > > > > > > > > >     */
> > > > > > > > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > > > > >
> > > > > > > > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > > > > > > > requests
> > > > > > > > > > > > >     *
> > > > > > > > > > > > >     * After the return of this ioctl the device will have
> > > > > > > > > > > > > restored all the
> > > > > > > > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > > > > > > > processing the
> > > > > > > > > > > > >     * virtqueue descriptors.
> > > > > > > > > > > > >     */
> > > > > > > > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > > > > >
> > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > > > > > that the
> > > > > > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > > > > > It's okay
> > > > > > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > > > > > avoid
> > > > > > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > > > > >
> > > > > > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > > > > > VHOST_STOP
> > > > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > > > > > change
> > > > > > > > > > > > to SUSPEND.
> > > > > > > > > > > >
> > > > > > > > > > > > Generally it is better if we make the interface less parametrized
> > > > > > > > > > > > and
> > > > > > > > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > > > > > > > words, instead of
> > > > > > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > > > > > > > send
> > > > > > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > > > > > > > command.
> > > > > > > > > > > >
> > > > > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > > > > > > > it
> > > > > > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > > > > > > >
> > > > > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be
> > > > > > > > > > > > ok.
> > > > > > > > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > > > > > > > "migration_set_state(state)", and maybe it is better when the
> > > > > > > > > > > > number
> > > > > > > > > > > > of states is high.
> > > > > > > > > > > Hi Eugenio,
> > > > > > > > > > > Another question about vDPA suspend/resume:
> > > > > > > > > > >
> > > > > > > > > > >    /* Host notifiers must be enabled at this point. */
> > > > > > > > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > > > > > > > bool vrings)
> > > > > > > > > > >    {
> > > > > > > > > > >        int i;
> > > > > > > > > > >
> > > > > > > > > > >        /* should only be called after backend is connected */
> > > > > > > > > > >        assert(hdev->vhost_ops);
> > > > > > > > > > >        event_notifier_test_and_clear(
> > > > > > > > > > >            &hdev-
> > > > > > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > > > > > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > > > > > > >
> > > > > > > > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > > > > > > >
> > > > > > > > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > > > > > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > > > > > > > >            ^^^ SUSPEND ^^^
> > > > > > > > > > >        }
> > > > > > > > > > >        if (vrings) {
> > > > > > > > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > > > > > > > >        }
> > > > > > > > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > > > > > > > >            vhost_virtqueue_stop(hdev,
> > > > > > > > > > >                                 vdev,
> > > > > > > > > > >                                 hdev->vqs + i,
> > > > > > > > > > >                                 hdev->vq_index + i);
> > > > > > > > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > > > > > > > >        }
> > > > > > > > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > > > > > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > > > > > > > >            ^^^ reset device^^^
> > > > > > > > > > >
> > > > > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop()
> > > > > > > > > > > ->
> > > > > > > > > > > vhost_reset_status(). The device's migration code runs after
> > > > > > > > > > > vhost_dev_stop() and the state will have been lost.
> > > > > > > > > > >
> > > > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > > > > > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > > > > > > >
> > > > > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > > > > > > > VM run in the CVQ. So it can track all of that status in the device
> > > > > > > > > > model too.
> > > > > > > > > >
> > > > > > > > > > When a migration effectively occurs, all the frontend state is
> > > > > > > > > > migrated as a regular emulated device. To route all of the state in a
> > > > > > > > > > normalized way for qemu is what leaves open the possibility to do
> > > > > > > > > > cross-backends migrations, etc.
> > > > > > > > > >
> > > > > > > > > > Does that answer your question?
> > > > > > > > > I think you're confirming that changes would be necessary in order for
> > > > > > > > > vDPA to support the save/load operation that Hanna is introducing.
> > > > > > > > >
> > > > > > > > Yes, this first iteration was centered on net, with an eye on block,
> > > > > > > > where state can be routed through classical emulated devices. This is
> > > > > > > > how vhost-kernel and vhost-user do classically. And it allows
> > > > > > > > cross-backend, to not modify qemu migration state, etc.
> > > > > > > >
> > > > > > > > To introduce this opaque state to qemu, that must be fetched after the
> > > > > > > > suspend and not before, requires changes in vhost protocol, as
> > > > > > > > discussed previously.
> > > > > > > >
> > > > > > > > > > > It looks like vDPA changes are necessary in order to support
> > > > > > > > > > > stateful
> > > > > > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > > > > > > > correct?
> > > > > > > > > > >
> > > > > > > > > > Changes are required elsewhere, as the code to restore the state
> > > > > > > > > > properly in the destination has not been merged.
> > > > > > > > > I'm not sure what you mean by elsewhere?
> > > > > > > > >
> > > > > > > > I meant for vdpa *net* devices the changes are not required in vdpa
> > > > > > > > ioctls, but mostly in qemu.
> > > > > > > >
> > > > > > > > If you meant stateful as "it must have a state blob that it must be
> > > > > > > > opaque to qemu", then I think the straightforward action is to fetch
> > > > > > > > state blob about the same time as vq indexes. But yes, changes (at
> > > > > > > > least a new ioctl) is needed for that.
> > > > > > > >
> > > > > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > > > > > > > then VHOST_VDPA_SET_STATUS 0.
> > > > > > > > >
> > > > > > > > > In order to save device state from the vDPA device in the future, it
> > > > > > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > > > > > > > the device state can be saved before the device is reset.
> > > > > > > > >
> > > > > > > > > Does that sound right?
> > > > > > > > >
> > > > > > > > The split between suspend and reset was added recently for that very
> > > > > > > > reason. In all the virtio devices, the frontend is initialized before
> > > > > > > > the backend, so I don't think it is a good idea to defer the backend
> > > > > > > > cleanup. Especially if we have already set the state is small enough
> > > > > > > > to not needing iterative migration from virtiofsd point of view.
> > > > > > > >
> > > > > > > > If fetching that state at the same time as vq indexes is not valid,
> > > > > > > > could it follow the same model as the "in-flight descriptors"?
> > > > > > > > vhost-user follows them by using a shared memory region where their
> > > > > > > > state is tracked [1]. This allows qemu to survive vhost-user SW
> > > > > > > > backend crashes, and does not forbid the cross-backends live migration
> > > > > > > > as all the information is there to recover them.
> > > > > > > >
> > > > > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > > > > > > > a possibility is to synchronize this memory region after a
> > > > > > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > > > > > > > devices are not going to crash in the software sense, so all use cases
> > > > > > > > remain the same to qemu. And that shared memory information is
> > > > > > > > recoverable after vhost_dev_stop.
> > > > > > > >
> > > > > > > > Does that sound reasonable to virtiofsd? To offer a shared memory
> > > > > > > > region where it dumps the state, maybe only after the
> > > > > > > > set_state(STATE_PHASE_STOPPED)?
> > > > > > >
> > > > > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is
> > > > > > > mandatory anyway.
> > > > > > >
> > > > > > > As for the shared memory, the RFC before this series used shared memory,
> > > > > > > so it’s possible, yes.  But “shared memory region” can mean a lot of
> > > > > > > things – it sounds like you’re saying the back-end (virtiofsd) should
> > > > > > > provide it to the front-end, is that right?  That could work like this:
> > > > > > >
> > > > > > > On the source side:
> > > > > > >
> > > > > > > S1. SUSPEND goes to virtiofsd
> > > > > > > S2. virtiofsd maybe double-checks that the device is stopped, then
> > > > > > > serializes its state into a newly allocated shared memory area[1]
> > > > > > > S3. virtiofsd responds to SUSPEND
> > > > > > > S4. front-end requests shared memory, virtiofsd responds with a handle,
> > > > > > > maybe already closes its reference
> > > > > > > S5. front-end saves state, closes its handle, freeing the SHM
> > > > > > >
> > > > > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> > > > > > > it can immediately allocate this area and serialize directly into it;
> > > > > > > maybe it can’t, then we’ll need a bounce buffer.  Not really a
> > > > > > > fundamental problem, but there are limitations around what you can do
> > > > > > > with serde implementations in Rust…
> > > > > > >
> > > > > > > On the destination side:
> > > > > > >
> > > > > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> > > > > > > virtiofsd would serialize its empty state into an SHM area, and respond
> > > > > > > to SUSPEND
> > > > > > > D2. front-end reads state from migration stream into an SHM it has allocated
> > > > > > > D3. front-end supplies this SHM to virtiofsd, which discards its
> > > > > > > previous area, and now uses this one
> > > > > > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> > > > > > >
> > > > > > > Couple of questions:
> > > > > > >
> > > > > > > A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> > > > > > > would imply to deserialize a state, and the state is to be transferred
> > > > > > > through SHM, this is what would need to be done.  So maybe we should
> > > > > > > skip SUSPEND on the destination?
> > > > > > > B. You described that the back-end should supply the SHM, which works
> > > > > > > well on the source.  On the destination, only the front-end knows how
> > > > > > > big the state is, so I’ve decided above that it should allocate the SHM
> > > > > > > (D2) and provide it to the back-end.  Is that feasible or is it
> > > > > > > important (e.g. for real hardware) that the back-end supplies the SHM?
> > > > > > > (In which case the front-end would need to tell the back-end how big the
> > > > > > > state SHM needs to be.)
> > > > > >
> > > > > > How does this work for iterative live migration?
> > > > > >
> > > > >
> > > > > A pipe will always fit better for iterative from qemu POV, that's for sure.
> > > > > Especially if we want to keep that opaqueness.
> > > > >
> > > > > But  we will need to communicate with the HW device using shared memory sooner
> > > > > or later for big states.  If we don't transform it in qemu, we will need to do
> > > > > it in the kernel.  Also, the pipe will not support daemon crashes.
> > > > >
> > > > > Again I'm just putting this on the table, just in case it fits better or it is
> > > > > convenient.  I missed the previous patch where SHM was proposed too, so maybe I
> > > > > missed some feedback useful here.  I think the pipe is a better solution in the
> > > > > long run because of the iterative part.
> > > >
> > > > Pipes and shared memory are conceptually equivalent for building
> > > > streaming interfaces. It's just more complex to design a shared memory
> > > > interface and it reinvents what file descriptors already offer.
> > > >
> > > > I have no doubt we could design iterative migration over a shared memory
> > > > interface if we needed to, but I'm not sure why? When you mention
> > > > hardware, are you suggesting defining a standard memory/register layout
> > > > that hardware implements and mapping it to userspace (QEMU)?
> > >
> > > Right.
> > >
> > > > Is there a
> > > > big advantage to exposing memory versus a file descriptor?
> > > >
> > >
> > > For hardware it allows to retrieve and set the device state without
> > > intervention of the kernel, saving context switches. For virtiofsd
> > > this may not make a lot of sense, but I'm thinking on devices with big
> > > states (virtio gpu, maybe?).
> >
> > A streaming interface implemented using shared memory involves consuming
> > chunks of bytes. Each time data has been read, an action must be
> > performed to notify the device and receive a notification when more data
> > becomes available.
> >
> > That notification involves the kernel (e.g. an eventfd that is triggered
> > by a hardware interrupt) and a read(2) syscall to reset the eventfd.
> >
> > Unless userspace disables notifications and polls (busy waits) the
> > hardware registers, there is still going to be kernel involvement and a
> > context switch. For this reason, I think that shared memory vs pipes
> > will not be significantly different.
> >
>
> Yes, for big states that's right. I was thinking of not-so-big states,
> where all of it can be asked in one shot, but it may be problematic
> with iterative migration for sure. In that regard pipes are way
> better.
>
> > > For software it allows the backend to survive a crash, as the old
> > > state can be set directly to a fresh backend instance.
> >
> > Can you explain by describing the steps involved?
>
> It's how vhost-user inflight I/O tracking works [1]: QEMU and the
> backend shares a memory region where the backend dump states
> continuously. In the event of a crash, this state can be dumped
> directly to a new vhost-user backend.
Neither shared memory nor INFLIGHT_FD are required for crash recovery
because the backend can stash state elsewhere, like tmpfs or systemd's
FDSTORE=1 (https://www.freedesktop.org/software/systemd/man/sd_pid_notify_with_fds.html).
INFLIGHT_FD is just a mechanism to stash an fd (only the backend
interprets the contents of the fd and the frontend doesn't even know
whether the fd is shared memory or another type of file).
I think crash recovery is orthogonal to this discussion because we're
talking about a streaming interface. A streaming interface breaks when
a crash occurs (regardless of whether it's implemented via shared
memory or pipes) as it involves two entities coordinating with each
other. If an entity goes away then the stream is incomplete and cannot
be used for crash recovery. I guess you're thinking of an fd that
contains the full state of the device. That fd could be handed to the
backend after reconnection for crash recovery, but a streaming
interface doesn't support that.
I guess your bringing up the idea of having the full device state
always up-to-date for crash recovery purposes? I think crash recovery
should be optional since it's complex and hard to test while many
(most?) backends don't implement it. It is likely that using the crash
recovery state for live migration is going to be even trickier because
live migration has additional requirements (e.g. compatibility). My
feeling is that it's too hard to satisfy both live migration and crash
recovery requirements for all vhost-user device types, but if you have
concrete ideas then let's discuss them.
Stefan
^ permalink raw reply	[flat|nested] 93+ messages in thread
end of thread, other threads:[~2023-05-09 17:34 UTC | newest]
Thread overview: 93+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-11 15:05 [PATCH 0/4] vhost-user-fs: Internal migration Hanna Czenczek
2023-04-11 15:05 ` [PATCH 1/4] vhost: Re-enable vrings after setting features Hanna Czenczek
2023-04-12 10:55   ` German Maglione
2023-04-12 12:18     ` Hanna Czenczek
2023-04-12 20:51   ` Stefan Hajnoczi
2023-04-13  7:17     ` Maxime Coquelin
2023-04-13  8:19     ` Hanna Czenczek
2023-04-13 11:03       ` Stefan Hajnoczi
2023-04-13 14:24         ` Anton Kuchin
2023-04-13 15:48           ` Michael S. Tsirkin
2023-04-13 11:03   ` Stefan Hajnoczi
2023-04-13 17:32     ` Hanna Czenczek
2023-04-13 13:19   ` Michael S. Tsirkin
2023-04-11 15:05 ` [PATCH 2/4] vhost-user: Interface for migration state transfer Hanna Czenczek
2023-04-12 21:06   ` Stefan Hajnoczi
2023-04-13  9:24     ` Hanna Czenczek
2023-04-13 11:38       ` Stefan Hajnoczi
2023-04-13 17:55         ` Hanna Czenczek
2023-04-13 20:42           ` Stefan Hajnoczi
2023-04-14 15:17           ` Eugenio Perez Martin
2023-04-17 15:18             ` Stefan Hajnoczi
2023-04-17 18:55               ` Eugenio Perez Martin
2023-04-17 19:08                 ` Stefan Hajnoczi
2023-04-17 19:11                   ` Eugenio Perez Martin
2023-04-17 19:46                     ` Stefan Hajnoczi
2023-04-18 10:09                       ` Eugenio Perez Martin
2023-04-19 10:45             ` Hanna Czenczek
2023-04-19 10:57               ` Stefan Hajnoczi
2023-04-13 10:14     ` Eugenio Perez Martin
2023-04-13 11:07       ` Stefan Hajnoczi
2023-04-13 17:31       ` Hanna Czenczek
2023-04-17 15:12         ` Stefan Hajnoczi
2023-04-19 10:47           ` Hanna Czenczek
2023-04-17 18:37         ` Eugenio Perez Martin
2023-04-17 15:38       ` Stefan Hajnoczi
2023-04-17 19:09         ` Eugenio Perez Martin
2023-04-17 19:33           ` Stefan Hajnoczi
2023-04-18  8:09             ` Eugenio Perez Martin
2023-04-18 17:59               ` Stefan Hajnoczi
2023-04-18 18:31                 ` Eugenio Perez Martin
2023-04-18 20:40                   ` Stefan Hajnoczi
2023-04-20 13:27                     ` Eugenio Pérez
2023-05-08 19:12                       ` Stefan Hajnoczi
2023-05-09  6:31                         ` Eugenio Perez Martin
2023-05-09  9:01                           ` Hanna Czenczek
2023-05-09 15:26                             ` Eugenio Perez Martin
2023-04-19 10:57                 ` [Virtio-fs] " Hanna Czenczek
2023-04-19 11:10                   ` Stefan Hajnoczi
2023-04-19 11:15                     ` Hanna Czenczek
2023-04-19 11:24                       ` Stefan Hajnoczi
2023-04-17 17:14       ` Stefan Hajnoczi
2023-04-17 19:06         ` Eugenio Perez Martin
2023-04-17 19:20           ` Stefan Hajnoczi
2023-04-18  7:54             ` Eugenio Perez Martin
2023-04-19 11:10               ` Hanna Czenczek
2023-04-19 11:21                 ` Stefan Hajnoczi
2023-04-19 11:24                   ` Hanna Czenczek
2023-04-20 13:29                   ` Eugenio Pérez
2023-05-08 20:10                     ` Stefan Hajnoczi
2023-05-09  6:45                       ` Eugenio Perez Martin
2023-05-09 15:09                         ` Stefan Hajnoczi
2023-05-09 15:35                           ` Eugenio Perez Martin
2023-05-09 17:33                             ` Stefan Hajnoczi
2023-04-20 10:44                 ` Eugenio Pérez
2023-04-13  8:50   ` Eugenio Perez Martin
2023-04-13  9:25     ` Hanna Czenczek
2023-04-11 15:05 ` [PATCH 3/4] vhost: Add high-level state save/load functions Hanna Czenczek
2023-04-12 21:14   ` Stefan Hajnoczi
2023-04-13  9:04     ` Hanna Czenczek
2023-04-13 11:22       ` Stefan Hajnoczi
2023-04-11 15:05 ` [PATCH 4/4] vhost-user-fs: Implement internal migration Hanna Czenczek
2023-04-12 21:00 ` [PATCH 0/4] vhost-user-fs: Internal migration Stefan Hajnoczi
2023-04-13  8:20   ` Hanna Czenczek
2023-04-13 16:11 ` Michael S. Tsirkin
2023-04-13 17:53   ` [Virtio-fs] " Hanna Czenczek
2023-05-04 16:05 ` Hanna Czenczek
2023-05-04 21:14   ` Stefan Hajnoczi
2023-05-05  9:03     ` Hanna Czenczek
2023-05-05  9:51       ` Hanna Czenczek
2023-05-05 14:26         ` Eugenio Perez Martin
2023-05-05 14:37           ` Hanna Czenczek
2023-05-08 17:00             ` Hanna Czenczek
2023-05-08 17:51               ` Eugenio Perez Martin
2023-05-08 19:31                 ` Eugenio Perez Martin
2023-05-09  8:59                   ` Hanna Czenczek
2023-05-09 15:30           ` Stefan Hajnoczi
2023-05-09 15:43             ` Eugenio Perez Martin
2023-05-05  9:53       ` Eugenio Perez Martin
2023-05-05 12:51         ` Hanna Czenczek
2023-05-08 21:10           ` Stefan Hajnoczi
2023-05-09  8:53             ` Hanna Czenczek
2023-05-09 14:53               ` Stefan Hajnoczi
2023-05-09 15:41       ` Stefan Hajnoczi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).