qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: qemu-devel@nongnu.org
Cc: Peter Maydell <peter.maydell@linaro.org>,
	Hanna Czenczek <hreitz@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>
Subject: [PULL 04/63] vhost-user.rst: Migrating back-end-internal state
Date: Tue, 7 Nov 2023 05:09:41 -0500	[thread overview]
Message-ID: <019233096c03b826e0e677115b6e3c550a54a48d.1699351720.git.mst@redhat.com> (raw)
In-Reply-To: <cover.1699351720.git.mst@redhat.com>

From: Hanna Czenczek <hreitz@redhat.com>

For vhost-user devices, qemu can migrate the virtio state, but not the
back-end's internal state.  To do so, we need to be able to transfer
this internal state between front-end (qemu) and back-end.

At this point, this new feature is added for the purpose of virtio-fs
migration.  Because virtiofsd's internal state will not be too large, we
believe it is best to transfer it as a single binary blob after the
streaming phase.

These are the additions to the protocol:
- New vhost-user protocol feature VHOST_USER_PROTOCOL_F_DEVICE_STATE
- SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a file
  descriptor over which to transfer the state.
- CHECK_DEVICE_STATE: After the state has been transferred through the
  file descriptor, the front-end invokes this function to verify
  success.  There is no in-band way (through the file descriptor) to
  indicate failure, so we need to check explicitly.

Once the transfer FD has been established via SET_DEVICE_STATE_FD
(which includes establishing the direction of transfer and migration
phase), the sending side writes its data into it, and the reading side
reads it until it sees an EOF.  Then, the front-end will check for
success via CHECK_DEVICE_STATE, which on the destination side includes
checking for integrity (i.e. errors during deserialization).

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-Id: <20231016134243.68248-5-hreitz@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 docs/interop/vhost-user.rst | 172 ++++++++++++++++++++++++++++++++++++
 1 file changed, 172 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 035a23ed35..9f1103f85a 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -322,6 +322,32 @@ VhostUserShared
 :UUID: 16 bytes UUID, whose first three components (a 32-bit value, then
   two 16-bit values) are stored in big endian.
 
+Device state transfer parameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
++--------------------+-----------------+
+| transfer direction | migration phase |
++--------------------+-----------------+
+
+:transfer direction: a 32-bit enum, describing the direction in which
+  the state is transferred:
+
+  - 0: Save: Transfer the state from the back-end to the front-end,
+    which happens on the source side of migration
+  - 1: Load: Transfer the state from the front-end to the back-end,
+    which happens on the destination side of migration
+
+:migration phase: a 32-bit enum, describing the state in which the VM
+  guest and devices are:
+
+  - 0: Stopped (in the period after the transfer of memory-mapped
+    regions before switch-over to the destination): The VM guest is
+    stopped, and the vhost-user device is suspended (see
+    :ref:`Suspended device state <suspended_device_state>`).
+
+  In the future, additional phases might be added e.g. to allow
+  iterative migration while the device is running.
+
 C structure
 -----------
 
@@ -381,6 +407,7 @@ in the ancillary data:
 * ``VHOST_USER_SET_VRING_ERR``
 * ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``)
 * ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
+* ``VHOST_USER_SET_DEVICE_STATE_FD``
 
 If *front-end* is unable to send the full message or receives a wrong
 reply it will close the connection. An optional reconnection mechanism
@@ -555,6 +582,80 @@ it performs WAKE ioctl's on the userfaultfd to wake the stalled
 back-end.  The front-end indicates support for this via the
 ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
 
+.. _migrating_backend_state:
+
+Migrating back-end state
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Migrating device state involves transferring the state from one
+back-end, called the source, to another back-end, called the
+destination.  After migration, the destination transparently resumes
+operation without requiring the driver to re-initialize the device at
+the VIRTIO level.  If the migration fails, then the source can
+transparently resume operation until another migration attempt is made.
+
+Generally, the front-end is connected to a virtual machine guest (which
+contains the driver), which has its own state to transfer between source
+and destination, and therefore will have an implementation-specific
+mechanism to do so.  The ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature
+provides functionality to have the front-end include the back-end's
+state in this transfer operation so the back-end does not need to
+implement its own mechanism, and so the virtual machine may have its
+complete state, including vhost-user devices' states, contained within a
+single stream of data.
+
+To do this, the back-end state is transferred from back-end to front-end
+on the source side, and vice versa on the destination side.  This
+transfer happens over a channel that is negotiated using the
+``VHOST_USER_SET_DEVICE_STATE_FD`` message.  This message has two
+parameters:
+
+* Direction of transfer: On the source, the data is saved, transferring
+  it from the back-end to the front-end.  On the destination, the data
+  is loaded, transferring it from the front-end to the back-end.
+
+* Migration phase: Currently, the only supported phase is the period
+  after the transfer of memory-mapped regions before switch-over to the
+  destination, when both the source and destination devices are
+  suspended (:ref:`Suspended device state <suspended_device_state>`).
+  In the future, additional phases might be supported to allow iterative
+  migration while the device is running.
+
+The nature of the channel is implementation-defined, but it must
+generally behave like a pipe: The writing end will write all the data it
+has into it, signalling the end of data by closing its end.  The reading
+end must read all of this data (until encountering the end of file) and
+process it.
+
+* When saving, the writing end is the source back-end, and the reading
+  end is the source front-end.  After reading the state data from the
+  channel, the source front-end must transfer it to the destination
+  front-end through an implementation-defined mechanism.
+
+* When loading, the writing end is the destination front-end, and the
+  reading end is the destination back-end.  After reading the state data
+  from the channel, the destination back-end must deserialize its
+  internal state from that data and set itself up to allow the driver to
+  seamlessly resume operation on the VIRTIO level.
+
+Seamlessly resuming operation means that the migration must be
+transparent to the guest driver, which operates on the VIRTIO level.
+This driver will not perform any re-initialization steps, but continue
+to use the device as if no migration had occurred.  The vhost-user
+front-end, however, will re-initialize the vhost state on the
+destination, following the usual protocol for establishing a connection
+to a vhost-user back-end: This includes, for example, setting up memory
+mappings and kick and call FDs as necessary, negotiating protocol
+features, or setting the initial vring base indices (to the same value
+as on the source side, so that operation can resume).
+
+Both on the source and on the destination side, after the respective
+front-end has seen all data transferred (when the transfer FD has been
+closed), it sends the ``VHOST_USER_CHECK_DEVICE_STATE`` message to
+verify that data transfer was successful in the back-end, too.  The
+back-end responds once it knows whether the transfer and processing was
+successful or not.
+
 Memory access
 -------------
 
@@ -949,6 +1050,7 @@ Protocol features
   #define VHOST_USER_PROTOCOL_F_STATUS               16
   #define VHOST_USER_PROTOCOL_F_XEN_MMAP             17
   #define VHOST_USER_PROTOCOL_F_SHARED_OBJECT        18
+  #define VHOST_USER_PROTOCOL_F_DEVICE_STATE         19
 
 Front-end message types
 -----------------------
@@ -1553,6 +1655,76 @@ Front-end message types
   the requested UUID. Back-end will reply passing the fd when the operation
   is successful, or no fd otherwise.
 
+``VHOST_USER_SET_DEVICE_STATE_FD``
+  :id: 42
+  :equivalent ioctl: N/A
+  :request payload: device state transfer parameters
+  :reply payload: ``u64``
+
+  Front-end and back-end negotiate a channel over which to transfer the
+  back-end’s internal state during migration.  Either side (front-end or
+  back-end) may create the channel.  The nature of this channel is not
+  restricted or defined in this document, but whichever side creates it
+  must create a file descriptor that is provided to the respectively
+  other side, allowing access to the channel.  This FD must behave as
+  follows:
+
+  * For the writing end, it must allow writing the whole back-end state
+    sequentially.  Closing the file descriptor signals the end of
+    transfer.
+
+  * For the reading end, it must allow reading the whole back-end state
+    sequentially.  The end of file signals the end of the transfer.
+
+  For example, the channel may be a pipe, in which case the two ends of
+  the pipe fulfill these requirements respectively.
+
+  Initially, the front-end creates a channel along with such an FD.  It
+  passes the FD to the back-end as ancillary data of a
+  ``VHOST_USER_SET_DEVICE_STATE_FD`` message.  The back-end may create a
+  different transfer channel, passing the respective FD back to the
+  front-end as ancillary data of the reply.  If so, the front-end must
+  then discard its channel and use the one provided by the back-end.
+
+  Whether the back-end should decide to use its own channel is decided
+  based on efficiency: If the channel is a pipe, both ends will most
+  likely need to copy data into and out of it.  Any channel that allows
+  for more efficient processing on at least one end, e.g. through
+  zero-copy, is considered more efficient and thus preferred.  If the
+  back-end can provide such a channel, it should decide to use it.
+
+  The request payload contains parameters for the subsequent data
+  transfer, as described in the :ref:`Migrating back-end state
+  <migrating_backend_state>` section.
+
+  The value returned is both an indication for success, and whether a
+  file descriptor for a back-end-provided channel is returned: Bits 0–7
+  are 0 on success, and non-zero on error.  Bit 8 is the invalid FD
+  flag; this flag is set when there is no file descriptor returned.
+  When this flag is not set, the front-end must use the returned file
+  descriptor as its end of the transfer channel.  The back-end must not
+  both indicate an error and return a file descriptor.
+
+  Using this function requires prior negotiation of the
+  ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature.
+
+``VHOST_USER_CHECK_DEVICE_STATE``
+  :id: 43
+  :equivalent ioctl: N/A
+  :request payload: N/A
+  :reply payload: ``u64``
+
+  After transferring the back-end’s internal state during migration (see
+  the :ref:`Migrating back-end state <migrating_backend_state>`
+  section), check whether the back-end was able to successfully fully
+  process the state.
+
+  The value returned indicates success or error; 0 is success, any
+  non-zero value is an error.
+
+  Using this function requires prior negotiation of the
+  ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature.
+
 Back-end message types
 ----------------------
 
-- 
MST



  parent reply	other threads:[~2023-11-07 10:11 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-07 10:09 [PULL 00/63] virtio,pc,pci: features, fixes Michael S. Tsirkin
2023-11-07 10:09 ` [PULL 01/63] vhost-user.rst: Improve [GS]ET_VRING_BASE doc Michael S. Tsirkin
2023-11-07 10:09 ` [PULL 02/63] vhost-user.rst: Clarify enabling/disabling vrings Michael S. Tsirkin
2023-11-07 10:09 ` [PULL 03/63] vhost-user.rst: Introduce suspended state Michael S. Tsirkin
2023-11-07 10:09 ` Michael S. Tsirkin [this message]
2023-11-07 10:09 ` [PULL 05/63] vhost-user: Interface for migration state transfer Michael S. Tsirkin
2023-11-07 10:09 ` [PULL 06/63] vhost: Add high-level state save/load functions Michael S. Tsirkin
2023-11-07 10:09 ` [PULL 07/63] vhost-user-fs: Implement internal migration Michael S. Tsirkin
2023-11-07 10:09 ` [PULL 08/63] Add virtio-sound device stub Michael S. Tsirkin
2023-11-09 14:30   ` Peter Maydell
2023-11-09 15:50     ` Manos Pitsidianakis
2023-11-09 16:06       ` Peter Maydell
2023-11-09 16:10       ` Alex Bennée
2023-11-07 10:10 ` [PULL 09/63] Add virtio-sound-pci device Michael S. Tsirkin
2023-11-07 10:10 ` [PULL 10/63] virtio-sound: handle control messages and streams Michael S. Tsirkin
2023-11-07 10:10 ` [PULL 11/63] virtio-sound: handle VIRTIO_SND_R_PCM_INFO request Michael S. Tsirkin
2023-11-07 10:10 ` [PULL 12/63] virtio-sound: handle VIRTIO_SND_R_PCM_{START,STOP} Michael S. Tsirkin
2023-11-07 10:10 ` [PULL 13/63] virtio-sound: handle VIRTIO_SND_R_PCM_SET_PARAMS Michael S. Tsirkin
2023-11-07 10:10 ` [PULL 14/63] virtio-sound: handle VIRTIO_SND_R_PCM_PREPARE Michael S. Tsirkin
2023-11-07 10:10 ` [PULL 15/63] virtio-sound: handle VIRTIO_SND_R_PCM_RELEASE Michael S. Tsirkin
2023-11-07 10:10 ` [PULL 16/63] virtio-sound: implement audio output (TX) Michael S. Tsirkin
2023-11-07 10:10 ` [PULL 17/63] virtio-sound: implement audio capture (RX) Michael S. Tsirkin
2023-11-07 10:10 ` [PULL 18/63] docs/system: add basic virtio-snd documentation Michael S. Tsirkin
2023-11-07 10:10 ` [PULL 19/63] vdpa: Restore hash calculation state Michael S. Tsirkin
2023-11-07 10:10 ` [PULL 20/63] vdpa: Allow VIRTIO_NET_F_HASH_REPORT in SVQ Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 21/63] vdpa: Add SetSteeringEBPF method for NetClientState Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 22/63] vdpa: Restore receive-side scaling state Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 23/63] vdpa: Allow VIRTIO_NET_F_RSS in SVQ Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 24/63] tests: test-smp-parse: Add the test for cores/threads per socket helpers Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 25/63] tests: bios-tables-test: Prepare the ACPI table change for smbios type4 count test Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 26/63] tests: bios-tables-test: Add test for smbios type4 count Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 27/63] tests: bios-tables-test: Add ACPI table binaries for smbios type4 count test Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 28/63] tests: bios-tables-test: Prepare the ACPI table change for smbios type4 core " Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 29/63] tests: bios-tables-test: Add test for smbios type4 core count Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 30/63] tests: bios-tables-test: Add ACPI table binaries for smbios type4 core count test Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 31/63] tests: bios-tables-test: Prepare the ACPI table change for smbios type4 core count2 test Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 32/63] tests: bios-tables-test: Extend smbios core count2 test to cover general topology Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 33/63] tests: bios-tables-test: Update ACPI table binaries for smbios core count2 test Michael S. Tsirkin
2023-11-07 10:11 ` [PULL 34/63] tests: bios-tables-test: Prepare the ACPI table change for smbios type4 thread count test Michael S. Tsirkin
2023-11-07 10:12 ` [PULL 35/63] tests: bios-tables-test: Add test for smbios type4 thread count Michael S. Tsirkin
2023-11-07 10:12 ` [PULL 36/63] tests: bios-tables-test: Add ACPI table binaries for smbios type4 thread count test Michael S. Tsirkin
2023-11-07 10:12 ` [PULL 37/63] tests: bios-tables-test: Prepare the ACPI table change for smbios type4 thread count2 test Michael S. Tsirkin
2023-11-07 10:12 ` [PULL 38/63] tests: bios-tables-test: Add test for smbios type4 thread count2 Michael S. Tsirkin
2023-11-07 10:12 ` [PULL 39/63] tests: bios-tables-test: Add ACPI table binaries for smbios type4 thread count2 test Michael S. Tsirkin
2023-11-07 10:12 ` [PULL 40/63] hw/cxl: Use a switch to explicitly check size in caps_reg_read() Michael S. Tsirkin
2023-11-07 10:12 ` [PULL 41/63] hw/cxl: Use switch statements for read and write of cachemem registers Michael S. Tsirkin
2023-11-07 10:12 ` [PULL 42/63] hw/cxl: CXLDVSECPortExtensions renamed to CXLDVSECPortExt Michael S. Tsirkin
2023-11-07 10:12 ` [PULL 43/63] hw/cxl: Line length reductions Michael S. Tsirkin
2023-11-07 10:12 ` [PULL 44/63] hw/cxl: Fix a QEMU_BUILD_BUG_ON() in switch statement scope issue Michael S. Tsirkin
2023-11-07 10:12 ` [PULL 45/63] hw/cxl/mbox: Pull the payload out of struct cxl_cmd and make instances constant Michael S. Tsirkin
2023-11-07 10:12 ` [PULL 46/63] hw/cxl/mbox: Split mailbox command payload into separate input and output Michael S. Tsirkin
2023-11-07 10:12 ` [PULL 47/63] hw/cxl/mbox: Pull the CCI definition out of the CXLDeviceState Michael S. Tsirkin
2023-11-07 10:13 ` [PULL 48/63] hw/cxl/mbox: Generalize the CCI command processing Michael S. Tsirkin
2023-11-07 10:13 ` [PULL 49/63] hw/pci-bridge/cxl_upstream: Move defintion of device to header Michael S. Tsirkin
2023-11-07 10:13 ` [PULL 50/63] hw/cxl: Add a switch mailbox CCI function Michael S. Tsirkin
2023-11-07 10:13 ` [PULL 51/63] hw/cxl/mbox: Add Information and Status / Identify command Michael S. Tsirkin
2023-11-07 10:13 ` [PULL 52/63] hw/cxl/mbox: Add Physical Switch " Michael S. Tsirkin
2023-11-07 10:13 ` [PULL 53/63] hw/pci-bridge/cxl_downstream: Set default link width and link speed Michael S. Tsirkin
2023-11-07 10:13 ` [PULL 54/63] hw/cxl: Implement Physical Ports status retrieval Michael S. Tsirkin
2023-11-07 10:13 ` [PULL 55/63] hw/cxl/mbox: Add support for background operations Michael S. Tsirkin
2023-11-09 14:44   ` Peter Maydell
2023-11-10  4:25     ` Davidlohr Bueso
2023-11-07 10:13 ` [PULL 56/63] hw/cxl/mbox: Wire up interrupts for background completion Michael S. Tsirkin
2023-11-07 10:13 ` [PULL 57/63] hw/cxl: Add support for device sanitation Michael S. Tsirkin
2023-11-09 14:39   ` Peter Maydell
2023-11-10  4:14     ` Davidlohr Bueso
2023-11-07 10:13 ` [PULL 58/63] hw/cxl/mbox: Add Get Background Operation Status Command Michael S. Tsirkin
2023-11-07 10:13 ` [PULL 59/63] hw/cxl/type3: Cleanup multiple CXL_TYPE3() calls in read/write functions Michael S. Tsirkin
2023-11-07 10:13 ` [PULL 60/63] hw/cxl: Add dummy security state get Michael S. Tsirkin
2023-11-07 10:13 ` [PULL 61/63] hw/cxl: Add tunneled command support to mailbox for switch cci Michael S. Tsirkin
2023-11-07 10:13 ` [PULL 62/63] acpi/tests/avocado/bits: enforce 32-bit SMBIOS entry point Michael S. Tsirkin
2023-11-07 10:14 ` [PULL 63/63] acpi/tests/avocado/bits: enable console logging from bits VM Michael S. Tsirkin
2023-11-07 13:40 ` [PULL 00/63] virtio,pc,pci: features, fixes Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=019233096c03b826e0e677115b6e3c550a54a48d.1699351720.git.mst@redhat.com \
    --to=mst@redhat.com \
    --cc=hreitz@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).