qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Kirti Wankhede <kwankhede@nvidia.com>
To: <alex.williamson@redhat.com>
Cc: mcrossley@nvidia.com, cjia@nvidia.com, cohuck@redhat.com,
	qemu-devel@nongnu.org, Kirti Wankhede <kwankhede@nvidia.com>,
	dnigam@nvidia.com, philmd@redhat.com
Subject: [PATCH v1] docs/devel: Add VFIO device migration documentation
Date: Thu, 29 Oct 2020 11:23:11 +0530	[thread overview]
Message-ID: <1603950791-27236-1-git-send-email-kwankhede@nvidia.com> (raw)

Document interfaces used for VFIO device migration. Added flow of state
changes during live migration with VFIO device.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 MAINTAINERS                   |   1 +
 docs/devel/vfio-migration.rst | 119 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 120 insertions(+)
 create mode 100644 docs/devel/vfio-migration.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 6a197bd358d6..6f3fcffc6b3d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1728,6 +1728,7 @@ M: Alex Williamson <alex.williamson@redhat.com>
 S: Supported
 F: hw/vfio/*
 F: include/hw/vfio/
+F: docs/devel/vfio-migration.rst
 
 vfio-ccw
 M: Cornelia Huck <cohuck@redhat.com>
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
new file mode 100644
index 000000000000..dab9127825e4
--- /dev/null
+++ b/docs/devel/vfio-migration.rst
@@ -0,0 +1,119 @@
+=====================
+VFIO device Migration
+=====================
+
+VFIO devices use iterative approach for migration because certain VFIO devices
+(e.g. GPU) have large amount of data to be transfered. The iterative pre-copy
+phase of migration allows for the guest to continue whilst the VFIO device state
+is transferred to destination, this helps to reduce the total downtime of the
+VM. VFIO devices can choose to skip the pre-copy phase of migration by returning
+pending_bytes as zero during pre-copy phase.
+
+Detailed description of UAPI for VFIO device for migration is in the comment
+above ``vfio_device_migration_info`` structure definition in header file
+linux-headers/linux/vfio.h.
+
+VFIO device hooks for iterative approach:
+-  A ``save_setup`` function that setup migration region, sets _SAVING flag in
+VFIO device state and inform VFIO IOMMU module to start dirty page tracking.
+
+- A ``load_setup`` function that setup migration region on the destination and
+sets _RESUMING flag in VFIO device state.
+
+- A ``save_live_pending`` function that reads pending_bytes from vendor driver
+that indicate how much more data the vendor driver yet to save for the VFIO
+device.
+
+- A ``save_live_iterate`` function that reads VFIO device's data from vendor
+driver through migration region during iterative phase.
+
+- A ``save_live_complete_precopy`` function that resets _RUNNING flag from VFIO
+device state, saves device config space, if any, and iteratively copies
+remaining data for VFIO device till pending_bytes returned by vendor driver
+is zero.
+
+- A ``load_state`` function loads config section and data sections generated by
+above save functions.
+
+- ``cleanup`` functions for both save and load that unmap migration region.
+
+VM state change handler is registered to change VFIO device state based on VM
+state change.
+
+Similarly, a migration state change notifier is added to get a notification on
+migration state change. These states are translated to VFIO device state and
+conveyed to vendor driver.
+
+System memory dirty pages tracking
+----------------------------------
+
+A ``log_sync`` memory listener callback is added to mark system memory pages
+as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried
+per container. All pages pinned by vendor driver through vfio_pin_pages()
+external API have to be marked as dirty during migration. When there are CPU
+writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
+by vendor driver can also be written by device. There is currently no device
+which has hardware support for dirty page tracking. So all pages which are
+pinned by vendor driver are considered as dirty.
+Dirty pages are tracked when device is in stop-and-copy phase because if pages
+are marked dirty during pre-copy phase and content is transfered from source to
+destination, there is no way to know newly dirtied pages from the point they
+were copied earlier until device stops. To avoid repeated copy of same content,
+pinned pages are marked dirty only during stop-and-copy phase.
+
+System memory dirty pages tracking when vIOMMU is enabled
+---------------------------------------------------------
+With vIOMMU, IO virtual address range can get unmapped while in pre-copy phase
+of migration. In that case, unmap ioctl returns pages pinned in that range and
+QEMU reports corresponding guest physical pages dirty.
+During stop-and-copy phase, an IOMMU notifier is used to get a callback for
+mapped pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for
+those mapped ranges.
+
+Flow of state changes during Live migration
+===========================================
+Below is the flow of state change during live migration where states in brackets
+represent VM state, migration state and VFIO device state as:
+                (VM state, MIGRATION_STATUS, VFIO_DEVICE_STATE)
+
+Live migration save path
+------------------------
+                        QEMU normal running state
+                        (RUNNING, _NONE, _RUNNING)
+                                    |
+                       migrate_init spawns migration_thread
+                Migration thread then calls each device's .save_setup()
+                        (RUNNING, _SETUP, _RUNNING|_SAVING)
+                                    |
+                        (RUNNING, _ACTIVE, _RUNNING|_SAVING)
+            If device is active, get pending_bytes by .save_live_pending()
+         if total pending_bytes >= threshold_size, call .save_live_iterate()
+                  Data of VFIO device for pre-copy phase is copied
+     Iterate till total pending bytes converge and are less than threshold
+                                    |
+   On migration completion, vCPUs stops and calls .save_live_complete_precopy
+   for each active device. VFIO device is then transitioned in _SAVING state
+                    (FINISH_MIGRATE, _DEVICE, _SAVING)
+                                    |
+For VFIO device, iterate in .save_live_complete_precopy until pending data is 0
+                    (FINISH_MIGRATE, _DEVICE, _STOPPED)
+                                    |
+                    (FINISH_MIGRATE, _COMPLETED, _STOPPED)
+                Migraton thread schedule cleanup bottom half and exit
+
+Live migration resume path
+--------------------------
+
+             Incoming migration calls .load_setup for each device
+                        (RESTORE_VM, _ACTIVE, _STOPPED)
+                                    |
+    For each device, .load_state is called for that device section data
+                        (RESTORE_VM, _ACTIVE, _RESUMING)
+                                    |
+    At the end, called .load_cleanup for each device and vCPUs are started                        |
+                        (RUNNING, _NONE, _RUNNING)
+
+
+Postcopy
+========
+Postcopy migration is not supported for VFIO devices.
-- 
2.7.0



             reply	other threads:[~2020-10-29  6:27 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-29  5:53 Kirti Wankhede [this message]
2020-10-29 11:52 ` [PATCH v1] docs/devel: Add VFIO device migration documentation Cornelia Huck
2020-10-29 17:41   ` Kirti Wankhede
2020-10-29 19:05     ` Alex Williamson
2020-10-29 20:25       ` Cornelia Huck
2020-11-03 19:48       ` Kirti Wankhede
2020-11-03 20:27         ` Alex Williamson
2020-11-04  7:55           ` Kirti Wankhede
2020-11-04 12:45             ` Alex Williamson
2020-11-05 18:59               ` Kirti Wankhede
2020-11-05 19:11                 ` Alex Williamson
2020-11-05 20:52                   ` Kirti Wankhede
2020-11-05 21:26                     ` Alex Williamson
2020-11-06 18:57                       ` Kirti Wankhede
2020-11-06 19:17                         ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1603950791-27236-1-git-send-email-kwankhede@nvidia.com \
    --to=kwankhede@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=cjia@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=dnigam@nvidia.com \
    --cc=mcrossley@nvidia.com \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).