[PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2
@ 2023-02-16 14:36 Avihai Horon
  2023-02-16 14:36 ` [PATCH v11 01/11] linux-headers: Update to v6.2-rc8 Avihai Horon
                   ` (10 more replies)
  0 siblings, 11 replies; 41+ messages in thread
From: Avihai Horon @ 2023-02-16 14:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Hello,

Following VFIO migration protocol v2 acceptance in kernel, this series
implements VFIO migration according to the new v2 protocol and replaces
the now deprecated v1 implementation.

The main differences between v1 and v2 migration protocols are:
1. VFIO device state is represented as a finite state machine instead of
   a bitmap.

2. The migration interface with kernel is done using VFIO_DEVICE_FEATURE
   ioctl and normal read() and write() instead of the migration region
   used in v1.

3. Pre-copy is made optional in v2 protocol. Support for pre-copy will
   be added later on.

Full description of the v2 protocol and the differences from v1 can be
found here [1].



Patch list:

Patch 1 updates linux headers so we will have the MIG_DATA_SIZE ioctl.

Patches 2-7 are prep patches fixing bugs and refactoring v1 protocol
code to make it easier to add v2 protocol.

Patches 8-11 implement v2 protocol and remove v1 protocol.

Thanks.



Changes from v10 [2]:
- Rebased on latest master branch and Juan's pull request [3].
- Changed linux headers sync patch (#1) to sync to v6.2-rc8.
- Dropped patch #4 (qemu_file_get_to_fd()) as it is included in Juan's
  pull request.
- Added Juan's/Cedric's R-bs.



Changes from v9 [4]:
- Rebased on latest master branch. As part of it:
  1. Dropped patch #2 as it was merged in Juan's pull req already.
  2. Dropped patch #11 (vfio_save_pending() optimization). With Juan's
     changes we can achieve this optimization by implementing only
     .state_pending_exact() handler and not a .state_pending_estimate()
      handler (see comment in code).
  3. Some function refactoring to be aligned with Juan's changes.

- Made multiple_devices_migration_blocker static (Cedric).

- Addressed Alex's comments:
  1. Changed multiple devices migration block/unblock logic.
  2. Removed recover_state local variable in
     vfio_save_complete_precopy().
  3. Added comments explaining the purpose of ERROR state usage as a
     recover state (three places).
  4. Changed vfio_migration_set_state() to directly reset the device if
     ERROR is given as recover_state, instead of setting the device in
     ERROR state, failing and only then resetting the device.
  5. Added a note in VFIO migration documentation that migration is
     supported only with a single device.



Changes from v8 [5]:
- Added patch that blocks migration of multiple devices. As discussed,
  this is necessary since VFIO migration doesn't support P2P yet.
- Removed unnecessary P2P code. This should be added when P2P support is
  added.
- Fixed vfio_save_block() comment to say -errno is returned on error.
- Added Reviewed-by tag to linux headers sync patch.



Changes from v7 [6]:
- Fixed compilation error on windows in patch #9 reported by Cedric.



Changes from v6 [7]:
- Fixed another compilation error in patch #9 reported by Cedric.
- Added Reviewed-by tags.



Changes from v5 [8]:
- Dropped patch #3.
- Simplified patch #5 as per Alex's suggestion.
- Changed qemu_file_get_to_fd() to return -EIO instead of -1, as
  suggested by Cedric.
  Also changed it so now write returns -errno instead of -1 on error.
- Fixed compilation error reported by Cedric.
- Changed vfio_migration_query_flags() to print error message and return
  -errno in error case as suggested by Cedric.
- Added Reviewed-by tags.



Changes from v4 [9]:
- Rebased on latest master branch.
- Added linux header update to kernel v6.2-rc1.
- Merged preview patches (#13-14) into this series.



Changes from v3 [10]:
- Rebased on latest master branch.

- Dropped patch #1 "migration: Remove res_compatible parameter" as
  it's not mandatory to this series and needs some further discussion.

- Dropped patch #3 "migration: Block migration comment or code is
  wrong" as it has been merged already.

- Addressed overlooked corner case reported by Vladimir in patch #4
  "migration: Simplify migration_iteration_run()".

- Dropped patch #5 "vfio/migration: Fix wrong enum usage" as it has
  been merged already.

- In patch #12 "vfio/migration: Implement VFIO migration protocol v2":
  1. Changed vfio_save_pending() to update res_precopy_only instead of
     res_postcopy_only (as VFIO migration doesn’t support postcopy).
  2. Moved VFIOMigration->data_buffer allocation to vfio_save_setup()
     and its de-allocation to vfio_save_cleanup(), so now it's
     allocated when actually used (during migration and only on source
     side).

- Addressed Alex's comments:
  1. Eliminated code duplication in patch #7 "vfio/migration: Allow
     migration without VFIO IOMMU dirty tracking support".
  2. Removed redundant initialization of vfio_region_info in patch #10
     "vfio/migration: Move migration v1 logic to vfio_migration_init()".
  3. Added comment about VFIO_MIG_DATA_BUFFER_SIZE heuristic (and
     renamed to VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE).
  4. Cast migration structs to their actual types instead of void *.
  5. Return -errno and -EBADF instead of -1 in vfio_migration_set_state().
  6. Set migration->device_state to new_state even in case of data_fd
     out of sync. Although migration will be aborted, setting device
     state succeeded so we should reflect that.
  7. Renamed VFIO_MIG_PENDING_SIZE to VFIO_MIG_STOP_COPY_SIZE, set it
     to 100G and added a comment about the size choice.
  8. Changed vfio_save_block() to return -errno on error.
  9. Squashed Patch #14 to patch #12.
  10. Adjusted migration data buffer size according to MIG_DATA_SIZE
      ioctl.

- In preview patch #17 "vfio/migration: Query device data size in
  vfio_save_pending()" - changed vfio_save_pending() to report
  VFIO_MIG_STOP_COPY_SIZE on any error.
   
- Added another preview patch "vfio/migration: Optimize
  vfio_save_pending()".

- Added ret value on some traces as suggested by David.

- Added Reviewed-By tags.



Changes from v2 [11]:
- Rebased on top of latest master branch.

- Added relevant patches from Juan's RFC [12] with minor changes:
  1. Added Reviewed-by tag to patch #3 in the RFC.
  2. Adjusted patch #6 to work without patch #4 in the RFC.

- Added a new patch "vfio/migration: Fix wrong enum usage" that fixes a
  small bug in v1 code. This patch has been sent a few weeks ago [13] but
  wasn't taken yet.

- Patch #2 (vfio/migration: Skip pre-copy if dirty page tracking is not
  supported):
  1. Dropped this patch and replaced it with
     "vfio/migration: Allow migration without VFIO IOMMU dirty tracking
     support".
     The new patch uses a different approach – instead of skipping
     pre-copy phase completely, QEMU VFIO code will mark RAM dirty
     (instead of kernel). This ensures that current migration behavior
     is not changed and SLA is taken into account.

- Patch #4 (vfio/common: Change vfio_devices_all_running_and_saving()
  logic to equivalent one):
  1. Improved commit message to better explain the change.

- Patch #7 (vfio/migration: Implement VFIO migration protocol v2):
  1. Enhanced vfio_migration_set_state() error reporting.
  2. In vfio_save_complete_precopy() of v2 protocol - when changing
     device state to STOP, set recover state to ERROR instead of STOP as
     suggested by Joao.
  3. Constify SaveVMHandlers of v2 protocol.
  4. Modified trace_vfio_vmstate_change and
     trace_vfio_migration_set_state
     to print device state string instead of enum.
  5. Replaced qemu_put_buffer_async() with qemu_put_buffer() in
     vfio_save_block(), as requested by Juan.
  6. Implemented v2 protocol version of vfio_save_pending() as requested
     by Juan. Until ioctl to get device state size is added, we just
     report some big hard coded value, as agreed in KVM call.

- Patch #9 (vfio/migration: Reset device if setting recover state
  fails):
  1. Enhanced error reporting.
  2. Set VFIOMigration->device_state to RUNNING after device reset.

- Patch #11 (docs/devel: Align vfio-migration docs to VFIO migration
  v2):
  1. Adjusted vfio migration documentation to the added
     vfio_save_pending()

- Added the last patch (which is not for merging yet) that demonstrates
  how the new ioctl to get device state size will work once added.



Changes from v1 [14]:
- Split the big patch that replaced v1 with v2 into several patches as
  suggested by Joao, to make review easier.
- Change warn_report to warn_report_once when container doesn't support
  dirty tracking.
- Add Reviewed-by tag.

[1]
https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com/

[2]
https://lore.kernel.org/qemu-devel/20230209192043.14885-1-avihaih@nvidia.com/

[3]
https://lore.kernel.org/all/20230215200554.1365-1-quintela@redhat.com/

[4]
https://lore.kernel.org/qemu-devel/20230206123137.31149-1-avihaih@nvidia.com/

[5]
https://lore.kernel.org/qemu-devel/20230116141135.12021-1-avihaih@nvidia.com/

[6]
https://lore.kernel.org/qemu-devel/20230115183556.7691-1-avihaih@nvidia.com/

[7]
https://lore.kernel.org/qemu-devel/20230112085020.15866-1-avihaih@nvidia.com/

[8]
https://lore.kernel.org/qemu-devel/20221229110345.12480-1-avihaih@nvidia.com/

[9]
https://lore.kernel.org/qemu-devel/20221130094414.27247-1-avihaih@nvidia.com/

[10]
https://lore.kernel.org/qemu-devel/20221103161620.13120-1-avihaih@nvidia.com/

[11]
https://lore.kernel.org/all/20220530170739.19072-1-avihaih@nvidia.com/

[12]
https://lore.kernel.org/qemu-devel/20221003031600.20084-1-quintela@redhat.com/T/

[13]
https://lore.kernel.org/all/20221016085752.32740-1-avihaih@nvidia.com/

[14]
https://lore.kernel.org/all/20220512154320.19697-1-avihaih@nvidia.com/

Avihai Horon (11):
  linux-headers: Update to v6.2-rc8
  vfio/migration: Fix NULL pointer dereference bug
  vfio/migration: Allow migration without VFIO IOMMU dirty tracking
    support
  vfio/common: Change vfio_devices_all_running_and_saving() logic to
    equivalent one
  vfio/migration: Block multiple devices migration
  vfio/migration: Move migration v1 logic to vfio_migration_init()
  vfio/migration: Rename functions/structs related to v1 protocol
  vfio/migration: Implement VFIO migration protocol v2
  vfio/migration: Remove VFIO migration protocol v1
  vfio: Alphabetize migration section of VFIO trace-events file
  docs/devel: Align VFIO migration docs to v2 protocol

 docs/devel/vfio-migration.rst                 |  72 +-
 include/hw/vfio/vfio-common.h                 |  10 +-
 include/standard-headers/drm/drm_fourcc.h     |  29 +
 include/standard-headers/linux/ethtool.h      |  18 +-
 include/standard-headers/linux/fuse.h         |  16 +-
 .../linux/input-event-codes.h                 |   3 +
 include/standard-headers/linux/pci_regs.h     |   2 +
 include/standard-headers/linux/virtio_bt.h    |   8 +
 include/standard-headers/linux/virtio_net.h   |   4 +
 linux-headers/asm-arm64/kvm.h                 |   1 +
 linux-headers/asm-riscv/kvm.h                 |   3 +
 linux-headers/asm-x86/kvm.h                   |  11 +-
 linux-headers/linux/kvm.h                     |  34 +-
 linux-headers/linux/psci.h                    |   4 +-
 linux-headers/linux/vfio.h                    | 136 +++-
 hw/vfio/common.c                              |  92 ++-
 hw/vfio/migration.c                           | 736 ++++++------------
 hw/vfio/trace-events                          |  28 +-
 18 files changed, 611 insertions(+), 596 deletions(-)

-- 
2.26.3



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v11 01/11] linux-headers: Update to v6.2-rc8
  2023-02-16 14:36 [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
@ 2023-02-16 14:36 ` Avihai Horon
  2023-02-16 14:36 ` [PATCH v11 02/11] vfio/migration: Fix NULL pointer dereference bug Avihai Horon
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Avihai Horon @ 2023-02-16 14:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Update to commit ceaa837f96ad ("Linux 6.2-rc8").

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/standard-headers/drm/drm_fourcc.h     |  29 ++++
 include/standard-headers/linux/ethtool.h      |  18 ++-
 include/standard-headers/linux/fuse.h         |  16 ++-
 .../linux/input-event-codes.h                 |   3 +
 include/standard-headers/linux/pci_regs.h     |   2 +
 include/standard-headers/linux/virtio_bt.h    |   8 ++
 include/standard-headers/linux/virtio_net.h   |   4 +
 linux-headers/asm-arm64/kvm.h                 |   1 +
 linux-headers/asm-riscv/kvm.h                 |   3 +
 linux-headers/asm-x86/kvm.h                   |  11 +-
 linux-headers/linux/kvm.h                     |  34 ++---
 linux-headers/linux/psci.h                    |   4 +-
 linux-headers/linux/vfio.h                    | 136 +++++++++++++++++-
 13 files changed, 230 insertions(+), 39 deletions(-)

diff --git a/include/standard-headers/drm/drm_fourcc.h b/include/standard-headers/drm/drm_fourcc.h
index b868488f93..69cab17b38 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -743,6 +743,35 @@ extern "C" {
  */
 #define DRM_FORMAT_MOD_VIVANTE_SPLIT_SUPER_TILED fourcc_mod_code(VIVANTE, 4)
 
+/*
+ * Vivante TS (tile-status) buffer modifiers. They can be combined with all of
+ * the color buffer tiling modifiers defined above. When TS is present it's a
+ * separate buffer containing the clear/compression status of each tile. The
+ * modifiers are defined as VIVANTE_MOD_TS_c_s, where c is the color buffer
+ * tile size in bytes covered by one entry in the status buffer and s is the
+ * number of status bits per entry.
+ * We reserve the top 8 bits of the Vivante modifier space for tile status
+ * clear/compression modifiers, as future cores might add some more TS layout
+ * variations.
+ */
+#define VIVANTE_MOD_TS_64_4               (1ULL << 48)
+#define VIVANTE_MOD_TS_64_2               (2ULL << 48)
+#define VIVANTE_MOD_TS_128_4              (3ULL << 48)
+#define VIVANTE_MOD_TS_256_4              (4ULL << 48)
+#define VIVANTE_MOD_TS_MASK               (0xfULL << 48)
+
+/*
+ * Vivante compression modifiers. Those depend on a TS modifier being present
+ * as the TS bits get reinterpreted as compression tags instead of simple
+ * clear markers when compression is enabled.
+ */
+#define VIVANTE_MOD_COMP_DEC400           (1ULL << 52)
+#define VIVANTE_MOD_COMP_MASK             (0xfULL << 52)
+
+/* Masking out the extension bits will yield the base modifier. */
+#define VIVANTE_MOD_EXT_MASK              (VIVANTE_MOD_TS_MASK | \
+                                           VIVANTE_MOD_COMP_MASK)
+
 /* NVIDIA frame buffer modifiers */
 
 /*
diff --git a/include/standard-headers/linux/ethtool.h b/include/standard-headers/linux/ethtool.h
index 1dc56cdc0a..87176ab075 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -159,8 +159,10 @@ static inline uint32_t ethtool_cmd_speed(const struct ethtool_cmd *ep)
  *	in its bus driver structure (e.g. pci_driver::name).  Must
  *	not be an empty string.
  * @version: Driver version string; may be an empty string
- * @fw_version: Firmware version string; may be an empty string
- * @erom_version: Expansion ROM version string; may be an empty string
+ * @fw_version: Firmware version string; driver defined; may be an
+ *	empty string
+ * @erom_version: Expansion ROM version string; driver defined; may be
+ *	an empty string
  * @bus_info: Device bus address.  This should match the dev_name()
  *	string for the underlying bus device, if there is one.  May be
  *	an empty string.
@@ -179,10 +181,6 @@ static inline uint32_t ethtool_cmd_speed(const struct ethtool_cmd *ep)
  *
  * Users can use the %ETHTOOL_GSSET_INFO command to get the number of
  * strings in any string set (from Linux 2.6.34).
- *
- * Drivers should set at most @driver, @version, @fw_version and
- * @bus_info in their get_drvinfo() implementation.  The ethtool
- * core fills in the other fields using other driver operations.
  */
 struct ethtool_drvinfo {
 	uint32_t	cmd;
@@ -1737,6 +1735,13 @@ enum ethtool_link_mode_bit_indices {
 	ETHTOOL_LINK_MODE_100baseFX_Half_BIT		 = 90,
 	ETHTOOL_LINK_MODE_100baseFX_Full_BIT		 = 91,
 	ETHTOOL_LINK_MODE_10baseT1L_Full_BIT		 = 92,
+	ETHTOOL_LINK_MODE_800000baseCR8_Full_BIT	 = 93,
+	ETHTOOL_LINK_MODE_800000baseKR8_Full_BIT	 = 94,
+	ETHTOOL_LINK_MODE_800000baseDR8_Full_BIT	 = 95,
+	ETHTOOL_LINK_MODE_800000baseDR8_2_Full_BIT	 = 96,
+	ETHTOOL_LINK_MODE_800000baseSR8_Full_BIT	 = 97,
+	ETHTOOL_LINK_MODE_800000baseVR8_Full_BIT	 = 98,
+
 	/* must be last entry */
 	__ETHTOOL_LINK_MODE_MASK_NBITS
 };
@@ -1848,6 +1853,7 @@ enum ethtool_link_mode_bit_indices {
 #define SPEED_100000		100000
 #define SPEED_200000		200000
 #define SPEED_400000		400000
+#define SPEED_800000		800000
 
 #define SPEED_UNKNOWN		-1
 
diff --git a/include/standard-headers/linux/fuse.h b/include/standard-headers/linux/fuse.h
index 713d259768..a1af78d989 100644
--- a/include/standard-headers/linux/fuse.h
+++ b/include/standard-headers/linux/fuse.h
@@ -197,6 +197,10 @@
  *
  *  7.37
  *  - add FUSE_TMPFILE
+ *
+ *  7.38
+ *  - add FUSE_EXPIRE_ONLY flag to fuse_notify_inval_entry
+ *  - add FOPEN_PARALLEL_DIRECT_WRITES
  */
 
 #ifndef _LINUX_FUSE_H
@@ -228,7 +232,7 @@
 #define FUSE_KERNEL_VERSION 7
 
 /** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 37
+#define FUSE_KERNEL_MINOR_VERSION 38
 
 /** The node ID of the root inode */
 #define FUSE_ROOT_ID 1
@@ -300,6 +304,7 @@ struct fuse_file_lock {
  * FOPEN_CACHE_DIR: allow caching this directory
  * FOPEN_STREAM: the file is stream-like (no file position at all)
  * FOPEN_NOFLUSH: don't flush data cache on close (unless FUSE_WRITEBACK_CACHE)
+ * FOPEN_PARALLEL_DIRECT_WRITES: Allow concurrent direct writes on the same inode
  */
 #define FOPEN_DIRECT_IO		(1 << 0)
 #define FOPEN_KEEP_CACHE	(1 << 1)
@@ -307,6 +312,7 @@ struct fuse_file_lock {
 #define FOPEN_CACHE_DIR		(1 << 3)
 #define FOPEN_STREAM		(1 << 4)
 #define FOPEN_NOFLUSH		(1 << 5)
+#define FOPEN_PARALLEL_DIRECT_WRITES	(1 << 6)
 
 /**
  * INIT request/reply flags
@@ -487,6 +493,12 @@ struct fuse_file_lock {
  */
 #define FUSE_SETXATTR_ACL_KILL_SGID	(1 << 0)
 
+/**
+ * notify_inval_entry flags
+ * FUSE_EXPIRE_ONLY
+ */
+#define FUSE_EXPIRE_ONLY		(1 << 0)
+
 enum fuse_opcode {
 	FUSE_LOOKUP		= 1,
 	FUSE_FORGET		= 2,  /* no reply */
@@ -915,7 +927,7 @@ struct fuse_notify_inval_inode_out {
 struct fuse_notify_inval_entry_out {
 	uint64_t	parent;
 	uint32_t	namelen;
-	uint32_t	padding;
+	uint32_t	flags;
 };
 
 struct fuse_notify_delete_out {
diff --git a/include/standard-headers/linux/input-event-codes.h b/include/standard-headers/linux/input-event-codes.h
index 815f7a1dff..f6bab08540 100644
--- a/include/standard-headers/linux/input-event-codes.h
+++ b/include/standard-headers/linux/input-event-codes.h
@@ -614,6 +614,9 @@
 #define KEY_KBD_LAYOUT_NEXT	0x248	/* AC Next Keyboard Layout Select */
 #define KEY_EMOJI_PICKER	0x249	/* Show/hide emoji picker (HUTRR101) */
 #define KEY_DICTATE		0x24a	/* Start or Stop Voice Dictation Session (HUTRR99) */
+#define KEY_CAMERA_ACCESS_ENABLE	0x24b	/* Enables programmatic access to camera devices. (HUTRR72) */
+#define KEY_CAMERA_ACCESS_DISABLE	0x24c	/* Disables programmatic access to camera devices. (HUTRR72) */
+#define KEY_CAMERA_ACCESS_TOGGLE	0x24d	/* Toggles the current state of the camera access control. (HUTRR72) */
 
 #define KEY_BRIGHTNESS_MIN		0x250	/* Set Brightness to Minimum */
 #define KEY_BRIGHTNESS_MAX		0x251	/* Set Brightness to Maximum */
diff --git a/include/standard-headers/linux/pci_regs.h b/include/standard-headers/linux/pci_regs.h
index 57b8e2ffb1..85ab127881 100644
--- a/include/standard-headers/linux/pci_regs.h
+++ b/include/standard-headers/linux/pci_regs.h
@@ -1058,6 +1058,7 @@
 /* Precision Time Measurement */
 #define PCI_PTM_CAP			0x04	    /* PTM Capability */
 #define  PCI_PTM_CAP_REQ		0x00000001  /* Requester capable */
+#define  PCI_PTM_CAP_RES		0x00000002  /* Responder capable */
 #define  PCI_PTM_CAP_ROOT		0x00000004  /* Root capable */
 #define  PCI_PTM_GRANULARITY_MASK	0x0000FF00  /* Clock granularity */
 #define PCI_PTM_CTRL			0x08	    /* PTM Control */
@@ -1119,6 +1120,7 @@
 #define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */
 #define PCI_DOE_WRITE		0x10    /* DOE Write Data Mailbox Register */
 #define PCI_DOE_READ		0x14    /* DOE Read Data Mailbox Register */
+#define PCI_DOE_CAP_SIZEOF	0x18	/* Size of DOE register block */
 
 /* DOE Data Object - note not actually registers */
 #define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
diff --git a/include/standard-headers/linux/virtio_bt.h b/include/standard-headers/linux/virtio_bt.h
index 245e1eff4b..a11ecc3f92 100644
--- a/include/standard-headers/linux/virtio_bt.h
+++ b/include/standard-headers/linux/virtio_bt.h
@@ -9,6 +9,7 @@
 #define VIRTIO_BT_F_VND_HCI	0	/* Indicates vendor command support */
 #define VIRTIO_BT_F_MSFT_EXT	1	/* Indicates MSFT vendor support */
 #define VIRTIO_BT_F_AOSP_EXT	2	/* Indicates AOSP vendor support */
+#define VIRTIO_BT_F_CONFIG_V2	3	/* Use second version configuration */
 
 enum virtio_bt_config_type {
 	VIRTIO_BT_CONFIG_TYPE_PRIMARY	= 0,
@@ -28,4 +29,11 @@ struct virtio_bt_config {
 	uint16_t msft_opcode;
 } QEMU_PACKED;
 
+struct virtio_bt_config_v2 {
+	uint8_t  type;
+	uint8_t  alignment;
+	uint16_t vendor;
+	uint16_t msft_opcode;
+};
+
 #endif /* _LINUX_VIRTIO_BT_H */
diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
index 42c68caf71..c0e797067a 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 #define VIRTIO_NET_F_NOTF_COAL	53	/* Device supports notifications coalescing */
+#define VIRTIO_NET_F_GUEST_USO4	54	/* Guest can handle USOv4 in. */
+#define VIRTIO_NET_F_GUEST_USO6	55	/* Guest can handle USOv6 in. */
+#define VIRTIO_NET_F_HOST_USO	56	/* Host can handle USO in. */
 #define VIRTIO_NET_F_HASH_REPORT  57	/* Supports hash report */
 #define VIRTIO_NET_F_RSS	  60	/* Supports RSS RX steering */
 #define VIRTIO_NET_F_RSC_EXT	  61	/* extended coalescing info */
@@ -130,6 +133,7 @@ struct virtio_net_hdr_v1 {
 #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */
 #define VIRTIO_NET_HDR_GSO_UDP		3	/* GSO frame, IPv4 UDP (UFO) */
 #define VIRTIO_NET_HDR_GSO_TCPV6	4	/* GSO frame, IPv6 TCP */
+#define VIRTIO_NET_HDR_GSO_UDP_L4	5	/* GSO frame, IPv4& IPv6 UDP (USO) */
 #define VIRTIO_NET_HDR_GSO_ECN		0x80	/* TCP has ECN set */
 	uint8_t gso_type;
 	__virtio16 hdr_len;	/* Ethernet + IP + tcp/udp hdrs */
diff --git a/linux-headers/asm-arm64/kvm.h b/linux-headers/asm-arm64/kvm.h
index 4bf2d7246e..a7cfefb3a8 100644
--- a/linux-headers/asm-arm64/kvm.h
+++ b/linux-headers/asm-arm64/kvm.h
@@ -43,6 +43,7 @@
 #define __KVM_HAVE_VCPU_EVENTS
 
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64
 
 #define KVM_REG_SIZE(id)						\
 	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
diff --git a/linux-headers/asm-riscv/kvm.h b/linux-headers/asm-riscv/kvm.h
index 8985ff234c..92af6f3f05 100644
--- a/linux-headers/asm-riscv/kvm.h
+++ b/linux-headers/asm-riscv/kvm.h
@@ -49,6 +49,9 @@ struct kvm_sregs {
 struct kvm_riscv_config {
 	unsigned long isa;
 	unsigned long zicbom_block_size;
+	unsigned long mvendorid;
+	unsigned long marchid;
+	unsigned long mimpid;
 };
 
 /* CORE registers for KVM_GET_ONE_REG and KVM_SET_ONE_REG */
diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 46de10a809..2747d2ce14 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -53,14 +53,6 @@
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
 
-struct kvm_memory_alias {
-	__u32 slot;  /* this has a different namespace than memory slots */
-	__u32 flags;
-	__u64 guest_phys_addr;
-	__u64 memory_size;
-	__u64 target_phys_addr;
-};
-
 /* for KVM_GET_IRQCHIP and KVM_SET_IRQCHIP */
 struct kvm_pic_state {
 	__u8 last_irr;	/* edge detection */
@@ -214,6 +206,8 @@ struct kvm_msr_list {
 struct kvm_msr_filter_range {
 #define KVM_MSR_FILTER_READ  (1 << 0)
 #define KVM_MSR_FILTER_WRITE (1 << 1)
+#define KVM_MSR_FILTER_RANGE_VALID_MASK (KVM_MSR_FILTER_READ | \
+					 KVM_MSR_FILTER_WRITE)
 	__u32 flags;
 	__u32 nmsrs; /* number of msrs in bitmap */
 	__u32 base;  /* MSR index the bitmap starts at */
@@ -224,6 +218,7 @@ struct kvm_msr_filter_range {
 struct kvm_msr_filter {
 #define KVM_MSR_FILTER_DEFAULT_ALLOW (0 << 0)
 #define KVM_MSR_FILTER_DEFAULT_DENY  (1 << 0)
+#define KVM_MSR_FILTER_VALID_MASK (KVM_MSR_FILTER_DEFAULT_DENY)
 	__u32 flags;
 	struct kvm_msr_filter_range ranges[KVM_MSR_FILTER_MAX_RANGES];
 };
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index b2783c5202..1e2c16cfe3 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -86,14 +86,6 @@ struct kvm_debug_guest {
 /* *** End of deprecated interfaces *** */
 
 
-/* for KVM_CREATE_MEMORY_REGION */
-struct kvm_memory_region {
-	__u32 slot;
-	__u32 flags;
-	__u64 guest_phys_addr;
-	__u64 memory_size; /* bytes */
-};
-
 /* for KVM_SET_USER_MEMORY_REGION */
 struct kvm_userspace_memory_region {
 	__u32 slot;
@@ -104,9 +96,9 @@ struct kvm_userspace_memory_region {
 };
 
 /*
- * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
- * other bits are reserved for kvm internal use which are defined in
- * include/linux/kvm_host.h.
+ * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
+ * userspace, other bits are reserved for kvm internal use which are defined
+ * in include/linux/kvm_host.h.
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
@@ -483,6 +475,9 @@ struct kvm_run {
 #define KVM_MSR_EXIT_REASON_INVAL	(1 << 0)
 #define KVM_MSR_EXIT_REASON_UNKNOWN	(1 << 1)
 #define KVM_MSR_EXIT_REASON_FILTER	(1 << 2)
+#define KVM_MSR_EXIT_REASON_VALID_MASK	(KVM_MSR_EXIT_REASON_INVAL   |	\
+					 KVM_MSR_EXIT_REASON_UNKNOWN |	\
+					 KVM_MSR_EXIT_REASON_FILTER)
 			__u32 reason; /* kernel -> user */
 			__u32 index; /* kernel -> user */
 			__u64 data; /* kernel <-> user */
@@ -1176,6 +1171,8 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_ZPCI_OP 221
 #define KVM_CAP_S390_CPU_TOPOLOGY 222
 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
+#define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
+#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1265,6 +1262,7 @@ struct kvm_x86_mce {
 #define KVM_XEN_HVM_CONFIG_RUNSTATE		(1 << 3)
 #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL	(1 << 4)
 #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND		(1 << 5)
+#define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG	(1 << 6)
 
 struct kvm_xen_hvm_config {
 	__u32 flags;
@@ -1435,18 +1433,12 @@ struct kvm_vfio_spapr_tce {
 	__s32	tablefd;
 };
 
-/*
- * ioctls for VM fds
- */
-#define KVM_SET_MEMORY_REGION     _IOW(KVMIO,  0x40, struct kvm_memory_region)
 /*
  * KVM_CREATE_VCPU receives as a parameter the vcpu slot, and returns
  * a vcpu fd.
  */
 #define KVM_CREATE_VCPU           _IO(KVMIO,   0x41)
 #define KVM_GET_DIRTY_LOG         _IOW(KVMIO,  0x42, struct kvm_dirty_log)
-/* KVM_SET_MEMORY_ALIAS is obsolete: */
-#define KVM_SET_MEMORY_ALIAS      _IOW(KVMIO,  0x43, struct kvm_memory_alias)
 #define KVM_SET_NR_MMU_PAGES      _IO(KVMIO,   0x44)
 #define KVM_GET_NR_MMU_PAGES      _IO(KVMIO,   0x45)
 #define KVM_SET_USER_MEMORY_REGION _IOW(KVMIO, 0x46, \
@@ -1738,6 +1730,8 @@ enum pv_cmd_id {
 	KVM_PV_UNSHARE_ALL,
 	KVM_PV_INFO,
 	KVM_PV_DUMP,
+	KVM_PV_ASYNC_CLEANUP_PREPARE,
+	KVM_PV_ASYNC_CLEANUP_PERFORM,
 };
 
 struct kvm_pv_cmd {
@@ -1768,8 +1762,10 @@ struct kvm_xen_hvm_attr {
 	union {
 		__u8 long_mode;
 		__u8 vector;
+		__u8 runstate_update_flag;
 		struct {
 			__u64 gfn;
+#define KVM_XEN_INVALID_GFN ((__u64)-1)
 		} shared_info;
 		struct {
 			__u32 send_port;
@@ -1801,6 +1797,7 @@ struct kvm_xen_hvm_attr {
 	} u;
 };
 
+
 /* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO */
 #define KVM_XEN_ATTR_TYPE_LONG_MODE		0x0
 #define KVM_XEN_ATTR_TYPE_SHARED_INFO		0x1
@@ -1808,6 +1805,8 @@ struct kvm_xen_hvm_attr {
 /* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_EVTCHN_SEND */
 #define KVM_XEN_ATTR_TYPE_EVTCHN		0x3
 #define KVM_XEN_ATTR_TYPE_XEN_VERSION		0x4
+/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG */
+#define KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG	0x5
 
 /* Per-vCPU Xen attributes */
 #define KVM_XEN_VCPU_GET_ATTR	_IOWR(KVMIO, 0xca, struct kvm_xen_vcpu_attr)
@@ -1824,6 +1823,7 @@ struct kvm_xen_vcpu_attr {
 	__u16 pad[3];
 	union {
 		__u64 gpa;
+#define KVM_XEN_INVALID_GPA ((__u64)-1)
 		__u64 pad[8];
 		struct {
 			__u64 state;
diff --git a/linux-headers/linux/psci.h b/linux-headers/linux/psci.h
index e60dfd8907..74f3cb5007 100644
--- a/linux-headers/linux/psci.h
+++ b/linux-headers/linux/psci.h
@@ -58,7 +58,7 @@
 
 #define PSCI_1_1_FN_SYSTEM_RESET2		PSCI_0_2_FN(18)
 #define PSCI_1_1_FN_MEM_PROTECT			PSCI_0_2_FN(19)
-#define PSCI_1_1_FN_MEM_PROTECT_CHECK_RANGE	PSCI_0_2_FN(19)
+#define PSCI_1_1_FN_MEM_PROTECT_CHECK_RANGE	PSCI_0_2_FN(20)
 
 #define PSCI_1_0_FN64_CPU_DEFAULT_SUSPEND	PSCI_0_2_FN64(12)
 #define PSCI_1_0_FN64_NODE_HW_STATE		PSCI_0_2_FN64(13)
@@ -67,7 +67,7 @@
 #define PSCI_1_0_FN64_STAT_COUNT		PSCI_0_2_FN64(17)
 
 #define PSCI_1_1_FN64_SYSTEM_RESET2		PSCI_0_2_FN64(18)
-#define PSCI_1_1_FN64_MEM_PROTECT_CHECK_RANGE	PSCI_0_2_FN64(19)
+#define PSCI_1_1_FN64_MEM_PROTECT_CHECK_RANGE	PSCI_0_2_FN64(20)
 
 /* PSCI v0.2 power state encoding for CPU_SUSPEND function */
 #define PSCI_0_2_POWER_STATE_ID_MASK		0xffff
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index bee7e42198..c59692ce0b 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -819,12 +819,20 @@ struct vfio_device_feature {
  * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P
  * is supported in addition to the STOP_COPY states.
  *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_PRE_COPY means that
+ * PRE_COPY is supported in addition to the STOP_COPY states.
+ *
+ * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY
+ * means that RUNNING_P2P, PRE_COPY and PRE_COPY_P2P are supported
+ * in addition to the STOP_COPY states.
+ *
  * Other combinations of flags have behavior to be defined in the future.
  */
 struct vfio_device_feature_migration {
 	__aligned_u64 flags;
 #define VFIO_MIGRATION_STOP_COPY	(1 << 0)
 #define VFIO_MIGRATION_P2P		(1 << 1)
+#define VFIO_MIGRATION_PRE_COPY		(1 << 2)
 };
 #define VFIO_DEVICE_FEATURE_MIGRATION 1
 
@@ -875,8 +883,13 @@ struct vfio_device_feature_mig_state {
  *  RESUMING - The device is stopped and is loading a new internal state
  *  ERROR - The device has failed and must be reset
  *
- * And 1 optional state to support VFIO_MIGRATION_P2P:
+ * And optional states to support VFIO_MIGRATION_P2P:
  *  RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA
+ * And VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY - The device is running normally but tracking internal state
+ *             changes
+ * And VFIO_MIGRATION_P2P | VFIO_MIGRATION_PRE_COPY:
+ *  PRE_COPY_P2P - PRE_COPY, except the device cannot do peer to peer DMA
  *
  * The FSM takes actions on the arcs between FSM states. The driver implements
  * the following behavior for the FSM arcs:
@@ -908,20 +921,48 @@ struct vfio_device_feature_mig_state {
  *
  *   To abort a RESUMING session the device must be reset.
  *
+ * PRE_COPY -> RUNNING
  * RUNNING_P2P -> RUNNING
  *   While in RUNNING the device is fully operational, the device may generate
  *   interrupts, DMA, respond to MMIO, all vfio device regions are functional,
  *   and the device may advance its internal state.
  *
+ *   The PRE_COPY arc will terminate a data transfer session.
+ *
+ * PRE_COPY_P2P -> RUNNING_P2P
  * RUNNING -> RUNNING_P2P
  * STOP -> RUNNING_P2P
  *   While in RUNNING_P2P the device is partially running in the P2P quiescent
  *   state defined below.
  *
+ *   The PRE_COPY_P2P arc will terminate a data transfer session.
+ *
+ * RUNNING -> PRE_COPY
+ * RUNNING_P2P -> PRE_COPY_P2P
  * STOP -> STOP_COPY
- *   This arc begin the process of saving the device state and will return a
- *   new data_fd.
+ *   PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of states
+ *   which share a data transfer session. Moving between these states alters
+ *   what is streamed in session, but does not terminate or otherwise affect
+ *   the associated fd.
+ *
+ *   These arcs begin the process of saving the device state and will return a
+ *   new data_fd. The migration driver may perform actions such as enabling
+ *   dirty logging of device state when entering PRE_COPY or PER_COPY_P2P.
+ *
+ *   Each arc does not change the device operation, the device remains
+ *   RUNNING, P2P quiesced or in STOP. The STOP_COPY state is described below
+ *   in PRE_COPY_P2P -> STOP_COPY.
  *
+ * PRE_COPY -> PRE_COPY_P2P
+ *   Entering PRE_COPY_P2P continues all the behaviors of PRE_COPY above.
+ *   However, while in the PRE_COPY_P2P state, the device is partially running
+ *   in the P2P quiescent state defined below, like RUNNING_P2P.
+ *
+ * PRE_COPY_P2P -> PRE_COPY
+ *   This arc allows returning the device to a full RUNNING behavior while
+ *   continuing all the behaviors of PRE_COPY.
+ *
+ * PRE_COPY_P2P -> STOP_COPY
  *   While in the STOP_COPY state the device has the same behavior as STOP
  *   with the addition that the data transfers session continues to stream the
  *   migration state. End of stream on the FD indicates the entire device
@@ -939,6 +980,13 @@ struct vfio_device_feature_mig_state {
  *   device state for this arc if required to prepare the device to receive the
  *   migration data.
  *
+ * STOP_COPY -> PRE_COPY
+ * STOP_COPY -> PRE_COPY_P2P
+ *   These arcs are not permitted and return error if requested. Future
+ *   revisions of this API may define behaviors for these arcs, in this case
+ *   support will be discoverable by a new flag in
+ *   VFIO_DEVICE_FEATURE_MIGRATION.
+ *
  * any -> ERROR
  *   ERROR cannot be specified as a device state, however any transition request
  *   can be failed with an errno return and may then move the device_state into
@@ -950,7 +998,7 @@ struct vfio_device_feature_mig_state {
  * The optional peer to peer (P2P) quiescent state is intended to be a quiescent
  * state for the device for the purposes of managing multiple devices within a
  * user context where peer-to-peer DMA between devices may be active. The
- * RUNNING_P2P states must prevent the device from initiating
+ * RUNNING_P2P and PRE_COPY_P2P states must prevent the device from initiating
  * any new P2P DMA transactions. If the device can identify P2P transactions
  * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration
  * driver must complete any such outstanding operations prior to completing the
@@ -963,6 +1011,8 @@ struct vfio_device_feature_mig_state {
  * above FSM arcs. As there are multiple paths through the FSM arcs the path
  * should be selected based on the following rules:
  *   - Select the shortest path.
+ *   - The path cannot have saving group states as interior arcs, only
+ *     starting/end states.
  * Refer to vfio_mig_get_next_state() for the result of the algorithm.
  *
  * The automatic transit through the FSM arcs that make up the combination
@@ -976,6 +1026,9 @@ struct vfio_device_feature_mig_state {
  * support them. The user can discover if these states are supported by using
  * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can
  * avoid knowing about these optional states if the kernel driver supports them.
+ *
+ * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support for PRE_COPY
+ * is not present.
  */
 enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_ERROR = 0,
@@ -984,8 +1037,70 @@ enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_STOP_COPY = 3,
 	VFIO_DEVICE_STATE_RESUMING = 4,
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
+	VFIO_DEVICE_STATE_PRE_COPY = 6,
+	VFIO_DEVICE_STATE_PRE_COPY_P2P = 7,
+};
+
+/**
+ * VFIO_MIG_GET_PRECOPY_INFO - _IO(VFIO_TYPE, VFIO_BASE + 21)
+ *
+ * This ioctl is used on the migration data FD in the precopy phase of the
+ * migration data transfer. It returns an estimate of the current data sizes
+ * remaining to be transferred. It allows the user to judge when it is
+ * appropriate to leave PRE_COPY for STOP_COPY.
+ *
+ * This ioctl is valid only in PRE_COPY states and kernel driver should
+ * return -EINVAL from any other migration state.
+ *
+ * The vfio_precopy_info data structure returned by this ioctl provides
+ * estimates of data available from the device during the PRE_COPY states.
+ * This estimate is split into two categories, initial_bytes and
+ * dirty_bytes.
+ *
+ * The initial_bytes field indicates the amount of initial precopy
+ * data available from the device. This field should have a non-zero initial
+ * value and decrease as migration data is read from the device.
+ * It is recommended to leave PRE_COPY for STOP_COPY only after this field
+ * reaches zero. Leaving PRE_COPY earlier might make things slower.
+ *
+ * The dirty_bytes field tracks device state changes relative to data
+ * previously retrieved.  This field starts at zero and may increase as
+ * the internal device state is modified or decrease as that modified
+ * state is read from the device.
+ *
+ * Userspace may use the combination of these fields to estimate the
+ * potential data size available during the PRE_COPY phases, as well as
+ * trends relative to the rate the device is dirtying its internal
+ * state, but these fields are not required to have any bearing relative
+ * to the data size available during the STOP_COPY phase.
+ *
+ * Drivers have a lot of flexibility in when and what they transfer during the
+ * PRE_COPY phase, and how they report this from VFIO_MIG_GET_PRECOPY_INFO.
+ *
+ * During pre-copy the migration data FD has a temporary "end of stream" that is
+ * reached when both initial_bytes and dirty_byte are zero. For instance, this
+ * may indicate that the device is idle and not currently dirtying any internal
+ * state. When read() is done on this temporary end of stream the kernel driver
+ * should return ENOMSG from read(). Userspace can wait for more data (which may
+ * never come) by using poll.
+ *
+ * Once in STOP_COPY the migration data FD has a permanent end of stream
+ * signaled in the usual way by read() always returning 0 and poll always
+ * returning readable. ENOMSG may not be returned in STOP_COPY.
+ * Support for this ioctl is mandatory if a driver claims to support
+ * VFIO_MIGRATION_PRE_COPY.
+ *
+ * Return: 0 on success, -1 and errno set on failure.
+ */
+struct vfio_precopy_info {
+	__u32 argsz;
+	__u32 flags;
+	__aligned_u64 initial_bytes;
+	__aligned_u64 dirty_bytes;
 };
 
+#define VFIO_MIG_GET_PRECOPY_INFO _IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /*
  * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
  * state with the platform-based power management.  Device use of lower power
@@ -1128,6 +1243,19 @@ struct vfio_device_feature_dma_logging_report {
 
 #define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 8
 
+/*
+ * Upon VFIO_DEVICE_FEATURE_GET read back the estimated data length that will
+ * be required to complete stop copy.
+ *
+ * Note: Can be called on each device state.
+ */
+
+struct vfio_device_feature_mig_data_size {
+	__aligned_u64 stop_copy_length;
+};
+
+#define VFIO_DEVICE_FEATURE_MIG_DATA_SIZE 9
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 02/11] vfio/migration: Fix NULL pointer dereference bug
  2023-02-16 14:36 [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
  2023-02-16 14:36 ` [PATCH v11 01/11] linux-headers: Update to v6.2-rc8 Avihai Horon
@ 2023-02-16 14:36 ` Avihai Horon
  2023-02-16 14:36 ` [PATCH v11 03/11] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support Avihai Horon
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Avihai Horon @ 2023-02-16 14:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

As part of its error flow, vfio_vmstate_change() accesses
MigrationState->to_dst_file without any checks. This can cause a NULL
pointer dereference if the error flow is taken and
MigrationState->to_dst_file is not set.

For example, this can happen if VM is started or stopped not during
migration and vfio_vmstate_change() error flow is taken, as
MigrationState->to_dst_file is not set at that time.

Fix it by checking that MigrationState->to_dst_file is set before using
it.

Fixes: 02a7e71b1e5b ("vfio: Add VM state change handler to know state of VM")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
---
 hw/vfio/migration.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 83d2d44080..65f3f3bef7 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -741,7 +741,9 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
          */
         error_report("%s: Failed to set device state 0x%x", vbasedev->name,
                      (migration->device_state & mask) | value);
-        qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
+        if (migrate_get_current()->to_dst_file) {
+            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
+        }
     }
     vbasedev->migration->vm_running = running;
     trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 03/11] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support
  2023-02-16 14:36 [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
  2023-02-16 14:36 ` [PATCH v11 01/11] linux-headers: Update to v6.2-rc8 Avihai Horon
  2023-02-16 14:36 ` [PATCH v11 02/11] vfio/migration: Fix NULL pointer dereference bug Avihai Horon
@ 2023-02-16 14:36 ` Avihai Horon
  2023-02-16 14:36 ` [PATCH v11 04/11] vfio/common: Change vfio_devices_all_running_and_saving() logic to equivalent one Avihai Horon
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Avihai Horon @ 2023-02-16 14:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Currently, if IOMMU of a VFIO container doesn't support dirty page
tracking, migration is blocked. This is because a DMA-able VFIO device
can dirty RAM pages without updating QEMU about it, thus breaking the
migration.

However, this doesn't mean that migration can't be done at all.
In such case, allow migration and let QEMU VFIO code mark all pages
dirty.

This guarantees that all pages that might have gotten dirty are reported
back, and thus guarantees a valid migration even without VFIO IOMMU
dirty tracking support.

The motivation for this patch is the introduction of iommufd [1].
iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into its internal ops, allowing the usage of these IOCTLs
over iommufd. However, VFIO IOMMU dirty tracking is not supported by
this VFIO compatibility API.

This patch will allow migration by hosts that use the VFIO compatibility
API and prevent migration regressions caused by the lack of VFIO IOMMU
dirty tracking support.

[1]
https://lore.kernel.org/kvm/0-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 hw/vfio/common.c    | 20 ++++++++++++++++++--
 hw/vfio/migration.c |  3 +--
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 130e5d1dc7..f6dd571549 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -488,6 +488,12 @@ static int vfio_dma_unmap(VFIOContainer *container,
         return -errno;
     }
 
+    if (iotlb && vfio_devices_all_running_and_saving(container)) {
+        cpu_physical_memory_set_dirty_range(iotlb->translated_addr, size,
+                                            tcg_enabled() ? DIRTY_CLIENTS_ALL :
+                                            DIRTY_CLIENTS_NOCODE);
+    }
+
     return 0;
 }
 
@@ -1201,6 +1207,10 @@ static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
         .argsz = sizeof(dirty),
     };
 
+    if (!container->dirty_pages_supported) {
+        return;
+    }
+
     if (start) {
         dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_START;
     } else {
@@ -1236,6 +1246,13 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
     uint64_t pages;
     int ret;
 
+    if (!container->dirty_pages_supported) {
+        cpu_physical_memory_set_dirty_range(ram_addr, size,
+                                            tcg_enabled() ? DIRTY_CLIENTS_ALL :
+                                            DIRTY_CLIENTS_NOCODE);
+        return 0;
+    }
+
     dbitmap = g_malloc0(sizeof(*dbitmap) + sizeof(*range));
 
     dbitmap->argsz = sizeof(*dbitmap) + sizeof(*range);
@@ -1409,8 +1426,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 
-    if (vfio_listener_skipped_section(section) ||
-        !container->dirty_pages_supported) {
+    if (vfio_listener_skipped_section(section)) {
         return;
     }
 
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 65f3f3bef7..e56eef1ee8 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -858,11 +858,10 @@ int64_t vfio_mig_bytes_transferred(void)
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
-    VFIOContainer *container = vbasedev->group->container;
     struct vfio_region_info *info = NULL;
     int ret = -ENOTSUP;
 
-    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
+    if (!vbasedev->enable_migration) {
         goto add_blocker;
     }
 
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 04/11] vfio/common: Change vfio_devices_all_running_and_saving() logic to equivalent one
  2023-02-16 14:36 [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (2 preceding siblings ...)
  2023-02-16 14:36 ` [PATCH v11 03/11] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support Avihai Horon
@ 2023-02-16 14:36 ` Avihai Horon
  2023-02-16 14:53   ` Juan Quintela
  2023-02-16 14:36 ` [PATCH v11 05/11] vfio/migration: Block multiple devices migration Avihai Horon
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: Avihai Horon @ 2023-02-16 14:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

vfio_devices_all_running_and_saving() is used to check if migration is
in pre-copy phase. This is done by checking if migration is in setup or
active states and if all VFIO devices are in pre-copy state, i.e.
_SAVING | _RUNNING.

In VFIO migration protocol v2 pre-copy support is made optional. Hence,
a matching v2 protocol pre-copy state can't be used here.

As preparation for adding v2 protocol, change
vfio_devices_all_running_and_saving() logic such that it doesn't use the
VFIO pre-copy state.

The new equivalent logic checks if migration is in active state and if
all VFIO devices are in running state [1]. No functional changes
intended.

[1] Note that checking if migration is in setup or active states and if
all VFIO devices are in running state doesn't guarantee that we are in
pre-copy phase, thus we check if migration is only in active state.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/common.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f6dd571549..3a35f4afad 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -40,6 +40,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
+#include "migration/misc.h"
 #include "sysemu/tpm.h"
 
 VFIOGroupList vfio_group_list =
@@ -363,13 +364,16 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
     return true;
 }
 
-static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
+/*
+ * Check if all VFIO devices are running and migration is active, which is
+ * essentially equivalent to the migration being in pre-copy phase.
+ */
+static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
 {
     VFIOGroup *group;
     VFIODevice *vbasedev;
-    MigrationState *ms = migrate_get_current();
 
-    if (!migration_is_setup_or_active(ms->state)) {
+    if (!migration_is_active(migrate_get_current())) {
         return false;
     }
 
@@ -381,8 +385,7 @@ static bool vfio_devices_all_running_and_saving(VFIOContainer *container)
                 return false;
             }
 
-            if ((migration->device_state & VFIO_DEVICE_STATE_V1_SAVING) &&
-                (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
+            if (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING) {
                 continue;
             } else {
                 return false;
@@ -461,7 +464,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
     };
 
     if (iotlb && container->dirty_pages_supported &&
-        vfio_devices_all_running_and_saving(container)) {
+        vfio_devices_all_running_and_mig_active(container)) {
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
     }
 
@@ -488,7 +491,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
         return -errno;
     }
 
-    if (iotlb && vfio_devices_all_running_and_saving(container)) {
+    if (iotlb && vfio_devices_all_running_and_mig_active(container)) {
         cpu_physical_memory_set_dirty_range(iotlb->translated_addr, size,
                                             tcg_enabled() ? DIRTY_CLIENTS_ALL :
                                             DIRTY_CLIENTS_NOCODE);
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 05/11] vfio/migration: Block multiple devices migration
  2023-02-16 14:36 [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (3 preceding siblings ...)
  2023-02-16 14:36 ` [PATCH v11 04/11] vfio/common: Change vfio_devices_all_running_and_saving() logic to equivalent one Avihai Horon
@ 2023-02-16 14:36 ` Avihai Horon
  2023-05-16 10:03   ` Shameerali Kolothum Thodi via
  2023-02-16 14:36 ` [PATCH v11 06/11] vfio/migration: Move migration v1 logic to vfio_migration_init() Avihai Horon
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: Avihai Horon @ 2023-02-16 14:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Currently VFIO migration doesn't implement some kind of intermediate
quiescent state in which P2P DMAs are quiesced before stopping or
running the device. This can cause problems in multi-device migration
where the devices are doing P2P DMAs, since the devices are not stopped
together at the same time.

Until such support is added, block migration of multiple devices.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 include/hw/vfio/vfio-common.h |  2 ++
 hw/vfio/common.c              | 53 +++++++++++++++++++++++++++++++++++
 hw/vfio/migration.c           |  6 ++++
 3 files changed, 61 insertions(+)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index e573f5a9f1..56b1683824 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -218,6 +218,8 @@ typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
 extern VFIOGroupList vfio_group_list;
 
 bool vfio_mig_active(void);
+int vfio_block_multiple_devices_migration(Error **errp);
+void vfio_unblock_multiple_devices_migration(void);
 int64_t vfio_mig_bytes_transferred(void);
 
 #ifdef CONFIG_LINUX
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 3a35f4afad..fe80ccf914 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -41,6 +41,7 @@
 #include "qapi/error.h"
 #include "migration/migration.h"
 #include "migration/misc.h"
+#include "migration/blocker.h"
 #include "sysemu/tpm.h"
 
 VFIOGroupList vfio_group_list =
@@ -337,6 +338,58 @@ bool vfio_mig_active(void)
     return true;
 }
 
+static Error *multiple_devices_migration_blocker;
+
+static unsigned int vfio_migratable_device_num(void)
+{
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+    unsigned int device_num = 0;
+
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            if (vbasedev->migration) {
+                device_num++;
+            }
+        }
+    }
+
+    return device_num;
+}
+
+int vfio_block_multiple_devices_migration(Error **errp)
+{
+    int ret;
+
+    if (multiple_devices_migration_blocker ||
+        vfio_migratable_device_num() <= 1) {
+        return 0;
+    }
+
+    error_setg(&multiple_devices_migration_blocker,
+               "Migration is currently not supported with multiple "
+               "VFIO devices");
+    ret = migrate_add_blocker(multiple_devices_migration_blocker, errp);
+    if (ret < 0) {
+        error_free(multiple_devices_migration_blocker);
+        multiple_devices_migration_blocker = NULL;
+    }
+
+    return ret;
+}
+
+void vfio_unblock_multiple_devices_migration(void)
+{
+    if (!multiple_devices_migration_blocker ||
+        vfio_migratable_device_num() > 1) {
+        return;
+    }
+
+    migrate_del_blocker(multiple_devices_migration_blocker);
+    error_free(multiple_devices_migration_blocker);
+    multiple_devices_migration_blocker = NULL;
+}
+
 static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
 {
     VFIOGroup *group;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index e56eef1ee8..8e96999669 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -878,6 +878,11 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
         goto add_blocker;
     }
 
+    ret = vfio_block_multiple_devices_migration(errp);
+    if (ret) {
+        return ret;
+    }
+
     trace_vfio_migration_probe(vbasedev->name, info->index);
     g_free(info);
     return 0;
@@ -904,6 +909,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
         qemu_del_vm_change_state_handler(migration->vm_state);
         unregister_savevm(VMSTATE_IF(vbasedev->dev), "vfio", vbasedev);
         vfio_migration_exit(vbasedev);
+        vfio_unblock_multiple_devices_migration();
     }
 
     if (vbasedev->migration_blocker) {
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 06/11] vfio/migration: Move migration v1 logic to vfio_migration_init()
  2023-02-16 14:36 [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (4 preceding siblings ...)
  2023-02-16 14:36 ` [PATCH v11 05/11] vfio/migration: Block multiple devices migration Avihai Horon
@ 2023-02-16 14:36 ` Avihai Horon
  2023-02-16 14:50   ` Juan Quintela
  2023-02-16 14:36 ` [PATCH v11 07/11] vfio/migration: Rename functions/structs related to v1 protocol Avihai Horon
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: Avihai Horon @ 2023-02-16 14:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Move vfio_dev_get_region_info() logic from vfio_migration_probe() to
vfio_migration_init(). This logic is specific to v1 protocol and moving
it will make it easier to add the v2 protocol implementation later.
No functional changes intended.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/migration.c  | 30 +++++++++++++++---------------
 hw/vfio/trace-events |  2 +-
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 8e96999669..a3bf7327a1 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -786,14 +786,14 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
     vbasedev->migration = NULL;
 }
 
-static int vfio_migration_init(VFIODevice *vbasedev,
-                               struct vfio_region_info *info)
+static int vfio_migration_init(VFIODevice *vbasedev)
 {
     int ret;
     Object *obj;
     VFIOMigration *migration;
     char id[256] = "";
     g_autofree char *path = NULL, *oid = NULL;
+    struct vfio_region_info *info;
 
     if (!vbasedev->ops->vfio_get_object) {
         return -EINVAL;
@@ -804,6 +804,14 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         return -EINVAL;
     }
 
+    ret = vfio_get_dev_region_info(vbasedev,
+                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
+                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
+                                   &info);
+    if (ret) {
+        return ret;
+    }
+
     vbasedev->migration = g_new0(VFIOMigration, 1);
     vbasedev->migration->device_state = VFIO_DEVICE_STATE_V1_RUNNING;
     vbasedev->migration->vm_running = runstate_is_running();
@@ -823,6 +831,8 @@ static int vfio_migration_init(VFIODevice *vbasedev,
         goto err;
     }
 
+    g_free(info);
+
     migration = vbasedev->migration;
     migration->vbasedev = vbasedev;
 
@@ -845,6 +855,7 @@ static int vfio_migration_init(VFIODevice *vbasedev,
     return 0;
 
 err:
+    g_free(info);
     vfio_migration_exit(vbasedev);
     return ret;
 }
@@ -858,22 +869,13 @@ int64_t vfio_mig_bytes_transferred(void)
 
 int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
 {
-    struct vfio_region_info *info = NULL;
     int ret = -ENOTSUP;
 
     if (!vbasedev->enable_migration) {
         goto add_blocker;
     }
 
-    ret = vfio_get_dev_region_info(vbasedev,
-                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
-                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
-                                   &info);
-    if (ret) {
-        goto add_blocker;
-    }
-
-    ret = vfio_migration_init(vbasedev, info);
+    ret = vfio_migration_init(vbasedev);
     if (ret) {
         goto add_blocker;
     }
@@ -883,14 +885,12 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
         return ret;
     }
 
-    trace_vfio_migration_probe(vbasedev->name, info->index);
-    g_free(info);
+    trace_vfio_migration_probe(vbasedev->name);
     return 0;
 
 add_blocker:
     error_setg(&vbasedev->migration_blocker,
                "VFIO device doesn't support migration");
-    g_free(info);
 
     ret = migrate_add_blocker(vbasedev->migration_blocker, errp);
     if (ret < 0) {
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 90a8aecb37..6be5381cc9 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -148,7 +148,7 @@ vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
 
 # migration.c
-vfio_migration_probe(const char *name, uint32_t index) " (%s) Region %d"
+vfio_migration_probe(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 07/11] vfio/migration: Rename functions/structs related to v1 protocol
  2023-02-16 14:36 [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (5 preceding siblings ...)
  2023-02-16 14:36 ` [PATCH v11 06/11] vfio/migration: Move migration v1 logic to vfio_migration_init() Avihai Horon
@ 2023-02-16 14:36 ` Avihai Horon
  2023-02-16 14:54   ` Juan Quintela
  2023-02-16 14:36 ` [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 41+ messages in thread
From: Avihai Horon @ 2023-02-16 14:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

To avoid name collisions, rename functions and structs related to VFIO
migration protocol v1. This will allow the two protocols to co-exist
when v2 protocol is added, until v1 is removed. No functional changes
intended.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 include/hw/vfio/vfio-common.h |   2 +-
 hw/vfio/common.c              |   6 +-
 hw/vfio/migration.c           | 102 +++++++++++++++++-----------------
 hw/vfio/trace-events          |  12 ++--
 4 files changed, 61 insertions(+), 61 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 56b1683824..2c0fb1d622 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -62,7 +62,7 @@ typedef struct VFIOMigration {
     struct VFIODevice *vbasedev;
     VMChangeStateEntry *vm_state;
     VFIORegion region;
-    uint32_t device_state;
+    uint32_t device_state_v1;
     int vm_running;
     Notifier migration_state;
     uint64_t pending_bytes;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fe80ccf914..1c974e9c5a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -408,8 +408,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
                 return false;
             }
 
-            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF)
-                && (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING)) {
+            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
+                (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) {
                 return false;
             }
         }
@@ -438,7 +438,7 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
                 return false;
             }
 
-            if (migration->device_state & VFIO_DEVICE_STATE_V1_RUNNING) {
+            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index a3bf7327a1..a5fe285721 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -107,8 +107,8 @@ static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
  * an error is returned.
  */
 
-static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
-                                    uint32_t value)
+static int vfio_migration_v1_set_state(VFIODevice *vbasedev, uint32_t mask,
+                                       uint32_t value)
 {
     VFIOMigration *migration = vbasedev->migration;
     VFIORegion *region = &migration->region;
@@ -145,8 +145,8 @@ static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t mask,
         return ret;
     }
 
-    migration->device_state = device_state;
-    trace_vfio_migration_set_state(vbasedev->name, device_state);
+    migration->device_state_v1 = device_state;
+    trace_vfio_migration_v1_set_state(vbasedev->name, device_state);
     return 0;
 }
 
@@ -260,8 +260,8 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
     return ret;
 }
 
-static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
-                            uint64_t data_size)
+static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
+                               uint64_t data_size)
 {
     VFIORegion *region = &vbasedev->migration->region;
     uint64_t data_offset = 0, size, report_size;
@@ -288,7 +288,7 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
             data_size = 0;
         }
 
-        trace_vfio_load_state_device_data(vbasedev->name, data_offset, size);
+        trace_vfio_v1_load_state_device_data(vbasedev->name, data_offset, size);
 
         while (size) {
             void *buf;
@@ -394,7 +394,7 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
-static void vfio_migration_cleanup(VFIODevice *vbasedev)
+static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
 
@@ -405,13 +405,13 @@ static void vfio_migration_cleanup(VFIODevice *vbasedev)
 
 /* ---------------------------------------------------------------------- */
 
-static int vfio_save_setup(QEMUFile *f, void *opaque)
+static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
     int ret;
 
-    trace_vfio_save_setup(vbasedev->name);
+    trace_vfio_v1_save_setup(vbasedev->name);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
 
@@ -431,8 +431,8 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         }
     }
 
-    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
-                                   VFIO_DEVICE_STATE_V1_SAVING);
+    ret = vfio_migration_v1_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
+                                      VFIO_DEVICE_STATE_V1_SAVING);
     if (ret) {
         error_report("%s: Failed to set state SAVING", vbasedev->name);
         return ret;
@@ -448,16 +448,16 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static void vfio_save_cleanup(void *opaque)
+static void vfio_v1_save_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
 
-    vfio_migration_cleanup(vbasedev);
+    vfio_migration_v1_cleanup(vbasedev);
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
-static void vfio_state_pending(void *opaque, uint64_t *must_precopy,
-                               uint64_t *can_postcopy)
+static void vfio_v1_state_pending(void *opaque, uint64_t *must_precopy,
+                                  uint64_t *can_postcopy)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
@@ -470,7 +470,7 @@ static void vfio_state_pending(void *opaque, uint64_t *must_precopy,
 
     *must_precopy += migration->pending_bytes;
 
-    trace_vfio_state_pending(vbasedev->name, *must_precopy, *can_postcopy);
+    trace_vfio_v1_state_pending(vbasedev->name, *must_precopy, *can_postcopy);
 }
 
 static int vfio_save_iterate(QEMUFile *f, void *opaque)
@@ -520,15 +520,15 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
     return 0;
 }
 
-static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
     uint64_t data_size;
     int ret;
 
-    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_RUNNING,
-                                   VFIO_DEVICE_STATE_V1_SAVING);
+    ret = vfio_migration_v1_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_RUNNING,
+                                      VFIO_DEVICE_STATE_V1_SAVING);
     if (ret) {
         error_report("%s: Failed to set state STOP and SAVING",
                      vbasedev->name);
@@ -565,13 +565,14 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
         return ret;
     }
 
-    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_SAVING, 0);
+    ret = vfio_migration_v1_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_SAVING,
+                                      0);
     if (ret) {
         error_report("%s: Failed to set state STOPPED", vbasedev->name);
         return ret;
     }
 
-    trace_vfio_save_complete_precopy(vbasedev->name);
+    trace_vfio_v1_save_complete_precopy(vbasedev->name);
     return ret;
 }
 
@@ -588,7 +589,7 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
     }
 }
 
-static int vfio_load_setup(QEMUFile *f, void *opaque)
+static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
@@ -604,8 +605,8 @@ static int vfio_load_setup(QEMUFile *f, void *opaque)
         }
     }
 
-    ret = vfio_migration_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
-                                   VFIO_DEVICE_STATE_V1_RESUMING);
+    ret = vfio_migration_v1_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
+                                      VFIO_DEVICE_STATE_V1_RESUMING);
     if (ret) {
         error_report("%s: Failed to set state RESUMING", vbasedev->name);
         if (migration->region.mmaps) {
@@ -615,11 +616,11 @@ static int vfio_load_setup(QEMUFile *f, void *opaque)
     return ret;
 }
 
-static int vfio_load_cleanup(void *opaque)
+static int vfio_v1_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
 
-    vfio_migration_cleanup(vbasedev);
+    vfio_migration_v1_cleanup(vbasedev);
     trace_vfio_load_cleanup(vbasedev->name);
     return 0;
 }
@@ -657,7 +658,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             uint64_t data_size = qemu_get_be64(f);
 
             if (data_size) {
-                ret = vfio_load_buffer(f, vbasedev, data_size);
+                ret = vfio_v1_load_buffer(f, vbasedev, data_size);
                 if (ret < 0) {
                     return ret;
                 }
@@ -678,22 +679,22 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
-static SaveVMHandlers savevm_vfio_handlers = {
-    .save_setup = vfio_save_setup,
-    .save_cleanup = vfio_save_cleanup,
-    .state_pending_exact = vfio_state_pending,
-    .state_pending_estimate = vfio_state_pending,
+static SaveVMHandlers savevm_vfio_v1_handlers = {
+    .save_setup = vfio_v1_save_setup,
+    .save_cleanup = vfio_v1_save_cleanup,
+    .state_pending_exact = vfio_v1_state_pending,
+    .state_pending_estimate = vfio_v1_state_pending,
     .save_live_iterate = vfio_save_iterate,
-    .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_live_complete_precopy = vfio_v1_save_complete_precopy,
     .save_state = vfio_save_state,
-    .load_setup = vfio_load_setup,
-    .load_cleanup = vfio_load_cleanup,
+    .load_setup = vfio_v1_load_setup,
+    .load_cleanup = vfio_v1_load_cleanup,
     .load_state = vfio_load_state,
 };
 
 /* ---------------------------------------------------------------------- */
 
-static void vfio_vmstate_change(void *opaque, bool running, RunState state)
+static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
@@ -733,21 +734,21 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
         }
     }
 
-    ret = vfio_migration_set_state(vbasedev, mask, value);
+    ret = vfio_migration_v1_set_state(vbasedev, mask, value);
     if (ret) {
         /*
          * Migration should be aborted in this case, but vm_state_notify()
          * currently does not support reporting failures.
          */
         error_report("%s: Failed to set device state 0x%x", vbasedev->name,
-                     (migration->device_state & mask) | value);
+                     (migration->device_state_v1 & mask) | value);
         if (migrate_get_current()->to_dst_file) {
             qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
         }
     }
     vbasedev->migration->vm_running = running;
-    trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
-            (migration->device_state & mask) | value);
+    trace_vfio_v1_vmstate_change(vbasedev->name, running, RunState_str(state),
+            (migration->device_state_v1 & mask) | value);
 }
 
 static void vfio_migration_state_notifier(Notifier *notifier, void *data)
@@ -766,10 +767,10 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_FAILED:
         bytes_transferred = 0;
-        ret = vfio_migration_set_state(vbasedev,
-                                       ~(VFIO_DEVICE_STATE_V1_SAVING |
-                                         VFIO_DEVICE_STATE_V1_RESUMING),
-                                       VFIO_DEVICE_STATE_V1_RUNNING);
+        ret = vfio_migration_v1_set_state(vbasedev,
+                                          ~(VFIO_DEVICE_STATE_V1_SAVING |
+                                            VFIO_DEVICE_STATE_V1_RESUMING),
+                                          VFIO_DEVICE_STATE_V1_RUNNING);
         if (ret) {
             error_report("%s: Failed to set state RUNNING", vbasedev->name);
         }
@@ -813,7 +814,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     }
 
     vbasedev->migration = g_new0(VFIOMigration, 1);
-    vbasedev->migration->device_state = VFIO_DEVICE_STATE_V1_RUNNING;
+    vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
     vbasedev->migration->vm_running = runstate_is_running();
 
     ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
@@ -844,12 +845,11 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     }
     strpadcpy(id, sizeof(id), path, '\0');
 
-    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
-                         vbasedev);
+    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
+                         &savevm_vfio_v1_handlers, vbasedev);
 
-    migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev,
-                                                           vfio_vmstate_change,
-                                                           vbasedev);
+    migration->vm_state = qdev_add_vm_change_state_handler(
+        vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
     migration->migration_state.notify = vfio_migration_state_notifier;
     add_migration_state_change_notifier(&migration->migration_state);
     return 0;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 6be5381cc9..c8739449df 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,20 +149,20 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(const char *name) " (%s)"
-vfio_migration_set_state(const char *name, uint32_t state) " (%s) state %d"
-vfio_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
+vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d"
+vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
-vfio_save_setup(const char *name) " (%s)"
+vfio_v1_save_setup(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
 vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
 vfio_save_device_config_state(const char *name) " (%s)"
-vfio_state_pending(const char *name, uint64_t precopy, uint64_t postcopy) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64
+vfio_v1_state_pending(const char *name, uint64_t precopy, uint64_t postcopy) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64
 vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
-vfio_save_complete_precopy(const char *name) " (%s)"
+vfio_v1_save_complete_precopy(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
-vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
+vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
 vfio_load_cleanup(const char *name) " (%s)"
 vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2023-02-16 14:36 [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (6 preceding siblings ...)
  2023-02-16 14:36 ` [PATCH v11 07/11] vfio/migration: Rename functions/structs related to v1 protocol Avihai Horon
@ 2023-02-16 14:36 ` Avihai Horon
  2023-02-16 15:43   ` Juan Quintela
  2024-09-04 13:00   ` Peter Xu
  2023-02-16 14:36 ` [PATCH v11 09/11] vfio/migration: Remove VFIO migration protocol v1 Avihai Horon
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 41+ messages in thread
From: Avihai Horon @ 2023-02-16 14:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Implement the basic mandatory part of VFIO migration protocol v2.
This includes all functionality that is necessary to support
VFIO_MIGRATION_STOP_COPY part of the v2 protocol.

The two protocols, v1 and v2, will co-exist and in the following patches
v1 protocol code will be removed.

There are several main differences between v1 and v2 protocols:
- VFIO device state is now represented as a finite state machine instead
  of a bitmap.

- Migration interface with kernel is now done using VFIO_DEVICE_FEATURE
  ioctl and normal read() and write() instead of the migration region.

- Pre-copy is made optional in v2 protocol. Support for pre-copy will be
  added later on.

Detailed information about VFIO migration protocol v2 and its difference
compared to v1 protocol can be found here [1].

[1]
https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/hw/vfio/vfio-common.h |   5 +
 hw/vfio/common.c              |  17 +-
 hw/vfio/migration.c           | 479 +++++++++++++++++++++++++++++++---
 hw/vfio/trace-events          |   7 +
 4 files changed, 469 insertions(+), 39 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 2c0fb1d622..c4eab55af9 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -66,6 +66,11 @@ typedef struct VFIOMigration {
     int vm_running;
     Notifier migration_state;
     uint64_t pending_bytes;
+    uint32_t device_state;
+    int data_fd;
+    void *data_buffer;
+    size_t data_buffer_size;
+    bool v2;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 1c974e9c5a..54fee2d5de 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -408,10 +408,17 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
                 return false;
             }
 
-            if ((vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
+            if (!migration->v2 &&
+                (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
                 (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) {
                 return false;
             }
+
+            if (migration->v2 &&
+                vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
+                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+                return false;
+            }
         }
     }
     return true;
@@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
                 return false;
             }
 
-            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
+            if (!migration->v2 &&
+                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
+                continue;
+            }
+
+            if (migration->v2 &&
+                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index a5fe285721..452d96a520 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -10,6 +10,7 @@
 #include "qemu/osdep.h"
 #include "qemu/main-loop.h"
 #include "qemu/cutils.h"
+#include "qemu/units.h"
 #include <linux/vfio.h>
 #include <sys/ioctl.h>
 
@@ -44,8 +45,114 @@
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
 
+/*
+ * This is an arbitrary size based on migration of mlx5 devices, where typically
+ * total device migration size is on the order of 100s of MB. Testing with
+ * larger values, e.g. 128MB and 1GB, did not show a performance improvement.
+ */
+#define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
+
 static int64_t bytes_transferred;
 
+static const char *mig_state_to_str(enum vfio_device_mig_state state)
+{
+    switch (state) {
+    case VFIO_DEVICE_STATE_ERROR:
+        return "ERROR";
+    case VFIO_DEVICE_STATE_STOP:
+        return "STOP";
+    case VFIO_DEVICE_STATE_RUNNING:
+        return "RUNNING";
+    case VFIO_DEVICE_STATE_STOP_COPY:
+        return "STOP_COPY";
+    case VFIO_DEVICE_STATE_RESUMING:
+        return "RESUMING";
+    default:
+        return "UNKNOWN STATE";
+    }
+}
+
+static int vfio_migration_set_state(VFIODevice *vbasedev,
+                                    enum vfio_device_mig_state new_state,
+                                    enum vfio_device_mig_state recover_state)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+                              sizeof(struct vfio_device_feature_mig_state),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+    struct vfio_device_feature_mig_state *mig_state =
+        (struct vfio_device_feature_mig_state *)feature->data;
+    int ret;
+
+    feature->argsz = sizeof(buf);
+    feature->flags =
+        VFIO_DEVICE_FEATURE_SET | VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE;
+    mig_state->device_state = new_state;
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        /* Try to set the device in some good state */
+        ret = -errno;
+
+        if (recover_state == VFIO_DEVICE_STATE_ERROR) {
+            error_report("%s: Failed setting device state to %s, err: %s. "
+                         "Recover state is ERROR. Resetting device",
+                         vbasedev->name, mig_state_to_str(new_state),
+                         strerror(errno));
+
+            goto reset_device;
+        }
+
+        error_report(
+            "%s: Failed setting device state to %s, err: %s. Setting device in recover state %s",
+                     vbasedev->name, mig_state_to_str(new_state),
+                     strerror(errno), mig_state_to_str(recover_state));
+
+        mig_state->device_state = recover_state;
+        if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+            ret = -errno;
+            error_report(
+                "%s: Failed setting device in recover state, err: %s. Resetting device",
+                         vbasedev->name, strerror(errno));
+
+            goto reset_device;
+        }
+
+        migration->device_state = recover_state;
+
+        return ret;
+    }
+
+    migration->device_state = new_state;
+    if (mig_state->data_fd != -1) {
+        if (migration->data_fd != -1) {
+            /*
+             * This can happen if the device is asynchronously reset and
+             * terminates a data transfer.
+             */
+            error_report("%s: data_fd out of sync", vbasedev->name);
+            close(mig_state->data_fd);
+
+            return -EBADF;
+        }
+
+        migration->data_fd = mig_state->data_fd;
+    }
+
+    trace_vfio_migration_set_state(vbasedev->name, mig_state_to_str(new_state));
+
+    return 0;
+
+reset_device:
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_RESET)) {
+        hw_error("%s: Failed resetting device, err: %s", vbasedev->name,
+                 strerror(errno));
+    }
+
+    migration->device_state = VFIO_DEVICE_STATE_RUNNING;
+
+    return ret;
+}
+
 static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
                                   off_t off, bool iswrite)
 {
@@ -260,6 +367,18 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
     return ret;
 }
 
+static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
+                            uint64_t data_size)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    int ret;
+
+    ret = qemu_file_get_to_fd(f, migration->data_fd, data_size);
+    trace_vfio_load_state_device_data(vbasedev->name, data_size, ret);
+
+    return ret;
+}
+
 static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
                                uint64_t data_size)
 {
@@ -394,6 +513,14 @@ static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
+static void vfio_migration_cleanup(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    close(migration->data_fd);
+    migration->data_fd = -1;
+}
+
 static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
@@ -403,8 +530,80 @@ static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
     }
 }
 
+static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
+                                     uint64_t *stop_copy_size)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+                              sizeof(struct vfio_device_feature_mig_data_size),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+    struct vfio_device_feature_mig_data_size *mig_data_size =
+        (struct vfio_device_feature_mig_data_size *)feature->data;
+
+    feature->argsz = sizeof(buf);
+    feature->flags =
+        VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIG_DATA_SIZE;
+
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        return -errno;
+    }
+
+    *stop_copy_size = mig_data_size->stop_copy_length;
+
+    return 0;
+}
+
+/* Returns 1 if end-of-stream is reached, 0 if more data and -errno if error */
+static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
+{
+    ssize_t data_size;
+
+    data_size = read(migration->data_fd, migration->data_buffer,
+                     migration->data_buffer_size);
+    if (data_size < 0) {
+        return -errno;
+    }
+    if (data_size == 0) {
+        return 1;
+    }
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
+    qemu_put_be64(f, data_size);
+    qemu_put_buffer(f, migration->data_buffer, data_size);
+    bytes_transferred += data_size;
+
+    trace_vfio_save_block(migration->vbasedev->name, data_size);
+
+    return qemu_file_get_error(f);
+}
+
 /* ---------------------------------------------------------------------- */
 
+static int vfio_save_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    uint64_t stop_copy_size = VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE;
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
+
+    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
+    migration->data_buffer_size = MIN(VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE,
+                                      stop_copy_size);
+    migration->data_buffer = g_try_malloc0(migration->data_buffer_size);
+    if (!migration->data_buffer) {
+        error_report("%s: Failed to allocate migration data buffer",
+                     vbasedev->name);
+        return -ENOMEM;
+    }
+
+    trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    return qemu_file_get_error(f);
+}
+
 static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -448,6 +647,17 @@ static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
     return 0;
 }
 
+static void vfio_save_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    g_free(migration->data_buffer);
+    migration->data_buffer = NULL;
+    vfio_migration_cleanup(vbasedev);
+    trace_vfio_save_cleanup(vbasedev->name);
+}
+
 static void vfio_v1_save_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -456,6 +666,35 @@ static void vfio_v1_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+/*
+ * Migration size of VFIO devices can be as little as a few KBs or as big as
+ * many GBs. This value should be big enough to cover the worst case.
+ */
+#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
+
+/*
+ * Only exact function is implemented and not estimate function. The reason is
+ * that during pre-copy phase of migration the estimate function is called
+ * repeatedly while pending RAM size is over the threshold, thus migration
+ * can't converge and querying the VFIO device pending data size is useless.
+ */
+static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
+                                     uint64_t *can_postcopy)
+{
+    VFIODevice *vbasedev = opaque;
+    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
+
+    /*
+     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
+     * reported so downtime limit won't be violated.
+     */
+    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
+    *must_precopy += stop_copy_size;
+
+    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
+                                   stop_copy_size);
+}
+
 static void vfio_v1_state_pending(void *opaque, uint64_t *must_precopy,
                                   uint64_t *can_postcopy)
 {
@@ -520,6 +759,42 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
     return 0;
 }
 
+static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    int ret;
+
+    /* We reach here with device state STOP only */
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
+                                   VFIO_DEVICE_STATE_STOP);
+    if (ret) {
+        return ret;
+    }
+
+    do {
+        ret = vfio_save_block(f, vbasedev->migration);
+        if (ret < 0) {
+            return ret;
+        }
+    } while (!ret);
+
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        return ret;
+    }
+
+    /*
+     * If setting the device in STOP state fails, the device should be reset.
+     * To do so, use ERROR state as a recover state.
+     */
+    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
+                                   VFIO_DEVICE_STATE_ERROR);
+    trace_vfio_save_complete_precopy(vbasedev->name, ret);
+
+    return ret;
+}
+
 static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -589,6 +864,14 @@ static void vfio_save_state(QEMUFile *f, void *opaque)
     }
 }
 
+static int vfio_load_setup(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
+                                   vbasedev->migration->device_state);
+}
+
 static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -616,6 +899,16 @@ static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
     return ret;
 }
 
+static int vfio_load_cleanup(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+
+    vfio_migration_cleanup(vbasedev);
+    trace_vfio_load_cleanup(vbasedev->name);
+
+    return 0;
+}
+
 static int vfio_v1_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -658,7 +951,11 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             uint64_t data_size = qemu_get_be64(f);
 
             if (data_size) {
-                ret = vfio_v1_load_buffer(f, vbasedev, data_size);
+                if (vbasedev->migration->v2) {
+                    ret = vfio_load_buffer(f, vbasedev, data_size);
+                } else {
+                    ret = vfio_v1_load_buffer(f, vbasedev, data_size);
+                }
                 if (ret < 0) {
                     return ret;
                 }
@@ -679,6 +976,17 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
+static const SaveVMHandlers savevm_vfio_handlers = {
+    .save_setup = vfio_save_setup,
+    .save_cleanup = vfio_save_cleanup,
+    .state_pending_exact = vfio_state_pending_exact,
+    .save_live_complete_precopy = vfio_save_complete_precopy,
+    .save_state = vfio_save_state,
+    .load_setup = vfio_load_setup,
+    .load_cleanup = vfio_load_cleanup,
+    .load_state = vfio_load_state,
+};
+
 static SaveVMHandlers savevm_vfio_v1_handlers = {
     .save_setup = vfio_v1_save_setup,
     .save_cleanup = vfio_v1_save_cleanup,
@@ -694,6 +1002,38 @@ static SaveVMHandlers savevm_vfio_v1_handlers = {
 
 /* ---------------------------------------------------------------------- */
 
+static void vfio_vmstate_change(void *opaque, bool running, RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+    enum vfio_device_mig_state new_state;
+    int ret;
+
+    if (running) {
+        new_state = VFIO_DEVICE_STATE_RUNNING;
+    } else {
+        new_state = VFIO_DEVICE_STATE_STOP;
+    }
+
+    /*
+     * If setting the device in new_state fails, the device should be reset.
+     * To do so, use ERROR state as a recover state.
+     */
+    ret = vfio_migration_set_state(vbasedev, new_state,
+                                   VFIO_DEVICE_STATE_ERROR);
+    if (ret) {
+        /*
+         * Migration should be aborted in this case, but vm_state_notify()
+         * currently does not support reporting failures.
+         */
+        if (migrate_get_current()->to_dst_file) {
+            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
+        }
+    }
+
+    trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
+                              mig_state_to_str(new_state));
+}
+
 static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
@@ -767,12 +1107,21 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_FAILED:
         bytes_transferred = 0;
-        ret = vfio_migration_v1_set_state(vbasedev,
-                                          ~(VFIO_DEVICE_STATE_V1_SAVING |
-                                            VFIO_DEVICE_STATE_V1_RESUMING),
-                                          VFIO_DEVICE_STATE_V1_RUNNING);
-        if (ret) {
-            error_report("%s: Failed to set state RUNNING", vbasedev->name);
+        if (migration->v2) {
+            /*
+             * If setting the device in RUNNING state fails, the device should
+             * be reset. To do so, use ERROR state as a recover state.
+             */
+            vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
+                                     VFIO_DEVICE_STATE_ERROR);
+        } else {
+            ret = vfio_migration_v1_set_state(vbasedev,
+                                              ~(VFIO_DEVICE_STATE_V1_SAVING |
+                                                VFIO_DEVICE_STATE_V1_RESUMING),
+                                              VFIO_DEVICE_STATE_V1_RUNNING);
+            if (ret) {
+                error_report("%s: Failed to set state RUNNING", vbasedev->name);
+            }
         }
     }
 }
@@ -781,12 +1130,42 @@ static void vfio_migration_exit(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
 
-    vfio_region_exit(&migration->region);
-    vfio_region_finalize(&migration->region);
+    if (!migration->v2) {
+        vfio_region_exit(&migration->region);
+        vfio_region_finalize(&migration->region);
+    }
     g_free(vbasedev->migration);
     vbasedev->migration = NULL;
 }
 
+static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags)
+{
+    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
+                                  sizeof(struct vfio_device_feature_migration),
+                              sizeof(uint64_t))] = {};
+    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
+    struct vfio_device_feature_migration *mig =
+        (struct vfio_device_feature_migration *)feature->data;
+
+    feature->argsz = sizeof(buf);
+    feature->flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_MIGRATION;
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature)) {
+        if (errno == ENOTTY) {
+            error_report("%s: VFIO migration is not supported in kernel",
+                         vbasedev->name);
+        } else {
+            error_report("%s: Failed to query VFIO migration support, err: %s",
+                         vbasedev->name, strerror(errno));
+        }
+
+        return -errno;
+    }
+
+    *mig_flags = mig->flags;
+
+    return 0;
+}
+
 static int vfio_migration_init(VFIODevice *vbasedev)
 {
     int ret;
@@ -795,6 +1174,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     char id[256] = "";
     g_autofree char *path = NULL, *oid = NULL;
     struct vfio_region_info *info;
+    uint64_t mig_flags = 0;
 
     if (!vbasedev->ops->vfio_get_object) {
         return -EINVAL;
@@ -805,34 +1185,50 @@ static int vfio_migration_init(VFIODevice *vbasedev)
         return -EINVAL;
     }
 
-    ret = vfio_get_dev_region_info(vbasedev,
-                                   VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
-                                   VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
-                                   &info);
-    if (ret) {
-        return ret;
-    }
+    ret = vfio_migration_query_flags(vbasedev, &mig_flags);
+    if (!ret) {
+        /* Migration v2 */
+        /* Basic migration functionality must be supported */
+        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
+            return -EOPNOTSUPP;
+        }
+        vbasedev->migration = g_new0(VFIOMigration, 1);
+        vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
+        vbasedev->migration->data_fd = -1;
+        vbasedev->migration->v2 = true;
+    } else if (ret == -ENOTTY) {
+        /* Migration v1 */
+        ret = vfio_get_dev_region_info(vbasedev,
+                                       VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
+                                       VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
+                                       &info);
+        if (ret) {
+            return ret;
+        }
 
-    vbasedev->migration = g_new0(VFIOMigration, 1);
-    vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
-    vbasedev->migration->vm_running = runstate_is_running();
+        vbasedev->migration = g_new0(VFIOMigration, 1);
+        vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
+        vbasedev->migration->vm_running = runstate_is_running();
 
-    ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
-                            info->index, "migration");
-    if (ret) {
-        error_report("%s: Failed to setup VFIO migration region %d: %s",
-                     vbasedev->name, info->index, strerror(-ret));
-        goto err;
-    }
+        ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
+                                info->index, "migration");
+        if (ret) {
+            error_report("%s: Failed to setup VFIO migration region %d: %s",
+                         vbasedev->name, info->index, strerror(-ret));
+            goto err;
+        }
 
-    if (!vbasedev->migration->region.size) {
-        error_report("%s: Invalid zero-sized VFIO migration region %d",
-                     vbasedev->name, info->index);
-        ret = -EINVAL;
-        goto err;
-    }
+        if (!vbasedev->migration->region.size) {
+            error_report("%s: Invalid zero-sized VFIO migration region %d",
+                         vbasedev->name, info->index);
+            ret = -EINVAL;
+            goto err;
+        }
 
-    g_free(info);
+        g_free(info);
+    } else {
+        return ret;
+    }
 
     migration = vbasedev->migration;
     migration->vbasedev = vbasedev;
@@ -845,11 +1241,20 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     }
     strpadcpy(id, sizeof(id), path, '\0');
 
-    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
-                         &savevm_vfio_v1_handlers, vbasedev);
+    if (migration->v2) {
+        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
+                             &savevm_vfio_handlers, vbasedev);
+
+        migration->vm_state = qdev_add_vm_change_state_handler(
+            vbasedev->dev, vfio_vmstate_change, vbasedev);
+    } else {
+        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
+                             &savevm_vfio_v1_handlers, vbasedev);
+
+        migration->vm_state = qdev_add_vm_change_state_handler(
+            vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
+    }
 
-    migration->vm_state = qdev_add_vm_change_state_handler(
-        vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
     migration->migration_state.notify = vfio_migration_state_notifier;
     add_migration_state_change_notifier(&migration->migration_state);
     return 0;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index c8739449df..b24e448534 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -149,20 +149,27 @@ vfio_display_edid_write_error(void) ""
 
 # migration.c
 vfio_migration_probe(const char *name) " (%s)"
+vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
 vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d"
+vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
+vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
 vfio_v1_save_setup(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
 vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
 vfio_save_device_config_state(const char *name) " (%s)"
+vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
 vfio_v1_state_pending(const char *name, uint64_t precopy, uint64_t postcopy) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64
 vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
+vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_v1_save_complete_precopy(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
 vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
+vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size 0x%"PRIx64" ret %d"
 vfio_load_cleanup(const char *name) " (%s)"
 vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
+vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 09/11] vfio/migration: Remove VFIO migration protocol v1
  2023-02-16 14:36 [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (7 preceding siblings ...)
  2023-02-16 14:36 ` [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
@ 2023-02-16 14:36 ` Avihai Horon
  2023-02-16 14:36 ` [PATCH v11 10/11] vfio: Alphabetize migration section of VFIO trace-events file Avihai Horon
  2023-02-16 14:36 ` [PATCH v11 11/11] docs/devel: Align VFIO migration docs to v2 protocol Avihai Horon
  10 siblings, 0 replies; 41+ messages in thread
From: Avihai Horon @ 2023-02-16 14:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Now that v2 protocol implementation has been added, remove the
deprecated v1 implementation.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 include/hw/vfio/vfio-common.h |   5 -
 hw/vfio/common.c              |  17 +-
 hw/vfio/migration.c           | 700 ++--------------------------------
 hw/vfio/trace-events          |   9 -
 4 files changed, 24 insertions(+), 707 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c4eab55af9..87524c64a4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -61,16 +61,11 @@ typedef struct VFIORegion {
 typedef struct VFIOMigration {
     struct VFIODevice *vbasedev;
     VMChangeStateEntry *vm_state;
-    VFIORegion region;
-    uint32_t device_state_v1;
-    int vm_running;
     Notifier migration_state;
-    uint64_t pending_bytes;
     uint32_t device_state;
     int data_fd;
     void *data_buffer;
     size_t data_buffer_size;
-    bool v2;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 54fee2d5de..bab83c0e55 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -408,14 +408,7 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
                 return false;
             }
 
-            if (!migration->v2 &&
-                (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
-                (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING)) {
-                return false;
-            }
-
-            if (migration->v2 &&
-                vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
+            if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
                 migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
                 return false;
             }
@@ -445,13 +438,7 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
                 return false;
             }
 
-            if (!migration->v2 &&
-                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
-                continue;
-            }
-
-            if (migration->v2 &&
-                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 452d96a520..a2c3d9bade 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -153,220 +153,6 @@ reset_device:
     return ret;
 }
 
-static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
-                                  off_t off, bool iswrite)
-{
-    int ret;
-
-    ret = iswrite ? pwrite(vbasedev->fd, val, count, off) :
-                    pread(vbasedev->fd, val, count, off);
-    if (ret < count) {
-        error_report("vfio_mig_%s %d byte %s: failed at offset 0x%"
-                     HWADDR_PRIx", err: %s", iswrite ? "write" : "read", count,
-                     vbasedev->name, off, strerror(errno));
-        return (ret < 0) ? ret : -EINVAL;
-    }
-    return 0;
-}
-
-static int vfio_mig_rw(VFIODevice *vbasedev, __u8 *buf, size_t count,
-                       off_t off, bool iswrite)
-{
-    int ret, done = 0;
-    __u8 *tbuf = buf;
-
-    while (count) {
-        int bytes = 0;
-
-        if (count >= 8 && !(off % 8)) {
-            bytes = 8;
-        } else if (count >= 4 && !(off % 4)) {
-            bytes = 4;
-        } else if (count >= 2 && !(off % 2)) {
-            bytes = 2;
-        } else {
-            bytes = 1;
-        }
-
-        ret = vfio_mig_access(vbasedev, tbuf, bytes, off, iswrite);
-        if (ret) {
-            return ret;
-        }
-
-        count -= bytes;
-        done += bytes;
-        off += bytes;
-        tbuf += bytes;
-    }
-    return done;
-}
-
-#define vfio_mig_read(f, v, c, o)       vfio_mig_rw(f, (__u8 *)v, c, o, false)
-#define vfio_mig_write(f, v, c, o)      vfio_mig_rw(f, (__u8 *)v, c, o, true)
-
-#define VFIO_MIG_STRUCT_OFFSET(f)       \
-                                 offsetof(struct vfio_device_migration_info, f)
-/*
- * Change the device_state register for device @vbasedev. Bits set in @mask
- * are preserved, bits set in @value are set, and bits not set in either @mask
- * or @value are cleared in device_state. If the register cannot be accessed,
- * the resulting state would be invalid, or the device enters an error state,
- * an error is returned.
- */
-
-static int vfio_migration_v1_set_state(VFIODevice *vbasedev, uint32_t mask,
-                                       uint32_t value)
-{
-    VFIOMigration *migration = vbasedev->migration;
-    VFIORegion *region = &migration->region;
-    off_t dev_state_off = region->fd_offset +
-                          VFIO_MIG_STRUCT_OFFSET(device_state);
-    uint32_t device_state;
-    int ret;
-
-    ret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
-                        dev_state_off);
-    if (ret < 0) {
-        return ret;
-    }
-
-    device_state = (device_state & mask) | value;
-
-    if (!VFIO_DEVICE_STATE_VALID(device_state)) {
-        return -EINVAL;
-    }
-
-    ret = vfio_mig_write(vbasedev, &device_state, sizeof(device_state),
-                         dev_state_off);
-    if (ret < 0) {
-        int rret;
-
-        rret = vfio_mig_read(vbasedev, &device_state, sizeof(device_state),
-                             dev_state_off);
-
-        if ((rret < 0) || (VFIO_DEVICE_STATE_IS_ERROR(device_state))) {
-            hw_error("%s: Device in error state 0x%x", vbasedev->name,
-                     device_state);
-            return rret ? rret : -EIO;
-        }
-        return ret;
-    }
-
-    migration->device_state_v1 = device_state;
-    trace_vfio_migration_v1_set_state(vbasedev->name, device_state);
-    return 0;
-}
-
-static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
-                                   uint64_t data_size, uint64_t *size)
-{
-    void *ptr = NULL;
-    uint64_t limit = 0;
-    int i;
-
-    if (!region->mmaps) {
-        if (size) {
-            *size = MIN(data_size, region->size - data_offset);
-        }
-        return ptr;
-    }
-
-    for (i = 0; i < region->nr_mmaps; i++) {
-        VFIOMmap *map = region->mmaps + i;
-
-        if ((data_offset >= map->offset) &&
-            (data_offset < map->offset + map->size)) {
-
-            /* check if data_offset is within sparse mmap areas */
-            ptr = map->mmap + data_offset - map->offset;
-            if (size) {
-                *size = MIN(data_size, map->offset + map->size - data_offset);
-            }
-            break;
-        } else if ((data_offset < map->offset) &&
-                   (!limit || limit > map->offset)) {
-            /*
-             * data_offset is not within sparse mmap areas, find size of
-             * non-mapped area. Check through all list since region->mmaps list
-             * is not sorted.
-             */
-            limit = map->offset;
-        }
-    }
-
-    if (!ptr && size) {
-        *size = limit ? MIN(data_size, limit - data_offset) : data_size;
-    }
-    return ptr;
-}
-
-static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t *size)
-{
-    VFIOMigration *migration = vbasedev->migration;
-    VFIORegion *region = &migration->region;
-    uint64_t data_offset = 0, data_size = 0, sz;
-    int ret;
-
-    ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
-                      region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
-    if (ret < 0) {
-        return ret;
-    }
-
-    ret = vfio_mig_read(vbasedev, &data_size, sizeof(data_size),
-                        region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
-    if (ret < 0) {
-        return ret;
-    }
-
-    trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
-                           migration->pending_bytes);
-
-    qemu_put_be64(f, data_size);
-    sz = data_size;
-
-    while (sz) {
-        void *buf;
-        uint64_t sec_size;
-        bool buf_allocated = false;
-
-        buf = get_data_section_size(region, data_offset, sz, &sec_size);
-
-        if (!buf) {
-            buf = g_try_malloc(sec_size);
-            if (!buf) {
-                error_report("%s: Error allocating buffer ", __func__);
-                return -ENOMEM;
-            }
-            buf_allocated = true;
-
-            ret = vfio_mig_read(vbasedev, buf, sec_size,
-                                region->fd_offset + data_offset);
-            if (ret < 0) {
-                g_free(buf);
-                return ret;
-            }
-        }
-
-        qemu_put_buffer(f, buf, sec_size);
-
-        if (buf_allocated) {
-            g_free(buf);
-        }
-        sz -= sec_size;
-        data_offset += sec_size;
-    }
-
-    ret = qemu_file_get_error(f);
-
-    if (!ret && size) {
-        *size = data_size;
-    }
-
-    bytes_transferred += data_size;
-    return ret;
-}
-
 static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
                             uint64_t data_size)
 {
@@ -379,96 +165,6 @@ static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
     return ret;
 }
 
-static int vfio_v1_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
-                               uint64_t data_size)
-{
-    VFIORegion *region = &vbasedev->migration->region;
-    uint64_t data_offset = 0, size, report_size;
-    int ret;
-
-    do {
-        ret = vfio_mig_read(vbasedev, &data_offset, sizeof(data_offset),
-                      region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_offset));
-        if (ret < 0) {
-            return ret;
-        }
-
-        if (data_offset + data_size > region->size) {
-            /*
-             * If data_size is greater than the data section of migration region
-             * then iterate the write buffer operation. This case can occur if
-             * size of migration region at destination is smaller than size of
-             * migration region at source.
-             */
-            report_size = size = region->size - data_offset;
-            data_size -= size;
-        } else {
-            report_size = size = data_size;
-            data_size = 0;
-        }
-
-        trace_vfio_v1_load_state_device_data(vbasedev->name, data_offset, size);
-
-        while (size) {
-            void *buf;
-            uint64_t sec_size;
-            bool buf_alloc = false;
-
-            buf = get_data_section_size(region, data_offset, size, &sec_size);
-
-            if (!buf) {
-                buf = g_try_malloc(sec_size);
-                if (!buf) {
-                    error_report("%s: Error allocating buffer ", __func__);
-                    return -ENOMEM;
-                }
-                buf_alloc = true;
-            }
-
-            qemu_get_buffer(f, buf, sec_size);
-
-            if (buf_alloc) {
-                ret = vfio_mig_write(vbasedev, buf, sec_size,
-                        region->fd_offset + data_offset);
-                g_free(buf);
-
-                if (ret < 0) {
-                    return ret;
-                }
-            }
-            size -= sec_size;
-            data_offset += sec_size;
-        }
-
-        ret = vfio_mig_write(vbasedev, &report_size, sizeof(report_size),
-                        region->fd_offset + VFIO_MIG_STRUCT_OFFSET(data_size));
-        if (ret < 0) {
-            return ret;
-        }
-    } while (data_size);
-
-    return 0;
-}
-
-static int vfio_update_pending(VFIODevice *vbasedev)
-{
-    VFIOMigration *migration = vbasedev->migration;
-    VFIORegion *region = &migration->region;
-    uint64_t pending_bytes = 0;
-    int ret;
-
-    ret = vfio_mig_read(vbasedev, &pending_bytes, sizeof(pending_bytes),
-                    region->fd_offset + VFIO_MIG_STRUCT_OFFSET(pending_bytes));
-    if (ret < 0) {
-        migration->pending_bytes = 0;
-        return ret;
-    }
-
-    migration->pending_bytes = pending_bytes;
-    trace_vfio_update_pending(vbasedev->name, pending_bytes);
-    return 0;
-}
-
 static int vfio_save_device_config_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -521,15 +217,6 @@ static void vfio_migration_cleanup(VFIODevice *vbasedev)
     migration->data_fd = -1;
 }
 
-static void vfio_migration_v1_cleanup(VFIODevice *vbasedev)
-{
-    VFIOMigration *migration = vbasedev->migration;
-
-    if (migration->region.mmaps) {
-        vfio_region_unmap(&migration->region);
-    }
-}
-
 static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
                                      uint64_t *stop_copy_size)
 {
@@ -604,49 +291,6 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
     return qemu_file_get_error(f);
 }
 
-static int vfio_v1_save_setup(QEMUFile *f, void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    int ret;
-
-    trace_vfio_v1_save_setup(vbasedev->name);
-
-    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
-
-    if (migration->region.mmaps) {
-        /*
-         * Calling vfio_region_mmap() from migration thread. Memory API called
-         * from this function require locking the iothread when called from
-         * outside the main loop thread.
-         */
-        qemu_mutex_lock_iothread();
-        ret = vfio_region_mmap(&migration->region);
-        qemu_mutex_unlock_iothread();
-        if (ret) {
-            error_report("%s: Failed to mmap VFIO migration region: %s",
-                         vbasedev->name, strerror(-ret));
-            error_report("%s: Falling back to slow path", vbasedev->name);
-        }
-    }
-
-    ret = vfio_migration_v1_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
-                                      VFIO_DEVICE_STATE_V1_SAVING);
-    if (ret) {
-        error_report("%s: Failed to set state SAVING", vbasedev->name);
-        return ret;
-    }
-
-    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
-
-    ret = qemu_file_get_error(f);
-    if (ret) {
-        return ret;
-    }
-
-    return 0;
-}
-
 static void vfio_save_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -658,14 +302,6 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
-static void vfio_v1_save_cleanup(void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-
-    vfio_migration_v1_cleanup(vbasedev);
-    trace_vfio_save_cleanup(vbasedev->name);
-}
-
 /*
  * Migration size of VFIO devices can be as little as a few KBs or as big as
  * many GBs. This value should be big enough to cover the worst case.
@@ -695,70 +331,6 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
                                    stop_copy_size);
 }
 
-static void vfio_v1_state_pending(void *opaque, uint64_t *must_precopy,
-                                  uint64_t *can_postcopy)
-{
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    int ret;
-
-    ret = vfio_update_pending(vbasedev);
-    if (ret) {
-        return;
-    }
-
-    *must_precopy += migration->pending_bytes;
-
-    trace_vfio_v1_state_pending(vbasedev->name, *must_precopy, *can_postcopy);
-}
-
-static int vfio_save_iterate(QEMUFile *f, void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    uint64_t data_size;
-    int ret;
-
-    qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
-
-    if (migration->pending_bytes == 0) {
-        ret = vfio_update_pending(vbasedev);
-        if (ret) {
-            return ret;
-        }
-
-        if (migration->pending_bytes == 0) {
-            qemu_put_be64(f, 0);
-            qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
-            /* indicates data finished, goto complete phase */
-            return 1;
-        }
-    }
-
-    ret = vfio_save_buffer(f, vbasedev, &data_size);
-    if (ret) {
-        error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
-                     strerror(errno));
-        return ret;
-    }
-
-    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
-
-    ret = qemu_file_get_error(f);
-    if (ret) {
-        return ret;
-    }
-
-    /*
-     * Reset pending_bytes as state_pending* are not called during
-     * savevm or snapshot case, in such case vfio_update_pending() at
-     * the start of this function updates pending_bytes.
-     */
-    migration->pending_bytes = 0;
-    trace_vfio_save_iterate(vbasedev->name, data_size);
-    return 0;
-}
-
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -795,62 +367,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     return ret;
 }
 
-static int vfio_v1_save_complete_precopy(QEMUFile *f, void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    uint64_t data_size;
-    int ret;
-
-    ret = vfio_migration_v1_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_RUNNING,
-                                      VFIO_DEVICE_STATE_V1_SAVING);
-    if (ret) {
-        error_report("%s: Failed to set state STOP and SAVING",
-                     vbasedev->name);
-        return ret;
-    }
-
-    ret = vfio_update_pending(vbasedev);
-    if (ret) {
-        return ret;
-    }
-
-    while (migration->pending_bytes > 0) {
-        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
-        ret = vfio_save_buffer(f, vbasedev, &data_size);
-        if (ret < 0) {
-            error_report("%s: Failed to save buffer", vbasedev->name);
-            return ret;
-        }
-
-        if (data_size == 0) {
-            break;
-        }
-
-        ret = vfio_update_pending(vbasedev);
-        if (ret) {
-            return ret;
-        }
-    }
-
-    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
-
-    ret = qemu_file_get_error(f);
-    if (ret) {
-        return ret;
-    }
-
-    ret = vfio_migration_v1_set_state(vbasedev, ~VFIO_DEVICE_STATE_V1_SAVING,
-                                      0);
-    if (ret) {
-        error_report("%s: Failed to set state STOPPED", vbasedev->name);
-        return ret;
-    }
-
-    trace_vfio_v1_save_complete_precopy(vbasedev->name);
-    return ret;
-}
-
 static void vfio_save_state(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -872,33 +388,6 @@ static int vfio_load_setup(QEMUFile *f, void *opaque)
                                    vbasedev->migration->device_state);
 }
 
-static int vfio_v1_load_setup(QEMUFile *f, void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    int ret = 0;
-
-    if (migration->region.mmaps) {
-        ret = vfio_region_mmap(&migration->region);
-        if (ret) {
-            error_report("%s: Failed to mmap VFIO migration region %d: %s",
-                         vbasedev->name, migration->region.nr,
-                         strerror(-ret));
-            error_report("%s: Falling back to slow path", vbasedev->name);
-        }
-    }
-
-    ret = vfio_migration_v1_set_state(vbasedev, ~VFIO_DEVICE_STATE_MASK,
-                                      VFIO_DEVICE_STATE_V1_RESUMING);
-    if (ret) {
-        error_report("%s: Failed to set state RESUMING", vbasedev->name);
-        if (migration->region.mmaps) {
-            vfio_region_unmap(&migration->region);
-        }
-    }
-    return ret;
-}
-
 static int vfio_load_cleanup(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
@@ -909,15 +398,6 @@ static int vfio_load_cleanup(void *opaque)
     return 0;
 }
 
-static int vfio_v1_load_cleanup(void *opaque)
-{
-    VFIODevice *vbasedev = opaque;
-
-    vfio_migration_v1_cleanup(vbasedev);
-    trace_vfio_load_cleanup(vbasedev->name);
-    return 0;
-}
-
 static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 {
     VFIODevice *vbasedev = opaque;
@@ -951,11 +431,7 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             uint64_t data_size = qemu_get_be64(f);
 
             if (data_size) {
-                if (vbasedev->migration->v2) {
-                    ret = vfio_load_buffer(f, vbasedev, data_size);
-                } else {
-                    ret = vfio_v1_load_buffer(f, vbasedev, data_size);
-                }
+                ret = vfio_load_buffer(f, vbasedev, data_size);
                 if (ret < 0) {
                     return ret;
                 }
@@ -987,19 +463,6 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_state = vfio_load_state,
 };
 
-static SaveVMHandlers savevm_vfio_v1_handlers = {
-    .save_setup = vfio_v1_save_setup,
-    .save_cleanup = vfio_v1_save_cleanup,
-    .state_pending_exact = vfio_v1_state_pending,
-    .state_pending_estimate = vfio_v1_state_pending,
-    .save_live_iterate = vfio_save_iterate,
-    .save_live_complete_precopy = vfio_v1_save_complete_precopy,
-    .save_state = vfio_save_state,
-    .load_setup = vfio_v1_load_setup,
-    .load_cleanup = vfio_v1_load_cleanup,
-    .load_state = vfio_load_state,
-};
-
 /* ---------------------------------------------------------------------- */
 
 static void vfio_vmstate_change(void *opaque, bool running, RunState state)
@@ -1034,70 +497,12 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
                               mig_state_to_str(new_state));
 }
 
-static void vfio_v1_vmstate_change(void *opaque, bool running, RunState state)
-{
-    VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
-    uint32_t value, mask;
-    int ret;
-
-    if (vbasedev->migration->vm_running == running) {
-        return;
-    }
-
-    if (running) {
-        /*
-         * Here device state can have one of _SAVING, _RESUMING or _STOP bit.
-         * Transition from _SAVING to _RUNNING can happen if there is migration
-         * failure, in that case clear _SAVING bit.
-         * Transition from _RESUMING to _RUNNING occurs during resuming
-         * phase, in that case clear _RESUMING bit.
-         * In both the above cases, set _RUNNING bit.
-         */
-        mask = ~VFIO_DEVICE_STATE_MASK;
-        value = VFIO_DEVICE_STATE_V1_RUNNING;
-    } else {
-        /*
-         * Here device state could be either _RUNNING or _SAVING|_RUNNING. Reset
-         * _RUNNING bit
-         */
-        mask = ~VFIO_DEVICE_STATE_V1_RUNNING;
-
-        /*
-         * When VM state transition to stop for savevm command, device should
-         * start saving data.
-         */
-        if (state == RUN_STATE_SAVE_VM) {
-            value = VFIO_DEVICE_STATE_V1_SAVING;
-        } else {
-            value = 0;
-        }
-    }
-
-    ret = vfio_migration_v1_set_state(vbasedev, mask, value);
-    if (ret) {
-        /*
-         * Migration should be aborted in this case, but vm_state_notify()
-         * currently does not support reporting failures.
-         */
-        error_report("%s: Failed to set device state 0x%x", vbasedev->name,
-                     (migration->device_state_v1 & mask) | value);
-        if (migrate_get_current()->to_dst_file) {
-            qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
-        }
-    }
-    vbasedev->migration->vm_running = running;
-    trace_vfio_v1_vmstate_change(vbasedev->name, running, RunState_str(state),
-            (migration->device_state_v1 & mask) | value);
-}
-
 static void vfio_migration_state_notifier(Notifier *notifier, void *data)
 {
     MigrationState *s = data;
     VFIOMigration *migration = container_of(notifier, VFIOMigration,
                                             migration_state);
     VFIODevice *vbasedev = migration->vbasedev;
-    int ret;
 
     trace_vfio_migration_state_notifier(vbasedev->name,
                                         MigrationStatus_str(s->state));
@@ -1107,33 +512,17 @@ static void vfio_migration_state_notifier(Notifier *notifier, void *data)
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_FAILED:
         bytes_transferred = 0;
-        if (migration->v2) {
-            /*
-             * If setting the device in RUNNING state fails, the device should
-             * be reset. To do so, use ERROR state as a recover state.
-             */
-            vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
-                                     VFIO_DEVICE_STATE_ERROR);
-        } else {
-            ret = vfio_migration_v1_set_state(vbasedev,
-                                              ~(VFIO_DEVICE_STATE_V1_SAVING |
-                                                VFIO_DEVICE_STATE_V1_RESUMING),
-                                              VFIO_DEVICE_STATE_V1_RUNNING);
-            if (ret) {
-                error_report("%s: Failed to set state RUNNING", vbasedev->name);
-            }
-        }
+        /*
+         * If setting the device in RUNNING state fails, the device should
+         * be reset. To do so, use ERROR state as a recover state.
+         */
+        vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RUNNING,
+                                 VFIO_DEVICE_STATE_ERROR);
     }
 }
 
 static void vfio_migration_exit(VFIODevice *vbasedev)
 {
-    VFIOMigration *migration = vbasedev->migration;
-
-    if (!migration->v2) {
-        vfio_region_exit(&migration->region);
-        vfio_region_finalize(&migration->region);
-    }
     g_free(vbasedev->migration);
     vbasedev->migration = NULL;
 }
@@ -1173,7 +562,6 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     VFIOMigration *migration;
     char id[256] = "";
     g_autofree char *path = NULL, *oid = NULL;
-    struct vfio_region_info *info;
     uint64_t mig_flags = 0;
 
     if (!vbasedev->ops->vfio_get_object) {
@@ -1186,52 +574,20 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     }
 
     ret = vfio_migration_query_flags(vbasedev, &mig_flags);
-    if (!ret) {
-        /* Migration v2 */
-        /* Basic migration functionality must be supported */
-        if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
-            return -EOPNOTSUPP;
-        }
-        vbasedev->migration = g_new0(VFIOMigration, 1);
-        vbasedev->migration->device_state = VFIO_DEVICE_STATE_RUNNING;
-        vbasedev->migration->data_fd = -1;
-        vbasedev->migration->v2 = true;
-    } else if (ret == -ENOTTY) {
-        /* Migration v1 */
-        ret = vfio_get_dev_region_info(vbasedev,
-                                       VFIO_REGION_TYPE_MIGRATION_DEPRECATED,
-                                       VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED,
-                                       &info);
-        if (ret) {
-            return ret;
-        }
-
-        vbasedev->migration = g_new0(VFIOMigration, 1);
-        vbasedev->migration->device_state_v1 = VFIO_DEVICE_STATE_V1_RUNNING;
-        vbasedev->migration->vm_running = runstate_is_running();
-
-        ret = vfio_region_setup(obj, vbasedev, &vbasedev->migration->region,
-                                info->index, "migration");
-        if (ret) {
-            error_report("%s: Failed to setup VFIO migration region %d: %s",
-                         vbasedev->name, info->index, strerror(-ret));
-            goto err;
-        }
-
-        if (!vbasedev->migration->region.size) {
-            error_report("%s: Invalid zero-sized VFIO migration region %d",
-                         vbasedev->name, info->index);
-            ret = -EINVAL;
-            goto err;
-        }
-
-        g_free(info);
-    } else {
+    if (ret) {
         return ret;
     }
 
+    /* Basic migration functionality must be supported */
+    if (!(mig_flags & VFIO_MIGRATION_STOP_COPY)) {
+        return -EOPNOTSUPP;
+    }
+
+    vbasedev->migration = g_new0(VFIOMigration, 1);
     migration = vbasedev->migration;
     migration->vbasedev = vbasedev;
+    migration->device_state = VFIO_DEVICE_STATE_RUNNING;
+    migration->data_fd = -1;
 
     oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
     if (oid) {
@@ -1241,28 +597,16 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     }
     strpadcpy(id, sizeof(id), path, '\0');
 
-    if (migration->v2) {
-        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
-                             &savevm_vfio_handlers, vbasedev);
-
-        migration->vm_state = qdev_add_vm_change_state_handler(
-            vbasedev->dev, vfio_vmstate_change, vbasedev);
-    } else {
-        register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1,
-                             &savevm_vfio_v1_handlers, vbasedev);
-
-        migration->vm_state = qdev_add_vm_change_state_handler(
-            vbasedev->dev, vfio_v1_vmstate_change, vbasedev);
-    }
+    register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
+                         vbasedev);
 
+    migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev,
+                                                           vfio_vmstate_change,
+                                                           vbasedev);
     migration->migration_state.notify = vfio_migration_state_notifier;
     add_migration_state_change_notifier(&migration->migration_state);
-    return 0;
 
-err:
-    g_free(info);
-    vfio_migration_exit(vbasedev);
-    return ret;
+    return 0;
 }
 
 /* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index b24e448534..9d65c2da2e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -150,24 +150,15 @@ vfio_display_edid_write_error(void) ""
 # migration.c
 vfio_migration_probe(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
-vfio_migration_v1_set_state(const char *name, uint32_t state) " (%s) state %d"
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
-vfio_v1_vmstate_change(const char *name, int running, const char *reason, uint32_t dev_state) " (%s) running %d reason %s device state %d"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
-vfio_v1_save_setup(const char *name) " (%s)"
 vfio_save_cleanup(const char *name) " (%s)"
-vfio_save_buffer(const char *name, uint64_t data_offset, uint64_t data_size, uint64_t pending) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64" pending 0x%"PRIx64
-vfio_update_pending(const char *name, uint64_t pending) " (%s) pending 0x%"PRIx64
 vfio_save_device_config_state(const char *name) " (%s)"
 vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
-vfio_v1_state_pending(const char *name, uint64_t precopy, uint64_t postcopy) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64
-vfio_save_iterate(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
-vfio_v1_save_complete_precopy(const char *name) " (%s)"
 vfio_load_device_config_state(const char *name) " (%s)"
 vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
-vfio_v1_load_state_device_data(const char *name, uint64_t data_offset, uint64_t data_size) " (%s) Offset 0x%"PRIx64" size 0x%"PRIx64
 vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size 0x%"PRIx64" ret %d"
 vfio_load_cleanup(const char *name) " (%s)"
 vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 10/11] vfio: Alphabetize migration section of VFIO trace-events file
  2023-02-16 14:36 [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (8 preceding siblings ...)
  2023-02-16 14:36 ` [PATCH v11 09/11] vfio/migration: Remove VFIO migration protocol v1 Avihai Horon
@ 2023-02-16 14:36 ` Avihai Horon
  2023-02-16 14:36 ` [PATCH v11 11/11] docs/devel: Align VFIO migration docs to v2 protocol Avihai Horon
  10 siblings, 0 replies; 41+ messages in thread
From: Avihai Horon @ 2023-02-16 14:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Sort the migration section of VFIO trace events file alphabetically
and move two misplaced traces to common.c section.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 hw/vfio/trace-events | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 9d65c2da2e..669d9fe07c 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -119,6 +119,8 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
+vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
+vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
@@ -148,19 +150,17 @@ vfio_display_edid_update(uint32_t prefx, uint32_t prefy) "%ux%u"
 vfio_display_edid_write_error(void) ""
 
 # migration.c
+vfio_load_cleanup(const char *name) " (%s)"
+vfio_load_device_config_state(const char *name) " (%s)"
+vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
+vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size 0x%"PRIx64" ret %d"
 vfio_migration_probe(const char *name) " (%s)"
 vfio_migration_set_state(const char *name, const char *state) " (%s) state %s"
-vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
 vfio_migration_state_notifier(const char *name, const char *state) " (%s) state %s"
-vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
+vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
+vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
+vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
 vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
-vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
-vfio_load_device_config_state(const char *name) " (%s)"
-vfio_load_state(const char *name, uint64_t data) " (%s) data 0x%"PRIx64
-vfio_load_state_device_data(const char *name, uint64_t data_size, int ret) " (%s) size 0x%"PRIx64" ret %d"
-vfio_load_cleanup(const char *name) " (%s)"
-vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
-vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
-vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
+vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v11 11/11] docs/devel: Align VFIO migration docs to v2 protocol
  2023-02-16 14:36 [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
                   ` (9 preceding siblings ...)
  2023-02-16 14:36 ` [PATCH v11 10/11] vfio: Alphabetize migration section of VFIO trace-events file Avihai Horon
@ 2023-02-16 14:36 ` Avihai Horon
  2023-02-16 14:57   ` Juan Quintela
  10 siblings, 1 reply; 41+ messages in thread
From: Avihai Horon @ 2023-02-16 14:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Avihai Horon, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Now that VFIO migration protocol v2 has been implemented and v1 protocol
has been removed, update the documentation according to v2 protocol.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 docs/devel/vfio-migration.rst | 72 +++++++++++++++++------------------
 1 file changed, 34 insertions(+), 38 deletions(-)

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index 673057c90d..c214c73e28 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,46 +7,43 @@ the guest is running on source host and restoring this saved state on the
 destination host. This document details how saving and restoring of VFIO
 devices is done in QEMU.
 
-Migration of VFIO devices consists of two phases: the optional pre-copy phase,
-and the stop-and-copy phase. The pre-copy phase is iterative and allows to
-accommodate VFIO devices that have a large amount of data that needs to be
-transferred. The iterative pre-copy phase of migration allows for the guest to
-continue whilst the VFIO device state is transferred to the destination, this
-helps to reduce the total downtime of the VM. VFIO devices can choose to skip
-the pre-copy phase of migration by returning pending_bytes as zero during the
-pre-copy phase.
+Migration of VFIO devices currently consists of a single stop-and-copy phase.
+During the stop-and-copy phase the guest is stopped and the entire VFIO device
+data is transferred to the destination.
+
+The pre-copy phase of migration is currently not supported for VFIO devices.
+Support for VFIO pre-copy will be added later on.
+
+Note that currently VFIO migration is supported only for a single device. This
+is due to VFIO migration's lack of P2P support. However, P2P support is planned
+to be added later on.
 
 A detailed description of the UAPI for VFIO device migration can be found in
-the comment for the ``vfio_device_migration_info`` structure in the header
-file linux-headers/linux/vfio.h.
+the comment for the ``vfio_device_mig_state`` structure in the header file
+linux-headers/linux/vfio.h.
 
 VFIO implements the device hooks for the iterative approach as follows:
 
-* A ``save_setup`` function that sets up the migration region and sets _SAVING
-  flag in the VFIO device state.
+* A ``save_setup`` function that sets up migration on the source.
 
-* A ``load_setup`` function that sets up the migration region on the
-  destination and sets _RESUMING flag in the VFIO device state.
+* A ``load_setup`` function that sets the VFIO device on the destination in
+  _RESUMING state.
 
 * A ``state_pending_exact`` function that reads pending_bytes from the vendor
   driver, which indicates the amount of data that the vendor driver has yet to
   save for the VFIO device.
 
-* A ``save_live_iterate`` function that reads the VFIO device's data from the
-  vendor driver through the migration region during iterative phase.
-
 * A ``save_state`` function to save the device config space if it is present.
 
-* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
-  VFIO device state and iteratively copies the remaining data for the VFIO
-  device until the vendor driver indicates that no data remains (pending bytes
-  is zero).
+* A ``save_live_complete_precopy`` function that sets the VFIO device in
+  _STOP_COPY state and iteratively copies the data for the VFIO device until
+  the vendor driver indicates that no data remains.
 
 * A ``load_state`` function that loads the config section and the data
-  sections that are generated by the save functions above
+  sections that are generated by the save functions above.
 
 * ``cleanup`` functions for both save and load that perform any migration
-  related cleanup, including unmapping the migration region
+  related cleanup.
 
 
 The VFIO migration code uses a VM state change handler to change the VFIO
@@ -71,13 +68,13 @@ tracking can identify dirtied pages, but any page pinned by the vendor driver
 can also be written by the device. There is currently no device or IOMMU
 support for dirty page tracking in hardware.
 
-By default, dirty pages are tracked when the device is in pre-copy as well as
-stop-and-copy phase. So, a page pinned by the vendor driver will be copied to
-the destination in both phases. Copying dirty pages in pre-copy phase helps
-QEMU to predict if it can achieve its downtime tolerances. If QEMU during
-pre-copy phase keeps finding dirty pages continuously, then it understands
-that even in stop-and-copy phase, it is likely to find dirty pages and can
-predict the downtime accordingly.
+By default, dirty pages are tracked during pre-copy as well as stop-and-copy
+phase. So, a page pinned by the vendor driver will be copied to the destination
+in both phases. Copying dirty pages in pre-copy phase helps QEMU to predict if
+it can achieve its downtime tolerances. If QEMU during pre-copy phase keeps
+finding dirty pages continuously, then it understands that even in stop-and-copy
+phase, it is likely to find dirty pages and can predict the downtime
+accordingly.
 
 QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
 which disables querying the dirty bitmap during pre-copy phase. If it is set to
@@ -111,23 +108,22 @@ Live migration save path
                                   |
                      migrate_init spawns migration_thread
                 Migration thread then calls each device's .save_setup()
-                    (RUNNING, _SETUP, _RUNNING|_SAVING)
+                       (RUNNING, _SETUP, _RUNNING)
                                   |
-                    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
+                      (RUNNING, _ACTIVE, _RUNNING)
              If device is active, get pending_bytes by .state_pending_exact()
           If total pending_bytes >= threshold_size, call .save_live_iterate()
-                  Data of VFIO device for pre-copy phase is copied
         Iterate till total pending bytes converge and are less than threshold
                                   |
   On migration completion, vCPU stops and calls .save_live_complete_precopy for
-   each active device. The VFIO device is then transitioned into _SAVING state
-                   (FINISH_MIGRATE, _DEVICE, _SAVING)
+  each active device. The VFIO device is then transitioned into _STOP_COPY state
+                  (FINISH_MIGRATE, _DEVICE, _STOP_COPY)
                                   |
      For the VFIO device, iterate in .save_live_complete_precopy until
                          pending data is 0
-                   (FINISH_MIGRATE, _DEVICE, _STOPPED)
+                   (FINISH_MIGRATE, _DEVICE, _STOP)
                                   |
-                 (FINISH_MIGRATE, _COMPLETED, _STOPPED)
+                 (FINISH_MIGRATE, _COMPLETED, _STOP)
              Migraton thread schedules cleanup bottom half and exits
 
 Live migration resume path
@@ -136,7 +132,7 @@ Live migration resume path
 ::
 
               Incoming migration calls .load_setup for each device
-                       (RESTORE_VM, _ACTIVE, _STOPPED)
+                       (RESTORE_VM, _ACTIVE, _STOP)
                                  |
        For each device, .load_state is called for that device section data
                        (RESTORE_VM, _ACTIVE, _RESUMING)
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 06/11] vfio/migration: Move migration v1 logic to vfio_migration_init()
  2023-02-16 14:36 ` [PATCH v11 06/11] vfio/migration: Move migration v1 logic to vfio_migration_init() Avihai Horon
@ 2023-02-16 14:50   ` Juan Quintela
  0 siblings, 0 replies; 41+ messages in thread
From: Juan Quintela @ 2023-02-16 14:50 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Avihai Horon <avihaih@nvidia.com> wrote:
> Move vfio_dev_get_region_info() logic from vfio_migration_probe() to
> vfio_migration_init(). This logic is specific to v1 protocol and moving
> it will make it easier to add the v2 protocol implementation later.
> No functional changes intended.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>


Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 04/11] vfio/common: Change vfio_devices_all_running_and_saving() logic to equivalent one
  2023-02-16 14:36 ` [PATCH v11 04/11] vfio/common: Change vfio_devices_all_running_and_saving() logic to equivalent one Avihai Horon
@ 2023-02-16 14:53   ` Juan Quintela
  0 siblings, 0 replies; 41+ messages in thread
From: Juan Quintela @ 2023-02-16 14:53 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Avihai Horon <avihaih@nvidia.com> wrote:
> vfio_devices_all_running_and_saving() is used to check if migration is
> in pre-copy phase. This is done by checking if migration is in setup or
> active states and if all VFIO devices are in pre-copy state, i.e.
> _SAVING | _RUNNING.
>
> In VFIO migration protocol v2 pre-copy support is made optional. Hence,
> a matching v2 protocol pre-copy state can't be used here.
>
> As preparation for adding v2 protocol, change
> vfio_devices_all_running_and_saving() logic such that it doesn't use the
> VFIO pre-copy state.
>
> The new equivalent logic checks if migration is in active state and if
> all VFIO devices are in running state [1]. No functional changes
> intended.
>
> [1] Note that checking if migration is in setup or active states and if
> all VFIO devices are in running state doesn't guarantee that we are in
> pre-copy phase, thus we check if migration is only in active state.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 07/11] vfio/migration: Rename functions/structs related to v1 protocol
  2023-02-16 14:36 ` [PATCH v11 07/11] vfio/migration: Rename functions/structs related to v1 protocol Avihai Horon
@ 2023-02-16 14:54   ` Juan Quintela
  0 siblings, 0 replies; 41+ messages in thread
From: Juan Quintela @ 2023-02-16 14:54 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Avihai Horon <avihaih@nvidia.com> wrote:
> To avoid name collisions, rename functions and structs related to VFIO
> migration protocol v1. This will allow the two protocols to co-exist
> when v2 protocol is added, until v1 is removed. No functional changes
> intended.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 11/11] docs/devel: Align VFIO migration docs to v2 protocol
  2023-02-16 14:36 ` [PATCH v11 11/11] docs/devel: Align VFIO migration docs to v2 protocol Avihai Horon
@ 2023-02-16 14:57   ` Juan Quintela
  0 siblings, 0 replies; 41+ messages in thread
From: Juan Quintela @ 2023-02-16 14:57 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Avihai Horon <avihaih@nvidia.com> wrote:
> Now that VFIO migration protocol v2 has been implemented and v1 protocol
> has been removed, update the documentation according to v2 protocol.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2023-02-16 14:36 ` [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
@ 2023-02-16 15:43   ` Juan Quintela
  2023-02-16 16:40     ` Avihai Horon
  2024-09-04 13:00   ` Peter Xu
  1 sibling, 1 reply; 41+ messages in thread
From: Juan Quintela @ 2023-02-16 15:43 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Avihai Horon <avihaih@nvidia.com> wrote:

Reviewed-by: Juan Quintela <quintela@redhat.com>.


1st comment is a bug, but as you just remove it.


Not that it matters a lot in the end (you are removing v1 on the
following patch).

> @@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>                  return false;
>              }
>  
> -            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> +            if (!migration->v2 &&
> +                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> +                continue;
> +            }

Are you missing here:


               } else {
                   return false;
               }

Or I am missunderstanding the code.


> +            if (migration->v2 &&
> +                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>                  continue;
>              } else {
>                  return false;


[...]



> +/*
> + * This is an arbitrary size based on migration of mlx5 devices, where typically
> + * total device migration size is on the order of 100s of MB. Testing with
> + * larger values, e.g. 128MB and 1GB, did not show a performance improvement.
> + */
> +#define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)

Just in case that it makes a difference.  You are using here a 1MB
buffer.
But qemu_file_get_to_fd() uses a 32KB buffer.



>      }
>  }
>  
> +static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
> +                                     uint64_t *stop_copy_size)
> +{
> +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
> +                              sizeof(struct vfio_device_feature_mig_data_size),
> +                              sizeof(uint64_t))] = {};
> +    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;

suggestion:

       size_t size = DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
                                 sizeof(struct vfio_device_feature_mig_data_size),
                                 sizeof(uint64_t));
       g_autofree struct vfio_device_feature *feature = g_malloc0(size * sizeof(uint64_t));

I have to reread several times to see what was going on there.


And call it a day?



Later, Juan.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2023-02-16 15:43   ` Juan Quintela
@ 2023-02-16 16:40     ` Avihai Horon
  2023-02-16 16:52       ` Juan Quintela
  0 siblings, 1 reply; 41+ messages in thread
From: Avihai Horon @ 2023-02-16 16:40 UTC (permalink / raw)
  To: quintela
  Cc: qemu-devel, Alex Williamson, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 16/02/2023 17:43, Juan Quintela wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> wrote:
>
> Reviewed-by: Juan Quintela <quintela@redhat.com>.
>
>
> 1st comment is a bug, but as you just remove it.
>
>
> Not that it matters a lot in the end (you are removing v1 on the
> following patch).
>
>> @@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>                   return false;
>>               }
>>
>> -            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
>> +            if (!migration->v2 &&
>> +                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
>> +                continue;
>> +            }
> Are you missing here:
>
>
>                 } else {
>                     return false;
>                 }
>
> Or I am missunderstanding the code.

I think the code is OK:

If the device uses v1 migration and is running, the first if is true and 
we continue.
If the device uses v1 migration and is not running, the first if is 
false and the second if is false (since the device doesn't use v2 
migration) and we return false.

If the device uses v2 migration and is running, the first if is false 
and the second if is true, so we continue.
If the device uses v2 migration and is not running, first if is false 
and the second if is false, so we return false.

>
>> +            if (migration->v2 &&
>> +                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>>                   continue;
>>               } else {
>>                   return false;
>
> [...]
>
>
>
>> +/*
>> + * This is an arbitrary size based on migration of mlx5 devices, where typically
>> + * total device migration size is on the order of 100s of MB. Testing with
>> + * larger values, e.g. 128MB and 1GB, did not show a performance improvement.
>> + */
>> +#define VFIO_MIG_DEFAULT_DATA_BUFFER_SIZE (1 * MiB)
> Just in case that it makes a difference.  You are using here a 1MB
> buffer.
> But qemu_file_get_to_fd() uses a 32KB buffer.

Yes, I'm aware of that.
Down the road we will address the performance subject more thoroughly.
In the meantime, it seems like a reasonable size according to the tests 
we did.

>
>
>>       }
>>   }
>>
>> +static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>> +                                     uint64_t *stop_copy_size)
>> +{
>> +    uint64_t buf[DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
>> +                              sizeof(struct vfio_device_feature_mig_data_size),
>> +                              sizeof(uint64_t))] = {};
>> +    struct vfio_device_feature *feature = (struct vfio_device_feature *)buf;
> suggestion:
>
>         size_t size = DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
>                                   sizeof(struct vfio_device_feature_mig_data_size),
>                                   sizeof(uint64_t));
>         g_autofree struct vfio_device_feature *feature = g_malloc0(size * sizeof(uint64_t));
>
> I have to reread several times to see what was going on there.
>
>
> And call it a day?

I don't have a strong opinion here.
We can do whatever you/Alex like more.

Thanks for reviewing!



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2023-02-16 16:40     ` Avihai Horon
@ 2023-02-16 16:52       ` Juan Quintela
  2023-02-16 19:53         ` Alex Williamson
  0 siblings, 1 reply; 41+ messages in thread
From: Juan Quintela @ 2023-02-16 16:52 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Avihai Horon <avihaih@nvidia.com> wrote:
> On 16/02/2023 17:43, Juan Quintela wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Avihai Horon <avihaih@nvidia.com> wrote:
>>
>> Reviewed-by: Juan Quintela <quintela@redhat.com>.
>>
>>
>> 1st comment is a bug, but as you just remove it.
>>
>>
>> Not that it matters a lot in the end (you are removing v1 on the
>> following patch).
>>
>>> @@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>>                   return false;
>>>               }
>>>
>>> -            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
>>> +            if (!migration->v2 &&
>>> +                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
>>> +                continue;
>>> +            }
>> Are you missing here:
>>
>>
>>                 } else {
>>                     return false;
>>                 }
>>
>> Or I am missunderstanding the code.
>
> I think the code is OK:
>
> If the device uses v1 migration and is running, the first if is true
> and we continue.
> If the device uses v1 migration and is not running, the first if is
> false and the second if is false (since the device doesn't use v2
> migration) and we return false.
>
> If the device uses v2 migration and is running, the first if is false
> and the second if is true, so we continue.
> If the device uses v2 migration and is not running, first if is false
> and the second if is false, so we return false.

You win O:-)

I was looking at C level, not at the "semantic" level.

>>
>>         size_t size = DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
>>                                   sizeof(struct vfio_device_feature_mig_data_size),
>>                                   sizeof(uint64_t));
>>         g_autofree struct vfio_device_feature *feature = g_malloc0(size * sizeof(uint64_t));
>>
>> I have to reread several times to see what was going on there.
>>
>>
>> And call it a day?
>
> I don't have a strong opinion here.
> We can do whatever you/Alex like more.

I think it is more readable, and we don't put (normally) big chunks in
the stack, but once told that, who wrotes the code has more to say O:-)

Later, Juan.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2023-02-16 16:52       ` Juan Quintela
@ 2023-02-16 19:53         ` Alex Williamson
  0 siblings, 0 replies; 41+ messages in thread
From: Alex Williamson @ 2023-02-16 19:53 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Avihai Horon, qemu-devel, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, 16 Feb 2023 17:52:33 +0100
Juan Quintela <quintela@redhat.com> wrote:

> Avihai Horon <avihaih@nvidia.com> wrote:
> > On 16/02/2023 17:43, Juan Quintela wrote:  
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> Avihai Horon <avihaih@nvidia.com> wrote:
> >>
> >> Reviewed-by: Juan Quintela <quintela@redhat.com>.
> >>
> >>
> >> 1st comment is a bug, but as you just remove it.
> >>
> >>
> >> Not that it matters a lot in the end (you are removing v1 on the
> >> following patch).
> >>  
> >>> @@ -438,7 +445,13 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
> >>>                   return false;
> >>>               }
> >>>
> >>> -            if (migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> >>> +            if (!migration->v2 &&
> >>> +                migration->device_state_v1 & VFIO_DEVICE_STATE_V1_RUNNING) {
> >>> +                continue;
> >>> +            }  
> >> Are you missing here:
> >>
> >>
> >>                 } else {
> >>                     return false;
> >>                 }
> >>
> >> Or I am missunderstanding the code.  
> >
> > I think the code is OK:
> >
> > If the device uses v1 migration and is running, the first if is true
> > and we continue.
> > If the device uses v1 migration and is not running, the first if is
> > false and the second if is false (since the device doesn't use v2
> > migration) and we return false.
> >
> > If the device uses v2 migration and is running, the first if is false
> > and the second if is true, so we continue.
> > If the device uses v2 migration and is not running, first if is false
> > and the second if is false, so we return false.  
> 
> You win O:-)
> 
> I was looking at C level, not at the "semantic" level.
> 
> >>
> >>         size_t size = DIV_ROUND_UP(sizeof(struct vfio_device_feature) +
> >>                                   sizeof(struct vfio_device_feature_mig_data_size),
> >>                                   sizeof(uint64_t));
> >>         g_autofree struct vfio_device_feature *feature = g_malloc0(size * sizeof(uint64_t));
> >>
> >> I have to reread several times to see what was going on there.
> >>
> >>
> >> And call it a day?  
> >
> > I don't have a strong opinion here.
> > We can do whatever you/Alex like more.  
> 
> I think it is more readable, and we don't put (normally) big chunks in
> the stack, but once told that, who wrotes the code has more to say O:-)

This is not a huge amount of data on the stack, so I'm fine with it as
is.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v11 05/11] vfio/migration: Block multiple devices migration
  2023-02-16 14:36 ` [PATCH v11 05/11] vfio/migration: Block multiple devices migration Avihai Horon
@ 2023-05-16 10:03   ` Shameerali Kolothum Thodi via
  2023-05-16 11:59     ` Jason Gunthorpe
  0 siblings, 1 reply; 41+ messages in thread
From: Shameerali Kolothum Thodi via @ 2023-05-16 10:03 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel@nongnu.org
  Cc: Alex Williamson, Juan Quintela, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Cornelia Huck, Paolo Bonzini,
	Vladimir Sementsov-Ogievskiy, Cédric Le Goater, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins



> -----Original Message-----
> From:
> qemu-devel-bounces+shameerali.kolothum.thodi=huawei.com@nongnu.org
> [mailto:qemu-devel-bounces+shameerali.kolothum.thodi=huawei.com@nong
> nu.org] On Behalf Of Avihai Horon
> Sent: 16 February 2023 14:36
> To: qemu-devel@nongnu.org
> Cc: Alex Williamson <alex.williamson@redhat.com>; Juan Quintela
> <quintela@redhat.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>;
> Michael S. Tsirkin <mst@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> Paolo Bonzini <pbonzini@redhat.com>; Vladimir Sementsov-Ogievskiy
> <vsementsov@yandex-team.ru>; Cédric Le Goater <clg@redhat.com>; Yishai
> Hadas <yishaih@nvidia.com>; Jason Gunthorpe <jgg@nvidia.com>; Maor
> Gottlieb <maorg@nvidia.com>; Avihai Horon <avihaih@nvidia.com>; Kirti
> Wankhede <kwankhede@nvidia.com>; Tarun Gupta <targupta@nvidia.com>;
> Joao Martins <joao.m.martins@oracle.com>
> Subject: [PATCH v11 05/11] vfio/migration: Block multiple devices migration
> 
> Currently VFIO migration doesn't implement some kind of intermediate
> quiescent state in which P2P DMAs are quiesced before stopping or
> running the device. This can cause problems in multi-device migration
> where the devices are doing P2P DMAs, since the devices are not stopped
> together at the same time.
> 
> Until such support is added, block migration of multiple devices.

Missed this one. Currently this blocks even if the attached devices are not
capable of P2P DMAs. eg; HiSilicon ACC devices. These are integrated end point
devices without any P2P capability between them. Is it Ok to check for
VFIO_MIGRATION_P2P flag and allow if the devices are not supporting that?

I can sent a patch if that’s fine.

Thanks,
Shameer
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> ---
>  include/hw/vfio/vfio-common.h |  2 ++
>  hw/vfio/common.c              | 53
> +++++++++++++++++++++++++++++++++++
>  hw/vfio/migration.c           |  6 ++++
>  3 files changed, 61 insertions(+)
> 
> diff --git a/include/hw/vfio/vfio-common.h
> b/include/hw/vfio/vfio-common.h
> index e573f5a9f1..56b1683824 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -218,6 +218,8 @@ typedef QLIST_HEAD(VFIOGroupList, VFIOGroup)
> VFIOGroupList;
>  extern VFIOGroupList vfio_group_list;
> 
>  bool vfio_mig_active(void);
> +int vfio_block_multiple_devices_migration(Error **errp);
> +void vfio_unblock_multiple_devices_migration(void);
>  int64_t vfio_mig_bytes_transferred(void);
> 
>  #ifdef CONFIG_LINUX
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 3a35f4afad..fe80ccf914 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -41,6 +41,7 @@
>  #include "qapi/error.h"
>  #include "migration/migration.h"
>  #include "migration/misc.h"
> +#include "migration/blocker.h"
>  #include "sysemu/tpm.h"
> 
>  VFIOGroupList vfio_group_list =
> @@ -337,6 +338,58 @@ bool vfio_mig_active(void)
>      return true;
>  }
> 
> +static Error *multiple_devices_migration_blocker;
> +
> +static unsigned int vfio_migratable_device_num(void)
> +{
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +    unsigned int device_num = 0;
> +
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            if (vbasedev->migration) {
> +                device_num++;
> +            }
> +        }
> +    }
> +
> +    return device_num;
> +}
> +
> +int vfio_block_multiple_devices_migration(Error **errp)
> +{
> +    int ret;
> +
> +    if (multiple_devices_migration_blocker ||
> +        vfio_migratable_device_num() <= 1) {
> +        return 0;
> +    }
> +
> +    error_setg(&multiple_devices_migration_blocker,
> +               "Migration is currently not supported with multiple "
> +               "VFIO devices");
> +    ret = migrate_add_blocker(multiple_devices_migration_blocker, errp);
> +    if (ret < 0) {
> +        error_free(multiple_devices_migration_blocker);
> +        multiple_devices_migration_blocker = NULL;
> +    }
> +
> +    return ret;
> +}
> +
> +void vfio_unblock_multiple_devices_migration(void)
> +{
> +    if (!multiple_devices_migration_blocker ||
> +        vfio_migratable_device_num() > 1) {
> +        return;
> +    }
> +
> +    migrate_del_blocker(multiple_devices_migration_blocker);
> +    error_free(multiple_devices_migration_blocker);
> +    multiple_devices_migration_blocker = NULL;
> +}
> +
>  static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>  {
>      VFIOGroup *group;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index e56eef1ee8..8e96999669 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -878,6 +878,11 @@ int vfio_migration_probe(VFIODevice *vbasedev,
> Error **errp)
>          goto add_blocker;
>      }
> 
> +    ret = vfio_block_multiple_devices_migration(errp);
> +    if (ret) {
> +        return ret;
> +    }
> +
>      trace_vfio_migration_probe(vbasedev->name, info->index);
>      g_free(info);
>      return 0;
> @@ -904,6 +909,7 @@ void vfio_migration_finalize(VFIODevice *vbasedev)
>          qemu_del_vm_change_state_handler(migration->vm_state);
>          unregister_savevm(VMSTATE_IF(vbasedev->dev), "vfio",
> vbasedev);
>          vfio_migration_exit(vbasedev);
> +        vfio_unblock_multiple_devices_migration();
>      }
> 
>      if (vbasedev->migration_blocker) {
> --
> 2.26.3
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 05/11] vfio/migration: Block multiple devices migration
  2023-05-16 10:03   ` Shameerali Kolothum Thodi via
@ 2023-05-16 11:59     ` Jason Gunthorpe
  2023-05-16 13:57       ` Shameerali Kolothum Thodi via
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Gunthorpe @ 2023-05-16 11:59 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Avihai Horon, qemu-devel@nongnu.org, Alex Williamson,
	Juan Quintela, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Joao Martins

On Tue, May 16, 2023 at 10:03:54AM +0000, Shameerali Kolothum Thodi wrote:

> > Currently VFIO migration doesn't implement some kind of intermediate
> > quiescent state in which P2P DMAs are quiesced before stopping or
> > running the device. This can cause problems in multi-device migration
> > where the devices are doing P2P DMAs, since the devices are not stopped
> > together at the same time.
> > 
> > Until such support is added, block migration of multiple devices.
> 
> Missed this one. Currently this blocks even if the attached devices are not
> capable of P2P DMAs. eg; HiSilicon ACC devices. These are integrated end point
> devices without any P2P capability between them. Is it Ok to check for
> VFIO_MIGRATION_P2P flag and allow if the devices are not supporting that?

Lacking VFIO_MIGRATION_P2P doesn't mean the device is incapable of
P2P, it means the migration can't support P2P.

We'd need some kind of new flag to check and such devices should be
blocked from creating P2P mappings. Basically we don't currently
fully support devices that are incapable of P2P operations.

What happens on your platform if a guest tries to do P2P? Does the
platform crash?

Jason


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v11 05/11] vfio/migration: Block multiple devices migration
  2023-05-16 11:59     ` Jason Gunthorpe
@ 2023-05-16 13:57       ` Shameerali Kolothum Thodi via
  2023-05-16 14:04         ` Jason Gunthorpe
  2023-05-16 14:27         ` Alex Williamson
  0 siblings, 2 replies; 41+ messages in thread
From: Shameerali Kolothum Thodi via @ 2023-05-16 13:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Avihai Horon, qemu-devel@nongnu.org, Alex Williamson,
	Juan Quintela, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Joao Martins



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> Sent: 16 May 2023 13:00
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Avihai Horon <avihaih@nvidia.com>; qemu-devel@nongnu.org; Alex
> Williamson <alex.williamson@redhat.com>; Juan Quintela
> <quintela@redhat.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>;
> Michael S. Tsirkin <mst@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> Paolo Bonzini <pbonzini@redhat.com>; Vladimir Sementsov-Ogievskiy
> <vsementsov@yandex-team.ru>; Cédric Le Goater <clg@redhat.com>; Yishai
> Hadas <yishaih@nvidia.com>; Maor Gottlieb <maorg@nvidia.com>; Kirti
> Wankhede <kwankhede@nvidia.com>; Tarun Gupta <targupta@nvidia.com>;
> Joao Martins <joao.m.martins@oracle.com>
> Subject: Re: [PATCH v11 05/11] vfio/migration: Block multiple devices
> migration
> 
> On Tue, May 16, 2023 at 10:03:54AM +0000, Shameerali Kolothum Thodi
> wrote:
> 
> > > Currently VFIO migration doesn't implement some kind of intermediate
> > > quiescent state in which P2P DMAs are quiesced before stopping or
> > > running the device. This can cause problems in multi-device migration
> > > where the devices are doing P2P DMAs, since the devices are not stopped
> > > together at the same time.
> > >
> > > Until such support is added, block migration of multiple devices.
> >
> > Missed this one. Currently this blocks even if the attached devices are not
> > capable of P2P DMAs. eg; HiSilicon ACC devices. These are integrated end
> point
> > devices without any P2P capability between them. Is it Ok to check for
> > VFIO_MIGRATION_P2P flag and allow if the devices are not supporting
> that?
> 
> Lacking VFIO_MIGRATION_P2P doesn't mean the device is incapable of
> P2P, it means the migration can't support P2P.
> 
> We'd need some kind of new flag to check and such devices should be
> blocked from creating P2P mappings. Basically we don't currently
> fully support devices that are incapable of P2P operations.

Ok. I will take a look.

> What happens on your platform if a guest tries to do P2P? Does the
> platform crash?

I am not sure. Since the devices are behind SMMU, I was under the assumption
that we do have the guarantee of isolation here(grouping). Or this is something
we are worried only during migration?

Thanks,
Shameer
 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 05/11] vfio/migration: Block multiple devices migration
  2023-05-16 13:57       ` Shameerali Kolothum Thodi via
@ 2023-05-16 14:04         ` Jason Gunthorpe
  2023-05-16 14:27         ` Alex Williamson
  1 sibling, 0 replies; 41+ messages in thread
From: Jason Gunthorpe @ 2023-05-16 14:04 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Avihai Horon, qemu-devel@nongnu.org, Alex Williamson,
	Juan Quintela, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Joao Martins

On Tue, May 16, 2023 at 01:57:22PM +0000, Shameerali Kolothum Thodi wrote:

> > What happens on your platform if a guest tries to do P2P? Does the
> > platform crash?
> 
> I am not sure. Since the devices are behind SMMU, I was under the assumption
> that we do have the guarantee of isolation here(grouping).

That is not the case. The SMMU will contain an IOPTE with the the CPU
address of the P2P memory and when the SMMU executes that operation
the system does .. ??

Hopefully it responds back with error and keeps going, or it
crashes..

Jason


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 05/11] vfio/migration: Block multiple devices migration
  2023-05-16 13:57       ` Shameerali Kolothum Thodi via
  2023-05-16 14:04         ` Jason Gunthorpe
@ 2023-05-16 14:27         ` Alex Williamson
  2023-05-16 14:35           ` Shameerali Kolothum Thodi via
  1 sibling, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2023-05-16 14:27 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, Avihai Horon, qemu-devel@nongnu.org,
	Juan Quintela, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Joao Martins

On Tue, 16 May 2023 13:57:22 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> > -----Original Message-----
> > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > Sent: 16 May 2023 13:00
> > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> > Cc: Avihai Horon <avihaih@nvidia.com>; qemu-devel@nongnu.org; Alex
> > Williamson <alex.williamson@redhat.com>; Juan Quintela
> > <quintela@redhat.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>;
> > Michael S. Tsirkin <mst@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> > Paolo Bonzini <pbonzini@redhat.com>; Vladimir Sementsov-Ogievskiy
> > <vsementsov@yandex-team.ru>; Cédric Le Goater <clg@redhat.com>; Yishai
> > Hadas <yishaih@nvidia.com>; Maor Gottlieb <maorg@nvidia.com>; Kirti
> > Wankhede <kwankhede@nvidia.com>; Tarun Gupta <targupta@nvidia.com>;
> > Joao Martins <joao.m.martins@oracle.com>
> > Subject: Re: [PATCH v11 05/11] vfio/migration: Block multiple devices
> > migration
> > 
> > On Tue, May 16, 2023 at 10:03:54AM +0000, Shameerali Kolothum Thodi
> > wrote:
> >   
> > > > Currently VFIO migration doesn't implement some kind of intermediate
> > > > quiescent state in which P2P DMAs are quiesced before stopping or
> > > > running the device. This can cause problems in multi-device migration
> > > > where the devices are doing P2P DMAs, since the devices are not stopped
> > > > together at the same time.
> > > >
> > > > Until such support is added, block migration of multiple devices.  
> > >
> > > Missed this one. Currently this blocks even if the attached devices are not
> > > capable of P2P DMAs. eg; HiSilicon ACC devices. These are integrated end  
> > point  
> > > devices without any P2P capability between them. Is it Ok to check for
> > > VFIO_MIGRATION_P2P flag and allow if the devices are not supporting  
> > that?
> > 
> > Lacking VFIO_MIGRATION_P2P doesn't mean the device is incapable of
> > P2P, it means the migration can't support P2P.
> > 
> > We'd need some kind of new flag to check and such devices should be
> > blocked from creating P2P mappings. Basically we don't currently
> > fully support devices that are incapable of P2P operations.  
> 
> Ok. I will take a look.
> 
> > What happens on your platform if a guest tries to do P2P? Does the
> > platform crash?  
> 
> I am not sure. Since the devices are behind SMMU, I was under the assumption
> that we do have the guarantee of isolation here(grouping). Or this is something
> we are worried only during migration?

Grouping doesn't guarantee that mappings cannot be created through the
SMMU between devices.  When we refer to devices being isolated between
groups, that only excludes internal P2P between devices, for example
across the internal link between functions with implementation specific
routing.  For group isolation, the guarantee is that DMA is always
routed upstream, not that the ultimate target cannot be another device.
To guarantee lack of P2P the SMMU would need to reject non-memory
translation targets.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v11 05/11] vfio/migration: Block multiple devices migration
  2023-05-16 14:27         ` Alex Williamson
@ 2023-05-16 14:35           ` Shameerali Kolothum Thodi via
  2023-05-16 16:11             ` Jason Gunthorpe
  0 siblings, 1 reply; 41+ messages in thread
From: Shameerali Kolothum Thodi via @ 2023-05-16 14:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Avihai Horon, qemu-devel@nongnu.org,
	Juan Quintela, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Joao Martins



> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: 16 May 2023 15:28
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; Avihai Horon <avihaih@nvidia.com>;
> qemu-devel@nongnu.org; Juan Quintela <quintela@redhat.com>; Dr. David
> Alan Gilbert <dgilbert@redhat.com>; Michael S. Tsirkin <mst@redhat.com>;
> Cornelia Huck <cohuck@redhat.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Vladimir Sementsov-Ogievskiy
> <vsementsov@yandex-team.ru>; Cédric Le Goater <clg@redhat.com>; Yishai
> Hadas <yishaih@nvidia.com>; Maor Gottlieb <maorg@nvidia.com>; Kirti
> Wankhede <kwankhede@nvidia.com>; Tarun Gupta <targupta@nvidia.com>;
> Joao Martins <joao.m.martins@oracle.com>
> Subject: Re: [PATCH v11 05/11] vfio/migration: Block multiple devices
> migration
> 
> On Tue, 16 May 2023 13:57:22 +0000
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> wrote:
> 
> > > -----Original Message-----
> > > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > > Sent: 16 May 2023 13:00
> > > To: Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>
> > > Cc: Avihai Horon <avihaih@nvidia.com>; qemu-devel@nongnu.org; Alex
> > > Williamson <alex.williamson@redhat.com>; Juan Quintela
> > > <quintela@redhat.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>;
> > > Michael S. Tsirkin <mst@redhat.com>; Cornelia Huck
> <cohuck@redhat.com>;
> > > Paolo Bonzini <pbonzini@redhat.com>; Vladimir Sementsov-Ogievskiy
> > > <vsementsov@yandex-team.ru>; Cédric Le Goater <clg@redhat.com>;
> Yishai
> > > Hadas <yishaih@nvidia.com>; Maor Gottlieb <maorg@nvidia.com>; Kirti
> > > Wankhede <kwankhede@nvidia.com>; Tarun Gupta
> <targupta@nvidia.com>;
> > > Joao Martins <joao.m.martins@oracle.com>
> > > Subject: Re: [PATCH v11 05/11] vfio/migration: Block multiple devices
> > > migration
> > >
> > > On Tue, May 16, 2023 at 10:03:54AM +0000, Shameerali Kolothum Thodi
> > > wrote:
> > >
> > > > > Currently VFIO migration doesn't implement some kind of
> intermediate
> > > > > quiescent state in which P2P DMAs are quiesced before stopping or
> > > > > running the device. This can cause problems in multi-device migration
> > > > > where the devices are doing P2P DMAs, since the devices are not
> stopped
> > > > > together at the same time.
> > > > >
> > > > > Until such support is added, block migration of multiple devices.
> > > >
> > > > Missed this one. Currently this blocks even if the attached devices are
> not
> > > > capable of P2P DMAs. eg; HiSilicon ACC devices. These are integrated
> end
> > > point
> > > > devices without any P2P capability between them. Is it Ok to check for
> > > > VFIO_MIGRATION_P2P flag and allow if the devices are not supporting
> > > that?
> > >
> > > Lacking VFIO_MIGRATION_P2P doesn't mean the device is incapable of
> > > P2P, it means the migration can't support P2P.
> > >
> > > We'd need some kind of new flag to check and such devices should be
> > > blocked from creating P2P mappings. Basically we don't currently
> > > fully support devices that are incapable of P2P operations.
> >
> > Ok. I will take a look.
> >
> > > What happens on your platform if a guest tries to do P2P? Does the
> > > platform crash?
> >
> > I am not sure. Since the devices are behind SMMU, I was under the
> assumption
> > that we do have the guarantee of isolation here(grouping). Or this is
> something
> > we are worried only during migration?
> 
> Grouping doesn't guarantee that mappings cannot be created through the
> SMMU between devices.  When we refer to devices being isolated between
> groups, that only excludes internal P2P between devices, for example
> across the internal link between functions with implementation specific
> routing.  For group isolation, the guarantee is that DMA is always
> routed upstream, not that the ultimate target cannot be another device.
> To guarantee lack of P2P the SMMU would need to reject non-memory
> translation targets.  Thanks,

Ok. Got it. So it depends on what SMMU does for that mapping and is not
related to migration per se and has the potential to crash the system if 
SMMU go ahead with that memory access. Isn't it a more generic problem
then when we have multiple devices attached to the VM? I need to check if 
there is anything in SMMU spec that forbids this access.

Thanks,
Shameer



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 05/11] vfio/migration: Block multiple devices migration
  2023-05-16 14:35           ` Shameerali Kolothum Thodi via
@ 2023-05-16 16:11             ` Jason Gunthorpe
  0 siblings, 0 replies; 41+ messages in thread
From: Jason Gunthorpe @ 2023-05-16 16:11 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Alex Williamson, Avihai Horon, qemu-devel@nongnu.org,
	Juan Quintela, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Cornelia Huck, Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Joao Martins

On Tue, May 16, 2023 at 02:35:21PM +0000, Shameerali Kolothum Thodi wrote:

> Ok. Got it. So it depends on what SMMU does for that mapping and is not
> related to migration per se and has the potential to crash the system if 
> SMMU go ahead with that memory access. Isn't it a more generic problem
> then when we have multiple devices attached to the VM? 

It is, and I am hoping to solve it when I get P2P support into iommufd

> I need to check if there is anything in SMMU spec that forbids this
> access.

There isn't, it is up to the integration how it handles 'hairpin'
traffic. A good implementation will work correctly as Intel and other
CPUs do. A very bad implementation will crash. Medium is to return
error tlps.

Jason


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2023-02-16 14:36 ` [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
  2023-02-16 15:43   ` Juan Quintela
@ 2024-09-04 13:00   ` Peter Xu
  2024-09-04 15:41     ` Avihai Horon
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Xu @ 2024-09-04 13:00 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Fabiano Rosas, Zhiyi Guo

Hello, Avihai,

Reviving this thread just to discuss one issue below..

On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote:
> +/*
> + * Migration size of VFIO devices can be as little as a few KBs or as big as
> + * many GBs. This value should be big enough to cover the worst case.
> + */
> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
> +
> +/*
> + * Only exact function is implemented and not estimate function. The reason is
> + * that during pre-copy phase of migration the estimate function is called
> + * repeatedly while pending RAM size is over the threshold, thus migration
> + * can't converge and querying the VFIO device pending data size is useless.
> + */
> +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> +                                     uint64_t *can_postcopy)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
> +
> +    /*
> +     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
> +     * reported so downtime limit won't be violated.
> +     */
> +    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> +    *must_precopy += stop_copy_size;

Is this the chunk of data only can be copied during VM stopped?  If so, I
wonder why it's reported as "must precopy" if we know precopy won't ever
move them..

The issue is if with such reporting (and now in latest master branch we do
have the precopy size too, which was reported both in exact() and
estimate()), we can observe weird reports like this:

23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0)
23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0)
23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0)
23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)

So estimate() keeps reporting zero but the exact() reports much larger, and
it keeps spinning like this.  I think that's not how it was designed to be
used..

Does this stop copy size change for a VFIO device or not?

IIUC, we may want some other mechanism to report stop copy size for a
device, rather than reporting it with the current exact()/estimate() api.
That's, per my undertanding, only used for iterable data, while
stop-copy-size may not fall into that category if so.

> +
> +    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> +                                   stop_copy_size);
> +}

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2024-09-04 13:00   ` Peter Xu
@ 2024-09-04 15:41     ` Avihai Horon
  2024-09-04 16:16       ` Peter Xu
  0 siblings, 1 reply; 41+ messages in thread
From: Avihai Horon @ 2024-09-04 15:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Fabiano Rosas, Zhiyi Guo


On 04/09/2024 16:00, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> Hello, Avihai,
>
> Reviving this thread just to discuss one issue below..
>
> On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote:
>> +/*
>> + * Migration size of VFIO devices can be as little as a few KBs or as big as
>> + * many GBs. This value should be big enough to cover the worst case.
>> + */
>> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>> +
>> +/*
>> + * Only exact function is implemented and not estimate function. The reason is
>> + * that during pre-copy phase of migration the estimate function is called
>> + * repeatedly while pending RAM size is over the threshold, thus migration
>> + * can't converge and querying the VFIO device pending data size is useless.
>> + */
>> +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>> +                                     uint64_t *can_postcopy)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>> +
>> +    /*
>> +     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
>> +     * reported so downtime limit won't be violated.
>> +     */
>> +    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>> +    *must_precopy += stop_copy_size;
> Is this the chunk of data only can be copied during VM stopped?  If so, I
> wonder why it's reported as "must precopy" if we know precopy won't ever
> move them..

A VFIO device that doesn't support precopy will send this data only when 
VM is stopped.
A VFIO device that supports precopy may or may not send this data (or 
part of it) during precopy, and it depends on the specific VFIO device.

According to state_pending_{estimate,exact} documentation, must_precopy 
is the amount of data that must be migrated before target starts, and 
indeed this VFIO data must be migrated before target starts.

>
> The issue is if with such reporting (and now in latest master branch we do
> have the precopy size too, which was reported both in exact() and
> estimate()), we can observe weird reports like this:
>
> 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
> 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0)
> 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
> 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0)
> 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0)
> 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>
> So estimate() keeps reporting zero but the exact() reports much larger, and
> it keeps spinning like this.  I think that's not how it was designed to be
> used..

It keeps spinning and migration doesn't converge?
If so, configuring a higher downtime limit or the 
avail-switchover-bandwidth parameter may solve it.

>
> Does this stop copy size change for a VFIO device or not?

It depends on the specific VFIO device.
If the device supports precopy and all (or part) of its data is 
precopy-able, then stopcopy size will change.
Besides that, the amount of resources currently used by the VFIO device 
can also affect the stopcopy size, and it may increase or decrease as 
resources are created or destroyed.

> IIUC, we may want some other mechanism to report stop copy size for a
> device, rather than reporting it with the current exact()/estimate() api.
> That's, per my undertanding, only used for iterable data, while
> stop-copy-size may not fall into that category if so.

The above situation is caused by the fact that VFIO data may not be 
fully precopy-able (as opposed to RAM data).
I don't think reporting the stop-copy-size in a different API will help 
the above situation -- we would still have to take stop-copy-size into 
account before converging, to not violate downtime.

Thanks.

>
>> +
>> +    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
>> +                                   stop_copy_size);
>> +}
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2024-09-04 15:41     ` Avihai Horon
@ 2024-09-04 16:16       ` Peter Xu
  2024-09-05 11:41         ` Avihai Horon
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2024-09-04 16:16 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Fabiano Rosas, Zhiyi Guo

On Wed, Sep 04, 2024 at 06:41:03PM +0300, Avihai Horon wrote:
> 
> On 04/09/2024 16:00, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > Hello, Avihai,
> > 
> > Reviving this thread just to discuss one issue below..
> > 
> > On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote:
> > > +/*
> > > + * Migration size of VFIO devices can be as little as a few KBs or as big as
> > > + * many GBs. This value should be big enough to cover the worst case.
> > > + */
> > > +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
> > > +
> > > +/*
> > > + * Only exact function is implemented and not estimate function. The reason is
> > > + * that during pre-copy phase of migration the estimate function is called
> > > + * repeatedly while pending RAM size is over the threshold, thus migration
> > > + * can't converge and querying the VFIO device pending data size is useless.
> > > + */
> > > +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> > > +                                     uint64_t *can_postcopy)
> > > +{
> > > +    VFIODevice *vbasedev = opaque;
> > > +    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
> > > +
> > > +    /*
> > > +     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
> > > +     * reported so downtime limit won't be violated.
> > > +     */
> > > +    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> > > +    *must_precopy += stop_copy_size;
> > Is this the chunk of data only can be copied during VM stopped?  If so, I
> > wonder why it's reported as "must precopy" if we know precopy won't ever
> > move them..
> 
> A VFIO device that doesn't support precopy will send this data only when VM
> is stopped.
> A VFIO device that supports precopy may or may not send this data (or part
> of it) during precopy, and it depends on the specific VFIO device.
> 
> According to state_pending_{estimate,exact} documentation, must_precopy is
> the amount of data that must be migrated before target starts, and indeed
> this VFIO data must be migrated before target starts.

Correct.  It's just that estimate/exact API will be more suitable when the
exact() gets the report closer to estimate(), while the estimate() is only
a fast-path of exact().  It'll cause some chaos like this if it doesn't do
as so.

Since we have discussion elsewhere on the fact that we currently ignore
non-iterative device states during migration decides to switchover, I was
wondering when reply on whether this stop-size could be the first thing
that will start to provide such non-iterable pending data.  But that might
indeed need more thoughts, at least we may want to collect more outliers of
non-iterative device states outside VFIO that can cause downtime to be too
large.

> 
> > 
> > The issue is if with such reporting (and now in latest master branch we do
> > have the precopy size too, which was reported both in exact() and
> > estimate()), we can observe weird reports like this:
> > 
> > 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
> > 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0)
> > 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
> > 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0)
> > 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0)
> > 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > 
> > So estimate() keeps reporting zero but the exact() reports much larger, and
> > it keeps spinning like this.  I think that's not how it was designed to be
> > used..
> 
> It keeps spinning and migration doesn't converge?
> If so, configuring a higher downtime limit or the avail-switchover-bandwidth
> parameter may solve it.

Yes, this is the only way to go, but it's a separate issue on reportings of
estimate()/exact().  More below.

> 
> > 
> > Does this stop copy size change for a VFIO device or not?
> 
> It depends on the specific VFIO device.
> If the device supports precopy and all (or part) of its data is
> precopy-able, then stopcopy size will change.
> Besides that, the amount of resources currently used by the VFIO device can
> also affect the stopcopy size, and it may increase or decrease as resources
> are created or destroyed.

I see, thanks.

> 
> > IIUC, we may want some other mechanism to report stop copy size for a
> > device, rather than reporting it with the current exact()/estimate() api.
> > That's, per my undertanding, only used for iterable data, while
> > stop-copy-size may not fall into that category if so.
> 
> The above situation is caused by the fact that VFIO data may not be fully
> precopy-able (as opposed to RAM data).
> I don't think reporting the stop-copy-size in a different API will help the
> above situation -- we would still have to take stop-copy-size into account
> before converging, to not violate downtime.

It will help some other situation, though.

One issue with above freqeunt estimate()/exact() call is that QEMU will go
into madness loop thinking "we're close" and "we're far away from converge"
even if the reality is "we're far away". The bad side effect is when this
loop happens it'll not only affect VFIO but also other devices (e.g. KVM,
vhost, etc.) so we'll do high overhead sync() in an extremely frequent
manner.  IMHO they're totally a waste of resource, because all the rest of
the modules are following the default rules of estimate()/exact().

One simple but efficient fix for VFIO, IMHO, is at least VFIO should also
cache the stop-size internally and report in estimate(), e.g.:

/* Give an estimate of the amount left to be transferred,
 * the result is split into the amount for units that can and
 * for units that can't do postcopy.
 */
void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
                                        uint64_t *can_postcopy)
{
}

If it's justified that the stop-size to be reported in exact(), it should
also be reported in estimate() per the comment above.  It should also fall
into precopy category in this case.

Then with that we should avoid calling exact() frequently for not only VFIO
but also others (especially, KVM GET_DIRTY_LOG / CLEAR_DIRTY_LOG ioctls),
then we know it won't converge anyway without the help of tuning downtime
upper, or adjust avail-switchover-bandwidth.

This may improve situation but still leave one other issue, that IIUC even
with above change and even if we can avoid sync dirty bitmap frequently,
the migration thread can still spinning 100% calling estimate() and keep
seeing data (which is not iterable...).  For the longer term we may still
need to report non-iterable stop-size in another way so QEMU knows that
iterate() over all the VMState registers won't help in this case, so it
should go into sleep without eating the cores.  I hope that explains why I
think a new API should be still needed for the long run.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2024-09-04 16:16       ` Peter Xu
@ 2024-09-05 11:41         ` Avihai Horon
  2024-09-05 15:17           ` Peter Xu
  0 siblings, 1 reply; 41+ messages in thread
From: Avihai Horon @ 2024-09-05 11:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Fabiano Rosas, Zhiyi Guo


On 04/09/2024 19:16, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, Sep 04, 2024 at 06:41:03PM +0300, Avihai Horon wrote:
>> On 04/09/2024 16:00, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Hello, Avihai,
>>>
>>> Reviving this thread just to discuss one issue below..
>>>
>>> On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote:
>>>> +/*
>>>> + * Migration size of VFIO devices can be as little as a few KBs or as big as
>>>> + * many GBs. This value should be big enough to cover the worst case.
>>>> + */
>>>> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>>>> +
>>>> +/*
>>>> + * Only exact function is implemented and not estimate function. The reason is
>>>> + * that during pre-copy phase of migration the estimate function is called
>>>> + * repeatedly while pending RAM size is over the threshold, thus migration
>>>> + * can't converge and querying the VFIO device pending data size is useless.
>>>> + */
>>>> +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>>>> +                                     uint64_t *can_postcopy)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>>>> +
>>>> +    /*
>>>> +     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
>>>> +     * reported so downtime limit won't be violated.
>>>> +     */
>>>> +    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>>> +    *must_precopy += stop_copy_size;
>>> Is this the chunk of data only can be copied during VM stopped?  If so, I
>>> wonder why it's reported as "must precopy" if we know precopy won't ever
>>> move them..
>> A VFIO device that doesn't support precopy will send this data only when VM
>> is stopped.
>> A VFIO device that supports precopy may or may not send this data (or part
>> of it) during precopy, and it depends on the specific VFIO device.
>>
>> According to state_pending_{estimate,exact} documentation, must_precopy is
>> the amount of data that must be migrated before target starts, and indeed
>> this VFIO data must be migrated before target starts.
> Correct.  It's just that estimate/exact API will be more suitable when the
> exact() gets the report closer to estimate(), while the estimate() is only
> a fast-path of exact().  It'll cause some chaos like this if it doesn't do
> as so.
>
> Since we have discussion elsewhere on the fact that we currently ignore
> non-iterative device states during migration decides to switchover, I was
> wondering when reply on whether this stop-size could be the first thing
> that will start to provide such non-iterable pending data.  But that might
> indeed need more thoughts, at least we may want to collect more outliers of
> non-iterative device states outside VFIO that can cause downtime to be too
> large.

Ah, I see.

>
>>> The issue is if with such reporting (and now in latest master branch we do
>>> have the precopy size too, which was reported both in exact() and
>>> estimate()), we can observe weird reports like this:
>>>
>>> 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>> 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
>>> 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>> 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0)
>>> 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>> 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
>>> 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>> 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0)
>>> 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>> 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0)
>>> 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>
>>> So estimate() keeps reporting zero but the exact() reports much larger, and
>>> it keeps spinning like this.  I think that's not how it was designed to be
>>> used..
>> It keeps spinning and migration doesn't converge?
>> If so, configuring a higher downtime limit or the avail-switchover-bandwidth
>> parameter may solve it.
> Yes, this is the only way to go, but it's a separate issue on reportings of
> estimate()/exact().  More below.
>
>>> Does this stop copy size change for a VFIO device or not?
>> It depends on the specific VFIO device.
>> If the device supports precopy and all (or part) of its data is
>> precopy-able, then stopcopy size will change.
>> Besides that, the amount of resources currently used by the VFIO device can
>> also affect the stopcopy size, and it may increase or decrease as resources
>> are created or destroyed.
> I see, thanks.
>
>>> IIUC, we may want some other mechanism to report stop copy size for a
>>> device, rather than reporting it with the current exact()/estimate() api.
>>> That's, per my undertanding, only used for iterable data, while
>>> stop-copy-size may not fall into that category if so.
>> The above situation is caused by the fact that VFIO data may not be fully
>> precopy-able (as opposed to RAM data).
>> I don't think reporting the stop-copy-size in a different API will help the
>> above situation -- we would still have to take stop-copy-size into account
>> before converging, to not violate downtime.
> It will help some other situation, though.
>
> One issue with above freqeunt estimate()/exact() call is that QEMU will go
> into madness loop thinking "we're close" and "we're far away from converge"
> even if the reality is "we're far away". The bad side effect is when this
> loop happens it'll not only affect VFIO but also other devices (e.g. KVM,
> vhost, etc.) so we'll do high overhead sync() in an extremely frequent
> manner.  IMHO they're totally a waste of resource, because all the rest of
> the modules are following the default rules of estimate()/exact().
>
> One simple but efficient fix for VFIO, IMHO, is at least VFIO should also
> cache the stop-size internally and report in estimate(), e.g.:
>
> /* Give an estimate of the amount left to be transferred,
>   * the result is split into the amount for units that can and
>   * for units that can't do postcopy.
>   */
> void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
>                                          uint64_t *can_postcopy)
> {
> }
>
> If it's justified that the stop-size to be reported in exact(), it should
> also be reported in estimate() per the comment above.  It should also fall
> into precopy category in this case.
>
> Then with that we should avoid calling exact() frequently for not only VFIO
> but also others (especially, KVM GET_DIRTY_LOG / CLEAR_DIRTY_LOG ioctls),
> then we know it won't converge anyway without the help of tuning downtime
> upper, or adjust avail-switchover-bandwidth.

Yes, it will solve this problem, but IIRC, the reason I didn't cache and 
return stop-copy-size in estimate from the first place was that it 
wasn't fully precopy-able, and that could cause some problems:
Once we cache and report this size in estimate, it may not decrease 
anymore -- we can't send it during precopy and we might not be able to 
call get stop-copy-size again from the exact path (because estimate 
might now reach below the threshold).

For example, assume the threshold for convergence is 1GB.
A VFIO device may report 2GB stop-copy-size (not precopy-able) in the 
beginning of the migration.
If the VFIO device stop-copy-size changes in the middle of migration 
(e.g., some of its resources are destroyed) and drops to 900MB, we will 
still see 2GB report in estimate.
Only when calling the exact handler again we will get the updated 900MB 
value and be able to converge. But that won't happen, because estimate 
size will always be above the 1GB threshold.

We could prevent these situations and call the get stop-copy-size once 
every X seconds, but it feels a bit awkward.

>
> This may improve situation but still leave one other issue, that IIUC even
> with above change and even if we can avoid sync dirty bitmap frequently,
> the migration thread can still spinning 100% calling estimate() and keep
> seeing data (which is not iterable...).  For the longer term we may still
> need to report non-iterable stop-size in another way so QEMU knows that
> iterate() over all the VMState registers won't help in this case, so it
> should go into sleep without eating the cores.  I hope that explains why I
> think a new API should be still needed for the long run.

Yes, I get your point.
But is the spinning case very common? AFAIU, if it happens, it may 
indicate some misconfiguration of the migration parameters.

Anyway, I think your idea may fit VFIO, but still need to think about 
all the small details.

Thanks.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2024-09-05 11:41         ` Avihai Horon
@ 2024-09-05 15:17           ` Peter Xu
  2024-09-05 16:07             ` Avihai Horon
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2024-09-05 15:17 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Fabiano Rosas, Zhiyi Guo

On Thu, Sep 05, 2024 at 02:41:09PM +0300, Avihai Horon wrote:
> 
> On 04/09/2024 19:16, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, Sep 04, 2024 at 06:41:03PM +0300, Avihai Horon wrote:
> > > On 04/09/2024 16:00, Peter Xu wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > Hello, Avihai,
> > > > 
> > > > Reviving this thread just to discuss one issue below..
> > > > 
> > > > On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote:
> > > > > +/*
> > > > > + * Migration size of VFIO devices can be as little as a few KBs or as big as
> > > > > + * many GBs. This value should be big enough to cover the worst case.
> > > > > + */
> > > > > +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
> > > > > +
> > > > > +/*
> > > > > + * Only exact function is implemented and not estimate function. The reason is
> > > > > + * that during pre-copy phase of migration the estimate function is called
> > > > > + * repeatedly while pending RAM size is over the threshold, thus migration
> > > > > + * can't converge and querying the VFIO device pending data size is useless.
> > > > > + */
> > > > > +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
> > > > > +                                     uint64_t *can_postcopy)
> > > > > +{
> > > > > +    VFIODevice *vbasedev = opaque;
> > > > > +    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
> > > > > +
> > > > > +    /*
> > > > > +     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
> > > > > +     * reported so downtime limit won't be violated.
> > > > > +     */
> > > > > +    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> > > > > +    *must_precopy += stop_copy_size;
> > > > Is this the chunk of data only can be copied during VM stopped?  If so, I
> > > > wonder why it's reported as "must precopy" if we know precopy won't ever
> > > > move them..
> > > A VFIO device that doesn't support precopy will send this data only when VM
> > > is stopped.
> > > A VFIO device that supports precopy may or may not send this data (or part
> > > of it) during precopy, and it depends on the specific VFIO device.
> > > 
> > > According to state_pending_{estimate,exact} documentation, must_precopy is
> > > the amount of data that must be migrated before target starts, and indeed
> > > this VFIO data must be migrated before target starts.
> > Correct.  It's just that estimate/exact API will be more suitable when the
> > exact() gets the report closer to estimate(), while the estimate() is only
> > a fast-path of exact().  It'll cause some chaos like this if it doesn't do
> > as so.
> > 
> > Since we have discussion elsewhere on the fact that we currently ignore
> > non-iterative device states during migration decides to switchover, I was
> > wondering when reply on whether this stop-size could be the first thing
> > that will start to provide such non-iterable pending data.  But that might
> > indeed need more thoughts, at least we may want to collect more outliers of
> > non-iterative device states outside VFIO that can cause downtime to be too
> > large.
> 
> Ah, I see.
> 
> > 
> > > > The issue is if with such reporting (and now in latest master branch we do
> > > > have the precopy size too, which was reported both in exact() and
> > > > estimate()), we can observe weird reports like this:
> > > > 
> > > > 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > > > 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
> > > > 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > > > 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0)
> > > > 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > > > 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
> > > > 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > > > 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0)
> > > > 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > > > 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0)
> > > > 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
> > > > 
> > > > So estimate() keeps reporting zero but the exact() reports much larger, and
> > > > it keeps spinning like this.  I think that's not how it was designed to be
> > > > used..
> > > It keeps spinning and migration doesn't converge?
> > > If so, configuring a higher downtime limit or the avail-switchover-bandwidth
> > > parameter may solve it.
> > Yes, this is the only way to go, but it's a separate issue on reportings of
> > estimate()/exact().  More below.
> > 
> > > > Does this stop copy size change for a VFIO device or not?
> > > It depends on the specific VFIO device.
> > > If the device supports precopy and all (or part) of its data is
> > > precopy-able, then stopcopy size will change.
> > > Besides that, the amount of resources currently used by the VFIO device can
> > > also affect the stopcopy size, and it may increase or decrease as resources
> > > are created or destroyed.
> > I see, thanks.
> > 
> > > > IIUC, we may want some other mechanism to report stop copy size for a
> > > > device, rather than reporting it with the current exact()/estimate() api.
> > > > That's, per my undertanding, only used for iterable data, while
> > > > stop-copy-size may not fall into that category if so.
> > > The above situation is caused by the fact that VFIO data may not be fully
> > > precopy-able (as opposed to RAM data).
> > > I don't think reporting the stop-copy-size in a different API will help the
> > > above situation -- we would still have to take stop-copy-size into account
> > > before converging, to not violate downtime.
> > It will help some other situation, though.
> > 
> > One issue with above freqeunt estimate()/exact() call is that QEMU will go
> > into madness loop thinking "we're close" and "we're far away from converge"
> > even if the reality is "we're far away". The bad side effect is when this
> > loop happens it'll not only affect VFIO but also other devices (e.g. KVM,
> > vhost, etc.) so we'll do high overhead sync() in an extremely frequent
> > manner.  IMHO they're totally a waste of resource, because all the rest of
> > the modules are following the default rules of estimate()/exact().
> > 
> > One simple but efficient fix for VFIO, IMHO, is at least VFIO should also
> > cache the stop-size internally and report in estimate(), e.g.:
> > 
> > /* Give an estimate of the amount left to be transferred,
> >   * the result is split into the amount for units that can and
> >   * for units that can't do postcopy.
> >   */
> > void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
> >                                          uint64_t *can_postcopy)
> > {
> > }
> > 
> > If it's justified that the stop-size to be reported in exact(), it should
> > also be reported in estimate() per the comment above.  It should also fall
> > into precopy category in this case.
> > 
> > Then with that we should avoid calling exact() frequently for not only VFIO
> > but also others (especially, KVM GET_DIRTY_LOG / CLEAR_DIRTY_LOG ioctls),
> > then we know it won't converge anyway without the help of tuning downtime
> > upper, or adjust avail-switchover-bandwidth.
> 
> Yes, it will solve this problem, but IIRC, the reason I didn't cache and
> return stop-copy-size in estimate from the first place was that it wasn't
> fully precopy-able, and that could cause some problems:
> Once we cache and report this size in estimate, it may not decrease anymore
> -- we can't send it during precopy and we might not be able to call get
> stop-copy-size again from the exact path (because estimate might now reach
> below the threshold).
> 
> For example, assume the threshold for convergence is 1GB.
> A VFIO device may report 2GB stop-copy-size (not precopy-able) in the
> beginning of the migration.
> If the VFIO device stop-copy-size changes in the middle of migration (e.g.,
> some of its resources are destroyed) and drops to 900MB, we will still see
> 2GB report in estimate.
> Only when calling the exact handler again we will get the updated 900MB
> value and be able to converge. But that won't happen, because estimate size
> will always be above the 1GB threshold.
> 
> We could prevent these situations and call the get stop-copy-size once every
> X seconds, but it feels a bit awkward.

IMHO the confusion is caused by an unclear definition of stop-size and
precopy-size in VFIO terms.

What I would hope is the stop-size reported should be constant and not be
affected by precopy process happening.  Whatever _can_ change at all should
be reported as precopy-size.

So I wonder why stop-size can change from a driver, and whether that can be
reported in a more predictable fashion.  Otherwise I see little point in
providing both stop-size and precopy-size, otherwise we'll always add them
up into VFIO's total pending-size.  The line on provision which data falls
into which bucket doesn't seem to be clear to me.

> 
> > 
> > This may improve situation but still leave one other issue, that IIUC even
> > with above change and even if we can avoid sync dirty bitmap frequently,
> > the migration thread can still spinning 100% calling estimate() and keep
> > seeing data (which is not iterable...).  For the longer term we may still
> > need to report non-iterable stop-size in another way so QEMU knows that
> > iterate() over all the VMState registers won't help in this case, so it
> > should go into sleep without eating the cores.  I hope that explains why I
> > think a new API should be still needed for the long run.
> 
> Yes, I get your point.
> But is the spinning case very common? AFAIU, if it happens, it may indicate
> some misconfiguration of the migration parameters.

Agreed.

It's just that our QE actively tests such migrations and there're always
weird and similar reports on cpu spinning, so it might be nice to still fix
it at some point.

This shouldn't be super urgent, indeed.  It's just that it can start to
make sense when we have other discussions where reporting non-iteratble
pending data might be pretty useful on its own as an idea.  It's just that
we need to figure out above on how VFIO can report predicatble stop-size in
the drivers, and I wonder whether the drivers can be fixed (without
breaking existing qemu; after all they currently always add both
counters..).

> 
> Anyway, I think your idea may fit VFIO, but still need to think about all
> the small details.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2024-09-05 15:17           ` Peter Xu
@ 2024-09-05 16:07             ` Avihai Horon
  2024-09-05 16:23               ` Peter Xu
  0 siblings, 1 reply; 41+ messages in thread
From: Avihai Horon @ 2024-09-05 16:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Fabiano Rosas, Zhiyi Guo


On 05/09/2024 18:17, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Sep 05, 2024 at 02:41:09PM +0300, Avihai Horon wrote:
>> On 04/09/2024 19:16, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Wed, Sep 04, 2024 at 06:41:03PM +0300, Avihai Horon wrote:
>>>> On 04/09/2024 16:00, Peter Xu wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> Hello, Avihai,
>>>>>
>>>>> Reviving this thread just to discuss one issue below..
>>>>>
>>>>> On Thu, Feb 16, 2023 at 04:36:27PM +0200, Avihai Horon wrote:
>>>>>> +/*
>>>>>> + * Migration size of VFIO devices can be as little as a few KBs or as big as
>>>>>> + * many GBs. This value should be big enough to cover the worst case.
>>>>>> + */
>>>>>> +#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>>>>>> +
>>>>>> +/*
>>>>>> + * Only exact function is implemented and not estimate function. The reason is
>>>>>> + * that during pre-copy phase of migration the estimate function is called
>>>>>> + * repeatedly while pending RAM size is over the threshold, thus migration
>>>>>> + * can't converge and querying the VFIO device pending data size is useless.
>>>>>> + */
>>>>>> +static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>>>>>> +                                     uint64_t *can_postcopy)
>>>>>> +{
>>>>>> +    VFIODevice *vbasedev = opaque;
>>>>>> +    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
>>>>>> +     * reported so downtime limit won't be violated.
>>>>>> +     */
>>>>>> +    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>>>>> +    *must_precopy += stop_copy_size;
>>>>> Is this the chunk of data only can be copied during VM stopped?  If so, I
>>>>> wonder why it's reported as "must precopy" if we know precopy won't ever
>>>>> move them..
>>>> A VFIO device that doesn't support precopy will send this data only when VM
>>>> is stopped.
>>>> A VFIO device that supports precopy may or may not send this data (or part
>>>> of it) during precopy, and it depends on the specific VFIO device.
>>>>
>>>> According to state_pending_{estimate,exact} documentation, must_precopy is
>>>> the amount of data that must be migrated before target starts, and indeed
>>>> this VFIO data must be migrated before target starts.
>>> Correct.  It's just that estimate/exact API will be more suitable when the
>>> exact() gets the report closer to estimate(), while the estimate() is only
>>> a fast-path of exact().  It'll cause some chaos like this if it doesn't do
>>> as so.
>>>
>>> Since we have discussion elsewhere on the fact that we currently ignore
>>> non-iterative device states during migration decides to switchover, I was
>>> wondering when reply on whether this stop-size could be the first thing
>>> that will start to provide such non-iterable pending data.  But that might
>>> indeed need more thoughts, at least we may want to collect more outliers of
>>> non-iterative device states outside VFIO that can cause downtime to be too
>>> large.
>> Ah, I see.
>>
>>>>> The issue is if with such reporting (and now in latest master branch we do
>>>>> have the precopy size too, which was reported both in exact() and
>>>>> estimate()), we can observe weird reports like this:
>>>>>
>>>>> 23411@1725380798968696657 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799050766000 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
>>>>> 23411@1725380799050896975 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799138657103 migrate_pending_exact exact pending size 21040144384 (pre = 21040144384 post=0)
>>>>> 23411@1725380799140166709 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799217246861 migrate_pending_exact exact pending size 21038628864 (pre = 21038628864 post=0)
>>>>> 23411@1725380799217384969 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799305147722 migrate_pending_exact exact pending size 21039976448 (pre = 21039976448 post=0)
>>>>> 23411@1725380799306639956 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>> 23411@1725380799385118245 migrate_pending_exact exact pending size 21038796800 (pre = 21038796800 post=0)
>>>>> 23411@1725380799385709382 migrate_pending_estimate estimate pending size 0 (pre = 0 post=0)
>>>>>
>>>>> So estimate() keeps reporting zero but the exact() reports much larger, and
>>>>> it keeps spinning like this.  I think that's not how it was designed to be
>>>>> used..
>>>> It keeps spinning and migration doesn't converge?
>>>> If so, configuring a higher downtime limit or the avail-switchover-bandwidth
>>>> parameter may solve it.
>>> Yes, this is the only way to go, but it's a separate issue on reportings of
>>> estimate()/exact().  More below.
>>>
>>>>> Does this stop copy size change for a VFIO device or not?
>>>> It depends on the specific VFIO device.
>>>> If the device supports precopy and all (or part) of its data is
>>>> precopy-able, then stopcopy size will change.
>>>> Besides that, the amount of resources currently used by the VFIO device can
>>>> also affect the stopcopy size, and it may increase or decrease as resources
>>>> are created or destroyed.
>>> I see, thanks.
>>>
>>>>> IIUC, we may want some other mechanism to report stop copy size for a
>>>>> device, rather than reporting it with the current exact()/estimate() api.
>>>>> That's, per my undertanding, only used for iterable data, while
>>>>> stop-copy-size may not fall into that category if so.
>>>> The above situation is caused by the fact that VFIO data may not be fully
>>>> precopy-able (as opposed to RAM data).
>>>> I don't think reporting the stop-copy-size in a different API will help the
>>>> above situation -- we would still have to take stop-copy-size into account
>>>> before converging, to not violate downtime.
>>> It will help some other situation, though.
>>>
>>> One issue with above freqeunt estimate()/exact() call is that QEMU will go
>>> into madness loop thinking "we're close" and "we're far away from converge"
>>> even if the reality is "we're far away". The bad side effect is when this
>>> loop happens it'll not only affect VFIO but also other devices (e.g. KVM,
>>> vhost, etc.) so we'll do high overhead sync() in an extremely frequent
>>> manner.  IMHO they're totally a waste of resource, because all the rest of
>>> the modules are following the default rules of estimate()/exact().
>>>
>>> One simple but efficient fix for VFIO, IMHO, is at least VFIO should also
>>> cache the stop-size internally and report in estimate(), e.g.:
>>>
>>> /* Give an estimate of the amount left to be transferred,
>>>    * the result is split into the amount for units that can and
>>>    * for units that can't do postcopy.
>>>    */
>>> void qemu_savevm_state_pending_estimate(uint64_t *must_precopy,
>>>                                           uint64_t *can_postcopy)
>>> {
>>> }
>>>
>>> If it's justified that the stop-size to be reported in exact(), it should
>>> also be reported in estimate() per the comment above.  It should also fall
>>> into precopy category in this case.
>>>
>>> Then with that we should avoid calling exact() frequently for not only VFIO
>>> but also others (especially, KVM GET_DIRTY_LOG / CLEAR_DIRTY_LOG ioctls),
>>> then we know it won't converge anyway without the help of tuning downtime
>>> upper, or adjust avail-switchover-bandwidth.
>> Yes, it will solve this problem, but IIRC, the reason I didn't cache and
>> return stop-copy-size in estimate from the first place was that it wasn't
>> fully precopy-able, and that could cause some problems:
>> Once we cache and report this size in estimate, it may not decrease anymore
>> -- we can't send it during precopy and we might not be able to call get
>> stop-copy-size again from the exact path (because estimate might now reach
>> below the threshold).
>>
>> For example, assume the threshold for convergence is 1GB.
>> A VFIO device may report 2GB stop-copy-size (not precopy-able) in the
>> beginning of the migration.
>> If the VFIO device stop-copy-size changes in the middle of migration (e.g.,
>> some of its resources are destroyed) and drops to 900MB, we will still see
>> 2GB report in estimate.
>> Only when calling the exact handler again we will get the updated 900MB
>> value and be able to converge. But that won't happen, because estimate size
>> will always be above the 1GB threshold.
>>
>> We could prevent these situations and call the get stop-copy-size once every
>> X seconds, but it feels a bit awkward.
> IMHO the confusion is caused by an unclear definition of stop-size and
> precopy-size in VFIO terms.
>
> What I would hope is the stop-size reported should be constant and not be
> affected by precopy process happening.  Whatever _can_ change at all should
> be reported as precopy-size.
>
> So I wonder why stop-size can change from a driver, and whether that can be
> reported in a more predictable fashion.  Otherwise I see little point in
> providing both stop-size and precopy-size, otherwise we'll always add them
> up into VFIO's total pending-size.  The line on provision which data falls
> into which bucket doesn't seem to be clear to me.

Stopcopy-size is reported by VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl, 
which states that this is "[...] the estimated data length that will be 
required to complete stop copy".
So by this definition, stopcopy-size can change during precopy (it can 
also change if device resources are created or destroyed).

Precopy-size is reported by VFIO_MIG_GET_PRECOPY_INFO ioctl, which 
states that this is "[...] estimates of data available from the device 
during the PRE_COPY states".

Maybe the confusion was caused by this line in vfio_state_pending_exact():
     *must_precopy += migration->precopy_init_size + 
migration->precopy_dirty_size;

Which I think should be removed, as stopcopy-size should already cover 
precopy data.

Thanks.

>
>>> This may improve situation but still leave one other issue, that IIUC even
>>> with above change and even if we can avoid sync dirty bitmap frequently,
>>> the migration thread can still spinning 100% calling estimate() and keep
>>> seeing data (which is not iterable...).  For the longer term we may still
>>> need to report non-iterable stop-size in another way so QEMU knows that
>>> iterate() over all the VMState registers won't help in this case, so it
>>> should go into sleep without eating the cores.  I hope that explains why I
>>> think a new API should be still needed for the long run.
>> Yes, I get your point.
>> But is the spinning case very common? AFAIU, if it happens, it may indicate
>> some misconfiguration of the migration parameters.
> Agreed.
>
> It's just that our QE actively tests such migrations and there're always
> weird and similar reports on cpu spinning, so it might be nice to still fix
> it at some point.
>
> This shouldn't be super urgent, indeed.  It's just that it can start to
> make sense when we have other discussions where reporting non-iteratble
> pending data might be pretty useful on its own as an idea.  It's just that
> we need to figure out above on how VFIO can report predicatble stop-size in
> the drivers, and I wonder whether the drivers can be fixed (without
> breaking existing qemu; after all they currently always add both
> counters..).
>
>> Anyway, I think your idea may fit VFIO, but still need to think about all
>> the small details.
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2024-09-05 16:07             ` Avihai Horon
@ 2024-09-05 16:23               ` Peter Xu
  2024-09-05 16:45                 ` Avihai Horon
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2024-09-05 16:23 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Fabiano Rosas, Zhiyi Guo

On Thu, Sep 05, 2024 at 07:07:28PM +0300, Avihai Horon wrote:
> > So I wonder why stop-size can change from a driver, and whether that can be
> > reported in a more predictable fashion.  Otherwise I see little point in
> > providing both stop-size and precopy-size, otherwise we'll always add them
> > up into VFIO's total pending-size.  The line on provision which data falls
> > into which bucket doesn't seem to be clear to me.
> 
> Stopcopy-size is reported by VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl, which
> states that this is "[...] the estimated data length that will be required
> to complete stop copy".
> So by this definition, stopcopy-size can change during precopy (it can also
> change if device resources are created or destroyed).
> 
> Precopy-size is reported by VFIO_MIG_GET_PRECOPY_INFO ioctl, which states
> that this is "[...] estimates of data available from the device during the
> PRE_COPY states".
> 
> Maybe the confusion was caused by this line in vfio_state_pending_exact():
>     *must_precopy += migration->precopy_init_size +
> migration->precopy_dirty_size;
> 
> Which I think should be removed, as stopcopy-size should already cover
> precopy data.

Would you please help double check on the kernel drivers' side?  If that's
true I agree that should be dropped.

Does it also mean then that the currently reported stop-size - precopy-size
will be very close to the constant non-iterable data size?  If so, is it
cache-able?  It doesn't need to be 100% accurate as the ioctls are not
atomic, but still if it's mostly usable (again, will need you all to help
justify that..) as that then it looks like we have some way to report that
without the concern you raised before.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2024-09-05 16:23               ` Peter Xu
@ 2024-09-05 16:45                 ` Avihai Horon
  2024-09-05 18:31                   ` Peter Xu
  0 siblings, 1 reply; 41+ messages in thread
From: Avihai Horon @ 2024-09-05 16:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Fabiano Rosas, Zhiyi Guo


On 05/09/2024 19:23, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Sep 05, 2024 at 07:07:28PM +0300, Avihai Horon wrote:
>>> So I wonder why stop-size can change from a driver, and whether that can be
>>> reported in a more predictable fashion.  Otherwise I see little point in
>>> providing both stop-size and precopy-size, otherwise we'll always add them
>>> up into VFIO's total pending-size.  The line on provision which data falls
>>> into which bucket doesn't seem to be clear to me.
>> Stopcopy-size is reported by VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl, which
>> states that this is "[...] the estimated data length that will be required
>> to complete stop copy".
>> So by this definition, stopcopy-size can change during precopy (it can also
>> change if device resources are created or destroyed).
>>
>> Precopy-size is reported by VFIO_MIG_GET_PRECOPY_INFO ioctl, which states
>> that this is "[...] estimates of data available from the device during the
>> PRE_COPY states".
>>
>> Maybe the confusion was caused by this line in vfio_state_pending_exact():
>>      *must_precopy += migration->precopy_init_size +
>> migration->precopy_dirty_size;
>>
>> Which I think should be removed, as stopcopy-size should already cover
>> precopy data.
> Would you please help double check on the kernel drivers' side?  If that's
> true I agree that should be dropped.

Sure.

>
> Does it also mean then that the currently reported stop-size - precopy-size
> will be very close to the constant non-iterable data size?

It's not constant, while the VM is running it can change.

Thanks.

>    If so, is it
> cache-able?  It doesn't need to be 100% accurate as the ioctls are not
> atomic, but still if it's mostly usable (again, will need you all to help
> justify that..) as that then it looks like we have some way to report that
> without the concern you raised before.
>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2024-09-05 16:45                 ` Avihai Horon
@ 2024-09-05 18:31                   ` Peter Xu
  2024-09-09 12:52                     ` Avihai Horon
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2024-09-05 18:31 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Fabiano Rosas, Zhiyi Guo

On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote:
> > Does it also mean then that the currently reported stop-size - precopy-size
> > will be very close to the constant non-iterable data size?
> 
> It's not constant, while the VM is running it can change.

I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl.

I just gave it a quick shot with a busy VM migrating and estimate() is
invoked only every ~100ms.

VFIO might be different, but I wonder whether we can fetch stop-size in
estimate() somehow, so it's still a pretty fast estimate() meanwhile we
avoid the rest of exact() calls (which are destined to be useless without
VFIO).

IIUC so far the estimate()/exact() was because ram sync is heavy when
exact().  When idle it's 80+ms now for 32G VM with current master (which
has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I
think both numbers contain dirty bitmap sync for both vfio and kvm).  So in
that case maybe we can still try fetching stop-size only for both
estimate() and exact(), but only sync bitmap in exact().

[1] https://lore.kernel.org/r/20240904223510.3519358-1-peterx@redhat.com

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2024-09-05 18:31                   ` Peter Xu
@ 2024-09-09 12:52                     ` Avihai Horon
  2024-09-09 15:11                       ` Peter Xu
  0 siblings, 1 reply; 41+ messages in thread
From: Avihai Horon @ 2024-09-09 12:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Fabiano Rosas, Zhiyi Guo


On 05/09/2024 21:31, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote:
>>> Does it also mean then that the currently reported stop-size - precopy-size
>>> will be very close to the constant non-iterable data size?
>> It's not constant, while the VM is running it can change.
> I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl.
>
> I just gave it a quick shot with a busy VM migrating and estimate() is
> invoked only every ~100ms.
>
> VFIO might be different, but I wonder whether we can fetch stop-size in
> estimate() somehow, so it's still a pretty fast estimate() meanwhile we
> avoid the rest of exact() calls (which are destined to be useless without
> VFIO).
>
> IIUC so far the estimate()/exact() was because ram sync is heavy when
> exact().  When idle it's 80+ms now for 32G VM with current master (which
> has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I
> think both numbers contain dirty bitmap sync for both vfio and kvm).  So in
> that case maybe we can still try fetching stop-size only for both
> estimate() and exact(), but only sync bitmap in exact().

IIUC, the end goal is to prevent migration thread spinning uselessly in 
pre-copy in such scenarios, right?
If eventually we do call get stop-copy-size in estimate(), we will move 
the spinning from "exact() -> estimate() -> exact() -> estimate() ..." 
to "estimate() -> estimate() -> ...".
If so, what benefit would we get from this? We only move the useless 
work to other place.
Shouldn't we directly go for the non precopy-able vs precopy-able report 
that you suggested?

Thanks.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2024-09-09 12:52                     ` Avihai Horon
@ 2024-09-09 15:11                       ` Peter Xu
  2024-09-12  8:09                         ` Avihai Horon
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Xu @ 2024-09-09 15:11 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Fabiano Rosas, Zhiyi Guo

On Mon, Sep 09, 2024 at 03:52:39PM +0300, Avihai Horon wrote:
> 
> On 05/09/2024 21:31, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote:
> > > > Does it also mean then that the currently reported stop-size - precopy-size
> > > > will be very close to the constant non-iterable data size?
> > > It's not constant, while the VM is running it can change.
> > I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl.
> > 
> > I just gave it a quick shot with a busy VM migrating and estimate() is
> > invoked only every ~100ms.
> > 
> > VFIO might be different, but I wonder whether we can fetch stop-size in
> > estimate() somehow, so it's still a pretty fast estimate() meanwhile we
> > avoid the rest of exact() calls (which are destined to be useless without
> > VFIO).
> > 
> > IIUC so far the estimate()/exact() was because ram sync is heavy when
> > exact().  When idle it's 80+ms now for 32G VM with current master (which
> > has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I
> > think both numbers contain dirty bitmap sync for both vfio and kvm).  So in
> > that case maybe we can still try fetching stop-size only for both
> > estimate() and exact(), but only sync bitmap in exact().
> 
> IIUC, the end goal is to prevent migration thread spinning uselessly in
> pre-copy in such scenarios, right?
> If eventually we do call get stop-copy-size in estimate(), we will move the
> spinning from "exact() -> estimate() -> exact() -> estimate() ..." to
> "estimate() -> estimate() -> ...".
> If so, what benefit would we get from this? We only move the useless work to
> other place.

We can avoid exact() calls invoked for other vmstate handlers, e.g. RAM,
which can be much heavier and can require BQL during the slow process,
which can further block more vcpu operations during migration.

And as mentioned previously, VFIO is, AFAIK, the only handler that provide
different definitions of estimate() and exact(), which can be confusing,
and it's against the "estimate() is the fast-path" logic.

But I agree it's not fundamentally changing much..

> Shouldn't we directly go for the non precopy-able vs precopy-able report
> that you suggested?

Yep, I just thought the previous one would be much easier to achieve.  And
as you said, VFIO is still pretty special that the user will need manual
involvements anyway to specify e.g. very large downtimes, so this condition
shouldn't be a major case to happen.  Said that, if you have a solid idea
on this please feel free to go ahead directly with a complete solution.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2024-09-09 15:11                       ` Peter Xu
@ 2024-09-12  8:09                         ` Avihai Horon
  2024-09-12  9:41                           ` Cédric Le Goater
  2024-09-12 13:45                           ` Peter Xu
  0 siblings, 2 replies; 41+ messages in thread
From: Avihai Horon @ 2024-09-12  8:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Fabiano Rosas, Zhiyi Guo


On 09/09/2024 18:11, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, Sep 09, 2024 at 03:52:39PM +0300, Avihai Horon wrote:
>> On 05/09/2024 21:31, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote:
>>>>> Does it also mean then that the currently reported stop-size - precopy-size
>>>>> will be very close to the constant non-iterable data size?
>>>> It's not constant, while the VM is running it can change.
>>> I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl.
>>>
>>> I just gave it a quick shot with a busy VM migrating and estimate() is
>>> invoked only every ~100ms.
>>>
>>> VFIO might be different, but I wonder whether we can fetch stop-size in
>>> estimate() somehow, so it's still a pretty fast estimate() meanwhile we
>>> avoid the rest of exact() calls (which are destined to be useless without
>>> VFIO).
>>>
>>> IIUC so far the estimate()/exact() was because ram sync is heavy when
>>> exact().  When idle it's 80+ms now for 32G VM with current master (which
>>> has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I
>>> think both numbers contain dirty bitmap sync for both vfio and kvm).  So in
>>> that case maybe we can still try fetching stop-size only for both
>>> estimate() and exact(), but only sync bitmap in exact().
>> IIUC, the end goal is to prevent migration thread spinning uselessly in
>> pre-copy in such scenarios, right?
>> If eventually we do call get stop-copy-size in estimate(), we will move the
>> spinning from "exact() -> estimate() -> exact() -> estimate() ..." to
>> "estimate() -> estimate() -> ...".
>> If so, what benefit would we get from this? We only move the useless work to
>> other place.
> We can avoid exact() calls invoked for other vmstate handlers, e.g. RAM,
> which can be much heavier and can require BQL during the slow process,
> which can further block more vcpu operations during migration.
>
> And as mentioned previously, VFIO is, AFAIK, the only handler that provide
> different definitions of estimate() and exact(), which can be confusing,
> and it's against the "estimate() is the fast-path" logic.
>
> But I agree it's not fundamentally changing much..
>
>> Shouldn't we directly go for the non precopy-able vs precopy-able report
>> that you suggested?
> Yep, I just thought the previous one would be much easier to achieve.

Yes, though I prefer not to add the get stop-copy-size ioctl in the 
estimate() flow because: a) it's guaranteed to be called (possibly many 
times) in every migration (either well configured which is the probable 
case or misconfigured which spins), and b) because how "heavy" get 
stop-copy-size is may differ from VFIO device to the other.

Maybe I am being a bit overcautious here, but let's explore other 
options first :)

> And
> as you said, VFIO is still pretty special that the user will need manual
> involvements anyway to specify e.g. very large downtimes, so this condition
> shouldn't be a major case to happen.  Said that, if you have a solid idea
> on this please feel free to go ahead directly with a complete solution.

I think it's possible to do it with what we currently have (VFIO 
uAPI-wise), I will try to think of one.

BTW, I checked again and I think we should drop this line from 
vfio_state_pending_exact():
     *must_precopy += migration->precopy_init_size + 
migration->precopy_dirty_size;

I can send a patch for that.

Thanks.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2024-09-12  8:09                         ` Avihai Horon
@ 2024-09-12  9:41                           ` Cédric Le Goater
  2024-09-12 13:45                           ` Peter Xu
  1 sibling, 0 replies; 41+ messages in thread
From: Cédric Le Goater @ 2024-09-12  9:41 UTC (permalink / raw)
  To: Avihai Horon, Peter Xu
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins, Fabiano Rosas, Zhiyi Guo

On 9/12/24 10:09, Avihai Horon wrote:
> 
> On 09/09/2024 18:11, Peter Xu wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Mon, Sep 09, 2024 at 03:52:39PM +0300, Avihai Horon wrote:
>>> On 05/09/2024 21:31, Peter Xu wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote:
>>>>>> Does it also mean then that the currently reported stop-size - precopy-size
>>>>>> will be very close to the constant non-iterable data size?
>>>>> It's not constant, while the VM is running it can change.
>>>> I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl.
>>>>
>>>> I just gave it a quick shot with a busy VM migrating and estimate() is
>>>> invoked only every ~100ms.
>>>>
>>>> VFIO might be different, but I wonder whether we can fetch stop-size in
>>>> estimate() somehow, so it's still a pretty fast estimate() meanwhile we
>>>> avoid the rest of exact() calls (which are destined to be useless without
>>>> VFIO).
>>>>
>>>> IIUC so far the estimate()/exact() was because ram sync is heavy when
>>>> exact().  When idle it's 80+ms now for 32G VM with current master (which
>>>> has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I
>>>> think both numbers contain dirty bitmap sync for both vfio and kvm).  So in
>>>> that case maybe we can still try fetching stop-size only for both
>>>> estimate() and exact(), but only sync bitmap in exact().
>>> IIUC, the end goal is to prevent migration thread spinning uselessly in
>>> pre-copy in such scenarios, right?
>>> If eventually we do call get stop-copy-size in estimate(), we will move the
>>> spinning from "exact() -> estimate() -> exact() -> estimate() ..." to
>>> "estimate() -> estimate() -> ...".
>>> If so, what benefit would we get from this? We only move the useless work to
>>> other place.
>> We can avoid exact() calls invoked for other vmstate handlers, e.g. RAM,
>> which can be much heavier and can require BQL during the slow process,
>> which can further block more vcpu operations during migration.
>>
>> And as mentioned previously, VFIO is, AFAIK, the only handler that provide
>> different definitions of estimate() and exact(), which can be confusing,
>> and it's against the "estimate() is the fast-path" logic.
>>
>> But I agree it's not fundamentally changing much..
>>
>>> Shouldn't we directly go for the non precopy-able vs precopy-able report
>>> that you suggested?
>> Yep, I just thought the previous one would be much easier to achieve.
> 
> Yes, though I prefer not to add the get stop-copy-size ioctl in the estimate() flow because: a) it's guaranteed to be called (possibly many times) in every migration (either well configured which is the probable case or misconfigured which spins), and b) because how "heavy" get stop-copy-size is may differ from VFIO device to the other.
> 
> Maybe I am being a bit overcautious here, but let's explore other options first :)
> 
>> And
>> as you said, VFIO is still pretty special that the user will need manual
>> involvements anyway to specify e.g. very large downtimes, so this condition
>> shouldn't be a major case to happen.  Said that, if you have a solid idea
>> on this please feel free to go ahead directly with a complete solution.
> 
> I think it's possible to do it with what we currently have (VFIO uAPI-wise), I will try to think of one.
> 
> BTW, I checked again and I think we should drop this line from vfio_state_pending_exact():
>      *must_precopy += migration->precopy_init_size + migration->precopy_dirty_size;
> 
> I can send a patch for that.

Please do. We can then provide a scratch build for further testing
and experiments with vGPUs.

Thanks,

C.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2
  2024-09-12  8:09                         ` Avihai Horon
  2024-09-12  9:41                           ` Cédric Le Goater
@ 2024-09-12 13:45                           ` Peter Xu
  1 sibling, 0 replies; 41+ messages in thread
From: Peter Xu @ 2024-09-12 13:45 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Juan Quintela,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Cornelia Huck,
	Paolo Bonzini, Vladimir Sementsov-Ogievskiy,
	Cédric Le Goater, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Fabiano Rosas, Zhiyi Guo

On Thu, Sep 12, 2024 at 11:09:12AM +0300, Avihai Horon wrote:
> 
> On 09/09/2024 18:11, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Mon, Sep 09, 2024 at 03:52:39PM +0300, Avihai Horon wrote:
> > > On 05/09/2024 21:31, Peter Xu wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > On Thu, Sep 05, 2024 at 07:45:43PM +0300, Avihai Horon wrote:
> > > > > > Does it also mean then that the currently reported stop-size - precopy-size
> > > > > > will be very close to the constant non-iterable data size?
> > > > > It's not constant, while the VM is running it can change.
> > > > I wonder how heavy is VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl.
> > > > 
> > > > I just gave it a quick shot with a busy VM migrating and estimate() is
> > > > invoked only every ~100ms.
> > > > 
> > > > VFIO might be different, but I wonder whether we can fetch stop-size in
> > > > estimate() somehow, so it's still a pretty fast estimate() meanwhile we
> > > > avoid the rest of exact() calls (which are destined to be useless without
> > > > VFIO).
> > > > 
> > > > IIUC so far the estimate()/exact() was because ram sync is heavy when
> > > > exact().  When idle it's 80+ms now for 32G VM with current master (which
> > > > has a bug and I'm fixing it up [1]..), even if after the fix it's 3ms (I
> > > > think both numbers contain dirty bitmap sync for both vfio and kvm).  So in
> > > > that case maybe we can still try fetching stop-size only for both
> > > > estimate() and exact(), but only sync bitmap in exact().
> > > IIUC, the end goal is to prevent migration thread spinning uselessly in
> > > pre-copy in such scenarios, right?
> > > If eventually we do call get stop-copy-size in estimate(), we will move the
> > > spinning from "exact() -> estimate() -> exact() -> estimate() ..." to
> > > "estimate() -> estimate() -> ...".
> > > If so, what benefit would we get from this? We only move the useless work to
> > > other place.
> > We can avoid exact() calls invoked for other vmstate handlers, e.g. RAM,
> > which can be much heavier and can require BQL during the slow process,
> > which can further block more vcpu operations during migration.
> > 
> > And as mentioned previously, VFIO is, AFAIK, the only handler that provide
> > different definitions of estimate() and exact(), which can be confusing,
> > and it's against the "estimate() is the fast-path" logic.
> > 
> > But I agree it's not fundamentally changing much..
> > 
> > > Shouldn't we directly go for the non precopy-able vs precopy-able report
> > > that you suggested?
> > Yep, I just thought the previous one would be much easier to achieve.
> 
> Yes, though I prefer not to add the get stop-copy-size ioctl in the
> estimate() flow because: a) it's guaranteed to be called (possibly many
> times) in every migration (either well configured which is the probable case
> or misconfigured which spins), and b) because how "heavy" get stop-copy-size
> is may differ from VFIO device to the other.
> 
> Maybe I am being a bit overcautious here, but let's explore other options
> first :)

Nop; that's totally ok.

> 
> > And
> > as you said, VFIO is still pretty special that the user will need manual
> > involvements anyway to specify e.g. very large downtimes, so this condition
> > shouldn't be a major case to happen.  Said that, if you have a solid idea
> > on this please feel free to go ahead directly with a complete solution.
> 
> I think it's possible to do it with what we currently have (VFIO uAPI-wise),
> I will try to think of one.
> 
> BTW, I checked again and I think we should drop this line from
> vfio_state_pending_exact():
>     *must_precopy += migration->precopy_init_size +
> migration->precopy_dirty_size;
> 
> I can send a patch for that.

Yes please.

I also wonder whether it'll be good to update the uAPI header too in Linux,
for vfio_device_feature_mig_data_size or VFIO_DEVICE_FEATURE_MIG_DATA_SIZE.
Currently it reads a bit ambiguous to me, while it could make a huge
difference iiuc when precopy size is non-trivial.

/*
 * Upon VFIO_DEVICE_FEATURE_GET read back the estimated data length that will
 * be required to complete stop copy.
 *
 * Note: Can be called on each device state.
 */

Maybe that'll also be helpful when other drivers will implement this, then
just to make sure both side (user / kernel) are crystal clear on the
definition.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2024-09-12 13:45 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-16 14:36 [PATCH v11 00/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
2023-02-16 14:36 ` [PATCH v11 01/11] linux-headers: Update to v6.2-rc8 Avihai Horon
2023-02-16 14:36 ` [PATCH v11 02/11] vfio/migration: Fix NULL pointer dereference bug Avihai Horon
2023-02-16 14:36 ` [PATCH v11 03/11] vfio/migration: Allow migration without VFIO IOMMU dirty tracking support Avihai Horon
2023-02-16 14:36 ` [PATCH v11 04/11] vfio/common: Change vfio_devices_all_running_and_saving() logic to equivalent one Avihai Horon
2023-02-16 14:53   ` Juan Quintela
2023-02-16 14:36 ` [PATCH v11 05/11] vfio/migration: Block multiple devices migration Avihai Horon
2023-05-16 10:03   ` Shameerali Kolothum Thodi via
2023-05-16 11:59     ` Jason Gunthorpe
2023-05-16 13:57       ` Shameerali Kolothum Thodi via
2023-05-16 14:04         ` Jason Gunthorpe
2023-05-16 14:27         ` Alex Williamson
2023-05-16 14:35           ` Shameerali Kolothum Thodi via
2023-05-16 16:11             ` Jason Gunthorpe
2023-02-16 14:36 ` [PATCH v11 06/11] vfio/migration: Move migration v1 logic to vfio_migration_init() Avihai Horon
2023-02-16 14:50   ` Juan Quintela
2023-02-16 14:36 ` [PATCH v11 07/11] vfio/migration: Rename functions/structs related to v1 protocol Avihai Horon
2023-02-16 14:54   ` Juan Quintela
2023-02-16 14:36 ` [PATCH v11 08/11] vfio/migration: Implement VFIO migration protocol v2 Avihai Horon
2023-02-16 15:43   ` Juan Quintela
2023-02-16 16:40     ` Avihai Horon
2023-02-16 16:52       ` Juan Quintela
2023-02-16 19:53         ` Alex Williamson
2024-09-04 13:00   ` Peter Xu
2024-09-04 15:41     ` Avihai Horon
2024-09-04 16:16       ` Peter Xu
2024-09-05 11:41         ` Avihai Horon
2024-09-05 15:17           ` Peter Xu
2024-09-05 16:07             ` Avihai Horon
2024-09-05 16:23               ` Peter Xu
2024-09-05 16:45                 ` Avihai Horon
2024-09-05 18:31                   ` Peter Xu
2024-09-09 12:52                     ` Avihai Horon
2024-09-09 15:11                       ` Peter Xu
2024-09-12  8:09                         ` Avihai Horon
2024-09-12  9:41                           ` Cédric Le Goater
2024-09-12 13:45                           ` Peter Xu
2023-02-16 14:36 ` [PATCH v11 09/11] vfio/migration: Remove VFIO migration protocol v1 Avihai Horon
2023-02-16 14:36 ` [PATCH v11 10/11] vfio: Alphabetize migration section of VFIO trace-events file Avihai Horon
2023-02-16 14:36 ` [PATCH v11 11/11] docs/devel: Align VFIO migration docs to v2 protocol Avihai Horon
2023-02-16 14:57   ` Juan Quintela

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).