[PATCH V1 vfio 0/6] Add support for PRE_COPY initial bytes re-initialization

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V1 vfio 0/6] Add support for PRE_COPY initial bytes re-initialization
@ 2026-03-10 16:40 Yishai Hadas
  2026-03-10 16:40 ` [PATCH V1 vfio 1/6] vfio: Define uAPI for re-init initial bytes during the PRE_COPY phase Yishai Hadas
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Yishai Hadas @ 2026-03-10 16:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: kvm, kevin.tian, joao.m.martins, leonro, yishaih, maorg, avihaih,
	clg, peterx, liulongfang, giovanni.cabiddu, kwankhede

This series introduces support for re-initializing the initial_bytes
value during the VFIO PRE_COPY migration phase.

Background
==========
As currently defined, initial_bytes is monotonically decreasing and
precedes dirty_bytes when reading from the saving file descriptor.
The transition from initial_bytes to dirty_bytes is unidirectional and
irreversible.

The initial_bytes are considered critical data that is highly
recommended to be transferred to the target as part of PRE_COPY.
Without this data, the PRE_COPY phase would be ineffective.

Problem Statement
=================
In some cases, a new chunk of critical data may appear during the
PRE_COPY phase. The current API does not provide a mechanism for the
driver to report an updated initial_bytes value when this occurs.

Solution
========
For that, we extend the VFIO_MIG_GET_PRECOPY_INFO ioctl with an output
flag named VFIO_PRECOPY_INFO_REINIT to allow drivers reporting a new
initial_bytes value during the PRE_COPY phase.

However, Currently, existing VFIO_MIG_GET_PRECOPY_INFO implementations
don't assign info.flags before copy_to_user(), this effectively echoes
userspace-provided flags back as output, preventing the field from being
used to report new reliable data from the drivers.

Reliable use of the new VFIO_PRECOPY_INFO_REINIT flag requires userspace
to explicitly opt in. For that we introduce a new feature named
VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2.

User should opt-in to the above feature with a SET operation, no data is
required and any supplied data is ignored.

When the caller opts in:
- We set info.flags to zero, otherwise we keep v1 behaviour as is for
  compatibility reasons.
- The new output flag VFIO_PRECOPY_INFO_REINIT can be used reliably.
- The VFIO_PRECOPY_INFO_REINIT output flag indicates that new initial
  data is present on the stream. The initial_bytes value should be
  re-evaluated relative to the readiness state for transition to
  STOP_COPY.

The mlx5 VFIO driver is extended to support this case when the
underlying firmware also supports the REINIT migration state.

As part of this series, a core helper function is introduced to provide
shared functionality for implementing the VFIO_MIG_GET_PRECOPY_INFO
ioctl, and all drivers have been updated to use it.

Changes from V0:
https://patchwork.kernel.org/project/kvm/cover/20260224082019.25772-1-yishaih@nvidia.com/

Patch #1:
-Fix a typo in a comment (prepcopy_info_v2 -> precopy_info_v2)

Patch #2,#3,#6:
-Rename 'precopy_info_flags_fix' to 'precopy_info_v2'.

The corresponding QEMU series can be found here [1].
[1] https://github.com/avihai1122/qemu/commits/vfio_precopy_info_reinit/

Note:
We may need to send the net/mlx5 patch to VFIO as a pull request to
avoid conflicts prior to acceptance.

Yishai

Yishai Hadas (6):
  vfio: Define uAPI for re-init initial bytes during the PRE_COPY phase
  vfio: Add support for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2
  vfio: Adapt drivers to use the core helper vfio_check_precopy_ioctl
  net/mlx5: Add IFC bits for migration state
  vfio/mlx5: consider inflight SAVE during PRE_COPY
  vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO

 .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c    |  17 +--
 drivers/vfio/pci/mlx5/cmd.c                   |  25 +++-
 drivers/vfio/pci/mlx5/cmd.h                   |   6 +-
 drivers/vfio/pci/mlx5/main.c                  | 118 +++++++++++-------
 drivers/vfio/pci/qat/main.c                   |  17 +--
 drivers/vfio/pci/vfio_pci_core.c              |   1 +
 drivers/vfio/pci/virtio/migrate.c             |  17 +--
 drivers/vfio/vfio_main.c                      |  20 +++
 include/linux/mlx5/mlx5_ifc.h                 |  16 ++-
 include/linux/vfio.h                          |  40 ++++++
 include/uapi/linux/vfio.h                     |  22 ++++
 samples/vfio-mdev/mtty.c                      |  16 +--
 12 files changed, 217 insertions(+), 98 deletions(-)

-- 
2.18.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V1 vfio 1/6] vfio: Define uAPI for re-init initial bytes during the PRE_COPY phase
  2026-03-10 16:40 [PATCH V1 vfio 0/6] Add support for PRE_COPY initial bytes re-initialization Yishai Hadas
@ 2026-03-10 16:40 ` Yishai Hadas
  2026-03-10 16:40 ` [PATCH V1 vfio 2/6] vfio: Add support for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 Yishai Hadas
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Yishai Hadas @ 2026-03-10 16:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: kvm, kevin.tian, joao.m.martins, leonro, yishaih, maorg, avihaih,
	clg, peterx, liulongfang, giovanni.cabiddu, kwankhede

As currently defined, initial_bytes is monotonically decreasing and
precedes dirty_bytes when reading from the saving file descriptor.
The transition from initial_bytes to dirty_bytes is unidirectional and
irreversible.

The initial_bytes are considered as critical data that is highly
recommended to be transferred to the target as part of PRE_COPY, without
this data, the PRE_COPY phase would be ineffective.

We come to solve the case when a new chunk of critical data is
introduced during the PRE_COPY phase and the driver would like to report
an entirely new value for the initial_bytes.

For that, we extend the VFIO_MIG_GET_PRECOPY_INFO ioctl with an output
flag named VFIO_PRECOPY_INFO_REINIT to allow drivers reporting a new
initial_bytes value during the PRE_COPY phase.

Currently, existing VFIO_MIG_GET_PRECOPY_INFO implementations don't
assign info.flags before copy_to_user(), this effectively echoes
userspace-provided flags back as output, preventing the field from being
used to report new reliable data from the drivers.

Reliable use of the new VFIO_PRECOPY_INFO_REINIT flag requires userspace
to explicitly opt in by enabling the
VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 device feature.

When the caller opts in, the driver may report an entirely new
value for initial_bytes. It may be larger, it may be smaller, it may
include the previous unread initial_bytes, it may discard the previous
unread initial_bytes, up to the driver logic and state.
The presence of the VFIO_PRECOPY_INFO_REINIT output flag set by the
driver indicates that new initial data is present on the stream.

Once the caller sees this flag, the initial_bytes value should be
re-evaluated relative to the readiness state for transition to
STOP_COPY.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 include/uapi/linux/vfio.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index bb7b89330d35..90e51e84539d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1266,6 +1266,17 @@ enum vfio_device_mig_state {
  * The initial_bytes field indicates the amount of initial precopy
  * data available from the device. This field should have a non-zero initial
  * value and decrease as migration data is read from the device.
+ * The presence of the VFIO_PRECOPY_INFO_REINIT output flag indicates
+ * that new initial data is present on the stream.
+ * In that case initial_bytes may report a non-zero value irrespective of
+ * any previously reported values, which progresses towards zero as precopy
+ * data is read from the data stream. dirty_bytes is also reset
+ * to zero and represents the state change of the device relative to the new
+ * initial_bytes.
+ * VFIO_PRECOPY_INFO_REINIT can be reported only after userspace opts in to
+ * VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2. Without this opt-in, the flags field
+ * of struct vfio_precopy_info is reserved for bug-compatibility reasons.
+ *
  * It is recommended to leave PRE_COPY for STOP_COPY only after this field
  * reaches zero. Leaving PRE_COPY earlier might make things slower.
  *
@@ -1301,6 +1312,7 @@ enum vfio_device_mig_state {
 struct vfio_precopy_info {
 	__u32 argsz;
 	__u32 flags;
+#define VFIO_PRECOPY_INFO_REINIT (1 << 0) /* output - new initial data is present */
 	__aligned_u64 initial_bytes;
 	__aligned_u64 dirty_bytes;
 };
@@ -1510,6 +1522,16 @@ struct vfio_device_feature_dma_buf {
 	struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges);
 };

+/*
+ * Enables the migration precopy_info_v2 behaviour.
+ *
+ * VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2.
+ *
+ * On SET, enables the v2 pre_copy_info behaviour, where the
+ * vfio_precopy_info.flags is a valid output field.
+ */
+#define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2  12
+
 /* -------- API for Type1 VFIO IOMMU -------- */

 /**
-- 
2.18.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V1 vfio 2/6] vfio: Add support for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2
  2026-03-10 16:40 [PATCH V1 vfio 0/6] Add support for PRE_COPY initial bytes re-initialization Yishai Hadas
  2026-03-10 16:40 ` [PATCH V1 vfio 1/6] vfio: Define uAPI for re-init initial bytes during the PRE_COPY phase Yishai Hadas
@ 2026-03-10 16:40 ` Yishai Hadas
  2026-03-10 16:40 ` [PATCH V1 vfio 3/6] vfio: Adapt drivers to use the core helper vfio_check_precopy_ioctl Yishai Hadas
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Yishai Hadas @ 2026-03-10 16:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: kvm, kevin.tian, joao.m.martins, leonro, yishaih, maorg, avihaih,
	clg, peterx, liulongfang, giovanni.cabiddu, kwankhede

Currently, existing VFIO_MIG_GET_PRECOPY_INFO implementations don't
assign info.flags before copy_to_user().

Because they copy the struct in from userspace first, this effectively
echoes userspace-provided flags back as output, preventing the field
from being used to report new reliable data from the drivers.

Add support for a new device feature named
VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2.

On SET, enables the v2 pre_copy_info behaviour, where the
vfio_precopy_info.flags is a valid output field.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c |  1 +
 drivers/vfio/vfio_main.c         | 20 ++++++++++++++++++++
 include/linux/vfio.h             |  1 +
 3 files changed, 22 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index d43745fe4c84..1daaceb5b2c8 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -736,6 +736,7 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
 #endif
 	vfio_pci_core_disable(vdev);
 
+	core_vdev->precopy_info_v2 = 0;
 	vfio_pci_dma_buf_cleanup(vdev);
 
 	mutex_lock(&vdev->igate);
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 742477546b15..dcb879018f27 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -964,6 +964,23 @@ vfio_ioctl_device_feature_migration_data_size(struct vfio_device *device,
 	return 0;
 }
 
+static int
+vfio_ioctl_device_feature_migration_precopy_info_v2(struct vfio_device *device,
+						    u32 flags, size_t argsz)
+{
+	int ret;
+
+	if (!(device->migration_flags & VFIO_MIGRATION_PRE_COPY))
+		return -EINVAL;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
+	if (ret != 1)
+		return ret;
+
+	device->precopy_info_v2 = 1;
+	return 0;
+}
+
 static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
 					       u32 flags, void __user *arg,
 					       size_t argsz)
@@ -1251,6 +1268,9 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
 		return vfio_ioctl_device_feature_migration_data_size(
 			device, feature.flags, arg->data,
 			feature.argsz - minsz);
+	case VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2:
+		return vfio_ioctl_device_feature_migration_precopy_info_v2(
+			device, feature.flags, feature.argsz - minsz);
 	default:
 		if (unlikely(!device->ops->device_feature))
 			return -ENOTTY;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e90859956514..7c1d33283e04 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -52,6 +52,7 @@ struct vfio_device {
 	struct vfio_device_set *dev_set;
 	struct list_head dev_set_list;
 	unsigned int migration_flags;
+	u8 precopy_info_v2;
 	struct kvm *kvm;
 
 	/* Members below here are private, not for driver use */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V1 vfio 3/6] vfio: Adapt drivers to use the core helper vfio_check_precopy_ioctl
  2026-03-10 16:40 [PATCH V1 vfio 0/6] Add support for PRE_COPY initial bytes re-initialization Yishai Hadas
  2026-03-10 16:40 ` [PATCH V1 vfio 1/6] vfio: Define uAPI for re-init initial bytes during the PRE_COPY phase Yishai Hadas
  2026-03-10 16:40 ` [PATCH V1 vfio 2/6] vfio: Add support for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 Yishai Hadas
@ 2026-03-10 16:40 ` Yishai Hadas
  2026-03-10 16:40 ` [PATCH V1 vfio 4/6] net/mlx5: Add IFC bits for migration state Yishai Hadas
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Yishai Hadas @ 2026-03-10 16:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: kvm, kevin.tian, joao.m.martins, leonro, yishaih, maorg, avihaih,
	clg, peterx, liulongfang, giovanni.cabiddu, kwankhede

Introduce a core helper function for VFIO_MIG_GET_PRECOPY_INFO and adapt
all drivers to use it.

It centralizes the common code and ensures that output flags are cleared
on entry, in case user opts in to VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2.
This preventing any unintended echoing of userspace data back to
userspace.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c    | 17 +++-----
 drivers/vfio/pci/mlx5/main.c                  | 18 +++------
 drivers/vfio/pci/qat/main.c                   | 17 +++-----
 drivers/vfio/pci/virtio/migrate.c             | 17 +++-----
 include/linux/vfio.h                          | 39 +++++++++++++++++++
 samples/vfio-mdev/mtty.c                      | 16 +++-----
 6 files changed, 68 insertions(+), 56 deletions(-)

diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
index 1d367cff7dcf..bb121f635b9f 100644
--- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
+++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
@@ -857,18 +857,12 @@ static long hisi_acc_vf_precopy_ioctl(struct file *filp,
 	struct hisi_acc_vf_core_device *hisi_acc_vdev = migf->hisi_acc_vdev;
 	loff_t *pos = &filp->f_pos;
 	struct vfio_precopy_info info;
-	unsigned long minsz;
 	int ret;
 
-	if (cmd != VFIO_MIG_GET_PRECOPY_INFO)
-		return -ENOTTY;
-
-	minsz = offsetofend(struct vfio_precopy_info, dirty_bytes);
-
-	if (copy_from_user(&info, (void __user *)arg, minsz))
-		return -EFAULT;
-	if (info.argsz < minsz)
-		return -EINVAL;
+	ret = vfio_check_precopy_ioctl(&hisi_acc_vdev->core_device.vdev, cmd,
+				       arg, &info);
+	if (ret)
+		return ret;
 
 	mutex_lock(&hisi_acc_vdev->state_mutex);
 	if (hisi_acc_vdev->mig_state != VFIO_DEVICE_STATE_PRE_COPY) {
@@ -893,7 +887,8 @@ static long hisi_acc_vf_precopy_ioctl(struct file *filp,
 	mutex_unlock(&migf->lock);
 	mutex_unlock(&hisi_acc_vdev->state_mutex);
 
-	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
+	return copy_to_user((void __user *)arg, &info,
+		offsetofend(struct vfio_precopy_info, dirty_bytes)) ? -EFAULT : 0;
 out:
 	mutex_unlock(&migf->lock);
 	mutex_unlock(&hisi_acc_vdev->state_mutex);
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index dbba6173894b..fb541c17c712 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -463,21 +463,14 @@ static long mlx5vf_precopy_ioctl(struct file *filp, unsigned int cmd,
 	struct mlx5_vhca_data_buffer *buf;
 	struct vfio_precopy_info info = {};
 	loff_t *pos = &filp->f_pos;
-	unsigned long minsz;
 	size_t inc_length = 0;
 	bool end_of_data = false;
 	int ret;
 
-	if (cmd != VFIO_MIG_GET_PRECOPY_INFO)
-		return -ENOTTY;
-
-	minsz = offsetofend(struct vfio_precopy_info, dirty_bytes);
-
-	if (copy_from_user(&info, (void __user *)arg, minsz))
-		return -EFAULT;
-
-	if (info.argsz < minsz)
-		return -EINVAL;
+	ret = vfio_check_precopy_ioctl(&mvdev->core_device.vdev, cmd, arg,
+				       &info);
+	if (ret)
+		return ret;
 
 	mutex_lock(&mvdev->state_mutex);
 	if (mvdev->mig_state != VFIO_DEVICE_STATE_PRE_COPY &&
@@ -545,7 +538,8 @@ static long mlx5vf_precopy_ioctl(struct file *filp, unsigned int cmd,
 
 done:
 	mlx5vf_state_mutex_unlock(mvdev);
-	if (copy_to_user((void __user *)arg, &info, minsz))
+	if (copy_to_user((void __user *)arg, &info,
+			 offsetofend(struct vfio_precopy_info, dirty_bytes)))
 		return -EFAULT;
 	return 0;
 
diff --git a/drivers/vfio/pci/qat/main.c b/drivers/vfio/pci/qat/main.c
index b982d4ae666c..b3a4b7a55696 100644
--- a/drivers/vfio/pci/qat/main.c
+++ b/drivers/vfio/pci/qat/main.c
@@ -121,18 +121,12 @@ static long qat_vf_precopy_ioctl(struct file *filp, unsigned int cmd,
 	struct qat_mig_dev *mig_dev = qat_vdev->mdev;
 	struct vfio_precopy_info info;
 	loff_t *pos = &filp->f_pos;
-	unsigned long minsz;
 	int ret = 0;
 
-	if (cmd != VFIO_MIG_GET_PRECOPY_INFO)
-		return -ENOTTY;
-
-	minsz = offsetofend(struct vfio_precopy_info, dirty_bytes);
-
-	if (copy_from_user(&info, (void __user *)arg, minsz))
-		return -EFAULT;
-	if (info.argsz < minsz)
-		return -EINVAL;
+	ret = vfio_check_precopy_ioctl(&qat_vdev->core_device.vdev, cmd, arg,
+				       &info);
+	if (ret)
+		return ret;
 
 	mutex_lock(&qat_vdev->state_mutex);
 	if (qat_vdev->mig_state != VFIO_DEVICE_STATE_PRE_COPY &&
@@ -160,7 +154,8 @@ static long qat_vf_precopy_ioctl(struct file *filp, unsigned int cmd,
 	mutex_unlock(&qat_vdev->state_mutex);
 	if (ret)
 		return ret;
-	return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
+	return copy_to_user((void __user *)arg, &info,
+		offsetofend(struct vfio_precopy_info, dirty_bytes)) ? -EFAULT : 0;
 }
 
 static ssize_t qat_vf_save_read(struct file *filp, char __user *buf,
diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c
index 35fa2d6ed611..7e11834ad512 100644
--- a/drivers/vfio/pci/virtio/migrate.c
+++ b/drivers/vfio/pci/virtio/migrate.c
@@ -443,19 +443,13 @@ static long virtiovf_precopy_ioctl(struct file *filp, unsigned int cmd,
 	struct vfio_precopy_info info = {};
 	loff_t *pos = &filp->f_pos;
 	bool end_of_data = false;
-	unsigned long minsz;
 	u32 ctx_size = 0;
 	int ret;
 
-	if (cmd != VFIO_MIG_GET_PRECOPY_INFO)
-		return -ENOTTY;
-
-	minsz = offsetofend(struct vfio_precopy_info, dirty_bytes);
-	if (copy_from_user(&info, (void __user *)arg, minsz))
-		return -EFAULT;
-
-	if (info.argsz < minsz)
-		return -EINVAL;
+	ret = vfio_check_precopy_ioctl(&virtvdev->core_device.vdev, cmd, arg,
+				       &info);
+	if (ret)
+		return ret;
 
 	mutex_lock(&virtvdev->state_mutex);
 	if (virtvdev->mig_state != VFIO_DEVICE_STATE_PRE_COPY &&
@@ -514,7 +508,8 @@ static long virtiovf_precopy_ioctl(struct file *filp, unsigned int cmd,
 
 done:
 	virtiovf_state_mutex_unlock(virtvdev);
-	if (copy_to_user((void __user *)arg, &info, minsz))
+	if (copy_to_user((void __user *)arg, &info,
+			 offsetofend(struct vfio_precopy_info, dirty_bytes)))
 		return -EFAULT;
 	return 0;
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 7c1d33283e04..50b474334a19 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -16,6 +16,7 @@
 #include <linux/cdev.h>
 #include <uapi/linux/vfio.h>
 #include <linux/iova_bitmap.h>
+#include <linux/uaccess.h>
 
 struct kvm;
 struct iommufd_ctx;
@@ -285,6 +286,44 @@ static inline int vfio_check_feature(u32 flags, size_t argsz, u32 supported_ops,
 	return 1;
 }
 
+/**
+ * vfio_check_precopy_ioctl - Validate user input for the VFIO_MIG_GET_PRECOPY_INFO ioctl
+ * @vdev: The vfio device
+ * @cmd: Cmd from the ioctl
+ * @arg: Arg from the ioctl
+ * @info: Driver pointer to hold the userspace input to the ioctl
+ *
+ * For use in a driver's get_precopy_info. Checks that the inputs to the
+ * VFIO_MIG_GET_PRECOPY_INFO ioctl are correct.
+
+ * Returns 0 on success, otherwise errno.
+ */
+
+static inline int
+vfio_check_precopy_ioctl(struct vfio_device *vdev, unsigned int cmd,
+			 unsigned long arg, struct vfio_precopy_info *info)
+{
+	unsigned long minsz;
+
+	if (cmd != VFIO_MIG_GET_PRECOPY_INFO)
+		return -ENOTTY;
+
+	minsz = offsetofend(struct vfio_precopy_info, dirty_bytes);
+
+	if (copy_from_user(info, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (info->argsz < minsz)
+		return -EINVAL;
+
+	/* keep v1 behaviour as is for compatibility reasons */
+	if (vdev->precopy_info_v2)
+		/* flags are output, set its initial value to 0 */
+		info->flags = 0;
+
+	return 0;
+}
+
 struct vfio_device *_vfio_alloc_device(size_t size, struct device *dev,
 				       const struct vfio_device_ops *ops);
 #define vfio_alloc_device(dev_struct, member, dev, ops)				\
diff --git a/samples/vfio-mdev/mtty.c b/samples/vfio-mdev/mtty.c
index bd92c38379b8..c1070af69544 100644
--- a/samples/vfio-mdev/mtty.c
+++ b/samples/vfio-mdev/mtty.c
@@ -837,18 +837,11 @@ static long mtty_precopy_ioctl(struct file *filp, unsigned int cmd,
 	struct mdev_state *mdev_state = migf->mdev_state;
 	loff_t *pos = &filp->f_pos;
 	struct vfio_precopy_info info = {};
-	unsigned long minsz;
 	int ret;
 
-	if (cmd != VFIO_MIG_GET_PRECOPY_INFO)
-		return -ENOTTY;
-
-	minsz = offsetofend(struct vfio_precopy_info, dirty_bytes);
-
-	if (copy_from_user(&info, (void __user *)arg, minsz))
-		return -EFAULT;
-	if (info.argsz < minsz)
-		return -EINVAL;
+	ret = vfio_check_precopy_ioctl(&mdev_state->vdev, cmd, arg, &info);
+	if (ret)
+		return ret;
 
 	mutex_lock(&mdev_state->state_mutex);
 	if (mdev_state->state != VFIO_DEVICE_STATE_PRE_COPY &&
@@ -875,7 +868,8 @@ static long mtty_precopy_ioctl(struct file *filp, unsigned int cmd,
 	info.initial_bytes = migf->filled_size - *pos;
 	mutex_unlock(&migf->lock);
 
-	ret = copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
+	ret = copy_to_user((void __user *)arg, &info,
+		offsetofend(struct vfio_precopy_info, dirty_bytes)) ? -EFAULT : 0;
 unlock:
 	mtty_state_mutex_unlock(mdev_state);
 	return ret;
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V1 vfio 4/6] net/mlx5: Add IFC bits for migration state
  2026-03-10 16:40 [PATCH V1 vfio 0/6] Add support for PRE_COPY initial bytes re-initialization Yishai Hadas
                   ` (2 preceding siblings ...)
  2026-03-10 16:40 ` [PATCH V1 vfio 3/6] vfio: Adapt drivers to use the core helper vfio_check_precopy_ioctl Yishai Hadas
@ 2026-03-10 16:40 ` Yishai Hadas
  2026-03-10 16:40 ` [PATCH V1 vfio 5/6] vfio/mlx5: consider inflight SAVE during PRE_COPY Yishai Hadas
  2026-03-10 16:40 ` [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO Yishai Hadas
  5 siblings, 0 replies; 16+ messages in thread
From: Yishai Hadas @ 2026-03-10 16:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: kvm, kevin.tian, joao.m.martins, leonro, yishaih, maorg, avihaih,
	clg, peterx, liulongfang, giovanni.cabiddu, kwankhede

Add the relevant IFC bits for querying an extra migration state from the
device.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 775cb0c56865..1c8922c58c8f 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -2173,7 +2173,8 @@ struct mlx5_ifc_cmd_hca_cap_2_bits {
 	u8	   sf_eq_usage[0x1];
 	u8	   reserved_at_d3[0x5];
 	u8	   multiplane[0x1];
-	u8	   reserved_at_d9[0x7];
+	u8	   migration_state[0x1];
+	u8	   reserved_at_da[0x6];
 
 	u8	   cross_vhca_object_to_object_supported[0x20];
 
@@ -13280,13 +13281,24 @@ struct mlx5_ifc_query_vhca_migration_state_in_bits {
 	u8         reserved_at_60[0x20];
 };
 
+enum {
+	MLX5_QUERY_VHCA_MIG_STATE_UNINITIALIZED = 0x0,
+	MLX5_QUERY_VHCA_MIG_STATE_OPER_MIGRATION_IDLE = 0x1,
+	MLX5_QUERY_VHCA_MIG_STATE_OPER_MIGRATION_READY = 0x2,
+	MLX5_QUERY_VHCA_MIG_STATE_OPER_MIGRATION_DIRTY = 0x3,
+	MLX5_QUERY_VHCA_MIG_STATE_OPER_MIGRATION_INIT = 0x4,
+};
+
 struct mlx5_ifc_query_vhca_migration_state_out_bits {
 	u8         status[0x8];
 	u8         reserved_at_8[0x18];
 
 	u8         syndrome[0x20];
 
-	u8         reserved_at_40[0x40];
+	u8         reserved_at_40[0x20];
+
+	u8         migration_state[0x4];
+	u8         reserved_at_64[0x1c];
 
 	u8         required_umem_size[0x20];
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V1 vfio 5/6] vfio/mlx5: consider inflight SAVE during PRE_COPY
  2026-03-10 16:40 [PATCH V1 vfio 0/6] Add support for PRE_COPY initial bytes re-initialization Yishai Hadas
                   ` (3 preceding siblings ...)
  2026-03-10 16:40 ` [PATCH V1 vfio 4/6] net/mlx5: Add IFC bits for migration state Yishai Hadas
@ 2026-03-10 16:40 ` Yishai Hadas
  2026-03-10 16:40 ` [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO Yishai Hadas
  5 siblings, 0 replies; 16+ messages in thread
From: Yishai Hadas @ 2026-03-10 16:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: kvm, kevin.tian, joao.m.martins, leonro, yishaih, maorg, avihaih,
	clg, peterx, liulongfang, giovanni.cabiddu, kwankhede

Consider an inflight SAVE operation during the PRE_COPY phase, so the
caller will wait when no data is currently available but is expected
to arrive.

This enables a follow-up patch to avoid returning -ENOMSG while a new
*initial_bytes* chunk is still pending from an asynchronous SAVE command
issued by the VFIO_MIG_GET_PRECOPY_INFO ioctl.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 5 +++++
 drivers/vfio/pci/mlx5/cmd.h  | 1 +
 drivers/vfio/pci/mlx5/main.c | 3 ++-
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index ca6d95f293cd..18b8d8594070 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -606,6 +606,8 @@ static void
 mlx5vf_save_callback_complete(struct mlx5_vf_migration_file *migf,
 			      struct mlx5vf_async_data *async_data)
 {
+	migf->inflight_save = 0;
+	wake_up_interruptible(&migf->poll_wait);
 	kvfree(async_data->out);
 	complete(&migf->save_comp);
 	fput(migf->filp);
@@ -809,6 +811,7 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 
 	async_data->header_buf = header_buf;
 	get_file(migf->filp);
+	migf->inflight_save = 1;
 	err = mlx5_cmd_exec_cb(&migf->async_ctx, in, sizeof(in),
 			       async_data->out,
 			       out_size, mlx5vf_save_callback,
@@ -819,6 +822,8 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	return 0;
 
 err_exec:
+	migf->inflight_save = 0;
+	wake_up_interruptible(&migf->poll_wait);
 	if (header_buf)
 		mlx5vf_put_data_buffer(header_buf);
 	fput(migf->filp);
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index d7821b5ca772..7d2c10be2e60 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -111,6 +111,7 @@ struct mlx5_vf_migration_file {
 	struct completion save_comp;
 	struct mlx5_async_ctx async_ctx;
 	struct mlx5vf_async_data async_data;
+	u8 inflight_save:1;
 };
 
 struct mlx5_vhca_cq_buf {
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index fb541c17c712..68e051c48d40 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -179,7 +179,8 @@ static ssize_t mlx5vf_save_read(struct file *filp, char __user *buf, size_t len,
 				!list_empty(&migf->buf_list) ||
 				migf->state == MLX5_MIGF_STATE_ERROR ||
 				migf->state == MLX5_MIGF_STATE_PRE_COPY_ERROR ||
-				migf->state == MLX5_MIGF_STATE_PRE_COPY ||
+				(migf->state == MLX5_MIGF_STATE_PRE_COPY &&
+				 !migf->inflight_save) ||
 				migf->state == MLX5_MIGF_STATE_COMPLETE))
 			return -ERESTARTSYS;
 	}
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
  2026-03-10 16:40 [PATCH V1 vfio 0/6] Add support for PRE_COPY initial bytes re-initialization Yishai Hadas
                   ` (4 preceding siblings ...)
  2026-03-10 16:40 ` [PATCH V1 vfio 5/6] vfio/mlx5: consider inflight SAVE during PRE_COPY Yishai Hadas
@ 2026-03-10 16:40 ` Yishai Hadas
  2026-03-12 17:37   ` Peter Xu
  5 siblings, 1 reply; 16+ messages in thread
From: Yishai Hadas @ 2026-03-10 16:40 UTC (permalink / raw)
  To: alex, jgg
  Cc: kvm, kevin.tian, joao.m.martins, leonro, yishaih, maorg, avihaih,
	clg, peterx, liulongfang, giovanni.cabiddu, kwankhede

When userspace opts into VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2, the
driver may report the VFIO_PRECOPY_INFO_REINIT output flag in response
to the VFIO_MIG_GET_PRECOPY_INFO ioctl, along with a new initial_bytes
value.

The presence of the VFIO_PRECOPY_INFO_REINIT flag indicates to the
caller that new initial data is available in the migration stream.

If the firmware reports a new initial-data chunk, any previously dirty
bytes in memory are treated as initial bytes, since the caller must read
both sets before reaching the end of the initial-data region.

In this case, the driver issues a new SAVE command to fetch the data and
prepare it for a subsequent read() from userspace.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 20 ++++++--
 drivers/vfio/pci/mlx5/cmd.h  |  5 +-
 drivers/vfio/pci/mlx5/main.c | 97 +++++++++++++++++++++++-------------
 3 files changed, 83 insertions(+), 39 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 18b8d8594070..5fe0621b5fbd 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -87,7 +87,7 @@ int mlx5vf_cmd_resume_vhca(struct mlx5vf_pci_core_device *mvdev, u16 op_mod)
 
 int mlx5vf_cmd_query_vhca_migration_state(struct mlx5vf_pci_core_device *mvdev,
 					  size_t *state_size, u64 *total_size,
-					  u8 query_flags)
+					  u8 *mig_state, u8 query_flags)
 {
 	u32 out[MLX5_ST_SZ_DW(query_vhca_migration_state_out)] = {};
 	u32 in[MLX5_ST_SZ_DW(query_vhca_migration_state_in)] = {};
@@ -152,6 +152,10 @@ int mlx5vf_cmd_query_vhca_migration_state(struct mlx5vf_pci_core_device *mvdev,
 			MLX5_GET64(query_vhca_migration_state_out, out,
 				   remaining_total_size) : *state_size;
 
+	if (mig_state && mvdev->mig_state_cap)
+		*mig_state = MLX5_GET(query_vhca_migration_state_out, out,
+				      migration_state);
+
 	return 0;
 }
 
@@ -277,6 +281,9 @@ void mlx5vf_cmd_set_migratable(struct mlx5vf_pci_core_device *mvdev,
 	if (MLX5_CAP_GEN_2(mvdev->mdev, migration_in_chunks))
 		mvdev->chunk_mode = 1;
 
+	if (MLX5_CAP_GEN_2(mvdev->mdev, migration_state))
+		mvdev->mig_state_cap = 1;
+
 end:
 	mlx5_vf_put_core_dev(mvdev->mdev);
 }
@@ -555,6 +562,7 @@ void mlx5vf_put_data_buffer(struct mlx5_vhca_data_buffer *buf)
 {
 	spin_lock_irq(&buf->migf->list_lock);
 	buf->stop_copy_chunk_num = 0;
+	buf->pre_copy_init_bytes_chunk = false;
 	list_add_tail(&buf->buf_elm, &buf->migf->avail_list);
 	spin_unlock_irq(&buf->migf->list_lock);
 }
@@ -689,7 +697,8 @@ static void mlx5vf_save_callback(int status, struct mlx5_async_work *context)
 				!next_required_umem_size;
 		if (async_data->header_buf) {
 			status = add_buf_header(async_data->header_buf, image_size,
-						initial_pre_copy);
+						initial_pre_copy ||
+						async_data->buf->pre_copy_init_bytes_chunk);
 			if (status)
 				goto err;
 		}
@@ -708,9 +717,12 @@ static void mlx5vf_save_callback(int status, struct mlx5_async_work *context)
 			}
 		}
 		spin_unlock_irqrestore(&migf->list_lock, flags);
-		if (initial_pre_copy) {
+		if (initial_pre_copy || async_data->buf->pre_copy_init_bytes_chunk) {
 			migf->pre_copy_initial_bytes += image_size;
-			migf->state = MLX5_MIGF_STATE_PRE_COPY;
+			if (initial_pre_copy)
+				migf->state = MLX5_MIGF_STATE_PRE_COPY;
+			if (async_data->buf->pre_copy_init_bytes_chunk)
+				async_data->buf->pre_copy_init_bytes_chunk = false;
 		}
 		if (stop_copy_last_chunk)
 			migf->state = MLX5_MIGF_STATE_COMPLETE;
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 7d2c10be2e60..deed0f132f39 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -62,6 +62,7 @@ struct mlx5_vhca_data_buffer {
 	u32 *mkey_in;
 	enum dma_data_direction dma_dir;
 	u8 stop_copy_chunk_num;
+	bool pre_copy_init_bytes_chunk;
 	struct list_head buf_elm;
 	struct mlx5_vf_migration_file *migf;
 };
@@ -97,6 +98,7 @@ struct mlx5_vf_migration_file {
 	u32 record_tag;
 	u64 stop_copy_prep_size;
 	u64 pre_copy_initial_bytes;
+	u64 pre_copy_initial_bytes_start;
 	size_t next_required_umem_size;
 	u8 num_ready_chunks;
 	/* Upon chunk mode preserve another set of buffers for stop_copy phase */
@@ -175,6 +177,7 @@ struct mlx5vf_pci_core_device {
 	u8 mdev_detach:1;
 	u8 log_active:1;
 	u8 chunk_mode:1;
+	u8 mig_state_cap:1;
 	struct completion tracker_comp;
 	/* protect migration state */
 	struct mutex state_mutex;
@@ -199,7 +202,7 @@ int mlx5vf_cmd_suspend_vhca(struct mlx5vf_pci_core_device *mvdev, u16 op_mod);
 int mlx5vf_cmd_resume_vhca(struct mlx5vf_pci_core_device *mvdev, u16 op_mod);
 int mlx5vf_cmd_query_vhca_migration_state(struct mlx5vf_pci_core_device *mvdev,
 					  size_t *state_size, u64 *total_size,
-					  u8 query_flags);
+					  u8 *migration_state, u8 query_flags);
 void mlx5vf_cmd_set_migratable(struct mlx5vf_pci_core_device *mvdev,
 			       const struct vfio_migration_ops *mig_ops,
 			       const struct vfio_log_ops *log_ops);
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index 68e051c48d40..de306dee1d1a 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -464,8 +464,10 @@ static long mlx5vf_precopy_ioctl(struct file *filp, unsigned int cmd,
 	struct mlx5_vhca_data_buffer *buf;
 	struct vfio_precopy_info info = {};
 	loff_t *pos = &filp->f_pos;
+	u8 migration_state = 0;
 	size_t inc_length = 0;
-	bool end_of_data = false;
+	bool reinit_state;
+	bool end_of_data;
 	int ret;
 
 	ret = vfio_check_precopy_ioctl(&mvdev->core_device.vdev, cmd, arg,
@@ -492,7 +494,8 @@ static long mlx5vf_precopy_ioctl(struct file *filp, unsigned int cmd,
 		 * As so, the other code below is safe with the proper locks.
 		 */
 		ret = mlx5vf_cmd_query_vhca_migration_state(mvdev, &inc_length,
-							    NULL, MLX5VF_QUERY_INC);
+							    NULL, &migration_state,
+							    MLX5VF_QUERY_INC);
 		if (ret)
 			goto err_state_unlock;
 	}
@@ -503,41 +506,67 @@ static long mlx5vf_precopy_ioctl(struct file *filp, unsigned int cmd,
 		goto err_migf_unlock;
 	}
 
-	if (migf->pre_copy_initial_bytes > *pos) {
-		info.initial_bytes = migf->pre_copy_initial_bytes - *pos;
+	/*
+	 * opt-in for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 serves
+	 * as opt-in for VFIO_PRECOPY_INFO_REINIT as well
+	 */
+	reinit_state = mvdev->core_device.vdev.precopy_info_v2 &&
+			migration_state == MLX5_QUERY_VHCA_MIG_STATE_OPER_MIGRATION_INIT;
+	end_of_data = !(migf->max_pos - *pos);
+	if (reinit_state) {
+		/*
+		 * Any bytes already present in memory are treated as initial
+		 * bytes, since the caller is required to read them before
+		 * reaching the new initial-bytes region.
+		 */
+		migf->pre_copy_initial_bytes_start = *pos;
+		migf->pre_copy_initial_bytes = migf->max_pos - *pos;
+		info.initial_bytes = migf->pre_copy_initial_bytes + inc_length;
+		info.flags |= VFIO_PRECOPY_INFO_REINIT;
 	} else {
-		info.dirty_bytes = migf->max_pos - *pos;
-		if (!info.dirty_bytes)
-			end_of_data = true;
-		info.dirty_bytes += inc_length;
+		if (migf->pre_copy_initial_bytes_start +
+		    migf->pre_copy_initial_bytes > *pos) {
+			WARN_ON_ONCE(end_of_data);
+			info.initial_bytes = migf->pre_copy_initial_bytes_start +
+				migf->pre_copy_initial_bytes - *pos;
+		} else {
+			info.dirty_bytes = (migf->max_pos - *pos) + inc_length;
+		}
 	}
+	mutex_unlock(&migf->lock);
 
-	if (!end_of_data || !inc_length) {
-		mutex_unlock(&migf->lock);
-		goto done;
-	}
+	if ((reinit_state || end_of_data) && inc_length) {
+		/*
+		 * In case we finished transferring the current state and the
+		 * device has a dirty state, or that the device has a new init
+		 * state, save a new state to be ready for.
+		 */
+		buf = mlx5vf_get_data_buffer(migf, DIV_ROUND_UP(inc_length, PAGE_SIZE),
+					     DMA_FROM_DEVICE);
+		if (IS_ERR(buf)) {
+			ret = PTR_ERR(buf);
+			mlx5vf_mark_err(migf);
+			goto err_state_unlock;
+		}
 
-	mutex_unlock(&migf->lock);
-	/*
-	 * We finished transferring the current state and the device has a
-	 * dirty state, save a new state to be ready for.
-	 */
-	buf = mlx5vf_get_data_buffer(migf, DIV_ROUND_UP(inc_length, PAGE_SIZE),
-				     DMA_FROM_DEVICE);
-	if (IS_ERR(buf)) {
-		ret = PTR_ERR(buf);
-		mlx5vf_mark_err(migf);
-		goto err_state_unlock;
-	}
+		buf->pre_copy_init_bytes_chunk = reinit_state;
+		ret = mlx5vf_cmd_save_vhca_state(mvdev, migf, buf, true, true);
+		if (ret) {
+			mlx5vf_mark_err(migf);
+			mlx5vf_put_data_buffer(buf);
+			goto err_state_unlock;
+		}
 
-	ret = mlx5vf_cmd_save_vhca_state(mvdev, migf, buf, true, true);
-	if (ret) {
-		mlx5vf_mark_err(migf);
-		mlx5vf_put_data_buffer(buf);
-		goto err_state_unlock;
+		/*
+		 * SAVE appends a header record via add_buf_header(),
+		 * let's account it as well.
+		 */
+		if (reinit_state)
+			info.initial_bytes += sizeof(struct mlx5_vf_migration_header);
+		else
+			info.dirty_bytes += sizeof(struct mlx5_vf_migration_header);
 	}
 
-done:
 	mlx5vf_state_mutex_unlock(mvdev);
 	if (copy_to_user((void __user *)arg, &info,
 			 offsetofend(struct vfio_precopy_info, dirty_bytes)))
@@ -570,7 +599,7 @@ static int mlx5vf_pci_save_device_inc_data(struct mlx5vf_pci_core_device *mvdev)
 	if (migf->state == MLX5_MIGF_STATE_ERROR)
 		return -ENODEV;
 
-	ret = mlx5vf_cmd_query_vhca_migration_state(mvdev, &length, NULL,
+	ret = mlx5vf_cmd_query_vhca_migration_state(mvdev, &length, NULL, NULL,
 				MLX5VF_QUERY_INC | MLX5VF_QUERY_FINAL);
 	if (ret)
 		goto err;
@@ -636,7 +665,7 @@ mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev, bool track)
 	if (ret)
 		goto out;
 
-	ret = mlx5vf_cmd_query_vhca_migration_state(mvdev, &length, &full_size, 0);
+	ret = mlx5vf_cmd_query_vhca_migration_state(mvdev, &length, &full_size, NULL, 0);
 	if (ret)
 		goto out_pd;
 
@@ -1123,7 +1152,7 @@ mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
 		enum mlx5_vf_migf_state state;
 		size_t size;
 
-		ret = mlx5vf_cmd_query_vhca_migration_state(mvdev, &size, NULL,
+		ret = mlx5vf_cmd_query_vhca_migration_state(mvdev, &size, NULL, NULL,
 					MLX5VF_QUERY_INC | MLX5VF_QUERY_CLEANUP);
 		if (ret)
 			return ERR_PTR(ret);
@@ -1248,7 +1277,7 @@ static int mlx5vf_pci_get_data_size(struct vfio_device *vdev,
 
 	mutex_lock(&mvdev->state_mutex);
 	ret = mlx5vf_cmd_query_vhca_migration_state(mvdev, &state_size,
-						    &total_size, 0);
+						    &total_size, NULL, 0);
 	if (!ret)
 		*stop_copy_length = total_size;
 	mlx5vf_state_mutex_unlock(mvdev);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
  2026-03-10 16:40 ` [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO Yishai Hadas
@ 2026-03-12 17:37   ` Peter Xu
  2026-03-12 19:08     ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2026-03-12 17:37 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: alex, jgg, kvm, kevin.tian, joao.m.martins, leonro, maorg,
	avihaih, clg, liulongfang, giovanni.cabiddu, kwankhede

Hi, Yishai,

Please feel free to treat my comments as pure questions only.

On Tue, Mar 10, 2026 at 06:40:06PM +0200, Yishai Hadas wrote:
> When userspace opts into VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2, the
> driver may report the VFIO_PRECOPY_INFO_REINIT output flag in response
> to the VFIO_MIG_GET_PRECOPY_INFO ioctl, along with a new initial_bytes
> value.

Does it also mean that VFIO_PRECOPY_INFO_REINIT is almost only a hint that
can be deduced by the userspace too, if it remembers the last time fetch of
initial_bytes?

It definitely sounds a bit weird when some initial_* data can actually
change, because it's not "initial_" anymore.

Another question is, if initial_bytes reached zero, could it be boosted
again to be non-zero?

I don't see what stops it from happening, if the "we get some fresh new
critical data" seem to be able to happen anytime..  but if so, I wonder if
it's a problem to QEMU: when initial_bytes reported to 0 at least _once_ it
means it's possible src QEMU decides to switchover.  Then looks like it
beats the purpose of "don't switchover until we flush the critical data"
whole idea.

Is there a way the HW can report and confidentally say no further critical
data will be generated?

> 
> The presence of the VFIO_PRECOPY_INFO_REINIT flag indicates to the
> caller that new initial data is available in the migration stream.
> 
> If the firmware reports a new initial-data chunk, any previously dirty
> bytes in memory are treated as initial bytes, since the caller must read
> both sets before reaching the end of the initial-data region.

This is unfortunate.  I believe it's a limtation because of the current
single fd streaming protocol, so HW can only append things because it's
kind of a pipeline.

One thing to mention is, I recall VFIO migration suffers from a major
bottleneck on read() of the VFIO FD, it means this streaming whole design
is also causing other perf issues.

Have you or anyone thought about making it not a stream anymore?  Take
example of RAM blocks: it is pagesize accessible, with that we can do a lot
more, e.g. we don't need to streamline pages, we can send pages in whatever
order.  Meanwhile, we can send pages concurrently because they're not
streamlined too.

I wonder if VFIO FDs can provide something like that too, as a start it
doesn't need to be as fine granule, maybe at least instead of using one
stream it can provide two streams, one for initial_bytes (or, I really
think this should be called "critical data" or something similar, if it
represents that rather than "some initial states", not anymore), another
one for dirty.  Then at least when you attach new critical data you don't
need to flush dirty queue too.

If to extend it a bit more, then we can also make e.g. dirty queue to be
multiple FDs, so that userspace can read() in multiple threads, speeding up
the switchover phase.

I had a vague memory that there's sometimes kernel big locks to block it,
but from interfacing POV it sounds always better to avoid using one fd to
stream everything.

Thanks,

> 
> In this case, the driver issues a new SAVE command to fetch the data and
> prepare it for a subsequent read() from userspace.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
  2026-03-12 17:37   ` Peter Xu
@ 2026-03-12 19:08     ` Alex Williamson
  2026-03-12 20:16       ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2026-03-12 19:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Yishai Hadas, jgg, kvm, kevin.tian, joao.m.martins, leonro, maorg,
	avihaih, clg, liulongfang, giovanni.cabiddu, kwankhede, alex

Hey Peter,

On Thu, 12 Mar 2026 13:37:04 -0400
Peter Xu <peterx@redhat.com> wrote:

> Hi, Yishai,
> 
> Please feel free to treat my comments as pure questions only.
> 
> On Tue, Mar 10, 2026 at 06:40:06PM +0200, Yishai Hadas wrote:
> > When userspace opts into VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2, the
> > driver may report the VFIO_PRECOPY_INFO_REINIT output flag in response
> > to the VFIO_MIG_GET_PRECOPY_INFO ioctl, along with a new initial_bytes
> > value.  
> 
> Does it also mean that VFIO_PRECOPY_INFO_REINIT is almost only a hint that
> can be deduced by the userspace too, if it remembers the last time fetch of
> initial_bytes?

I'll try to answer some of these.  PRECOPY_INFO is already just a hint.
We essentially define initial_bytes as the "please copy this before
migration to avoid high latency setup" and dirty_bytes is "I also have
this much dirty state I could give to you now".  We've defined
initial_bytes as monotonically decreasing, so a user could deduce that
they've passed the intended high latency setup threshold, while
dirty_bytes is purely volatile.

The trouble comes, for example, if the device has undergone a
reconfiguration during migration, which may effectively negate the
initial_bytes and switchover-ack.

A user deducing they've sent enough device data to cover initial_bytes
is essentially what we have now because our protocol doesn't allow the
driver to reset initial_bytes.  The driver may choose to send that
reconfiguration information in dirty_bytes bytes, but we don't
currently have any way to indicate to the user that data remaining
there is of higher importance for startup on the target than any other
dirtying of device state.

Hopefully the user/VMM is already polling the interface for dirty
bytes, where with the opt-in for the protocol change here, allows the
driver to split out the priority bytes versus the background dirtying. 

> It definitely sounds a bit weird when some initial_* data can actually
> change, because it's not "initial_" anymore.

It's just a priority scheme.  In the case I've outlined above it might
be more aptly named setup_bytes or critical_bytes as you've used, but
another driver might just use it for detecting migration compatibility.
Naming is hard.

> Another question is, if initial_bytes reached zero, could it be boosted
> again to be non-zero?

Under the new protocol, yes, and the REINIT flag would be set indicate
it had been reset.  Under the old protocol, no.

> I don't see what stops it from happening, if the "we get some fresh new
> critical data" seem to be able to happen anytime..  but if so, I wonder if
> it's a problem to QEMU: when initial_bytes reported to 0 at least _once_ it
> means it's possible src QEMU decides to switchover.  Then looks like it
> beats the purpose of "don't switchover until we flush the critical data"
> whole idea.

The definition of the protocol in the header stop it from happening.
We can't know that there isn't some userspace that follows the
deduction protocol rather than polling.  We don't know there isn't some
userspace that segfaults if initial_bytes doesn't follow the published
protocol.  Therefore opt-in where we have a mechanism to expose a new
initial_bytes session without it becoming a purely volatile value.

> Is there a way the HW can report and confidentally say no further critical
> data will be generated?

So long as there's a guest userspace running that can reconfigure the
device, no.  But if you stop the vCPUs and test PRECOPY_INFO, it should
be reliable.

> > The presence of the VFIO_PRECOPY_INFO_REINIT flag indicates to the
> > caller that new initial data is available in the migration stream.
> > 
> > If the firmware reports a new initial-data chunk, any previously dirty
> > bytes in memory are treated as initial bytes, since the caller must read
> > both sets before reaching the end of the initial-data region.  
> 
> This is unfortunate.  I believe it's a limtation because of the current
> single fd streaming protocol, so HW can only append things because it's
> kind of a pipeline.
> 
> One thing to mention is, I recall VFIO migration suffers from a major
> bottleneck on read() of the VFIO FD, it means this streaming whole design
> is also causing other perf issues.
> 
> Have you or anyone thought about making it not a stream anymore?  Take
> example of RAM blocks: it is pagesize accessible, with that we can do a lot
> more, e.g. we don't need to streamline pages, we can send pages in whatever
> order.  Meanwhile, we can send pages concurrently because they're not
> streamlined too.
> 
> I wonder if VFIO FDs can provide something like that too, as a start it
> doesn't need to be as fine granule, maybe at least instead of using one
> stream it can provide two streams, one for initial_bytes (or, I really
> think this should be called "critical data" or something similar, if it
> represents that rather than "some initial states", not anymore), another
> one for dirty.  Then at least when you attach new critical data you don't
> need to flush dirty queue too.
> 
> If to extend it a bit more, then we can also make e.g. dirty queue to be
> multiple FDs, so that userspace can read() in multiple threads, speeding up
> the switchover phase.
> 
> I had a vague memory that there's sometimes kernel big locks to block it,
> but from interfacing POV it sounds always better to avoid using one fd to
> stream everything.

I'll leave it to others to brainstorm improvements, but I'll note that
flushing dirty_bytes is a driver policy, another driver could consider
unread dirty bytes as invalidated by new initial_bytes and reset
counters.

It's not clear to me that there's generic algorithm to use for handling
device state as addressable blocks rather than serialized into a data
stream.  Multiple streams of different priorities seems feasible, but
now we're talking about a v3 migration protocol.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
  2026-03-12 19:08     ` Alex Williamson
@ 2026-03-12 20:16       ` Peter Xu
  2026-03-15 14:19         ` Yishai Hadas
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2026-03-12 20:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, jgg, kvm, kevin.tian, joao.m.martins, leonro, maorg,
	avihaih, clg, liulongfang, giovanni.cabiddu, kwankhede

On Thu, Mar 12, 2026 at 01:08:17PM -0600, Alex Williamson wrote:
> Hey Peter,

Hey, Alex,

> 
> On Thu, 12 Mar 2026 13:37:04 -0400
> Peter Xu <peterx@redhat.com> wrote:
> 
> > Hi, Yishai,
> > 
> > Please feel free to treat my comments as pure questions only.
> > 
> > On Tue, Mar 10, 2026 at 06:40:06PM +0200, Yishai Hadas wrote:
> > > When userspace opts into VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2, the
> > > driver may report the VFIO_PRECOPY_INFO_REINIT output flag in response
> > > to the VFIO_MIG_GET_PRECOPY_INFO ioctl, along with a new initial_bytes
> > > value.  
> > 
> > Does it also mean that VFIO_PRECOPY_INFO_REINIT is almost only a hint that
> > can be deduced by the userspace too, if it remembers the last time fetch of
> > initial_bytes?
> 
> I'll try to answer some of these.  PRECOPY_INFO is already just a hint.
> We essentially define initial_bytes as the "please copy this before
> migration to avoid high latency setup" and dirty_bytes is "I also have
> this much dirty state I could give to you now".  We've defined
> initial_bytes as monotonically decreasing, so a user could deduce that
> they've passed the intended high latency setup threshold, while
> dirty_bytes is purely volatile.

I see..  That might be another problem though to switchover decisions.

Currently, QEMU relies on dirty reporting to decide when to switchover.

What it does is asking all the modules for how many dirty data left, then
src QEMU do a sum, divide that sum with the estimated bandwidth to guess
the downtime.

When the estimated downtime is small enough so as to satisfy the user
specified downtime, QEMU src will switchover.  This didn't take
switchover_ack for VFIO into account, but it's a separate concept.

Above was based on the fact that the reported values are "total data", not
"what you can collect"..

Is there possible way to provide a total amount?  It can even be a maximum
total amount just to cap the downtime.  If with the current reporting
definition, VM is destined to have unpredictable live migration downtime
when relevant VFIO devices are involved.

The larger the diff between the current reported dirty value v.s. "total
data", the larger the downtime mistake can happen.

> 
> The trouble comes, for example, if the device has undergone a
> reconfiguration during migration, which may effectively negate the
> initial_bytes and switchover-ack.

Ah so it's about that, thanks.  IMHO it might be great if Yishai could
mention the source of growing initial_bytes somewhere in the commit log, or
even when documenting the new feature bit.

> 
> A user deducing they've sent enough device data to cover initial_bytes
> is essentially what we have now because our protocol doesn't allow the
> driver to reset initial_bytes.  The driver may choose to send that
> reconfiguration information in dirty_bytes bytes, but we don't
> currently have any way to indicate to the user that data remaining
> there is of higher importance for startup on the target than any other
> dirtying of device state.
> 
> Hopefully the user/VMM is already polling the interface for dirty
> bytes, where with the opt-in for the protocol change here, allows the
> driver to split out the priority bytes versus the background dirtying. 
>  
> > It definitely sounds a bit weird when some initial_* data can actually
> > change, because it's not "initial_" anymore.
> 
> It's just a priority scheme.  In the case I've outlined above it might
> be more aptly named setup_bytes or critical_bytes as you've used, but
> another driver might just use it for detecting migration compatibility.
> Naming is hard.

Yep. :) initial_bytes is still fine at least to me.  I wonder if we could
still update the document of this field, then it'll be good enough.

>  
> > Another question is, if initial_bytes reached zero, could it be boosted
> > again to be non-zero?
> 
> Under the new protocol, yes, and the REINIT flag would be set indicate
> it had been reset.  Under the old protocol, no.
>  
> > I don't see what stops it from happening, if the "we get some fresh new
> > critical data" seem to be able to happen anytime..  but if so, I wonder if
> > it's a problem to QEMU: when initial_bytes reported to 0 at least _once_ it
> > means it's possible src QEMU decides to switchover.  Then looks like it
> > beats the purpose of "don't switchover until we flush the critical data"
> > whole idea.
> 
> The definition of the protocol in the header stop it from happening.
> We can't know that there isn't some userspace that follows the
> deduction protocol rather than polling.  We don't know there isn't some
> userspace that segfaults if initial_bytes doesn't follow the published
> protocol.  Therefore opt-in where we have a mechanism to expose a new
> initial_bytes session without it becoming a purely volatile value.

Here, IMHO the problem is QEMU still needs to know when a switchover can
happen.

After a new QEMU probing this new driver feature bit and enable it, now
initial_bytes can be incremented when REINIT flag set.  This is fine on its
own.  But then, src QEMU still needs to decide when it can switch over.

It seems to me the only way to do it (with/without the new feature bit
enabled), is to relying on initial_bytes being zero.  When it's zero, it
means all possible "critical data" has been moved, then src QEMU can
kickoff that "switchover" message.

After that, IIUC we need to be prepared to trigger switchover anytime.

With the new REINIT, it means we can still observe REINIT event after src
QEMU making that decision.  Would that be a problem?

Nowadays, when looking at vfio code, what happens is src QEMU after seeing
initial_bytes==0 send one VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to dest QEMU,
later dst QEMU will ack that by sending back MIG_RP_MSG_SWITCHOVER_ACK.
Then switchover can happen anytime by the downtime calculation above.

Maybe there should be solution in the userspace to fix it, but we'll need
to figure it out.  Likely, we need one way or another to revoke the
switchover message, so ultimately we need to stop VM, query the last time,
seeing initial_bytes==0, then it can proceed with switchover.  If it sees
initial_bytes nonzero again, it will need to restart the VM and revoke the
previous message somehow.

>  
> > Is there a way the HW can report and confidentally say no further critical
> > data will be generated?
> 
> So long as there's a guest userspace running that can reconfigure the
> device, no.  But if you stop the vCPUs and test PRECOPY_INFO, it should
> be reliable.

This is definitely an important piece of info.  I recall Zhiyi used to tell
me there's no way to really stop a VFIO device from generating dirty data.
Happy to know it seems there seems to still be a way.  And now I suspect
what Zhiyi observed was exactly seeing dirty_bytes growing even after VM
stopped.  If that counter means "how much you can read" it all makes more
sense (even though it may suffer from the issue I mentioned above).

> 
> > > The presence of the VFIO_PRECOPY_INFO_REINIT flag indicates to the
> > > caller that new initial data is available in the migration stream.
> > > 
> > > If the firmware reports a new initial-data chunk, any previously dirty
> > > bytes in memory are treated as initial bytes, since the caller must read
> > > both sets before reaching the end of the initial-data region.  
> > 
> > This is unfortunate.  I believe it's a limtation because of the current
> > single fd streaming protocol, so HW can only append things because it's
> > kind of a pipeline.
> > 
> > One thing to mention is, I recall VFIO migration suffers from a major
> > bottleneck on read() of the VFIO FD, it means this streaming whole design
> > is also causing other perf issues.
> > 
> > Have you or anyone thought about making it not a stream anymore?  Take
> > example of RAM blocks: it is pagesize accessible, with that we can do a lot
> > more, e.g. we don't need to streamline pages, we can send pages in whatever
> > order.  Meanwhile, we can send pages concurrently because they're not
> > streamlined too.
> > 
> > I wonder if VFIO FDs can provide something like that too, as a start it
> > doesn't need to be as fine granule, maybe at least instead of using one
> > stream it can provide two streams, one for initial_bytes (or, I really
> > think this should be called "critical data" or something similar, if it
> > represents that rather than "some initial states", not anymore), another
> > one for dirty.  Then at least when you attach new critical data you don't
> > need to flush dirty queue too.
> > 
> > If to extend it a bit more, then we can also make e.g. dirty queue to be
> > multiple FDs, so that userspace can read() in multiple threads, speeding up
> > the switchover phase.
> > 
> > I had a vague memory that there's sometimes kernel big locks to block it,
> > but from interfacing POV it sounds always better to avoid using one fd to
> > stream everything.
> 
> I'll leave it to others to brainstorm improvements, but I'll note that
> flushing dirty_bytes is a driver policy, another driver could consider
> unread dirty bytes as invalidated by new initial_bytes and reset
> counters.
> 
> It's not clear to me that there's generic algorithm to use for handling
> device state as addressable blocks rather than serialized into a data
> stream.  Multiple streams of different priorities seems feasible, but
> now we're talking about a v3 migration protocol.  Thanks,

Yep, definitely not a request to invent v3 yet, but just to brainstorm it.
It doesn't need to be all-things addressable, index-able (e.g. via >1
objects) would be also nice even through one fd, then it can also be
threadified somehow.

It seems the HW designer needs to understand how hypervisor works on
collecting these HW data, so it does look like a hard problem when it's all
across the stack from silicon layer..

I just had a feeling that v3 (or more) will come at some point when we want
to finally resolve the VFIO downtime problems..

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
  2026-03-12 20:16       ` Peter Xu
@ 2026-03-15 14:19         ` Yishai Hadas
  2026-03-16 19:24           ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Yishai Hadas @ 2026-03-15 14:19 UTC (permalink / raw)
  To: Peter Xu, Alex Williamson
  Cc: jgg, kvm, kevin.tian, joao.m.martins, leonro, maorg, avihaih, clg,
	liulongfang, giovanni.cabiddu, kwankhede

On 12/03/2026 22:16, Peter Xu wrote:
> On Thu, Mar 12, 2026 at 01:08:17PM -0600, Alex Williamson wrote:
>> Hey Peter,
> 
> Hey, Alex,
> 
>>
>> On Thu, 12 Mar 2026 13:37:04 -0400
>> Peter Xu <peterx@redhat.com> wrote:
>>
>>> Hi, Yishai,
>>>
>>> Please feel free to treat my comments as pure questions only.
>>>
>>> On Tue, Mar 10, 2026 at 06:40:06PM +0200, Yishai Hadas wrote:
>>>> When userspace opts into VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2, the
>>>> driver may report the VFIO_PRECOPY_INFO_REINIT output flag in response
>>>> to the VFIO_MIG_GET_PRECOPY_INFO ioctl, along with a new initial_bytes
>>>> value.
>>>
>>> Does it also mean that VFIO_PRECOPY_INFO_REINIT is almost only a hint that
>>> can be deduced by the userspace too, if it remembers the last time fetch of
>>> initial_bytes?
>>
>> I'll try to answer some of these.  PRECOPY_INFO is already just a hint.
>> We essentially define initial_bytes as the "please copy this before
>> migration to avoid high latency setup" and dirty_bytes is "I also have
>> this much dirty state I could give to you now".  We've defined
>> initial_bytes as monotonically decreasing, so a user could deduce that
>> they've passed the intended high latency setup threshold, while
>> dirty_bytes is purely volatile.
> 
> I see..  That might be another problem though to switchover decisions.
> 
> Currently, QEMU relies on dirty reporting to decide when to switchover.
> 
> What it does is asking all the modules for how many dirty data left, then
> src QEMU do a sum, divide that sum with the estimated bandwidth to guess
> the downtime.
> 
> When the estimated downtime is small enough so as to satisfy the user
> specified downtime, QEMU src will switchover.  This didn't take
> switchover_ack for VFIO into account, but it's a separate concept.
> 
> Above was based on the fact that the reported values are "total data", not
> "what you can collect"..
> 
> Is there possible way to provide a total amount?  It can even be a maximum
> total amount just to cap the downtime. 

The total amount is already reported today via the 
VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl and QEMU accounts that in the 
switchover decision.

  If with the current reporting
> definition, VM is destined to have unpredictable live migration downtime
> when relevant VFIO devices are involved.
> 
> The larger the diff between the current reported dirty value v.s. "total
> data", the larger the downtime mistake can happen.
> 
>>
>> The trouble comes, for example, if the device has undergone a
>> reconfiguration during migration, which may effectively negate the
>> initial_bytes and switchover-ack.
> 
> Ah so it's about that, thanks.  IMHO it might be great if Yishai could
> mention the source of growing initial_bytes somewhere in the commit log, or
> even when documenting the new feature bit.

Sure, we can add as part of V2 the below chunk when documenting the new 
feature.

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 90e51e84539d..bb4a2df0550d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1268,6 +1268,8 @@ enum vfio_device_mig_state {
   * value and decrease as migration data is read from the device.
   * The presence of the VFIO_PRECOPY_INFO_REINIT output flag indicates
   * that new initial data is present on the stream.
+ * The new initial data may result, for example, from device 
reconfiguration
+ * during migration that requires additional initialization data.


> 
>>
>> A user deducing they've sent enough device data to cover initial_bytes
>> is essentially what we have now because our protocol doesn't allow the
>> driver to reset initial_bytes.  The driver may choose to send that
>> reconfiguration information in dirty_bytes bytes, but we don't
>> currently have any way to indicate to the user that data remaining
>> there is of higher importance for startup on the target than any other
>> dirtying of device state.
>>
>> Hopefully the user/VMM is already polling the interface for dirty
>> bytes, where with the opt-in for the protocol change here, allows the
>> driver to split out the priority bytes versus the background dirtying.
>>   
>>> It definitely sounds a bit weird when some initial_* data can actually
>>> change, because it's not "initial_" anymore.
>>
>> It's just a priority scheme.  In the case I've outlined above it might
>> be more aptly named setup_bytes or critical_bytes as you've used, but
>> another driver might just use it for detecting migration compatibility.
>> Naming is hard.
> 
> Yep. :) initial_bytes is still fine at least to me.  I wonder if we could
> still update the document of this field, then it'll be good enough.

As Alex mentioned, initial_bytes can be used for various purposes.

So, I would keep the existing description in the uAPI.

In the context of the new feature, the uAPI commit message refers to 
initial_bytes as 'critical data', to explain the motivation behind the 
feature. Together with the extra chunk in the uAPI suggested above, I 
believe this clarifies the intended usage.

Makes sense ?

> 
>>   
>>> Another question is, if initial_bytes reached zero, could it be boosted
>>> again to be non-zero?
>>
>> Under the new protocol, yes, and the REINIT flag would be set indicate
>> it had been reset.  Under the old protocol, no.
>>   
>>> I don't see what stops it from happening, if the "we get some fresh new
>>> critical data" seem to be able to happen anytime..  but if so, I wonder if
>>> it's a problem to QEMU: when initial_bytes reported to 0 at least _once_ it
>>> means it's possible src QEMU decides to switchover.  Then looks like it
>>> beats the purpose of "don't switchover until we flush the critical data"
>>> whole idea.
>>
>> The definition of the protocol in the header stop it from happening.
>> We can't know that there isn't some userspace that follows the
>> deduction protocol rather than polling.  We don't know there isn't some
>> userspace that segfaults if initial_bytes doesn't follow the published
>> protocol.  Therefore opt-in where we have a mechanism to expose a new
>> initial_bytes session without it becoming a purely volatile value.
> 
> Here, IMHO the problem is QEMU still needs to know when a switchover can
> happen.
> 
> After a new QEMU probing this new driver feature bit and enable it, now
> initial_bytes can be incremented when REINIT flag set.  This is fine on its
> own.  But then, src QEMU still needs to decide when it can switch over.
> 
> It seems to me the only way to do it (with/without the new feature bit
> enabled), is to relying on initial_bytes being zero.  When it's zero, it
> means all possible "critical data" has been moved, then src QEMU can
> kickoff that "switchover" message.
> 
> After that, IIUC we need to be prepared to trigger switchover anytime.
> 
> With the new REINIT, it means we can still observe REINIT event after src
> QEMU making that decision.  Would that be a problem?
> 
> Nowadays, when looking at vfio code, what happens is src QEMU after seeing
> initial_bytes==0 send one VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to dest QEMU,
> later dst QEMU will ack that by sending back MIG_RP_MSG_SWITCHOVER_ACK.
> Then switchover can happen anytime by the downtime calculation above.
> 
> Maybe there should be solution in the userspace to fix it, but we'll need
> to figure it out.  Likely, we need one way or another to revoke the
> switchover message, so ultimately we need to stop VM, query the last time,
> seeing initial_bytes==0, then it can proceed with switchover.  If it sees
> initial_bytes nonzero again, it will need to restart the VM and revoke the
> previous message somehow.

The counterpart QEMU series that we pointed to, handles that in similar 
way to what you described.

The switchover-ack mechanism is modified to be revoke-able and a final 
query to check initial_bytes == 0 is added after vCPUs are stopped.

Thanks,
Yishai

> 
>>   
>>> Is there a way the HW can report and confidentally say no further critical
>>> data will be generated?
>>
>> So long as there's a guest userspace running that can reconfigure the
>> device, no.  But if you stop the vCPUs and test PRECOPY_INFO, it should
>> be reliable.
> 
> This is definitely an important piece of info.  I recall Zhiyi used to tell
> me there's no way to really stop a VFIO device from generating dirty data.
> Happy to know it seems there seems to still be a way.  And now I suspect
> what Zhiyi observed was exactly seeing dirty_bytes growing even after VM
> stopped.  If that counter means "how much you can read" it all makes more
> sense (even though it may suffer from the issue I mentioned above).
> 
>>
>>>> The presence of the VFIO_PRECOPY_INFO_REINIT flag indicates to the
>>>> caller that new initial data is available in the migration stream.
>>>>
>>>> If the firmware reports a new initial-data chunk, any previously dirty
>>>> bytes in memory are treated as initial bytes, since the caller must read
>>>> both sets before reaching the end of the initial-data region.
>>>
>>> This is unfortunate.  I believe it's a limtation because of the current
>>> single fd streaming protocol, so HW can only append things because it's
>>> kind of a pipeline.
>>>
>>> One thing to mention is, I recall VFIO migration suffers from a major
>>> bottleneck on read() of the VFIO FD, it means this streaming whole design
>>> is also causing other perf issues.
>>>
>>> Have you or anyone thought about making it not a stream anymore?  Take
>>> example of RAM blocks: it is pagesize accessible, with that we can do a lot
>>> more, e.g. we don't need to streamline pages, we can send pages in whatever
>>> order.  Meanwhile, we can send pages concurrently because they're not
>>> streamlined too.
>>>
>>> I wonder if VFIO FDs can provide something like that too, as a start it
>>> doesn't need to be as fine granule, maybe at least instead of using one
>>> stream it can provide two streams, one for initial_bytes (or, I really
>>> think this should be called "critical data" or something similar, if it
>>> represents that rather than "some initial states", not anymore), another
>>> one for dirty.  Then at least when you attach new critical data you don't
>>> need to flush dirty queue too.
>>>
>>> If to extend it a bit more, then we can also make e.g. dirty queue to be
>>> multiple FDs, so that userspace can read() in multiple threads, speeding up
>>> the switchover phase.
>>>
>>> I had a vague memory that there's sometimes kernel big locks to block it,
>>> but from interfacing POV it sounds always better to avoid using one fd to
>>> stream everything.
>>
>> I'll leave it to others to brainstorm improvements, but I'll note that
>> flushing dirty_bytes is a driver policy, another driver could consider
>> unread dirty bytes as invalidated by new initial_bytes and reset
>> counters.
>>
>> It's not clear to me that there's generic algorithm to use for handling
>> device state as addressable blocks rather than serialized into a data
>> stream.  Multiple streams of different priorities seems feasible, but
>> now we're talking about a v3 migration protocol.  Thanks,
> 
> Yep, definitely not a request to invent v3 yet, but just to brainstorm it.
> It doesn't need to be all-things addressable, index-able (e.g. via >1
> objects) would be also nice even through one fd, then it can also be
> threadified somehow.
> 
> It seems the HW designer needs to understand how hypervisor works on
> collecting these HW data, so it does look like a hard problem when it's all
> across the stack from silicon layer..
> 
> I just had a feeling that v3 (or more) will come at some point when we want
> to finally resolve the VFIO downtime problems..
> 
> Thanks,
> 


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
  2026-03-15 14:19         ` Yishai Hadas
@ 2026-03-16 19:24           ` Peter Xu
  2026-03-17  9:58             ` Avihai Horon
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2026-03-16 19:24 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: Alex Williamson, jgg, kvm, kevin.tian, joao.m.martins, leonro,
	maorg, avihaih, clg, liulongfang, giovanni.cabiddu, kwankhede

On Sun, Mar 15, 2026 at 04:19:18PM +0200, Yishai Hadas wrote:
> On 12/03/2026 22:16, Peter Xu wrote:
> > On Thu, Mar 12, 2026 at 01:08:17PM -0600, Alex Williamson wrote:
> > > Hey Peter,
> > 
> > Hey, Alex,
> > 
> > > 
> > > On Thu, 12 Mar 2026 13:37:04 -0400
> > > Peter Xu <peterx@redhat.com> wrote:
> > > 
> > > > Hi, Yishai,
> > > > 
> > > > Please feel free to treat my comments as pure questions only.
> > > > 
> > > > On Tue, Mar 10, 2026 at 06:40:06PM +0200, Yishai Hadas wrote:
> > > > > When userspace opts into VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2, the
> > > > > driver may report the VFIO_PRECOPY_INFO_REINIT output flag in response
> > > > > to the VFIO_MIG_GET_PRECOPY_INFO ioctl, along with a new initial_bytes
> > > > > value.
> > > > 
> > > > Does it also mean that VFIO_PRECOPY_INFO_REINIT is almost only a hint that
> > > > can be deduced by the userspace too, if it remembers the last time fetch of
> > > > initial_bytes?
> > > 
> > > I'll try to answer some of these.  PRECOPY_INFO is already just a hint.
> > > We essentially define initial_bytes as the "please copy this before
> > > migration to avoid high latency setup" and dirty_bytes is "I also have
> > > this much dirty state I could give to you now".  We've defined
> > > initial_bytes as monotonically decreasing, so a user could deduce that
> > > they've passed the intended high latency setup threshold, while
> > > dirty_bytes is purely volatile.
> > 
> > I see..  That might be another problem though to switchover decisions.
> > 
> > Currently, QEMU relies on dirty reporting to decide when to switchover.
> > 
> > What it does is asking all the modules for how many dirty data left, then
> > src QEMU do a sum, divide that sum with the estimated bandwidth to guess
> > the downtime.
> > 
> > When the estimated downtime is small enough so as to satisfy the user
> > specified downtime, QEMU src will switchover.  This didn't take
> > switchover_ack for VFIO into account, but it's a separate concept.
> > 
> > Above was based on the fact that the reported values are "total data", not
> > "what you can collect"..
> > 
> > Is there possible way to provide a total amount?  It can even be a maximum
> > total amount just to cap the downtime.
> 
> The total amount is already reported today via the
> VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl and QEMU accounts that in the
> switchover decision.

Ok, I somehow got the impression that initial+dirty should be the total
previously.  It's likely because I was referring to this piece of code in
QEMU:

static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
                                        uint64_t *can_postcopy)
{
    VFIODevice *vbasedev = opaque;
    VFIOMigration *migration = vbasedev->migration;

    if (!vfio_device_state_is_precopy(vbasedev)) {
        return;
    }

    *must_precopy +=
        migration->precopy_init_size + migration->precopy_dirty_size;

    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
                                      *can_postcopy,
                                      migration->precopy_init_size,
                                      migration->precopy_dirty_size);
}

After you said so, I found indeed the exact() version is fetching the
stop-size:

static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
                                     uint64_t *can_postcopy)
{
    VFIODevice *vbasedev = opaque;
    VFIOMigration *migration = vbasedev->migration;
    uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;

    /*
     * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
     * reported so downtime limit won't be violated.
     */
    vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
    *must_precopy += stop_copy_size;

    if (vfio_device_state_is_precopy(vbasedev)) {
        vfio_query_precopy_size(migration);
    }

    trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
                                   stop_copy_size, migration->precopy_init_size,
                                   migration->precopy_dirty_size);
}

Do you know why the estimate version doesn't report a cached stop_size
instead?

Reporting different things will also confuse QEMU in its estimate() and
exact() hooks.  They should report the same thing except that the
estimate() can use a fast path for cached value.

> 
>  If with the current reporting
> > definition, VM is destined to have unpredictable live migration downtime
> > when relevant VFIO devices are involved.
> > 
> > The larger the diff between the current reported dirty value v.s. "total
> > data", the larger the downtime mistake can happen.
> > 
> > > 
> > > The trouble comes, for example, if the device has undergone a
> > > reconfiguration during migration, which may effectively negate the
> > > initial_bytes and switchover-ack.
> > 
> > Ah so it's about that, thanks.  IMHO it might be great if Yishai could
> > mention the source of growing initial_bytes somewhere in the commit log, or
> > even when documenting the new feature bit.
> 
> Sure, we can add as part of V2 the below chunk when documenting the new
> feature.
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 90e51e84539d..bb4a2df0550d 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1268,6 +1268,8 @@ enum vfio_device_mig_state {
>   * value and decrease as migration data is read from the device.
>   * The presence of the VFIO_PRECOPY_INFO_REINIT output flag indicates
>   * that new initial data is present on the stream.
> + * The new initial data may result, for example, from device
> reconfiguration
> + * during migration that requires additional initialization data.

This is helpful at least to me, thanks.

> 
> 
> > 
> > > 
> > > A user deducing they've sent enough device data to cover initial_bytes
> > > is essentially what we have now because our protocol doesn't allow the
> > > driver to reset initial_bytes.  The driver may choose to send that
> > > reconfiguration information in dirty_bytes bytes, but we don't
> > > currently have any way to indicate to the user that data remaining
> > > there is of higher importance for startup on the target than any other
> > > dirtying of device state.
> > > 
> > > Hopefully the user/VMM is already polling the interface for dirty
> > > bytes, where with the opt-in for the protocol change here, allows the
> > > driver to split out the priority bytes versus the background dirtying.
> > > > It definitely sounds a bit weird when some initial_* data can actually
> > > > change, because it's not "initial_" anymore.
> > > 
> > > It's just a priority scheme.  In the case I've outlined above it might
> > > be more aptly named setup_bytes or critical_bytes as you've used, but
> > > another driver might just use it for detecting migration compatibility.
> > > Naming is hard.
> > 
> > Yep. :) initial_bytes is still fine at least to me.  I wonder if we could
> > still update the document of this field, then it'll be good enough.
> 
> As Alex mentioned, initial_bytes can be used for various purposes.
> 
> So, I would keep the existing description in the uAPI.
> 
> In the context of the new feature, the uAPI commit message refers to
> initial_bytes as 'critical data', to explain the motivation behind the
> feature. Together with the extra chunk in the uAPI suggested above, I
> believe this clarifies the intended usage.
> 
> Makes sense ?

As long as Alex is happy with it, I'm OK either way.

> 
> > 
> > > > Another question is, if initial_bytes reached zero, could it be boosted
> > > > again to be non-zero?
> > > 
> > > Under the new protocol, yes, and the REINIT flag would be set indicate
> > > it had been reset.  Under the old protocol, no.
> > > > I don't see what stops it from happening, if the "we get some fresh new
> > > > critical data" seem to be able to happen anytime..  but if so, I wonder if
> > > > it's a problem to QEMU: when initial_bytes reported to 0 at least _once_ it
> > > > means it's possible src QEMU decides to switchover.  Then looks like it
> > > > beats the purpose of "don't switchover until we flush the critical data"
> > > > whole idea.
> > > 
> > > The definition of the protocol in the header stop it from happening.
> > > We can't know that there isn't some userspace that follows the
> > > deduction protocol rather than polling.  We don't know there isn't some
> > > userspace that segfaults if initial_bytes doesn't follow the published
> > > protocol.  Therefore opt-in where we have a mechanism to expose a new
> > > initial_bytes session without it becoming a purely volatile value.
> > 
> > Here, IMHO the problem is QEMU still needs to know when a switchover can
> > happen.
> > 
> > After a new QEMU probing this new driver feature bit and enable it, now
> > initial_bytes can be incremented when REINIT flag set.  This is fine on its
> > own.  But then, src QEMU still needs to decide when it can switch over.
> > 
> > It seems to me the only way to do it (with/without the new feature bit
> > enabled), is to relying on initial_bytes being zero.  When it's zero, it
> > means all possible "critical data" has been moved, then src QEMU can
> > kickoff that "switchover" message.
> > 
> > After that, IIUC we need to be prepared to trigger switchover anytime.
> > 
> > With the new REINIT, it means we can still observe REINIT event after src
> > QEMU making that decision.  Would that be a problem?
> > 
> > Nowadays, when looking at vfio code, what happens is src QEMU after seeing
> > initial_bytes==0 send one VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to dest QEMU,
> > later dst QEMU will ack that by sending back MIG_RP_MSG_SWITCHOVER_ACK.
> > Then switchover can happen anytime by the downtime calculation above.
> > 
> > Maybe there should be solution in the userspace to fix it, but we'll need
> > to figure it out.  Likely, we need one way or another to revoke the
> > switchover message, so ultimately we need to stop VM, query the last time,
> > seeing initial_bytes==0, then it can proceed with switchover.  If it sees
> > initial_bytes nonzero again, it will need to restart the VM and revoke the
> > previous message somehow.
> 
> The counterpart QEMU series that we pointed to, handles that in similar way
> to what you described.
> 
> The switchover-ack mechanism is modified to be revoke-able and a final query
> to check initial_bytes == 0 is added after vCPUs are stopped.

I only roughly skimmed the series and overlooked the QEMU branch link.  I
read it, indeed it should do most of above, except one possible issue I
found, that QEMU shouldn't fail the migration when REINIT happened after
exact() but before vm_stop(); instead IIUC it should fallback to iterations
and try to move over the initial_bytes and retry a switchover.

Thanks,

> 
> Thanks,
> Yishai
> 
> > 
> > > > Is there a way the HW can report and confidentally say no further critical
> > > > data will be generated?
> > > 
> > > So long as there's a guest userspace running that can reconfigure the
> > > device, no.  But if you stop the vCPUs and test PRECOPY_INFO, it should
> > > be reliable.
> > 
> > This is definitely an important piece of info.  I recall Zhiyi used to tell
> > me there's no way to really stop a VFIO device from generating dirty data.
> > Happy to know it seems there seems to still be a way.  And now I suspect
> > what Zhiyi observed was exactly seeing dirty_bytes growing even after VM
> > stopped.  If that counter means "how much you can read" it all makes more
> > sense (even though it may suffer from the issue I mentioned above).
> > 
> > > 
> > > > > The presence of the VFIO_PRECOPY_INFO_REINIT flag indicates to the
> > > > > caller that new initial data is available in the migration stream.
> > > > > 
> > > > > If the firmware reports a new initial-data chunk, any previously dirty
> > > > > bytes in memory are treated as initial bytes, since the caller must read
> > > > > both sets before reaching the end of the initial-data region.
> > > > 
> > > > This is unfortunate.  I believe it's a limtation because of the current
> > > > single fd streaming protocol, so HW can only append things because it's
> > > > kind of a pipeline.
> > > > 
> > > > One thing to mention is, I recall VFIO migration suffers from a major
> > > > bottleneck on read() of the VFIO FD, it means this streaming whole design
> > > > is also causing other perf issues.
> > > > 
> > > > Have you or anyone thought about making it not a stream anymore?  Take
> > > > example of RAM blocks: it is pagesize accessible, with that we can do a lot
> > > > more, e.g. we don't need to streamline pages, we can send pages in whatever
> > > > order.  Meanwhile, we can send pages concurrently because they're not
> > > > streamlined too.
> > > > 
> > > > I wonder if VFIO FDs can provide something like that too, as a start it
> > > > doesn't need to be as fine granule, maybe at least instead of using one
> > > > stream it can provide two streams, one for initial_bytes (or, I really
> > > > think this should be called "critical data" or something similar, if it
> > > > represents that rather than "some initial states", not anymore), another
> > > > one for dirty.  Then at least when you attach new critical data you don't
> > > > need to flush dirty queue too.
> > > > 
> > > > If to extend it a bit more, then we can also make e.g. dirty queue to be
> > > > multiple FDs, so that userspace can read() in multiple threads, speeding up
> > > > the switchover phase.
> > > > 
> > > > I had a vague memory that there's sometimes kernel big locks to block it,
> > > > but from interfacing POV it sounds always better to avoid using one fd to
> > > > stream everything.
> > > 
> > > I'll leave it to others to brainstorm improvements, but I'll note that
> > > flushing dirty_bytes is a driver policy, another driver could consider
> > > unread dirty bytes as invalidated by new initial_bytes and reset
> > > counters.
> > > 
> > > It's not clear to me that there's generic algorithm to use for handling
> > > device state as addressable blocks rather than serialized into a data
> > > stream.  Multiple streams of different priorities seems feasible, but
> > > now we're talking about a v3 migration protocol.  Thanks,
> > 
> > Yep, definitely not a request to invent v3 yet, but just to brainstorm it.
> > It doesn't need to be all-things addressable, index-able (e.g. via >1
> > objects) would be also nice even through one fd, then it can also be
> > threadified somehow.
> > 
> > It seems the HW designer needs to understand how hypervisor works on
> > collecting these HW data, so it does look like a hard problem when it's all
> > across the stack from silicon layer..
> > 
> > I just had a feeling that v3 (or more) will come at some point when we want
> > to finally resolve the VFIO downtime problems..
> > 
> > Thanks,
> > 
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
  2026-03-16 19:24           ` Peter Xu
@ 2026-03-17  9:58             ` Avihai Horon
  2026-03-17 14:06               ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Avihai Horon @ 2026-03-17  9:58 UTC (permalink / raw)
  To: Peter Xu, Yishai Hadas
  Cc: Alex Williamson, jgg, kvm, kevin.tian, joao.m.martins, leonro,
	maorg, clg, liulongfang, giovanni.cabiddu, kwankhede

Hi Peter,

On 3/16/2026 9:24 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Sun, Mar 15, 2026 at 04:19:18PM +0200, Yishai Hadas wrote:
>> On 12/03/2026 22:16, Peter Xu wrote:
>>> On Thu, Mar 12, 2026 at 01:08:17PM -0600, Alex Williamson wrote:
>>>> Hey Peter,
>>> Hey, Alex,
>>>
>>>> On Thu, 12 Mar 2026 13:37:04 -0400
>>>> Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>>> Hi, Yishai,
>>>>>
>>>>> Please feel free to treat my comments as pure questions only.
>>>>>
>>>>> On Tue, Mar 10, 2026 at 06:40:06PM +0200, Yishai Hadas wrote:
>>>>>> When userspace opts into VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2, the
>>>>>> driver may report the VFIO_PRECOPY_INFO_REINIT output flag in response
>>>>>> to the VFIO_MIG_GET_PRECOPY_INFO ioctl, along with a new initial_bytes
>>>>>> value.
>>>>> Does it also mean that VFIO_PRECOPY_INFO_REINIT is almost only a hint that
>>>>> can be deduced by the userspace too, if it remembers the last time fetch of
>>>>> initial_bytes?
>>>> I'll try to answer some of these.  PRECOPY_INFO is already just a hint.
>>>> We essentially define initial_bytes as the "please copy this before
>>>> migration to avoid high latency setup" and dirty_bytes is "I also have
>>>> this much dirty state I could give to you now".  We've defined
>>>> initial_bytes as monotonically decreasing, so a user could deduce that
>>>> they've passed the intended high latency setup threshold, while
>>>> dirty_bytes is purely volatile.
>>> I see..  That might be another problem though to switchover decisions.
>>>
>>> Currently, QEMU relies on dirty reporting to decide when to switchover.
>>>
>>> What it does is asking all the modules for how many dirty data left, then
>>> src QEMU do a sum, divide that sum with the estimated bandwidth to guess
>>> the downtime.
>>>
>>> When the estimated downtime is small enough so as to satisfy the user
>>> specified downtime, QEMU src will switchover.  This didn't take
>>> switchover_ack for VFIO into account, but it's a separate concept.
>>>
>>> Above was based on the fact that the reported values are "total data", not
>>> "what you can collect"..
>>>
>>> Is there possible way to provide a total amount?  It can even be a maximum
>>> total amount just to cap the downtime.
>> The total amount is already reported today via the
>> VFIO_DEVICE_FEATURE_MIG_DATA_SIZE ioctl and QEMU accounts that in the
>> switchover decision.
> Ok, I somehow got the impression that initial+dirty should be the total
> previously.  It's likely because I was referring to this piece of code in
> QEMU:
>
> static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
>                                          uint64_t *can_postcopy)
> {
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>
>      if (!vfio_device_state_is_precopy(vbasedev)) {
>          return;
>      }
>
>      *must_precopy +=
>          migration->precopy_init_size + migration->precopy_dirty_size;
>
>      trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
>                                        *can_postcopy,
>                                        migration->precopy_init_size,
>                                        migration->precopy_dirty_size);
> }
>
> After you said so, I found indeed the exact() version is fetching the
> stop-size:
>
> static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>                                       uint64_t *can_postcopy)
> {
>      VFIODevice *vbasedev = opaque;
>      VFIOMigration *migration = vbasedev->migration;
>      uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>
>      /*
>       * If getting pending migration size fails, VFIO_MIG_STOP_COPY_SIZE is
>       * reported so downtime limit won't be violated.
>       */
>      vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>      *must_precopy += stop_copy_size;
>
>      if (vfio_device_state_is_precopy(vbasedev)) {
>          vfio_query_precopy_size(migration);
>      }
>
>      trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
>                                     stop_copy_size, migration->precopy_init_size,
>                                     migration->precopy_dirty_size);
> }
>
> Do you know why the estimate version doesn't report a cached stop_size
> instead?
>
> Reporting different things will also confuse QEMU in its estimate() and
> exact() hooks.  They should report the same thing except that the
> estimate() can use a fast path for cached value.

Yes, this is because the VFIO device stop_copy_size may hold data that 
can be transferred only when the device is stopped, i.e., during 
switchover (as opposed to RAM which is fully precopy-able).
Reporting it as part of estimate() didn't seem right, as precopy 
iterations will not reduce it -- if it's big enough (above the 
threshold), it may block future exact() calls as we can't migrate this 
data during pre-copy and reach below threshold again.

>
>>   If with the current reporting
>>> definition, VM is destined to have unpredictable live migration downtime
>>> when relevant VFIO devices are involved.
>>>
>>> The larger the diff between the current reported dirty value v.s. "total
>>> data", the larger the downtime mistake can happen.
>>>
>>>> The trouble comes, for example, if the device has undergone a
>>>> reconfiguration during migration, which may effectively negate the
>>>> initial_bytes and switchover-ack.
>>> Ah so it's about that, thanks.  IMHO it might be great if Yishai could
>>> mention the source of growing initial_bytes somewhere in the commit log, or
>>> even when documenting the new feature bit.
>> Sure, we can add as part of V2 the below chunk when documenting the new
>> feature.
>>
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 90e51e84539d..bb4a2df0550d 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -1268,6 +1268,8 @@ enum vfio_device_mig_state {
>>    * value and decrease as migration data is read from the device.
>>    * The presence of the VFIO_PRECOPY_INFO_REINIT output flag indicates
>>    * that new initial data is present on the stream.
>> + * The new initial data may result, for example, from device
>> reconfiguration
>> + * during migration that requires additional initialization data.
> This is helpful at least to me, thanks.
>
>>
>>>> A user deducing they've sent enough device data to cover initial_bytes
>>>> is essentially what we have now because our protocol doesn't allow the
>>>> driver to reset initial_bytes.  The driver may choose to send that
>>>> reconfiguration information in dirty_bytes bytes, but we don't
>>>> currently have any way to indicate to the user that data remaining
>>>> there is of higher importance for startup on the target than any other
>>>> dirtying of device state.
>>>>
>>>> Hopefully the user/VMM is already polling the interface for dirty
>>>> bytes, where with the opt-in for the protocol change here, allows the
>>>> driver to split out the priority bytes versus the background dirtying.
>>>>> It definitely sounds a bit weird when some initial_* data can actually
>>>>> change, because it's not "initial_" anymore.
>>>> It's just a priority scheme.  In the case I've outlined above it might
>>>> be more aptly named setup_bytes or critical_bytes as you've used, but
>>>> another driver might just use it for detecting migration compatibility.
>>>> Naming is hard.
>>> Yep. :) initial_bytes is still fine at least to me.  I wonder if we could
>>> still update the document of this field, then it'll be good enough.
>> As Alex mentioned, initial_bytes can be used for various purposes.
>>
>> So, I would keep the existing description in the uAPI.
>>
>> In the context of the new feature, the uAPI commit message refers to
>> initial_bytes as 'critical data', to explain the motivation behind the
>> feature. Together with the extra chunk in the uAPI suggested above, I
>> believe this clarifies the intended usage.
>>
>> Makes sense ?
> As long as Alex is happy with it, I'm OK either way.
>
>>>>> Another question is, if initial_bytes reached zero, could it be boosted
>>>>> again to be non-zero?
>>>> Under the new protocol, yes, and the REINIT flag would be set indicate
>>>> it had been reset.  Under the old protocol, no.
>>>>> I don't see what stops it from happening, if the "we get some fresh new
>>>>> critical data" seem to be able to happen anytime..  but if so, I wonder if
>>>>> it's a problem to QEMU: when initial_bytes reported to 0 at least _once_ it
>>>>> means it's possible src QEMU decides to switchover.  Then looks like it
>>>>> beats the purpose of "don't switchover until we flush the critical data"
>>>>> whole idea.
>>>> The definition of the protocol in the header stop it from happening.
>>>> We can't know that there isn't some userspace that follows the
>>>> deduction protocol rather than polling.  We don't know there isn't some
>>>> userspace that segfaults if initial_bytes doesn't follow the published
>>>> protocol.  Therefore opt-in where we have a mechanism to expose a new
>>>> initial_bytes session without it becoming a purely volatile value.
>>> Here, IMHO the problem is QEMU still needs to know when a switchover can
>>> happen.
>>>
>>> After a new QEMU probing this new driver feature bit and enable it, now
>>> initial_bytes can be incremented when REINIT flag set.  This is fine on its
>>> own.  But then, src QEMU still needs to decide when it can switch over.
>>>
>>> It seems to me the only way to do it (with/without the new feature bit
>>> enabled), is to relying on initial_bytes being zero.  When it's zero, it
>>> means all possible "critical data" has been moved, then src QEMU can
>>> kickoff that "switchover" message.
>>>
>>> After that, IIUC we need to be prepared to trigger switchover anytime.
>>>
>>> With the new REINIT, it means we can still observe REINIT event after src
>>> QEMU making that decision.  Would that be a problem?
>>>
>>> Nowadays, when looking at vfio code, what happens is src QEMU after seeing
>>> initial_bytes==0 send one VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to dest QEMU,
>>> later dst QEMU will ack that by sending back MIG_RP_MSG_SWITCHOVER_ACK.
>>> Then switchover can happen anytime by the downtime calculation above.
>>>
>>> Maybe there should be solution in the userspace to fix it, but we'll need
>>> to figure it out.  Likely, we need one way or another to revoke the
>>> switchover message, so ultimately we need to stop VM, query the last time,
>>> seeing initial_bytes==0, then it can proceed with switchover.  If it sees
>>> initial_bytes nonzero again, it will need to restart the VM and revoke the
>>> previous message somehow.
>> The counterpart QEMU series that we pointed to, handles that in similar way
>> to what you described.
>>
>> The switchover-ack mechanism is modified to be revoke-able and a final query
>> to check initial_bytes == 0 is added after vCPUs are stopped.
> I only roughly skimmed the series and overlooked the QEMU branch link.  I
> read it, indeed it should do most of above, except one possible issue I
> found, that QEMU shouldn't fail the migration when REINIT happened after
> exact() but before vm_stop(); instead IIUC it should fallback to iterations
> and try to move over the initial_bytes and retry a switchover.

Yes, I agree, but it's a delicate flow in QEMU and need to get to the 
details.
Anyway, this case should be rare and we can further discuss these 
details when I send the QEMU series.

Thanks.

>
> Thanks,
>
>> Thanks,
>> Yishai
>>
>>>>> Is there a way the HW can report and confidentally say no further critical
>>>>> data will be generated?
>>>> So long as there's a guest userspace running that can reconfigure the
>>>> device, no.  But if you stop the vCPUs and test PRECOPY_INFO, it should
>>>> be reliable.
>>> This is definitely an important piece of info.  I recall Zhiyi used to tell
>>> me there's no way to really stop a VFIO device from generating dirty data.
>>> Happy to know it seems there seems to still be a way.  And now I suspect
>>> what Zhiyi observed was exactly seeing dirty_bytes growing even after VM
>>> stopped.  If that counter means "how much you can read" it all makes more
>>> sense (even though it may suffer from the issue I mentioned above).
>>>
>>>>>> The presence of the VFIO_PRECOPY_INFO_REINIT flag indicates to the
>>>>>> caller that new initial data is available in the migration stream.
>>>>>>
>>>>>> If the firmware reports a new initial-data chunk, any previously dirty
>>>>>> bytes in memory are treated as initial bytes, since the caller must read
>>>>>> both sets before reaching the end of the initial-data region.
>>>>> This is unfortunate.  I believe it's a limtation because of the current
>>>>> single fd streaming protocol, so HW can only append things because it's
>>>>> kind of a pipeline.
>>>>>
>>>>> One thing to mention is, I recall VFIO migration suffers from a major
>>>>> bottleneck on read() of the VFIO FD, it means this streaming whole design
>>>>> is also causing other perf issues.
>>>>>
>>>>> Have you or anyone thought about making it not a stream anymore?  Take
>>>>> example of RAM blocks: it is pagesize accessible, with that we can do a lot
>>>>> more, e.g. we don't need to streamline pages, we can send pages in whatever
>>>>> order.  Meanwhile, we can send pages concurrently because they're not
>>>>> streamlined too.
>>>>>
>>>>> I wonder if VFIO FDs can provide something like that too, as a start it
>>>>> doesn't need to be as fine granule, maybe at least instead of using one
>>>>> stream it can provide two streams, one for initial_bytes (or, I really
>>>>> think this should be called "critical data" or something similar, if it
>>>>> represents that rather than "some initial states", not anymore), another
>>>>> one for dirty.  Then at least when you attach new critical data you don't
>>>>> need to flush dirty queue too.
>>>>>
>>>>> If to extend it a bit more, then we can also make e.g. dirty queue to be
>>>>> multiple FDs, so that userspace can read() in multiple threads, speeding up
>>>>> the switchover phase.
>>>>>
>>>>> I had a vague memory that there's sometimes kernel big locks to block it,
>>>>> but from interfacing POV it sounds always better to avoid using one fd to
>>>>> stream everything.
>>>> I'll leave it to others to brainstorm improvements, but I'll note that
>>>> flushing dirty_bytes is a driver policy, another driver could consider
>>>> unread dirty bytes as invalidated by new initial_bytes and reset
>>>> counters.
>>>>
>>>> It's not clear to me that there's generic algorithm to use for handling
>>>> device state as addressable blocks rather than serialized into a data
>>>> stream.  Multiple streams of different priorities seems feasible, but
>>>> now we're talking about a v3 migration protocol.  Thanks,
>>> Yep, definitely not a request to invent v3 yet, but just to brainstorm it.
>>> It doesn't need to be all-things addressable, index-able (e.g. via >1
>>> objects) would be also nice even through one fd, then it can also be
>>> threadified somehow.
>>>
>>> It seems the HW designer needs to understand how hypervisor works on
>>> collecting these HW data, so it does look like a hard problem when it's all
>>> across the stack from silicon layer..
>>>
>>> I just had a feeling that v3 (or more) will come at some point when we want
>>> to finally resolve the VFIO downtime problems..
>>>
>>> Thanks,
>>>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
  2026-03-17  9:58             ` Avihai Horon
@ 2026-03-17 14:06               ` Peter Xu
  2026-03-17 15:22                 ` Avihai Horon
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2026-03-17 14:06 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Yishai Hadas, Alex Williamson, jgg, kvm, kevin.tian,
	joao.m.martins, leonro, maorg, clg, liulongfang, giovanni.cabiddu,
	kwankhede

Hi, Avihai,

On Tue, Mar 17, 2026 at 11:58:28AM +0200, Avihai Horon wrote:
> Yes, this is because the VFIO device stop_copy_size may hold data that can
> be transferred only when the device is stopped, i.e., during switchover (as
> opposed to RAM which is fully precopy-able).
> Reporting it as part of estimate() didn't seem right, as precopy iterations
> will not reduce it -- if it's big enough (above the threshold), it may block
> future exact() calls as we can't migrate this data during pre-copy and reach
> below threshold again.

This doesn't look right either. I think we should stick with the exact()
being only the accurate version of estimate() only.  The problem might be
that the current interfacing for both may not suite well enough with VFIO,
because VFIO is so far the only module that may contain stop-only data.

The other known issue (not directly relevant here, but we will need to look
into it very soon) is when VFIO is involved the query-migrate results are
sometimes not containing VFIO data, at least "total" and "remaining" fields.

I'll see if I can prepare some quick patch this week to address them, or at
least raise these problem explicitly so you can have a closer look if possible.

[...]

> Yes, I agree, but it's a delicate flow in QEMU and need to get to the
> details.
> Anyway, this case should be rare and we can further discuss these details
> when I send the QEMU series.

Yep, that's fine.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
  2026-03-17 14:06               ` Peter Xu
@ 2026-03-17 15:22                 ` Avihai Horon
  2026-03-17 15:52                   ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Avihai Horon @ 2026-03-17 15:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: Yishai Hadas, Alex Williamson, jgg, kvm, kevin.tian,
	joao.m.martins, leonro, maorg, clg, liulongfang, giovanni.cabiddu,
	kwankhede


On 3/17/2026 4:06 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> Hi, Avihai,
>
> On Tue, Mar 17, 2026 at 11:58:28AM +0200, Avihai Horon wrote:
>> Yes, this is because the VFIO device stop_copy_size may hold data that can
>> be transferred only when the device is stopped, i.e., during switchover (as
>> opposed to RAM which is fully precopy-able).
>> Reporting it as part of estimate() didn't seem right, as precopy iterations
>> will not reduce it -- if it's big enough (above the threshold), it may block
>> future exact() calls as we can't migrate this data during pre-copy and reach
>> below threshold again.
> This doesn't look right either. I think we should stick with the exact()
> being only the accurate version of estimate() only.  The problem might be
> that the current interfacing for both may not suite well enough with VFIO,
> because VFIO is so far the only module that may contain stop-only data.

Yes, I don't think either approach fits perfectly. We can discuss this 
in detail as part of my QEMU series or as a separate topic.
However, I don't think this is directly related or poses any problem to 
the precopy REINIT feature.

>
> The other known issue (not directly relevant here, but we will need to look
> into it very soon) is when VFIO is involved the query-migrate results are
> sometimes not containing VFIO data, at least "total" and "remaining" fields.
>
> I'll see if I can prepare some quick patch this week to address them, or at
> least raise these problem explicitly so you can have a closer look if possible.

Sure.

Thanks.

>
> [...]
>
>> Yes, I agree, but it's a delicate flow in QEMU and need to get to the
>> details.
>> Anyway, this case should be rare and we can further discuss these details
>> when I send the QEMU series.
> Yep, that's fine.
>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO
  2026-03-17 15:22                 ` Avihai Horon
@ 2026-03-17 15:52                   ` Peter Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2026-03-17 15:52 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Yishai Hadas, Alex Williamson, jgg, kvm, kevin.tian,
	joao.m.martins, leonro, maorg, clg, liulongfang, giovanni.cabiddu,
	kwankhede

On Tue, Mar 17, 2026 at 05:22:22PM +0200, Avihai Horon wrote:
> Yes, I don't think either approach fits perfectly. We can discuss this in
> detail as part of my QEMU series or as a separate topic.
> However, I don't think this is directly related or poses any problem to the
> precopy REINIT feature.

Oh definitely, it's a separate topic.

I hope what I will propose will make it accurate again, but we can wait
until seeing the patches.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-03-17 15:52 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-10 16:40 [PATCH V1 vfio 0/6] Add support for PRE_COPY initial bytes re-initialization Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 1/6] vfio: Define uAPI for re-init initial bytes during the PRE_COPY phase Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 2/6] vfio: Add support for VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 3/6] vfio: Adapt drivers to use the core helper vfio_check_precopy_ioctl Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 4/6] net/mlx5: Add IFC bits for migration state Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 5/6] vfio/mlx5: consider inflight SAVE during PRE_COPY Yishai Hadas
2026-03-10 16:40 ` [PATCH V1 vfio 6/6] vfio/mlx5: Add REINIT support to VFIO_MIG_GET_PRECOPY_INFO Yishai Hadas
2026-03-12 17:37   ` Peter Xu
2026-03-12 19:08     ` Alex Williamson
2026-03-12 20:16       ` Peter Xu
2026-03-15 14:19         ` Yishai Hadas
2026-03-16 19:24           ` Peter Xu
2026-03-17  9:58             ` Avihai Horon
2026-03-17 14:06               ` Peter Xu
2026-03-17 15:22                 ` Avihai Horon
2026-03-17 15:52                   ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox